Hate Speech and AI: Issues in Detection

Hate speech is a form of expression which attacks someone mostly based on their race, gender, ethnicity and sexual orientation. The history of hate speech dates back long time ago; however, with the expansion of the internet and social media, it had its most accelerated form. Now, 41% of the American population have experienced a form of online harassment as Pew Research Center’s report suggests. Also, the high correlation between suicide rates and verbal harrasment in migrant groups shows the crucial importance of detecting and preventing the spread of hate speech. Additonally as an instance from recent years, after the mass murder that happened in Pittsburg synagoge it has seen that the murderer was posting hated messages to jews constantly before the incident.

 

 

Retrieved from: https://www.kqed.org/news/11702239/why-its-so-hard-to-scrub-hate-speech-off-social-media

 

Furthermore, the Pew Research Center’s report also suggests that 79% of the American population thinks that the detection of hate speech/online harassment is in the responsibility of online service providers. Hence, many online service providers are aware of the importance of the issue and have close relationships with AI engineers while solving it.

When it comes to the logic of hate speech detection, there are many complex points. Firstly, such complexity comes from the current AI technologies’ limitations on understanding the contexts of human language. For instance, current technologies fail to detect hate speech or give false positives when there are contextual differences. As such, researchers from Carnegie Mellon University suggested that the toxicity of the speech may differ with the race, gender and ethnic characteristics of the people. Hence, to increase the quality of the data and detection; it is important to identify the characteristics of the author while identifying the hate speech and its toxicity rate according to the researchers. Also, such identification can also reduce the current bias the algorithms have.

Retrieved from: https://www.pewresearch.org/internet/2017/07/11/online-harassment-2017/pi_2017-07-11_online-harassment_0-01/

 

However, current AI technologies have difficulties in detecting such characteristics. Firstly, it’s difficult to identify the demographics and characteristics of the authors’; since in most of the cases such information is not available on the internet. So, the process of distinguishing hate speech becomes harder. Secondly, even if the author clearly indicates such information; sometimes the detection process becomes more difficult due to the cultural insights of the given context. The dynamics of the countries or even the regions in countries is changeable and is really related to their culture and language. Such differences and ongoing changing factors are also crucial points for the outcomes of the processes; some outcomes may fail to detect or detect false positives due to non-statistical cultural differences.

 

 

Language is one of the most complicated and most significant functions of the humankind. There are many different ways and contexts of communicating with language which even neuroscientists could not fully map yet. However, with artificial intelligence scientists are also one step forward in describing the patterns and mechanisms of language. In such sense, the crucially important subject in the age of the internet, hate speech detection, also has an advantage since it is much easier to detect online harassment with machine learning algorithms. Nevertheless, there is no way for humans to get out of the detection cycle in today’s technology with the issues faced in detection processes. 

 

References 

https://bdtechtalks.com/2019/08/19/ai-hate-speech-detection-challenges/

https://deepsense.ai/artificial-intelligence-hate-speech/

https://www.kqed.org/news/11702239/why-its-so-hard-to-scrub-hate-speech-off-social-media

 

Data Analysis and Visualization with Python – 2

We continue to make visualizations on the Iris dataset I used in my previous article. There are 2 most frequently used libraries for data visualization. Of these libraries, matplotlib is known by many people, just as I know. In addition, our second library is seaborn. In this article, we will witness the visualization of data with the help of libraries.

🔐 You need to enter the link for the Colab link I use.

Data Visualization Libraries

1. Seaborn: Statistical Data Visualization Library

Seaborn is a Python data visualization library based on Matplotlib. It provides a high-level interface to draw attractive and informative statistical graphs. Visit the setup page to see how you can download the package and start using it.

Seaborn

We can say that the difference compared to Matplotlib is that it has more customization options.

Seaborn Samples

In the image I gave above, we see how we can visualize the data thanks to Seaborn. It is possible to display our data in many different graphics and forms.

2. Matplotlib: Visualization with Python

Matplotlib; it is a comprehensive library for creating static, animated, and interactive visualizations in Python.

Matplotlib Logo

Matplotlib was originally written by John D. Hunter, has an active development community ever since.

Plots

Likewise, in the visual I have given here, there are visualization forms that can be made with Matplotlib.

🧷 Click on the link to view the plot, or graphics, in the Matplotlib library.

  • Line Plots: It shows the relationship between two variables in lines.

Line plots

  • Scatter Plots: As the name suggests, this relationship between two variables is shown as distributed points.

Scatter Plots

✨ I wanted to use the seaborn library to measure the relationship between the variables in the Iris data set.

Uploading Seaborn

After including the Seaborn library in our project, we provide the graph by entering various parameters. Here we have compared the relationship between sepal_length and petal_width attributes over dataframe. The cmap variable is the variable that determines the color palette we use in our chart. It can be changed upon request. The variables indicates the size of the points in the scatter chart given here as points.

Data Visulatizaton

We have come to the end of another article. Stay healthy ✨

REFERENCES

  1. https://seaborn.pydata.org.
  2. https://matplotlib.org.
  3. Machine Learning Days | Merve Noyan | Data Visualization | Study Jams 2 |, https://www.youtube.com/watch?v=JL35pUrth4g&t=640s.
  4. Matplotlib, Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/wiki/Matplotlib.
  5. https://jakevdp.github.io/PythonDataScienceHandbook/04.02-simple-scatter-plots.html.
  6. https://jakevdp.github.io/PythonDataScienceHandbook/04.01-simple-line-plots.html.
  7. https://matplotlib.org/3.1.1/tutorials/colors/colormaps.html.

 

 

 

 

 

 

 

Data Analysis and Visualization with Python

Hello, one more beautiful day! In this article, we will continue to code Python with you. So what are we doing today? We will talk about one of my favorite topics, data analysis. You can get your data set from data sites such as Kaggle or UCI. In addition to these, I did research on Iris Flower Data Set and chose it for you.

The Iris flower dataset is a multivariate dataset presented by the British statistician and biologist Ronald Fisher in his 1936 article on the use of multiple measures in taxonomic problems. It is sometimes referred to as the Anderson Iris dataset because Edgar Anderson collected data to measure the morphological variation of Iris flowers of three related species. The dataset consists of 50 samples from each of the three Iris species (Iris Setosa, Iris virginica and Iris versicolor).

Four properties were extracted from each sample:

    1. The length of the sepals in centimeters
    2. The width of the sepals in centimeters
    3. The length of the petals in centimeters
    4. The width of the petals in centimeters

This dataset becomes a typical test case for many statistical classification techniques in machine learning, such as support vector machines.

Iris dataset

The visual you see above is also included in the notebook I created in Colab. In this visual, we see examples from the data set. You can access it via the Colab link at the end of the article. It is already in the literature as one of the most frequently and fundamentally used data sets in the field of data science.

STEPS

✨ The necessary libraries must be introduced in Colab and then the path of the data set in the folder must be specified. Then you can print the df variable to see the data set content or use the df.head( ) command to access the first 5 lines.

Veri kümesini ve kitaplıkları içe aktarma

Veri Kümesini İncele

✨ If you wish, let’s run the df.head( ) command and see how we will get an output.

Baş Komuta

✨ We include the values of the features in the data set above. Variables like sepal_length and petal_width are numerical variables. In addition, the feature of the flower type referred to as species is referred to as a categorical variable. First of all, it is useful to know which type of variable this data falls into.

⚠️ If it is desired to estimate the categorical data, namely the type of flower from the numerical variables (features between sepal_length and petal_width), this is a classification problem.

Descriptive Statistics

✨ Descriptive statistics are printed with Pandas’ describe method. If you want to follow, you can access the original documents of Pandas. In this way, how much data each feature contains – it is possible to see the lost data – it is informed. Standard deviation, average, minimum and maximum values of the properties are seen.

Describe Method

For example, in these data, the sepal_length feature is specified as 150000 lines in total and the standard deviation of these values is approximately 0.83.

⏳ The 25% and 75% range are known as Quartiles. By controlling these values, data can be analyzed.

✨ To get information about the data set, df.info( ) command should be run.

According to this information, we see that there is no row with an empty value. In addition to these, we also know that the features that exist numerically have float type.

✨ The df.isna( ) command checks if there is missing data (Not a Number) in the data set. We expect the row with the missing data to be ‘True’. However, as we have seen above, we do not have any lost data.

NaN Any

✨ The df.isna( ).any( ) command returns True if the data set contains even 1 missing data while checking lost data.

Not a Number Value

🖇 NOTE: Click on the link for the Colab link I mentioned above.

In the second article of the series, I will refer to the small points in the data analysis and the visualization area. Stay healthy ✨

REFERENCES

  1. https://pandas.pydata.org/pandas-docs/stable/index.html.
  2. https://www.kaggle.com/arshid/iris-flower-dataset.
  3. Machine Learning Days | Merve Noyan | Data Visualization | Study Jams 2 |, https://www.youtube.com/watch?v=JL35pUrth4g.
  4. https://www.kaggle.com/peterchang77/exploratory-data-analysis.

 

Support Vector Machines Part 1

Hello everyone. Image classification are among the most common usage area of artificial intelligence. There are many ways to classify images, but I want to talk about support vector machines in this blog.

In machine learning, support-vector machines are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis.Since the algorithm in question does not require any joint distribution function information regarding the data, they are distribution independent learning algorithms.Support Vector Machine (SVM) can be used for both classification and regression challenges. However, it is mostly used for classification problems.

How to solve the classification problem with SVM?

In this algorithm, we draw each data item as a point in n-dimensional space. Next, we classify by finding the hyperplane that separates the two classes very well. The algorithm is set in two classes of the line to be drawn in such a way that it passes from the furthest place to its elements. It is a nonparametric classifier. SVM can also classify linear and nonlinear data, but generally tries to classify data linearly.

SVMs apply a classification strategy that uses a margin-based geometric criterion instead of a pure statistical criterion. In other words, SVMs do not need statistical distribution estimates of classes in order to move from the classification task, and they define the classification model using the concept of margin maximization.

In SVM literature, the predictor is called a variable symbol, and a transformed symbol used to describe the hyperplane is called a feature. The task of choosing the most appropriate representation is also known as feature selection. A set of properties that describe a case is called a vector.

Thus, the purpose of SVM modeling; The goal is to find the optimal hyperplane separating the vector sets, with the single-category states of the variable on one side of the plane and the other categorized states on the other side of the plane.

Classification with SVM

The mathematical algorithms owned by the SVM were originally designed for the classification problem of two-class linear data, then generalized for classification of multi-class and non-linear data. The working principle of DVM is based on the prediction of the most appropriate decision function that can distinguish the two classes, in other words, the definition of the hyper-plane that can distinguish the two classes from each other in the most appropriate way (Vapnik, 1995; Vapnik, 2000). In recent years, intensive studies have been carried out on the use of DVMs in the field of remote sensing, which are used successfully in many areas. (Foody et al., 2004; Melgani et al., 2004; Pal et al., 2005; Kavzoglu et al., 2009). In order to determine the optimum hyperplane, two hyperplanes parallel to this plane and its boundaries must be determined. The points that make up these hyperplanes are called support vectors.

How to Identify the Correct Hyper Plane?

It is quite easy to detect the correct hyperplane with package programs such as R, Python, but we can also detect the correct hyperplane manually with simple methods. Let’s consider a few simple examples.

Here we have 3 different hyperplanes a, b and c. Now let’s define the correct hyperplane to classify the star and the circle. Hyperplane b is chosen because it correctly separates stars and circles in this graph.

If all of our hyperplanes separate classes well, how can we detect the correct hyperplane?

Here, maximizing the distances between the nearest data point (class) or hyperplane will help us decide on the correct hyperplane. This distance is called the Margin.

We can see that the hyperplane C margin is high compared to both A and B. Hence, we call the straight plane C.

SVM for linearly inseparable data

In many problems, such as the classification of satellite images, it is not possible to separate the data linearly. In this case, the problem arising from the fact that some of the training data remains on the other side of the optimum hyperplane is solved by defining a positive dummy variable. The balance between maximizing the limit and minimizing false classification errors can be controlled by defining a regulation parameter (0 <C <∞) that takes positive values and is denoted by C (Cortes et al., 1995). Thus, data can be separated linearly and hyper-plane between classes can be determined. Support vector machines can mathematically make nonlinear transformations with the help of a kernel function, thus allowing the data to be separated linearly in high dimensions.

It is essential to determine the kernel function to be used for a classification process to be performed with support vector machines (SVM) and the optimum parameters of this function. The most commonly used kernel functions in the literature are polynomial, radial based function, PUK function and normalized polynomial kernels.

SVM is used for things like disease recognition in medicine, limitation of consumer loans in banking, and face recognition in artificial intelligence. In the next blog, I will try to talk about their applications on package programs. Goodbye until we meet again …

REFERENCES

  1. https://dergipark.org.tr/en/download/article-file/65371
  2. https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-example-code/
  3. http://nek.istanbul.edu.tr:4444/ekos/TEZ/43447.pdf
  4. https://www.harita.gov.tr/images/dergi/makaleler/144_7.pdf
  5. https://www.slideshare.net/oguzhantas/destek-vektr-makineleri-support-vector-machine
  6. https://tez.yok.gov.tr/UlusalTezMerkezi/tezSorguSonucYeni.jsp#top2
  7. https://medium.com/@k.ulgen90/makine-%C3%B6%C4%9Frenimi-b%C3%B6l%C3%BCm-4-destek-vekt%C3%B6r-makineleri-2f8010824054
  8. https://www.kdnuggets.com/2016/07/support-vector-machines-simple-explanation.html

 

Basic Information About Feature Selection

Artificial learning, deep learning and artificial intelligence, which we actively come across in all parts of our lives, is a situation where everyone is working on it, and the predictions are measured with the success score. In business processes, the subject of artificial learning has a critical importance. The data that is in your hands or collected by the company personally and comes to the Feature Engineering phase, is carefully examined from many issues and prepared for the final situation and taken to the person working as a Data Scientist. He can make inferences for the firm by making sense of the data. Thus, if the product or service developed is tested by offering it to the customer and meets the necessary success parameters, we can make the performance of the product sustainable. One of the most important steps here is the scalability of the product produced and the rapid adjustment of the adaptation phase to business processes. Another event is to obtain the significance levels of the features determined by correlation from the data set, to make this meaningful and to determine by the Feature Engineer before the modeling phase. We can think of Feature Engineers as an additional power that accelerates and facilitates the Data Scientist’s business process.

 

 

In the case of job search, we may encounter a ‘Feature Engineer’ announcement, which may appear frequently. We can obtain the critical information we learn from the data during the feature selection process during the data preparation phase. Feature selection methods are intended to reduce the number of input variables to those believed to be most useful for a model to predict the target feature. Feature selection processes provide great convenience to employees by reducing the workload as much as possible, if they are determined logically while involved in data pre-processing processes. I mentioned that there is a special business area for this. Feature Selection situations affect the success of the data in modeling and directly affect the success of the values ​​to be predicted. For this reason, the most important part of the events from the first data to the product stage is the right decision of the working person to choose the feature. If the progress is positive, the product will come to life in a short time. Making statistical inferences from the data is as important as determining which data is and how important it is through algorithms. Statistics science should play a role in data science processes in general.

 

 

There are also feature selection methods to be determined by statistical filter. We can give examples of scales that differ in their choice of features. Unfortunately, most people working in this field do not care enough about statistical significance. Among some people working on Data Science and Artificial Intelligence, writing code is seen as the basis of this work. I can give examples of categorical and numerical variables for the data set. In addition, these variables are divided into two within themselves. While the feature we see numerically is known as integer and float, variables we see categorically are; known as nominal, ordinal and boolean. You can find this basically in the image I put below. These variables are literally vital to feature selection. In line with the operations performed, these variables can be decided with a statistician during the evaluation phase, and the analysis of the selected features should be made on a solid basis. One of the most necessary features of those working in this field is their ability to interpret and analyze well. In this way, they can easily present the data they prepare in the form of products, with the basics matching the logic.

 

 

There is almost no exact method available. Feature selection for each data set is evaluated with a good analysis. Because the operations performed may vary for each feature. That is, while one data set contains too many integers or float values, another data set you are working on may be boolean. Therefore, there may be cases where feature selection methods differ for each data set. The important issue may be to adapt quickly, understand what the data set offers us and produce solutions accordingly. With this method, it is possible for the decisions taken during the transactions to continue in a healthier way. Categorical variables can be determined by methods such as the chi-square test, even this method is more powerful and the rate of efficiency can reach higher points. The choice of features throughout the product or service development stages is the most important step that contributes to the success criteria of a model.

 

References:

https://globalaihub.com/basic-statistics-information-series-2/

https://globalaihub.com/temel-istatistik-tanimlari-ve-aciklamalari/

https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/#:~:text=Feature%20selection%20is%20the%20process,the%20performance%20of%20the%20model.

https://www.istmer.com/regresyon-analizi-ve-degisken-secimi/

https://towardsdatascience.com/the-5-feature-selection-algorithms-every-data-scientist-need-to-know-3a6b566efd2

Featured Image

Data Labeling Tools For Machine Learning

The process of tagging data is a crucial step in any supervised machine learning projects. Tagging is the process of defining areas in an image and creating descriptions of which object belongs to these regions. By labeling the data, we prepare our data for ML projects and make them more readable. In most of the projects I’ve worked on, I’ve created sets in the dataset, I’ve done self-tagging, I’ve done my training with tagged images. In this article, I will introduce the data labeling tools that I encounter the most by sharing my experience in this field with you.
Labeling Image

📍COLABELER

Colabeler is a program that allows labeling in positioning and classification problems. Computer vision is a labeling program that is frequently used in the fields of natural language processing, artificial intelligence, and voice recognition [2]. The visual example that you see below shows the labeling of an image. The classes you see here are usually equivalent to the car class. In the tool section that you see on the left side, you can classify objects like curves, polygons, or rectangles. This selection may vary depending on the limits of the data you want to tag.
Labeling Colabeler
Then in the section that says ‘Label Info’, you type the name of the objects you want to tag yourself. After you finish all the tags, you save them by confirming them from the blue tick button. And so you can go to the next image with Next. Here we should note that every image we record is sorted to the left of this blue button. It is also possible to check the images you have recorded in this way. One of the things I like most about Colabeler is that it can also use artificial intelligence algorithms.
📌 I performed tagging via Colabeler in a project I worked on before, and it is a software with an incredibly easy interface.
📽 The video on Colabeler’s authorized websites describes how to make the labeling.
Localization of Bone Age
I gave a sample image of the project I worked on earlier above. Because this project is a localization project in the context of machine learning, labeling has been done by adhering to these features. Localization means isolating the subregion of the image where a feature is located. For example, trying to define bone regions for this project only means creating rectangles around bone regions in the image [3]. In this way, I have labeled the classes that are likely to be removed in the bone images as ROI zones. I then obtained these tags as Export XML/JSON provided by Colabeler. A lot of machine learning employees will like this part, it worked very well for me!

♻️ Export Of Labels

Exporting XML Output
At this stage, I have saved it as JSON output, because I will use JSON data, you can save your data in different formats. In the image I give below, you can see the places of the classes I created in the JSON output. In this way, your data was prepared in a labeled manner.
JSON Format

📍ImageJ

ImageJ is a Java-based image processing program developed at the National Institutes of Health and the Laboratory for Optical and Computational Instrumentation (LOCI, University of Wisconsin). ImageJ’s plugin architecture and built-in development environment have made it a popular platform for teaching image processing [3].

As I listed above, you can see a screenshot taken from ImageJ in Wikipedia. As can be seen, this software does not exist on an overly complex side. It is a tool that is used in many areas regardless of the profession. 📝The documentation provided as a user’s guide on authorized ImageJ websites describes how to perform labeling and how to use the software tool.
📌 I have also been to Fiji-ImageJ software tools for images that I had to tag in the machine learning project. I think its interface is much older than other labeling programs I’ve worked with. Of course, you can perform the operations that you want to do from a software point of view, but for me, the software also needs to saturate the user from a design point of view.
Fiji-ImageJ
The image I gave above was a screenshot I took during the project I was working on on my personal computer. In order to be able to activate the data while working on the Matlab platform, it was necessary to update with priority. For this reason, after updating, I continued to identify the images. Below is the package that will be installed during the installation of the Matlab plugin for ImageJ users.
ImageJ Matlab

📍Matlab Image Labeler

The Image Labeler app provides an easy way to mark rectangular area of interest (ROI) tags, polyline ROI tags, Pixel ROI tags, and scene tags in a video or image sequence. For example, using this app will start by showing you [4]:

  • Manually tag a picture frame from an image collection
  • Automatically tagging between image frames using an automation algorithm
  • Export tagged location fact data

Image Toolbox Matlab
In the image you see above, we can perform segmentation using Matlab image Labeler software. More precisely, it is possible to make labeling by dividing the data into ROI regions. In addition, you can use previously existing algorithms, as well as test and run your own algorithm on data.
Selection ROI
In this image I received from Matlab’s authorized documentation, the label names of the bounding regions you selected are entered in the left menu. A label Color is assigned according to the class of the object. It is also quite possible that we create our labels in this way. In the next article, I will talk about other labeling tools. Hope to see you ✨

REFERENCES
  1. https://medium.com/@abelling/comparison-of-different-labelling-tools-for-computer-vision-f3afd678da76.
  2. http://www.colabeler.com.
  3. From Wikipedia, The Free Encyclopedia, ImageJ, https://en.wikipedia.org/wiki/ImageJ.
  4. MathWorks, Get Started with the Image Labeler, https://www.mathworks.com/help/vision/ug/get-started-with-the-image-labeler.html.
  5. https://chatbotslife.com/how-to-organize-data-labeling-for-machine-learning-approaches-and-tools-5ede48aeb8e8.
  6. https://blog.cloudera.com/learning-with-limited-labeled-data/.

Data Mining and Being a Data Miner

Hello everyone, as a statistician, I can say that most statisticians dream of becoming a data miner but the road to be followed for this is long and bumpy. According to Google Trends data, “Data mining” and “Data Miner” searches in Google Web Search are very popular around the world. So what makes data mining so attractive?
Currently, the multiplicity of data and the difficulty of using the information required after processing data has increased the need for data mining.
Data mining is an automatic or semi-automated technical process used to analyze and interpret large amounts of dispersed information and turn it into information. Data mining is frequently used in marketing, retail, banking, healthcare, and e-commerce application areas.
Stages of Data Mining

We can basically consider the data mining process is:

  1. Obtain and secure the data stack
  2. Smoothing
  3. Damy-Optimization
  4. Data Reduction
  5. Normalization
  6. Applying Related Data Mining Algorithms
  7. Testing and training results in related software languages (R, Python, Java)
  8. Evaluation and presentation of results

To become a data miner requires programming, mathematics, statistics, machine learning, and some personal skills. Let’s examine these requirements in a little more detail together.
1)Programming:

  • Algorithmic approach
  • Programming logic
  • Big data technologies(Spark, Hive, Impala, DBS, etc.)
  • SQL(databases), NoSQL, Bash Script, R, Python, Scala, SPSS, SAS, MATLAB, etc.
  • Cloud technologies (AWS, Google Cloud, Microsoft Azure, IBM, etc.)

2)Statistical Learning (SL):

  • Tidy data process and data preprocessing
  • Regression Models
  • Linearity and causality
  • Inference Statistics
  • Multivariate Statistical Methods

3)Machine Learning(ML)

  • Classification
  • Clustering
  • Association Rule Learning
  • Text Mining, NLP
  • Reinforcement Learning
  • Deep Learning

4)Personal Skills

  • Being Able To Ask The Right Questions
  • Analytical Perspective
  • Problem Solving Ability
  • Storytelling and presentation ability

As a result, we talked briefly about the definition, stages, and requirements of data mining in this blog. Hope to see you in our next blog.
REFERENCES
https://iskulubu.com/teknoloji/veri-madenciligi-data-mining-nedir/
https://vizyonergenc.com/icerik/5-temel-soruda-veri-madenciligi-data-mining-nedir
https://www.veribilimiokulu.com/nasil-veri-bilimci-olunur/
https://trends.google.com/trends/explore?q=data%20mining,data%20miner
https://www.dreamstime.com/four-stages-data-mining-process-image194483251
https://www.kozmoslisesi.com/veri-madenciligi-data-mining-nedir/