Most of the time, I did explain how data is important for machine learning projects, and today in this article I am going to touch really important and necessary one of the steps of data preparation.
Data preparation is the main step in our projects. Does not matter which program language or algorithm will be used, if our data is not good there is no way to increase our accuracy score. Data exploration consists of missing values, detecting and treating outliers, variable identification, etc. Today’s data exploration topic is outlier detection in Python.
What is an Outlier and Why do we need to Detect Outliers?
An outlier is any data point that differs greatly from the rest of the observations in a dataset. There are different reasons for outliers exist. For example, an analyst made an error in data entry, or the machine made an error or an analyst was input false information intentionally. Outliers can affect statistical modeling and we can not get accurate results because of outliers. There are different ways of detecting and treating outliers and the most common ones are the box-plot method, removing outliers, etc. However, simply removing outliers from our data without considering how they will impact and change the result is a huge disaster. Also, real-life examples will be different and we will have multivariate variables, that is why working on multivariate outliers would be more precious.
Multivariate Outliers Analysis
Local Outlier Factor Method
First, I am going to import the libraries and upload the data. I found this data from Kaggle and it is food-related:
Here, I just want to use float and integer variables because I do not need categorical variables.
We have a score value for each observation value. This can be regarded as the threshold value and I am going to show further steps until determining our score value and will import necessary libraries as well.
As you see, there are values between 0 and 20. Among these values, we need to pick our score value and it might be random. However, before deciding, checking the values if there are huge increases or decreases among them. According to jumps, the third value is suitable for our case.
There are the outliers:
We can encounter wit some unusual data such as outliers and there are different techniques for univariate and multivariate variables that can be used to detect and remove them. In this article, I just wanted to show one of them that most used for multivariate outliers.