Data Preprocessing with Python

Machine learning and artificial intelligence continue to take over the world unabated. We also continue our work, which we do with pleasure, to keep up with our age. Come on this pleasant journey and let’s take a step slowly together. Just like a newborn baby trying to take its first steps by growing over time 👶

As you know, recently Kaggle is an inevitable opportunity in terms of data. I get a lot of data sets that I am working on from Kaggle for free. I recommend it to you. In addition, you can develop yourself with pleasure by participating in most competitions that I find useful in Kaggle. You can also obtain information by other individuals working on the selected dataset. We are constantly talking about data, but how can we use this data in machine learning?

For any data, it means that we can do what machines do to make sense of that data, trying to make sense of that data by humans. In fact, what I’m saying is, let’s elaborate the situation with a little sampling. Today, object detection is almost everywhere! Shopping malls, smart devices like phones and computers, cameras and more.. 🛒 🤳 Many technological devices you can think of! Have you thought about it? Facebook Facebook automatically does this for you without tagging photos you’ve added for a long time on the Facebook platform. Here is where status object recognition even addresses AI. When we analyze this process analytically with the eyes of the machines, perhaps we can infer some of the specific features of the data. Moreover, of course this data is required to be operated 🌀

While processing image data in a project I was working in, I encountered NaN (Not a Number) and uncertain results. As a result of this, I realized that it does not create a problem in fact I need to explore a New 🚀 later in my writing I will make sense of these meaningless words you see above and let us continue on our way quickly. I will use Python as the programming language and Jupyter as the platform 🐍. Let’s go on this expedition together!

[gdlr_core_space height=”30px”]

The first step is to import the Pandas🐼 and Numpy📊 libraries belonging to Python. I recommend that you examine these libraries because it is an inevitable opportunity for those who will work in this field in the future. Pandas is a powerful software library dedicated to the data science field of Python. It allows us to facilitate our operations by creating structures that we can call a data frame or numerical tables in order to examine the structure and processing of data.

[gdlr_core_space height=”30px”]

You can access the data set I use in the above way from the link for free via Kaggle. The reason I chose this data set is because it contains NaN data. In this way we can perform a preprocessing by clearing the data. Since we define Pandas library with the abbreviation pd when importing it, we can now call it with pd in all areas of the code. Because the data set I will work on is a CSV file, we have obtained the data with the read_csv( ) method.

🚩 df.info( ) studied by the method known as the doctype in the data set, data types, and gives information about the memory in use.

🚩 Calculation of statistical results is provided with df.describe( ) method.

🌟 One of the problems that is often encountered in the decrease in accuracy rate in ML projects is the failure to clean up lost (incomplete) or incorrectly entered data. Thanks to the df.isna( ) method written above, we have identified the data as NaN (Not a number) in order not to pose a problem in the future. I would like to draw your attention to the values of True and False in the image you see above. The result returns True for rows with NaN values and False for columns that do not. If you want to see how much NaN values are included for each column (property) in total, you can see the result on the screen with the sum( ) method.

For example, when the output in row 13 is examined, we see that NaN value does not exist in the column named Date, but 61154 values are found in the column named down. Since these lines do not correspond to numerical values, it affects us in reaching the desired result. We will use the dropna( ) method to eliminate these values.

🍂 Let’s look at the syntax structure of dropna( ), which you can also find on Pandas official site. First of all, it is necessary to select the region where the NaN data will be deleted with the axis parameter. When sxis = 0, the X axis (row) will be accepted as the axis = 1 and the Y axis (column). When any value is selected with the how parameter, even if there is any NaN value in that field, that field will be deleted automatically. However, if all is selected instead of any, all values in the selected region are requested to be NaN. The thresh parameter is the parameter that will automatically remove the values when you select a threshold value and exceed that threshold value.

In the line of code seen above, the thresh parameter is a threshold for us. In other words, if there is a value on 100 NAN data, that column(axis=1) will be deleted.

🚩 With the df.drop( ) method, it is possible to customize and remove the columns or rows you want to delete specifically. I want to completely remove a column with a lot of NaN values. Because that column will not help a lot when it gives me results. For this reason, we can eliminate it by specifying the column name and axis I want to delete as labels. If you want to get more details on this subject, you can access it from the link.

If we want to see if the column that is wanted to be deleted still exists, you can easily observe that the column that I mentioned with orange ink is not included in the next image.

Regulation Of Missing Values

🍂 Scikit-learn is a machine learning library written in the Python programming language. SimpleImputer is a converter module that we use from the sklearn library to convert missing values.

⬅️ The image given on the side contains the syntax structure of the simpleimputer module. It is possible to edit the missing data by giving the necessary parameters.

The data to be converted is determined with the missing_values parameter. Since we want to select NaN values, we can select NaN values from the Numpy library. During the next parameter, the strategy parameter, it is decided to choose one of the 4 types of strategy and according to which method it will be converted. These strategies are listed below:

🔓 “mean” strategy allows the replacement of missing values by using averages across each column. Available only with numeric data.

🔓 “median” strategy allows for the replacement of missing values by using the median across each column. Available only with numeric data.

🔓 “most_frequent” strategy allows missing values to be replaced using the most commonly used value throughout each column. Can be used with String or numeric data.

🔓 “constant” strategy, replace the missing values with fill_value. Can be used with String or numeric data.

We will use SimpleImputer to convert the lost data into more accurate values that are NaN in the data set. it has always been more appropriate to fill NaN-valued columns with appropriate values than to remove them all as in the yacepa column. Therefore, by selecting the down column, we will fill in the missing data with the mean strategy. Let’s do it together 🏋

⬇️ The NaN values in the down column in the image below will be filled in according to the mean strategy in the column.

Column View With Missing Data
Status Of Missing Data After Conversion

We are slowly approaching our targeted steps. I will touch on other data preprocessing steps in my next post. Hope to meet you in another post ✨

REFERENCES

Leave a Reply