Classification of IMDB Data : Binary Classification

Hello everyone, in this article we will go with you a little deeper learning basics. In fact, we will work with you so that we will have a long-term adventure together with the data set we have, from pre-preparation to training. In fact, we will work with you so that it will be a long-term adventure with the data set, from pre-preparation to training. If you have any prior knowledge of binary classification, let us continue on our way. But if you are not interested in the previous or if you have a small question in your mind, I would definitely recommend reviewing the link I left. The reason why the data set we will examine today is recognized as a binary classification problem is because we have two classes (positive and negative).

It is important to make a good sense of the data when classifying. Before starting a business, you should always examine your data well. We will work together to code the IMDB data set from the ready-made data sets provided by the Keras library in Python programming language. For this project, instead of Jupyter, I will use the free version of Colab, which Google offers us as an opportunity. For users using Jupyter notebook, this platform will come very hot. You can access your Colab Notebook reviews with a click.

The data set to be used contains a total of 50,000 pieces of data. Let’s start by creating a notebook as seen in the image and import the data set from the Keras library. We then need to separate the cluster for this data set to be used in the training and testing process.

So why don’t we just use the data set as a training set? Because you need to test with a test set whether the data you are training gives accurate results. If you try to test the training data with the training set, this is not the right method. Because when you show the machine something it has never learned, it will be confused and stunning. You can think of it just like a person entering the exams by memorizing it. You will be surprised when you encounter a different problem in the exam only when you memorize the given problems. Because memorizing is not a good method. As in real life, of course, we are not in favor of memorizing the machines. In addition, there are many ways to overcome this overfitting problem, but it is, of course, necessary to continue this situation.

Based on the film data, we will classify it as positive or negative. According to the notes I read, the IMDB data set downloads approximately 80 MB of data on the first installation, so it will vary according to the speed of the computer, CPU, or GPU usage. Wait patiently ✨

As I said above, it is very important to recognize the data set. For this reason, 50,000 data is divided into 25,000 training and 25,000 test sets. Both datasets contain 50% positive and 50% negative data. Let’s keep this information in mind.

If you notice when loading our data set with the load_data method, it contains the num_words parameter. This parameter holds the number of the most commonly used words.

View of the list variable that hosts train_data movie data

List variable of 0 and 1s – 1 positive data

List variable of 0 and 1s – 0 negative data

In the codes shown above, the labels of the 0th training and test data represent. When the label of 500th data is wanted to be printed instead of the 0th indexes, it will result as follows.

Now we’ll do a little data classification. We’ll create a maximum for loop for the parameter selected above as 15000.

You can specify the maximum number of iterations with the Python-specific method. In this way, 15000 limits will not be exceeded.

The convertWords is created and the index numbers in the data are converted back into words. Let’s go back to small operations on the data.

If you notice in the variable convertWords, the index values are taken by omitting three by three. The reason is that indices 0, 1, and 2 are parsed to encode unnecessary data.


Leave a Reply