Data Analysis and Visualization with Python – 2

We continue to make visualizations on the Iris dataset I used in my previous article. There are 2 most frequently used libraries for data visualization. Of these libraries, matplotlib is known by many people, just as I know. In addition, our second library is seaborn. In this article, we will witness the visualization of data with the help of libraries.

🔐 You need to enter the link for the Colab link I use.

Data Visualization Libraries

1. Seaborn: Statistical Data Visualization Library

Seaborn is a Python data visualization library based on Matplotlib. It provides a high-level interface to draw attractive and informative statistical graphs. Visit the setup page to see how you can download the package and start using it.

Seaborn

We can say that the difference compared to Matplotlib is that it has more customization options.

Seaborn Samples

In the image I gave above, we see how we can visualize the data thanks to Seaborn. It is possible to display our data in many different graphics and forms.

2. Matplotlib: Visualization with Python

Matplotlib; it is a comprehensive library for creating static, animated, and interactive visualizations in Python.

Matplotlib Logo

Matplotlib was originally written by John D. Hunter, has an active development community ever since.

Plots

Likewise, in the visual I have given here, there are visualization forms that can be made with Matplotlib.

🧷 Click on the link to view the plot, or graphics, in the Matplotlib library.

  • Line Plots: It shows the relationship between two variables in lines.

Line plots

  • Scatter Plots: As the name suggests, this relationship between two variables is shown as distributed points.

Scatter Plots

✨ I wanted to use the seaborn library to measure the relationship between the variables in the Iris data set.

Uploading Seaborn

After including the Seaborn library in our project, we provide the graph by entering various parameters. Here we have compared the relationship between sepal_length and petal_width attributes over dataframe. The cmap variable is the variable that determines the color palette we use in our chart. It can be changed upon request. The variables indicates the size of the points in the scatter chart given here as points.

Data Visulatizaton

We have come to the end of another article. Stay healthy ✨

REFERENCES

  1. https://seaborn.pydata.org.
  2. https://matplotlib.org.
  3. Machine Learning Days | Merve Noyan | Data Visualization | Study Jams 2 |, https://www.youtube.com/watch?v=JL35pUrth4g&t=640s.
  4. Matplotlib, Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/wiki/Matplotlib.
  5. https://jakevdp.github.io/PythonDataScienceHandbook/04.02-simple-scatter-plots.html.
  6. https://jakevdp.github.io/PythonDataScienceHandbook/04.01-simple-line-plots.html.
  7. https://matplotlib.org/3.1.1/tutorials/colors/colormaps.html.

 

 

 

 

 

 

 

Data Analysis and Visualization with Python

Hello, one more beautiful day! In this article, we will continue to code Python with you. So what are we doing today? We will talk about one of my favorite topics, data analysis. You can get your data set from data sites such as Kaggle or UCI. In addition to these, I did research on Iris Flower Data Set and chose it for you.

The Iris flower dataset is a multivariate dataset presented by the British statistician and biologist Ronald Fisher in his 1936 article on the use of multiple measures in taxonomic problems. It is sometimes referred to as the Anderson Iris dataset because Edgar Anderson collected data to measure the morphological variation of Iris flowers of three related species. The dataset consists of 50 samples from each of the three Iris species (Iris Setosa, Iris virginica and Iris versicolor).

Four properties were extracted from each sample:

    1. The length of the sepals in centimeters
    2. The width of the sepals in centimeters
    3. The length of the petals in centimeters
    4. The width of the petals in centimeters

This dataset becomes a typical test case for many statistical classification techniques in machine learning, such as support vector machines.

Iris dataset

The visual you see above is also included in the notebook I created in Colab. In this visual, we see examples from the data set. You can access it via the Colab link at the end of the article. It is already in the literature as one of the most frequently and fundamentally used data sets in the field of data science.

STEPS

✨ The necessary libraries must be introduced in Colab and then the path of the data set in the folder must be specified. Then you can print the df variable to see the data set content or use the df.head( ) command to access the first 5 lines.

Veri kümesini ve kitaplıkları içe aktarma

Veri Kümesini İncele

✨ If you wish, let’s run the df.head( ) command and see how we will get an output.

Baş Komuta

✨ We include the values of the features in the data set above. Variables like sepal_length and petal_width are numerical variables. In addition, the feature of the flower type referred to as species is referred to as a categorical variable. First of all, it is useful to know which type of variable this data falls into.

⚠️ If it is desired to estimate the categorical data, namely the type of flower from the numerical variables (features between sepal_length and petal_width), this is a classification problem.

Descriptive Statistics

✨ Descriptive statistics are printed with Pandas’ describe method. If you want to follow, you can access the original documents of Pandas. In this way, how much data each feature contains – it is possible to see the lost data – it is informed. Standard deviation, average, minimum and maximum values of the properties are seen.

Describe Method

For example, in these data, the sepal_length feature is specified as 150000 lines in total and the standard deviation of these values is approximately 0.83.

⏳ The 25% and 75% range are known as Quartiles. By controlling these values, data can be analyzed.

✨ To get information about the data set, df.info( ) command should be run.

According to this information, we see that there is no row with an empty value. In addition to these, we also know that the features that exist numerically have float type.

✨ The df.isna( ) command checks if there is missing data (Not a Number) in the data set. We expect the row with the missing data to be ‘True’. However, as we have seen above, we do not have any lost data.

NaN Any

✨ The df.isna( ).any( ) command returns True if the data set contains even 1 missing data while checking lost data.

Not a Number Value

🖇 NOTE: Click on the link for the Colab link I mentioned above.

In the second article of the series, I will refer to the small points in the data analysis and the visualization area. Stay healthy ✨

REFERENCES

  1. https://pandas.pydata.org/pandas-docs/stable/index.html.
  2. https://www.kaggle.com/arshid/iris-flower-dataset.
  3. Machine Learning Days | Merve Noyan | Data Visualization | Study Jams 2 |, https://www.youtube.com/watch?v=JL35pUrth4g.
  4. https://www.kaggle.com/peterchang77/exploratory-data-analysis.

 

Importance of Data Quality and Data Processing

The subject that the whole world talks about and is now seen as the most important thing in the new order is data. Data is processed in many different ways and is prepared to extract information from it. It is a structure that gives a different dimension that changes the direction of the world on its own. Today companies actually exist as much as the knowledge they have. The readily obtained data may be inferior to the data you have collected yourself, the details of which you know. Therefore, you can spend a lot of time on data processing and extend the project time. This can be a big disadvantage for you. It is entirely up to you to measure the quality of the incoming data and arrange them in a certain order. If the data quality is really bad, it can be integrated into the system after the final preparations are made by carefully applying the data processing steps above it.
 
The biggest mistake made by software developers who are at the beginning level is to process the data that is prepared cleanly. To level up, you can create a data set yourself and analyze it. While this gives you self-confidence, the solutions you find in the face of the difficulties you encounter are what will lead you to a great deal of progress, so that you will reach the ‘problem solving ability’ that many big companies care about. Dealing with the data you collect yourself will prepare you for real-life problems. People who want to pursue a career in Data Science should find a solution to a real problem by collecting their own data and adjusting this data so that they can finally go to the product stage. Thanks to the project phases it has developed, it can easily continue its career with a high level of experience in matters such as processing information, product development, and finding solutions to real life problems.
 

 
The most important issue for the Data Scientist is data. If there is no data, no solution can be found, and the people who have the data will hold the power in the new era. In the future world order, we can call the data that will give full direction to the world. There are data flowing live at every stage of life, and processing them and making logical inferences is an extremely important skill for the century we live in. Understanding the information obtained from the data well and finding solutions to the problems that may arise is another situation that will provide convenience in finding a job in the future. The most important issue opened to the subject of artificial intelligence is a project and the existence of quality data for that project. Data quality has full say in determining how long the project is to be formed and its maximum destination. There is no matter as important as data quality because if the data is of poor quality, there are many problems to occur.
 
Another issue, which is as important as data quality, is to perform data processing steps correctly. Data science, machine learning, artificial learning, deep learning or artificial intelligence, whatever you call it, all it takes for these jobs to become a product is data. In addition, the quality of this data and the fact that the data processing steps are prepared at a very good level directly affect the processes of the projects where these names are made. The most critical situation is to pass data processing steps, topics that will be presented as products. After you have overcome such vital points, you can quickly navigate by using mathematical, engineering or statistical information on the part of the work to become a product. This situation accelerates your project and motivates you. Thus, you can move to a different dimension by acting with the motivation you get and the pushing power you have.
 

 
The conditions of the world will continue to change continuously throughout the century we live in. It is the data itself that is determined to lead this change. The data, called the new oil, is literally petroleum for the new century. Processing them and obtaining logical results is the main goal of everyone. Persons working in this field must have strong numerical knowledge and have experience in data processing. It should benefit the units it works with by actively using its problem-solving intelligence from the first moment it takes the data. Data processing; It is a technique that has the power to change success scores in machine learning and deep learning. This technique, if used correctly, can easily achieve the achievable maximum score levels.
 
The thing that contributes to the development of smart systems and their full penetration into our lives has been created thanks to quality data. If you want to produce quality works for the project to be worked on, you must first collect the data you have on the basis of quality and solid foundations. If this is not the case, you can keep your project ready by performing a very good data processing phase and finishing before the project. Thus, it saves you both time and gives you confidence in the quality of the data you will deal with while getting to the job, and you will need to solve problems over data in a minimal way during the project steps. Data quality is the life source of projects. People who have had the opportunity to work with good data know exactly what I mean. Remember, good data means a good project, a good working order and good results.

 
I hope you like. If you like it, you can give me a return by stating it in the comments.
 
References:
-https://www.wired.com/insights/2014/07/data-new-oil-digital-economy/#:~:text=Like%20oil%2C%20for%20those%20who,the%20government%20to%20local%20companies.
-http://globalaihub.com/basic-statistics-information-series/
-http://globalaihub.com/python-ile-veri-on-isleme-data-preprocessing/
-https://searchdatamanagement.techtarget.com/definition/data-quality.