Is Data “Science” Really Science?

With the development of technology, a lot of new terms enter into our lives, applications diversify with terms, and professions start to thrive around these developments. In particular, the concept of “Data Science” emerged in the information age we are in. It is known that the salaries of the people who play with big data are as big as the data. This profession, which has been the focus of attention other than because of their salaries, was named by the Harvard Business Review in 2012 as “The Sexiest Job of the Century”. However, the world has not fully compromised on this concept. On one hand, some people argue that data science is not really “science” and that the concept is misleading; on the other hand, there are people who argue that it is true science.

So, who is right, can we say data science is science, or is there an absolute answer to this question: Let’s examine it together.

What is Data Science and who is a Data Scientist?

Data science is a field that examines structured or unstructured data using scientific methods and processes, algorithms, and processes. Structured data is highly regulated, and unstructured data can be described as irregular.

Figure 1: Structured Data vs Unstructured Data
https://www.igneous.io/blog/structured-data-vs-unstructured-data

Data science was defined by Chikio Hayashi in 1998 as a concept that combines statistics, data analysis, machine learning, and similar methods. Today, it is known that data science is a multi-disciplinary field that combines mathematics, statistics, and computer sciences.

Although we claim that data science is a well-accepted field today, there are still many controversies. The most prominent of these controversies is that the field is not new and is merely another name given to statistics. Many statisticians advocate this, but some statisticians argue that statistics is not an indispensable part of data science.

We can briefly define data scientists as people who use this “science”. Data scientists are people who analyze structured and unstructured data. They draw meaningful results from these data for the institutions they work for and the institutions prepare appropriate action plans in line with the results obtained. According to Glassdoor data, the average annual income of data scientists is about 113 thousand dollars. By looking at their salaries, we can understand how important these people are to companies.

In data science, particular methods are used, as used in academic research, to achieve meaningful results. Let’s examine the CRISP-DM methodology, one of the most popular among these methods.




Figure 2: The CRISP-DM Methodology
https://www-01.ibm.com/events/wwe/grp/grp304.nsf/vLookupPDFs/Polong%20Lin%20Presentation/$file/Polong%20Lin%20Presentation.pdf

The first stage “business understanding” clearly reveals what you intend to do as a company or organization and what your road map is for this purpose. It addresses issues that need to be resolved for progress.

The second stage “analytic approach” asks questions about the problem determined in the context of statistics and machine learning: “How can I offer customers more customized products?”, “Does this patient have disease x or disease y?”…

The three stages following the analytical approach (data requirements, data collection, data understanding) are collectively called the “data compilation” stages. The data required for the problem are determined, collected, and examined to infer insights.

In the “data preparation” phase, the data in hand is properly formatted, their deficiencies are determined, and their excess is removed.

In the “modeling” phase, the situation is modeled with many different algorithms. It is a step that is repeated frequently throughout all processes.

In the “evaluation“, the model constructed with the selected algorithm is examined. Conclusions are drawn about how good it works. As a result of the evaluations made from here, the model can be tested in the “deployment” phase, and its success is sent to the modeling phase as “feedback”.

Why do we analyze data?

Figure 3
https://vizyonergenc.com/storage/posts/September2019/Eg3XKkPZoVt6F0uyxHiV.jpg

Computing power is developing rapidly today. Computers now have more storage space and are strong enough to handle complex jobs. As a result of these developments, our access to information has become easier. In the middle of the 20th century, we entered a new era: the Age of Information. The most important feature of this age is the increase in the usage of computers and internet access. This results in the production of many structured and unstructured data every second on the Internet. In this context, we can identify data analysis with the Information Age.

Since the introduction of computers into our lives, we have always used them for different jobs. They were calculators that could make complex transactions for us when they entered our first lives. They could also be used for the solution of encryptions. Later, these capacities increased with the internet and they became a communication tool for us. In the future, we not only used it for communication but also started to create content for the internet. At this point, the websites realized that their users left important information to them with their actions. So, they started collecting data from them: like the number of visits and clicks. And as smart devices that can connect to the internet enter our lives, the variety of user data collected has increased in terms of type. Information such as location could now be collected. After the Internet of Things, devices became more involved in the daily habits of these users, the data obtained has become more inclusive.

And so, the people learned why data was valuable and how they could earn money from it.

A large amount of information produced every day, every second began to be stored in large data centers. Based on the consumption habits of the people, the sellers began to show them products that might be of interest to them: personalized shopping recommendations for each user and therefore increased sales.

Collecting data isn’t only a tool for making money. Applications such as Facebook and Instagram discover the habits of their users and make changes to keep them longer in the applications. And these companies do not explicitly present their data so that their competitors cannot copy their algorithms. Although such companies cannot process all the data they have, they store it because they are aware of the value data will gain in the future. Although there are many companies that are ready to pay millions of dollars to purchase the data, they now expect their time to come because they are aware of the power and money they will gain once the data is processed over the million dollars they will earn.

In addition to these, some companies share some data with people who purchase it so that they can access the datasets required for their own applications. The academic community and non-profit organizations offer their datasets to the public free of charge. Communities have emerged that focus only on building reliable and free datasets to assist academic studies. These communities form detailed sets in certain sciences.

Data has become the goal of most companies and people, nowadays, rather than just being the material used by analysts today. Companies collect as much data as they can even if they know they cannot process it today. They think that the magnitude of what they collect will power them in the future.

What is Science?

The definition of science is a “collection of information about the behavior and structure of the natural and physical world, consisting of provable arguments.” It is a way of justifying information.

Many people do not need to justify the practical information they use in their lives, and should the information turn out to be incorrect, they update it and carry on. For example, when information that is accepted as absolute truth is falsified, people adapt their discourses to the new truth and avoid questioning. When a piece of information is wrong in science, it is replaced by the knowledge which could not be proven otherwise. And this accepted information will be considered as correct until it is falsified. However, unlike practical knowledge and everyday life, knowledge in science is not considered to be absolutely correct, because there is always the possibility of falsification.

To summarize, science is a field open to ideas, changes, and criticism. It respects everyone’s rational thoughts and aims to make the world a better place through collective effort.

There are many methods used in scientific research for different purposes. However, there is a scheme that everybody thinks about the scientific research method. Although it does not seem very scientific, it summarizes the scientific research process in less than 10 steps. So, let’s take a look at the “scientific research method”:




Figure 4: The Scientific Method https://www.sciencebuddies.org/science-fair-projects/science-fair/steps-of-the-scientific-method#hypothesis

The scientific method begins with asking questions about an observation in a particular environment. Then the information is collected about the observed subject and inferences are made from this information. Implications are tested experimentally, and when healthy results are obtained, a conclusion is drawn from the findings. If there is not sufficient data from the experiment, the experiment step is reviewed and where the problem might have been are determined. And if the findings obtained as a result of the experiment do not support the hypothesis, the conclusion is reconstructed in the light of the new data.

Data Science as a Branch of Science

Now that we have a general knowledge of data science and science-related terms, we can now start our discussions on whether data science is really “science”.

If data science is a branch under science, it means that it was born out of science. And if data science is born of science, both should have the same approach to processes. Therefore we have to compare the methods of both fields.

When doing a study in data science, we first use the steps of “business understanding” and “analytical approach”. As the first step in a scientific study, we observe our environment and make inferences about the problems found there. Since the perspective in data science is in the business context, we can conclude that this concept of understanding only deals with the corporate environment as an environment.

The three-step phase, which data science calls “data compilation”, represents the preliminary information needed to solve the problem after the problem is uncovered. Likewise, after the problem is revealed, in the scientific method, we place the problem in our research focus and collect the information that will help us.

We can compare the “data preparation” phase to the hypothesis phase in scientific studies. Since there is no such universal scientific research method, the steps are interchangeable in science therefore we cannot separate the transition here with sharp lines. Nevertheless, since we will extract the data based on the inference, we can say that the hypothesis was established at this stage and that the data was prepared for the experimental stage in accordance with the hypothesis.

After these stages, the experimental part is initiated. The experimental part corresponds to the “modeling” in data science. By modeling, we test our hypothesis. And our modeling, as in the scientific method, gives or does not give desirable results. We repeat the steps we need to repeat in accordance with this output. And we repeat the steps from the last experiment until we get a decent output. The best model is chosen as a result of the research.

Setting out from the similarity of their methods one can say data science uses the scientific method. In other words, the report written as a result of any data science project can be qualified as a scientific report (when done according to the method). From this point of view, we can say that data science is science.

However, some other issues arise at this point. Can we say that data science is a unique branch of science? If we can say so since there are branches of science in which data science bears a high degree of similarity, shouldn’t it be necessary to create many new fields of the same nature as data science? If data science is not an original field, which science branch does it belong to? And if we consider data science to be a branch in statistics, we must accept mathematics as a branch of science. But this is another controversial issue: Is mathematics a science?

As a result, whether we accept data science as a “science” or not, we are faced with other dilemmas. Therefore, I think that there is no universal and definite answer to this question. What do you think about this? Do you think data science is science?

Bibliography

“Data Scientist: The Sexiest Job of the 21st Century” https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century

“Steps of Scientific Method” https://www.sciencebuddies.org/science-fair-projects/science-fair/steps-of-the-scientific-method#hypothesis

“Structured and nonstructured data” https://www.igneous.io/blog/structured-data-vs-unstructured-data

“The Data Science Process” https://www-01.ibm.com/events/wwe/grp/grp304.nsf/vLookupPDFs/Polong%20Lin%20Presentation/$file/Polong%20Lin%20Presentation.pdf

“What is Data Science? Fundamental Concepts and a Heuristic Example” https://www.springer.com/gp/book/9784431702085

“What is a Data Scientist?” https://www.mastersindatascience.org/careers/data-scientist/

Leave a Reply