Data Science is related to statistics and having knowledge of statistics would be helpful to understand almost every aspect of the machine learning world. There is no doubt is that first, we need machine learning problems and that is why statistics for data science comes to the stage.
I will start a new series about R for Machine learning and in regards to this series first I will touch the main statistics topics.
Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data.
It comes with inferential statistics. What are these inferential statistics? Making assumptions, population and sample, main statistics(mean, mode, median, standard deviation), probability, confidence interval, hypothesis tests, correlation and regression, and parametric and non-parametric distributions.
- Making assumptions: We want to predict the future even in our private life. Just think you work for a tech company as a data scientist or data analyst and you want to come up with assumptions using your knowledge in order to make impressed your boss. Yes, at that moment you will want to use your wonderful statistics knowledge :). In order to achieve your goal, you need to have at least basic inferential statistics knowledge.
- Population and sample: A population is any group of interests or any group that researchers want to learn more about. The population can be people or other things, such as animals or objects in which information is desired. A sample is a group of data is drawn from the population of interest. The population has a critical role in machine learning. As you know, we build machine learning models to test whether our conclusion fits well with the machine learning model that we built. Of course, no one can test using all population and it means that we need a sample to test and this sample needs to include all important metrics in order to present the right decisions.
- Statistics: It refers to mean, mode, median, and standard deviation.
- Confidence Intervals: It is the presentation of the calculated statistics at certain intervals. If the statistics are between these intervals, the rate of confidence comes with high.
- Probability: Probability is the science of how likely events are to happen.
Let’s look at important ones:
- Binomial Distribution: Used when n trial results concern the probability of success.
- Poisson Distribution: Calculates the probability of events that are rarely encountered in a given area in a given time interval.
- Normal Distribution: The normal distribution is a probability function that describes how the values of a variable are distributed.
6. Hypothesis Tests: It is a statistical technique used to test a situation. There are steps to apply if you are a statistician.
- Establishing hypotheses
- Determination of significance level and table value
- Calculation of test statistics
- Comparison of the table results with an alpha value
However, there are many functions to apply different kinds of hypothesis tests on R. We do not need to know every detail deeper, just need to have knowledge about what steps are for.
- One-Sample t-Test: It used to test the average and calculate the confidence interval. Of course, we do not forget that we have a continuous variable. T-test has different variations and it depends on the number of n. Generally, we decide to use z or t-test according to information about n>30 or n<30. Also, for example, if we use SPPS, the first step has to be to check the data whether it is distributed normally. But again the good part of using R is that we do not need to follow the different steps in order to check normality, R decides which t-tests(One simple or Welch-in t-test) suppose to be used.
We need to identify our problem:
Assume we have a list and called ‘degerler’. If we want to see the main statistics information:
Hypothesis test for t-test:
According to the result of this code, we decide to accept or reject H0. If the result of p<0,05, reject H0.
Interpretation for this problem is; if we reject H0, M<180 will be accepted.
I will continue to explain simple rate tests later.