- This topic has 0 replies, 1 voice, and was last updated 1 month ago by Oluwole.
February 28, 2020 at 8:49 pm #86851Participant@oluwole
Data Science (DS) is a multidisciplinary field that employs scientific methods and processes to extract knowledge from data in different forms and solve analytically complex problems. The core of this field is data, which must be used in creative ways to generate business value.
Data scientists use methods from many disciplines, including statistics, mathematics, computer science, and information science. Statistics, which is the use of mathematics to perform technical analysis of data, can be a powerful tool in the hands of a data scientist.
There is a certain degree of overlap between both fields, such that the definition of one discipline could describe the other. However, there are vital differences between both practices. Data scientists ask questions like;
- How much should this automobile cost?
- How come Bing is correctly “guessing” my search question?
Statistics is the art of representing these questions with numbers, investigating them, and obtaining answers. It provides quantitative connections to very qualitative questions.
A data scientist knows more programming than a statistician and more statistics than a programmer.
Good knowledge of statistics, as well as other branches of sciences, is a must-have for any data scientist who seeks successful scientific solutions to problems based on suitable approaches. In this post, we will discuss five important statistical concepts in the field of Data Science.
Important Statistical Concepts
Statistics is a vast field laden with a variety of techniques that accomplish different objectives. Some of these techniques, however, come very handy to data scientists and we shall discuss them under these broad categories;
- Descriptive Statistics
- Bayesian Statistics
- Probability Distribution
- Dimensionality Reduction
Descriptive statistics is the process of summarizing and describing a data set, which can either be a sample of a population or a representation of the entire population.
Descriptive Statistics is distinguished from another branch of statistics called Inferential Statistics (or Inductive Statistics). While in descriptive statistics, the objective is to describe, present, summarize and organize your data, inferential statistics involves using more complex mathematical calculations that enable us to infer trends and make predictions about our data.
Descriptive statistics are useful for two purposes;
- To provide basic information about the variables in your data
- To highlight potential relationships and connections between those variables
Accordingly, the most common descriptive statistics can be generally categorized as any of the following;
- Measures of Central Tendency – Mean, median, mode.
- Measures of Dispersion – range, standard deviation, variance, skew, quartiles, kurtosis.
- Measures of Association – Correlation, Chi-Square.
- Graphical or Pictorial Methods – histograms, scatter plots, sociograms, Geographic Information Systems (GIS).
Descriptive statistics include exploratory data analysis, clustering, unsupervised learning, as well as essential data summaries. With descriptive statistics, we can formulate hypotheses that can be tested later with more scientific methods.
An excellent example of a descriptive statistic is the Shot Accuracy in soccer, which summarizes the performance of a soccer player or a team. It is the number of shots on target divided by the number of shots taken. Thus a player with a Shot Accuracy of 25% is shooting approximately one shot on target out of every four.
Therefore, descriptive statistics provide you with a better understanding of your data and present your raw data in a more meaningful way that allows for a more straightforward interpretation.
Bayesian statistics is a mathematical system that describes epistemological uncertainty using probability. It is the application of probabilities to statistical problems. Bayesian Inference is a unique statistical activity that employs Bayes’ theorem to update probabilities after more evidence has been obtained.
These statistical methods begin with existing ‘prior’ beliefs and update these beliefs using data to give ‘posterior’ beliefs. They may then be used as the basis for inferential decisions.
Generally, the Bayesian statistical methods are used in three situations;
- When there is no alternative but to include prior quantitative judgments, either due to a lack of data or because assumptions were made about the biases involved.
- When the problem is moderate-size, and there are multiple sources of evidence.
- When there is a large joint probability model constructed.
Some of the areas of Bayesian Statistics include Hypothesis testing, model criticism, robustness and reporting, and model choice. Bayesian machine learning leverages on the Bayesian statistics. It allows us to encode our prior beliefs about what our expectation for the model is, independent of that the data says.
Data is a continuum, i.e., there is no ending to it. Although data scientists are usually handling big data, it is always just a part of a whole. This, samples need to be drawn in creative and efficient ways from the population such that they are effective in accurately representing the whole population. This process of selection is referred to as Sampling and it is essential in statistics.
Sampling, as a phenomenon, is practised in our everyday lives. Say you want to make a judgment on what movie to watch, you decide to watch the trailer of each of them, which you would expect, is an accurate representation of the film. You then go on to make your movie selection based on the sample that appeals most to you. The ground rule for sampling, therefore, is that it should represent the population as accurately as possible.
There are several ways to sample and they include;
- Simple Random Sampling, where every item in the population has an equal chance of being included in the sample. There is no bias towards any category. Therefore, they are usually reasonably representative.
- Stratified Random Sampling, where the population is first split into strata (groups or layers), and the overall sample consists of items from each level. It ensures that the sample is representative of each group.
- Cluster Random Sampling, where the population is split into clusters based on a similar feature(s) or attribute(s), and the groups are then selected on a random basis. This method helps when each group represents the population as a whole.
Inadequate data sampling methods do lead to skewed or biased results. In the case of an unequal data class, resampling techniques are used to create more uniform data sets. The two types of resampling are oversampling and undersampling.
If a class of data has an over-representation, undersampling can be used to balance it with the minority class, mainly when the data at hand is sufficient. Conventional undersampling methods include cluster centroids and Tomek links.
Conversely, when one class of data is under-represented, oversampling techniques may be used to form a more uniform dataset, especially when the data at hand is insufficient. One of the more popular oversampling methods is Synthetic Minority Over-sampling Technique (SMOTE).
A probability distribution is a list that shows all the possible outcomes of a random variable as well as their respective likelihoods within a given range. The range is bounded between the maximum and minimum possible values, but where the possible value is likely to be located on the distribution is dependent on factors like the mean, standard deviation, and skewness of the distribution.
Probability distributions are also used to create cumulative distribution functions that start at zero and end at 100% by adding up the individual probabilities of each outcome. There are different classes of probability distributions for different purposes and processes. Some of the classes are Poisson distribution, binomial distribution, chi-square distribution, and normal distribution.
The normal distribution is the most commonly used. It is fully characterized by a mean of zero, a standard deviation of one, a skew of zero, and a kurtosis value equal to three. The binomial distribution is based on the probability of an event occurring several times over a specified number of trials, given the likelihood of the event in each trial.
Binomial distributions are discrete since the only valid responses are 1 and 0. Continuous distributions, on the other hand, are continuous, i.e., all possible values are represented.
Dimensionality reduction is the process of converting a dataset with several dimensions or features into one with fewer dimensions with no significant loss in information. The more the features in a dataset, the higher the complexity of handling. Some of these features might even bear some correlation; hence they become redundant.
Dimension reduction is used to obtain better features for classification and regression tasks. It helps to:
- Compress data, thus reducing the required storage space
- Speed up computations
- Improve model performance by removing redundant variables
- Make clearer and more precise plots and visuals.
Some dimensionality reduction techniques include;
- Missing Values Ratio
- Low Variance Filter
- High Correlation Filter
- Random Forests / Ensemble Trees
- Principal Component Analysis (PCA)
- Backward Feature Elimination
- Forward Feature Construction
Statistics is an integral element of Data Science. It provides the data scientist with tools that enable him to find insights in his data sets and obtain answers to his many questions. These tools come in the form of techniques, broadly categorized under probabilities, Bayesian statistics, descriptive statistics, dimensionality reduction and Sampling. A good understanding of these concepts is, therefore, a necessity for a successful data science career.
You must be logged in to reply to this topic.