Home › Forums › ML, AI, Data Science › An Introduction to Data Preprocessing
 This topic has 0 replies, 1 voice, and was last updated 1 month, 4 weeks ago by Oluwole.

AuthorPosts

February 6, 2020 at 5:29 pm #85291Participant@oluwole
That we live in the age of data is an idea that is quickly becoming a cliché. Indeed, it is as common as much as it is true – the notion that our access to unprecedented amounts of data bears a great influence on the dynamics of various industries. From predicting financial market trends to understanding consumer behaviours, organizations of all kinds make use of large data sets in their operations.
One of the key steps in data analysis is data preprocessing. It is not uncommon for businesses to ignore this process, albeit it is critical. It could also be the case that the data scientist or analyst goes about the process wrongly. We’d be taking a look at data processing – its importance and the steps involved.
What is Data Preprocessing?
Data preprocessing is a data mining procedure involving the transformation of raw data into an understandable format. It is often the case that realworld data is inconsistent (containing discrepancies), incomplete and laden with errors. Data preprocessing is a useful technique to prepare and transform the data, thus making insight discovery more efficient.
The major issues with raw data can be categorized into three;
 Noisy data – this comprises of mislabels, exceptions, outliers and some human errors that are present in the data set but are meaningless.
 Inaccurate data – this usually occurs when there is an issue at the data collection phase. It refers to cases of missing data which leave gaps in field entries that are potentially relevant to the data analysis.
 Inconsistent data – this is reflected in duplicates and similar entries having different formats. It causes deviations in the data set and must be treated before analysis.
Why Data Preprocessing is Important
In typical machine learning projects, data preprocessing takes between 60 to 80 per cent of its analytical pipeline. This is because the accuracy of the analysis hinges upon the integrity of the data set. Therefore mistakes, missing values, and redundancies must be properly treated to obtain the best result. The approach to data preprocessing is dynamic, depending on the nature of the problem at hand. However, there are basic measures that must be taken in order to make the data set “neat” and organized.
Data Preprocessing Steps
Data preprocessing can be viewed from four fronts;
 Data Cleaning
 Data Integration
 Data Transformation
 Data Reduction
Data Cleaning
Data cleaning, as the name suggests, is the process of cleaning or cleansing the data. Generally, the techniques employed in data cleaning can be categorized into two based on their purpose. They could either be for;
 handling missing values; or
 removing noise from data.
Handling Missing Values
The methods for handling and filling missing values are;
 Ignoring the tuple: This involves dropping a tuple (the part of the data set containing the missing values). Removing a tuple is generally advisable only for large data sets, where their exclusion bears no effect on the information relayed by the data. It is discouraged for small data sets as there is a significant risk of losing valuable information.
 Manual filling: This approach involves manually making data entries if the nature of the data is already understood. It is timeconsuming and discouraged for large data sets.
 Using a standard value: You can replace missing values with global constant such as “N/A” or “Unknown”.
 Using central tendency: Based on the data summary statistics, the mean (for normal distributions) or median (in case of nonnormal distributions) can be used in place of the missing values.
 Interpolation: The interpolation technique is the use of regression, Bayesian formulation or other methods to develop a relationship among the data attributes. The missing values are then predicted and replaced with the most probable and accurate values.
Handling Noisy Data
The noise from data is any kind of random error or variance in the attributes. As earlier stated, they could be wrong labels, exceptions and outliers. Noisy data can be smoothed using the following methods;
 Visualization: This is the most scientific and informative approach to dealing with noise from data. It is the use of graphs and plots of the different data to spot anomalies. Box plots, scatter diagrams are the common informative plots used.
 Binning: Here, the data is first sorted then smoothed by using the values in its neighbourhood. The sorted values are then categorized into bins. There are different approaches to binning as explained in the case study below.
Given the sorted data as – 3, 4, 8, 11, 14, 16, 18, 22
Bin 1 = 3, 4, 8, 11
Bin 2 = 14, 16, 18, 22 Smoothing by Bin means – each member of the bin is replaced with the mean of the respective bins. Therefore, each Bin becomes;
Bin 1 = 6.5, 6.5, 6.5, 6.5
Bin 2 = 17.5, 17.5, 17.5, 17.5  Smoothing by Bin median – the values in the bin are replaced with the median of the bin’s values. Therefore;
Bin 1 = 6, 6, 6, 6
Bin 2 = 17, 17, 17, 17  Smoothing by Bin boundary – the values are replaced with the nearest boundary value of bin. Thus;
Bin 1 = 3, 3, 11, 11
Bin 2 = 14, 14, 22, 22
 Smoothing by Bin means – each member of the bin is replaced with the mean of the respective bins. Therefore, each Bin becomes;
 Regression: The data is smoothed by conforming/fitting it to a regression function. It could be linear regression in the case of one independent variable or multiple regression for multiple independent variables.
 Outlier analysis: Here, the data is clustered, i.e. grouped into different clusters based on similarities. Outliers are the values which remain outside the clusters and they are not considered for analysis.
Data Integration
This is the combination of data from multiple sources into a coherent data set. As such, it is also referred to as data merging. Data that is yet to be integrated usually contains duplicates, redundant tuples and data conflicts which are bad for analysis. Proper use of metadata is important to avoid errors during data consolidation.
Correlation analysis is a useful technique for the detection of redundancies in data sets. The chisquared test is well suited for nominal data while the covariance test is useful for numerical data.Data Transformation
After cleaning and merging the data, the next phase of data preprocessing is data transformation. This employs various techniques to change the nature of the data into appropriate conformations ahead of the analysis. The methods involved include;
 Normalization
This is the process of rescaling the original data to a predefined range (e.g. 1.0 to 1.0) without any loss of the data attributes. It is especially useful for algorithms involving neural networks. Some normalization techniques are minmax normalization, zscore or zeromean normalization, and decimal normalization.  Aggregation
This involves carrying out summary or aggregation operation over the data. A good example is an adding up of daily wages to compute monthly and annual income.  Attribute or feature construction
Feature construction or engineering is the process of constructing new features/attributes by observing the relationships between existing attributes. It is useful in generating more information from vague data sets, more so if there aren’t many features but they contain hidden insights.  Discretization
This is the replacement of the raw values of numerical features by discrete intervals or conceptual values. The former is exemplified in the categorization of age into groups (e.g. 0 – 15, 15 – 30, 30 45), while the latter makes use of nonnumerical categories (e.g. child, young adult, adult, etc.).  Concept hierarchy generation
This involves taking the values of nominal data into more general categories.
Data Reduction
Working with large data sets can be laborious and timeconsuming even if the process is automated, thus rendering such analysis essentially impractical. Hence, the data reduction stage of data preprocessing. It is the reduction of data sets to only the most critical information without compromising its integrity and still yielding qualitative insights. Data reduction, therefore, serves to increase storage efficiency and speed of analysis.
The steps to data reduction are;
 Data Cube Aggregation
This is the organization of the data set into multidimensional arrays of values called data cubes. Aggregation operations that produce a single value for a group of values, such as means/averages are used.  Attribute Subset Selection
The objective here is to select only the most relevant attributes and discarding the rest. This is achieved by defining a minimum threshold that the attributes must reach for them to be considered. Attributes under the threshold are automatically discarded.  Numerosity Reduction
This is the replacement of the original data with a smaller representative data set. It can be done through clustering, sampling, regression, loglinear models, and use of histograms  Dimensionality Reduction
The detection and removal of irrelevant or redundant attributes.
Summary
What we’ve learnt so far:
 Data preprocessing is an important procedure in data analysis.
 It involves the preparation of the raw data and its transformation into more understandable formats.
 It leads to better, cleaner and more manageable data sets.
 Data preprocessing is in four phases; cleaning, integration, transformation and reduction.
Conclusion
The odds are that whatever raw data you receive, cases of missing values, noise and inconsistencies are bound to be inherent therein. Checking through the data, cleaning, simplifying and transforming it is necessary for better data analysis. Although data preprocessing usually takes quite a while, it does save you from a lot of errors, inaccuracies, and inefficiency. Data preprocessing is, therefore, a must for businesses intending to extract meaningful insights from data sets.

AuthorPosts
You must be logged in to reply to this topic.