February 4, 2020 at 11:04 am #85141IdowuParticipant@idowu
According to IBM statistics, Data Scientists are likely to spend 80% of their time cleaning data. Most of the available raw data in nature will always contain inconsistencies, which when left un-engineered, will significantly affect the outcome of machine learning algorithms.
Let’s cover some loopholes together by working on one of such raw data sets in this article. You can download the data set here.
Load the Libraries
For this task, I’ll be using the following libraries:12345""" load the necessary libraries"""from scipy.stats import modeimport pandas as pdimport numpy as npimport matplotlib.pyplot as plt
Load the dataset
The name of the data I’ll be working with is raw_data.xlsx. Ensure that it’s in the same directory with your script file1df = pd.read_excel(r'raw_data.xlsx')
To start cleaning the data set, we might want to look into its general characteristics. There are a number of ways to do this:
First, let’s print the first 20 columns of the data set by running the following codes :1df.head(20)
Looking at the rows manually might be a little bit strenuous and you might not even take note of anything, as this data set is a bit large.
df.isnull().sum()can be used to scan quickly through our data set in order to be able to ascertain the number of missing values.
Let’s now take a look at the summary of our data set by simply checking for missing values:1df.isnull().sum()
You would see that each column has been returned iteratively, showing the missing values and the name of the columns at the same time.
You can also use
df.shapeto know the number of columns and rows therein.
Looking at the summary of my data set, you would notice that most of the columns consist of several missing values, which is one of the major problems that are associated with raw data, mostly because of the way in which some of them were entered by respondents.
Other irregularities also do exist in raw data, many of which we’ll come across while working on the data set provided in this article.
Dealing with Missing Data
There are several ways of dealing with missing data. You can either decide to fill them with default values, which are computed from statistics, or you may decide to remove the “nan” values. Although, unlike the former method, the latter option is not common in practice, as removing missing data might shorten our data and reduce the accuracy of our model.
More also, depending on the type of response variable, which could be categorical, discrete or continuous, you can fill missing values with mean, median or mode. In some cases, predictive modeling is also used (this will not be covered in this article).
Filling your Data Set with Default Values
Well, the above explanation is just for the literature, but what we actually want to do is fill all“nan” values in our data set with the default value, which is the mode as shown:1234"""Fill the nan with default value mode"""df.fillna(df.mode, inplace=True)"""Print out the summary of the dataset again"""df.isnull().sum()
Now all the columns have zero null values
If you so wish, depending on what you want, you can also fill a particular column with its default value. Generally, if the column contains categorical data, you would want to fill it with the mode, as calculating mean or median for such data type will be irrelevant.
However, if you have a continuous or discrete variable, you should consider using the mean or median. Mode can also be used for some continuous variables like age, e.g, if your intention is to fill the columns uniquely, take for instance, the DOB column, you should use:12df['DOB'].fillna(df['DOB'].mode, inplace=True)"""ensure that you include inplace=True to avoid reassignment"""
Now, since we’ve dealt generally with the missing data, for the purpose of this article, I will be focusing on one continuous variable, which is the “DOB” column and one categorical variable, which is the “gender” column and we will handle the irregularities in both columns.
Let’s Handle the “DOB” Column first
This column consist of time series, so, we will be working with the time series. Print out the “DOB” column by calling:1df.DOB
Convert the “DOB” which is in form of “yyyy-mm-dd” to the actual age by using the line of codes below:123456789"""from datetime import date"""from datetime import date"""Function to convert the DOB to age"""def calculate_age(birthdate):today=date.today()age=today.year-birthdate.year-((today.month, today.day)<(birthdate.month, birthdate.day))return ageAge=(df['DOB'].apply(calculate_age))Print(Age)
Add age to the DataFrame1234"""Add Age to the DataFrame """df['Age']=Age"""Print out df.Age and DOB back to back"""df[['Age'], ['DOB']]
Now, age has been added to the DataFrame.
Let’s do one last thing with this column by calling a Boxplot in order to see if there are things like outliers or invalid attributes:1Box = df.boxplot(column='Age')
You can see the outliers well outlined12"""Get the highest values to know what the problem is"""Maximum = df.Age.max())
Using IQR Method to Remove Outliers
Let’s handle the outliers with the Interquartile Range (IQR) method, this uses the concept of middle fifty to sort our data and automatically remove outliers
The codes below will solve this problem for us:123456789101112"""IQR method of to remove outliers"""lower_boundary=(0.1)upper_boundary=(0.95)outliers=df.Age.quantile([lower_boundary, upper_boundary])print(outliers)Accepted=((outliers.loc[lower_boundary]<df.Age.values)&(df.Age.values<outliers.loc[upper_boundary]))rejected_age=((df.Age.values<outliers.loc[lower_boundary])&(df.Age.values>outliers.loc[upper_boundary]))valid=df.Age[Accepted]print(valid)df['valid']=validdf.boxplot(column='valid')
Outliers are now gone
Dealing with Categorical Data (gender column)
Let’s take a look at the gender column:1df.gender
Furthermore, because there are always some typographical errors in categorical data, you might consider using a Boxplot to monitor the characteristic of this column:123"""use boxplot to view gender characteristics""""""plot a boxplot"""box=df.boxplot(column='Age', by= 'gender')
That's actually a very faulty data. It’s very inappropriate to collect data this way, the problem was that the forms were probably filled manually by users instead of using a defined drop down button. But, let’s deal with those:123456789101112131415161718192021222324252627282930"""Use the codes below to replace the inconsistencies with the correct values, by using the output of the boxplot above""""""We assume Femal to be Female: """Gender=df.gender.replace('Femal', 'Female')"""Insert into df:"""df['Gender']=Gender"""We assume that F is Female:"""Gender_1=df.Gender.replace('F', 'Female')"""Insert into df:"""df['Gender_1']=Gender_1"""We also assume that M is Male:"""Main_gender=df.Gender_1.replace('M', 'Male')"""Insert into df:"""df['Main_gender']=Main_gender"""Finally, let's replace U by NaN value, we're not really sure what it is"""Formatted_gender = df.Main_gender.replace('U', np.NaN)"""Insert into df"""df['Formatted_gender']=Formatted_gender"""Display boxplot for formatted_gender with the valid that was computed from Age earlier:"""box=df.boxplot(column='valid', by= 'Formatted_gender')
From the boxplot, it’s clear that the age of the females lie generally between 25 and 65 years, while majority of them are between ages 37 and 51 years.
However, by overviewing the age characteristics of the males, it lies between 25 and 64, but majority are between ages 36 and 51 years.
Now we’ve formatted gender and age, and we have the valid format: Female and Male. Remember we’ve replaced the “U” field with “NaN”, let’s do well by checking for null values in gender:123"""Missing values for Formatted_gender"""Missing = df['Formatted_gender'].isnull().sum()print('Missing gender is now', Missing)
Let’s apply some finishing touches by filling up the NaN :1234"""fill the NaN with mode and view missing values again"""modes = df.Formatted_gender.fillna(mode(Formatted_gender).mode, inplace=True)Missing = df['Formatted_gender'].isnull().sum()print('Missing gender is now', Missing)
We are now okay, you can also use the code
df.drop(column_name, inplace=True, axis=1)to drop columns if you so wish while formatting your data set.
Depending on how creative you are, there are several methods of cleaning and sorting data. You can follow the steps highlighted in this article and play around your data. Working with raw data could be tedious, but the beauty is getting it well formatted.
- This topic has 0 replies, 1 voice, and was last updated 2 weeks, 1 day ago by .
Viewing 1 post (of 1 total)
Viewing 1 post (of 1 total)