According to this data science report, data scientists spend only about 20 percent of their time on actual data analysis. The primary, and most time consuming activity data scientists perform is data cleaning and organization. Here’s how the full breakdown looks:
This is understandable because, most of the available raw data in nature will always contain inconsistencies. These inconsistencies when left un-engineered, will significantly affect the outcome of any machine learning algorithms built using the inconsistent data.
This article will practically demonstrate how to sort and clean data using Python and Python libraries. We will first expose some inconsistencies in our sample data set. Thereafter, we will perform some actual data cleaning on both continuous and categorical data.
For this tutorial, we will be working with this sample raw data. It’s an Excel file with a file name of raw_data.xlsx
Download the data set. We will get to it in a bit.
Load The Libraries
For this task, we will be using the following libraries:
""" Load the necessary libraries """ from scipy.stats import mode import pandas as pd import numpy as np import matplotlib.pyplot as plt
Load The Data Set
Download raw_data.xlsx if you have not already done so and save it to the same directory that contains your Python script file.
Now, load the data set with this line of code:
df = pd.read_excel(r'raw_data.xlsx')
Identifying Inconsistencies In Our Data Set
To start cleaning the data set, we might want to first look into its general characteristics. There are a number of ways to do this.
Let’s start by printing out the first 20 rows of the data set:
Our output looks like this:
Since the data in our excel file is relative large (about 4000 rows), if you’re just scanning the top 20 rows (as above), everything might look perfect. There doesn’t seem to be any inconsistencies so far. Truth is, 20 rows is really not enough. And just visually scanning your data like that isn’t a great way to identify inconsistencies.
Instead, let’s try scanning through the entire data set to ascertain the number of missing values. The command df.isnull().sum() can be used to do this. When you run df.isnull().sum(), you will get a summary of the missing values in our sample data set.
Here’s the output of the command:
Notice that each column has been returned iteratively, showing the number of missing values and the names of the columns at the same time.
Note: Use df.shape if you want to know the total number of rows and columns in your data set.
Looking at the summary of our data set in the above screenshot, you will notice that most of the columns have several missing values. This is one of the major problems associated with raw data. It happens mostly because of the questions left unanswered by survey respondents.
Other irregularities also do exist in raw data. We will come across many of them while working on our sample data set (provided above).
Dealing With Missing Data
There are several ways to deal with missing data. You can either decide to fill them with default values, which are computed from statistics, or you may decide to remove all rows containing “NaN” values from the data set entirely. In practice, however, the latter method of eliminating “NaN” values is not common. This is because removing entire rows because of certain missing fields would shorten our data and reduce the accuracy of our model.
The practice of filling up missing values with calculated defaults is a lot more popular. Depending on the type of response variable (which could be categorical, discrete, or continuous), you can fill missing values with mean, median, or mode. In some cases, predictive modeling is also used (but this will not be covered in this article).
Filling Missing Data With Default Values
In this demo, we will be dealing with our missing data by filling all “NaN” values in our data set with the mode. We will use this piece of code to do this:
""" Fill the NaN with default value mode """ df.fillna(df.mode, inplace=True) """Print out the summary of the data set again""" df.isnull().sum()
Now all the columns have zero null values.
The magic was done using the df.fillna(df.mode, inplace=True) command as shown above.
In the above, we filled in all “NaN” values across our entire data set. But if you like, you could also do this for specific columns that you define. Generally, if the column contains categorical data, you would want to fill it with the mode, since calculating the mean or median for such data type will be irrelevant. However, if you have a continuous or discrete variable, you should consider using the mean or median.
Mode can also be used for some continuous variables like age. For example, if your intention is to fill the columns uniquely (take for instance, the DOB column), you should use:
df['DOB'].fillna(df['DOB'].mode, inplace=True) """ Ensure that you include inplace=True to avoid reassignment """
Now, since we have dealt generally with the missing data, for the purpose of this article, let’s focus on one continuous variable, the “DOB” column, and one categorical variable, the “gender” column. We will handle the irregularities in both columns differently.
Cleaning The “DOB” Column (Continuous Variable)
This column consist of a time series. Print out the “DOB” column with the df.DOB command. The output will look like this:
Now let’s convert the date of birth values (which is in the yyyy-mm-dd format) to the actual age by using these lines of code:
from datetime import date """ Function to convert the DOB to age """ def calculate_age(birthdate): today=date.today() age=today.year-birthdate.year-((today.month, today.day)<(birthdate.month, birthdate.day)) return age Age=(df['DOB'].apply(calculate_age)) Print(Age)
Let’s add this new “Age” column to our DataFrame:
""" Add Age to the DataFrame """ df['Age']=Age """ Print out df.Age and DOB back to back """ df[['Age'], ['DOB']]
Age has been added to the DataFrame. It looks like this when we print it out:
Let’s do one last thing with this column. We will call a boxplot in order to see if there are things like outliers or invalid attributes.
Run this command: Box = df.boxplot(column=’Age’)
You can see the outliers well outlined. Let’s get the highest value to know what the problem is.
Run the Maximum = df.Age.max()) command. The output will be 176.0
Clearly, nobody is 176 years old!
Using The IQR Method To Remove Outliers
Let’s handle the outliers with the Interquartile Range (IQR) method. This method uses the concept of middle fifty to sort our data and automatically remove outliers.
The code below will solve this problem for us:
""" Using the IQR method to remove outliers """ lower_boundary=(0.1) upper_boundary=(0.95) outliers=df.Age.quantile([lower_boundary, upper_boundary]) print(outliers) Accepted=((outliers.loc[lower_boundary]<df.Age.values)&(df.Age.values<outliers.loc[upper_boundary])) rejected_age=((df.Age.values<outliers.loc[lower_boundary])&(df.Age.values>outliers.loc[upper_boundary])) valid=df.Age[Accepted] print(valid) df['valid']=valid df.boxplot(column='valid')
Outliers are now gone.
Cleaning Categorical Data (The gender Column)
Let’s now take a look at the gender column. Run the df.gender command to view the data.
Since there will usually be typographical errors within categorical data, you might consider using a boxplot to monitor the characteristics of this column:
""" Use boxplot to view gender characteristics """ box=df.boxplot(column='Age', by= 'gender')
Here’s what you will get:
This is very faulty data. Too many inconsistencies. It is very inappropriate to collect data this way. Again, problems like these arise because the survey form fields were probably filled in manually by users (via text boxes) instead of using drop downs with predefined values.
Anyways, let’s deal with this. Use the code below to replace the inconsistencies with their correct values. This code is based on the output we saw from the boxplot above:
""" We assume Femal to be Female """ Gender=df.gender.replace('Femal', 'Female') """ Insert into DataFrame """ df['Gender']=Gender """ We assume that F is Female """ Gender_1=df.Gender.replace('F', 'Female') """ Insert into DataFrame """ df['Gender_1']=Gender_1 """ We also assume that M is Male """ Main_gender=df.Gender_1.replace('M', 'Male') """ Insert into DataFrame """ df['Main_gender']=Main_gender """ Finally, let's replace U by NaN value, we're not really sure what it is """ Formatted_gender = df.Main_gender.replace('U', np.NaN) """Insert into DataFrame """ df['Formatted_gender']=Formatted_gender """ Display boxplot for formatted_gender with the valid column that was computed from Age earlier """ box=df.boxplot(column='valid', by= 'Formatted_gender')
From the boxplot, it is clear that the ages of the females lie generally between 25 and 65 years, while majority of them are between ages 37 and 51 years. Similarly, for the males, the ages lie between 25 and 64 year, but majority are between ages 36 and 51 years.
Now we’ve formatted our gender and age variables correctly. For gender, we now have a valid format of Female and Male. Remember that we replaced the “U” field with “NaN” right? Let’s now check for null values in our gender column:
""" Missing values for Formatted_gender """ Missing = df['Formatted_gender'].isnull().sum() print('Missing gender is now', Missing)
The output of this will be: Missing gender is now 88
Let’s apply some finishing touches by filling up the NaN values:
""" Fill the NaN with mode and view missing values again """ modes = df.Formatted_gender.fillna(mode(Formatted_gender).mode, inplace=True) Missing = df['Formatted_gender'].isnull().sum() print('Missing gender is now', Missing)
Now, our printed output will be: Missing gender is now 0
This means that, we are now okay, our data is clean now (at least for our specific columns of interest).
Tip: You can use the command df.drop(column_name, inplace=True, axis=1) to drop columns if you so wish while formatting your data set.
Depending on how creative you are, there are several methods of cleaning and sorting data. The process we demonstrated in this article shows you how to play around your data.
Working with raw data could be tedious, but it is beautiful and satisfying to finally have your data properly formatted.