Home › Forums › ML, AI, Data Science › The Art of Exploratory Data Analysis (Part 2)
Tagged: Data science, EDA
 This topic has 0 replies, 1 voice, and was last updated 1 week, 3 days ago by Simileoluwa.

AuthorPosts

In Part 1 of this series, I introduced the concept of Exploratory Data Analysis (EDA) and we analyzed Numerical datasets in R. In this second part, we will discuss Categorical datatypes and then analyze both Categorical and Numerical datasets.
What are Categorical datatypes?
They are datatypes which may be divided into groups. Examples of Categorical variables are race, sex, age group, and educational level. Categorical data can also take on Numerical values (Example: 1 for female and 0 for male). Note that those numbers don’t have mathematical meaning, unlike Numerical datatypes that allows one to perform any kind of mathematical calculation. It consists of two subsets namely; Nominal and Ordinal Data.
 Nominal Data: This datatype has no intrinsic ordering to its Categories. A typical example is a gender having two values (Male or Female), none of the value is greater than the other. Nominal data that has no order. Therefore, if you would change the order of its values, the meaning would not change.
 Ordinal Data: An ordinal scale is one where the order matters but not the difference between values. This types of Data can show ranks and examples include questions that have answers such as agree, strongly agree, disagree, strongly disagree. The replies assign weights to the possible responses available to respondents.
Categorical Data Analysis with R
Earlier, in the first part of this series we performed some EDA on Solely Numerical datatypes, in this part we will see how to use EDA to explore insights from Categorical datatypes using the toy dataset downloadable here. A fictional dataset containing 150000 rows and 6 columns, the data was generated to be suitable for different Analytical practice.
The libraries required for this exercise will be loaded;
123library(tidyverse)library(polycor)library(reshape2)Set your working directory to where your data is kept and load in the dataset of interest;
123456789#setwd()#Load in the dataset and examine its structureread.csv('toy_dataset.csv')#We will keep a copy of the original datasettoy_copy < toy_dataset#Examine structure of datasetstr(toy_dataset)We have Gender and Illness variables which are columns that should be Categorical variables alongside City.
Note: that all the Categorical variables are of Nominal type with no peculiar ordering, so we convert all these columns to Categorical variables;
12345678#For illness columntoy_dataset$Illness < plyr::mapvalues(toy_dataset$Illness, from = c("Yes", "No"), to = c(1, 0))#For Gendertoy_dataset$Gender < plyr::mapvalues(toy_dataset$Gender, from = c("Male", "Female"), to = c(1, 0))#Examine structure of data nowstr(toy_dataset)Notice that we didn’t call explicitly the plyr package, this is because it has some features that clashes with the dplyr package in the tidyverse packages, to avoid error, we only call when we need it.
From the above lines of code, we defined earlier that Nominal variables can take on numbers such as 1 and 0, but no mathematical implication, I converted the grouping variables to numbers such that Yes takes 1 and No, 0. The same applies to the Gender column. The values have changed but the datatype appears to be character, we will now convert to factor variable
1234567891011121314#Convert to factor variables#Examine structure of data now#Save data in new variabletoys < toy_datasettoys$Gender < factor(toys$Gender)toys$Illness < factor(toys$Illness)toys$City < factor(toys$City)#Examine structurestr(toys)Now we will apply some EDA techniques peculiar to Categorical datatypes. Lets have a dataframe containing only these Categorical datatypes;
123#Save only Categorical datatypestoys_cat < toys[, c(2,3,6)]From here on, we apply various visualization techniques available to us;
12345#First we check the distribution of citiesggplot(toys_cat) +geom_bar(mapping = aes(City, fill = City)) +coord_flip() +labs(title = "Distribution of Cities")Gender Distribution;
12345#For Genderggplot(toys_cat) +geom_bar(mapping = aes(Gender, fill = Gender)) +labs(title = "Distribution of Gender",subtitle = "0 = Female, 1 = Male")Illness distribution;
12345#For Illnessggplot(toys_cat) +geom_bar(mapping = aes(Illness, fill = Illness)) +labs(title = "Distribution of Gender",subtitle = "0 = No, 1 = Yes")Lets make some count plot, we will look at the distribution of illness across cities using Covariance, which is a measure of the joint variability of two random variables;
1234#Covariation between City and Illnessggplot(toys_cat, aes(x = City, y = Illness)) +geom_count() + labs(title = "Illness Across Cities")The size of each circle in the plot displays how many observations occurred at each combination of values. Covariation will appear as a strong correlation between specific x values and specific y values;
1234#Distribution of Gender across citiesggplot(toys_cat, aes(x = City, y = Gender)) +geom_count() + labs(title = "Gender Across Cities")+theme(axis.text.x = element_text(angle = 20))From here we can see that Los Angeles and New York have higher population than rest cities and a higher concentration of both Genders;
123#We will examine the disrtibution of illness across gendersggplot(toys_cat, aes(x = Illness, y = Gender)) +geom_count() + labs(title = "Illness Across Gender")Interestingly, we can decide to not convert the variables to numbers and just work with them that way. We will factor them without conversion to numbers as we did earlier with the plyr package;
1234#We will use the copy of the dataset we kept earlier ontoy_copy$City < factor(toy_copy$City)toy_copy$Gender < factor(toy_copy$Gender)toy_copy$Illness < factor(toy_copy$Illness)Replicating a few of the earlier made plots, we will examine the distribution of Illness across Genders and Genders across Cities;
1234567#We will examine the distribution of Illness across Gendersggplot(toy_copy, aes(x = Illness, y = Gender)) +geom_count() + labs(title = "Illness Across Gender")#Distribution of Genders across Citiesggplot(toy_copy, aes(x = City, y = Gender)) +geom_count() + labs(title = "Gender Across Cities")You can see an improvement in readability, we can also view this Covariation with a tile plot;
1234toy_copy %>%count(City, Gender) #%>%ggplot(mapping = aes(x = City, y = Gender)) +geom_tile(mapping = aes(fill = n))This ensures some more appealing visualization. There are also techniques and packages in R which allows us examine Correlation between dataframes containing Categorical and Numerical datatypes. The polycor package is essential in this application;
12345678910111213141516#Load in the required package for this exerciselibrary(polycor)df_cor < hetcor(toy_copy) #Insert the entire dataframe in this functiondf_cor$correlations %>% #Extract the correlation resultsreshape2::melt() %>% #Convert to a dataframeggplot(., aes(Var1, Var2, fill = value)) + #Pipe in ggplotgeom_tile() +scale_fill_gradient(low = "grey", high = "darkred") +geom_text(aes(Var1, Var2, label = round(value,2)), size = 2)+labs(title = "Correlation Matrix", x = "Numeric column", y = "Numeric Column",fill = "Coefficient Range") +theme(axis.text.x = element_text(vjust = 0.5, angle = 45),plot.title = element_text(face = "bold", hjust = 0.5))Let us make some interesting plots using both Categorical and Numerical;
1234#A boxplot comparing Income across gendersggplot(toy_copy, aes(Gender,Income, fill = Gender)) +geom_boxplot() + labs(title = "Income plot by Gender")1234#Boxplot of income across Citiesggplot(toy_copy, aes(City,Income, fill = City)) +geom_boxplot() + labs(title = "Income plot by City") + coord_flip()Asides Boxplots, Histograms and Density plots are also great for visualization of relationships between Numerical and Categorical datatypes;
123#Distribution of Income over gender using a histogramggplot(toy_copy, aes(Income, fill = Gender)) +geom_histogram() + labs(title = "Income Distribution by Gender")123#Distribution of Income over city using a density plotggplot(toy_copy, aes(Income, fill = City)) +geom_density() + labs(title = "Income Distribution by City")123#Distribution of Income over Gender using a density plotggplot(toy_copy, aes(Income, fill = Gender)) +geom_density() + labs(title = "Income Distribution by Gender")Conclusion
When it comes to the art of EDA, it is very important that one knows the wide variety of tools available and spend time practicing with all kinds of dataset. Real world problems usually differ, one to the next, you hardly face the same problem twice, you must therefore be flexible enough to adapt the different concepts previously learnt to solve the problems encountered. Remember, it is an art and thus, requires mastery and not a onesizefitsall design.

AuthorPosts