• Skip to main content
  • Skip to primary sidebar

Technical Notes Of
Ehi Kioya

Technical Notes Of Ehi Kioya

  • Forums
  • About
  • Contact
MENUMENU
  • Blog Home
  • AWS, Azure, Cloud
  • Backend (Server-Side)
  • Frontend (Client-Side)
  • SharePoint
  • Tools & Resources
    • CM/IN Ruler
    • URL Decoder
    • Text Hasher
    • Word Count
    • IP Lookup
  • Linux & Servers
  • Zero Code Tech
  • WordPress
  • Musings
  • More
    Categories
    • Cloud
    • Server-Side
    • Front-End
    • SharePoint
    • Tools
    • Linux
    • Zero Code
    • WordPress
    • Musings

The Art of Exploratory Data Analysis (Part 2)

Tagged: Data science, EDA

  • This topic has 0 replies, 1 voice, and was last updated 11 months ago by Simileoluwa.
Viewing 1 post (of 1 total)
  • Author
    Posts
  • February 8, 2020 at 8:03 pm #85456
    Participant
    @simileoluwa

    In Part 1 of this series, I introduced the concept of Exploratory Data Analysis (EDA) and we analyzed Numerical datasets in R. In this second part, we will discuss Categorical datatypes and then analyze both Categorical and Numerical datasets.

    What are Categorical datatypes?

    They are datatypes which may be divided into groups. Examples of Categorical variables are race, sex, age group, and educational level. Categorical data can also take on Numerical values (Example: 1 for female and 0 for male). Note that those numbers don’t have mathematical meaning, unlike Numerical datatypes that allows one to perform any kind of mathematical calculation. It consists of two subsets namely; Nominal and Ordinal Data.

    1. Nominal Data: This datatype has no intrinsic ordering to its Categories. A typical example is a gender having two values (Male or Female), none of the value is greater than the other. Nominal data that has no order. Therefore, if you would change the order of its values, the meaning would not change.
    2. Ordinal Data: An ordinal scale is one where the order matters but not the difference between values. This types of Data can show ranks and examples include questions that have answers such as agree, strongly agree, disagree, strongly disagree. The replies assign weights to the possible responses available to respondents.

    Categorical Data Analysis with R

    Earlier, in the first part of this series we performed some EDA on Solely Numerical datatypes, in this part we will see how to use EDA to explore insights from Categorical datatypes using the toy dataset downloadable here. A fictional dataset containing 150000 rows and 6 columns, the data was generated to be suitable for different Analytical practice.

    The libraries required for this exercise will be loaded;

    1
    2
    3
    library(tidyverse)
    library(polycor)
    library(reshape2)

    Set your working directory to where your data is kept and load in the dataset of interest;

    1
    2
    3
    4
    5
    6
    7
    8
    9
    #setwd()
    #Load in the dataset and examine its structure
    read.csv('toy_dataset.csv')
     
    #We will keep a copy of the original dataset
    toy_copy <- toy_dataset
     
    #Examine structure of dataset
    str(toy_dataset)

    We have Gender and Illness variables which are columns that should be Categorical variables alongside City.

    Note: that all the Categorical variables are of Nominal type with no peculiar ordering, so we convert all these columns to Categorical variables;

    1
    2
    3
    4
    5
    6
    7
    8
    #For illness column
    toy_dataset$Illness <- plyr::mapvalues(toy_dataset$Illness, from = c("Yes", "No"), to = c(1, 0))
     
    #For Gender
    toy_dataset$Gender <- plyr::mapvalues(toy_dataset$Gender, from = c("Male", "Female"), to = c(1, 0))
     
    #Examine structure of data now
    str(toy_dataset)

    Notice that we didn’t call explicitly the plyr package, this is because it has some features that clashes with the dplyr package in the tidyverse packages, to avoid error, we only call when we need it.

    From the above lines of code, we defined earlier that Nominal variables can take on numbers such as 1 and 0, but no mathematical implication, I converted the grouping variables to numbers such that Yes takes 1 and No, 0. The same applies to the Gender column. The values have changed but the datatype appears to be character, we will now convert to factor variable

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    #Convert to factor variables
     
    #Examine structure of data now
     
    #Save data in new variable
    toys <- toy_dataset
     
    toys$Gender <- factor(toys$Gender)
    toys$Illness <- factor(toys$Illness)
    toys$City <- factor(toys$City)
     
    #Examine structure
     
    str(toys)

    Now we will apply some EDA techniques peculiar to Categorical datatypes. Lets have a dataframe containing only these Categorical datatypes;

    1
    2
    3
    #Save only Categorical datatypes
     
    toys_cat <- toys[, c(2,3,6)]

    From here on, we apply various visualization techniques available to us;

    1
    2
    3
    4
    5
    #First we check the distribution of cities
    ggplot(toys_cat) +
    geom_bar(mapping = aes(City, fill = City)) +
    coord_flip() +
    labs(title = "Distribution of Cities")

    Gender Distribution;

    1
    2
    3
    4
    5
    #For Gender
    ggplot(toys_cat) +
    geom_bar(mapping = aes(Gender, fill = Gender)) +
    labs(title = "Distribution of Gender",
    subtitle = "0 = Female, 1 = Male")

    Illness distribution;

    1
    2
    3
    4
    5
    #For Illness
    ggplot(toys_cat) +
    geom_bar(mapping = aes(Illness, fill = Illness)) +
    labs(title = "Distribution of Gender",
    subtitle = "0 = No, 1 = Yes")

    Lets make some count plot, we will look at the distribution of illness across cities using Covariance, which is a measure of the joint variability of two random variables;

    1
    2
    3
    4
    #Covariation between City and Illness
     
    ggplot(toys_cat, aes(x = City, y = Illness)) +
    geom_count() + labs(title = "Illness Across Cities")

    The size of each circle in the plot displays how many observations occurred at each combination of values. Covariation will appear as a strong correlation between specific x values and specific y values;

    1
    2
    3
    4
    #Distribution of Gender across cities
    ggplot(toys_cat, aes(x = City, y = Gender)) +
    geom_count() + labs(title = "Gender Across Cities")+
    theme(axis.text.x = element_text(angle = 20))

    From here we can see that Los Angeles and New York have higher population than rest cities and a higher concentration of both Genders;

    1
    2
    3
    #We will examine the disrtibution of illness across genders
    ggplot(toys_cat, aes(x = Illness, y = Gender)) +
    geom_count() + labs(title = "Illness Across Gender")

    Interestingly, we can decide to not convert the variables to numbers and just work with them that way. We will factor them without conversion to numbers as we did earlier with the plyr package;

    1
    2
    3
    4
    #We will use the copy of the dataset we kept earlier on
    toy_copy$City <- factor(toy_copy$City)
    toy_copy$Gender <- factor(toy_copy$Gender)
    toy_copy$Illness <- factor(toy_copy$Illness)

    Replicating a few of the earlier made plots, we will examine the distribution of Illness across Genders and Genders across Cities;

    1
    2
    3
    4
    5
    6
    7
    #We will examine the distribution of Illness across Genders
    ggplot(toy_copy, aes(x = Illness, y = Gender)) +
    geom_count() + labs(title = "Illness Across Gender")
     
    #Distribution of Genders across Cities
    ggplot(toy_copy, aes(x = City, y = Gender)) +
    geom_count() + labs(title = "Gender Across Cities")

    You can see an improvement in readability, we can also view this Covariation with a tile plot;

    1
    2
    3
    4
    toy_copy %>%
    count(City, Gender) #%>%
    ggplot(mapping = aes(x = City, y = Gender)) +
    geom_tile(mapping = aes(fill = n))

    This ensures some more appealing visualization. There are also techniques and packages in R which allows us examine Correlation between dataframes containing Categorical and Numerical datatypes. The polycor package is essential in this application;

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    #Load in the required package for this exercise
     
    library(polycor)
     
    df_cor <- hetcor(toy_copy) #Insert the entire dataframe in this function
     
    df_cor$correlations %>% #Extract the correlation results
    reshape2::melt() %>% #Convert to a dataframe
    ggplot(., aes(Var1, Var2, fill = value)) + #Pipe in ggplot
    geom_tile() +
    scale_fill_gradient(low = "grey", high = "darkred") +
    geom_text(aes(Var1, Var2, label = round(value,2)), size = 2)+
    labs(title = "Correlation Matrix", x = "Numeric column", y = "Numeric Column",
    fill = "Coefficient Range") +
    theme(axis.text.x = element_text(vjust = 0.5, angle = 45),
    plot.title = element_text(face = "bold", hjust = 0.5))

    Let us make some interesting plots using both Categorical and Numerical;

    1
    2
    3
    4
    #A boxplot comparing Income across genders
     
    ggplot(toy_copy, aes(Gender,Income, fill = Gender)) +
    geom_boxplot() + labs(title = "Income plot by Gender")

    1
    2
    3
    4
    #Boxplot of income across Cities
     
    ggplot(toy_copy, aes(City,Income, fill = City)) +
    geom_boxplot() + labs(title = "Income plot by City") + coord_flip()

    Asides Boxplots, Histograms and Density plots are also great for visualization of relationships between Numerical and Categorical datatypes;

    1
    2
    3
    #Distribution of Income over gender using a histogram
    ggplot(toy_copy, aes(Income, fill = Gender)) +
    geom_histogram() + labs(title = "Income Distribution by Gender")

    1
    2
    3
    #Distribution of Income over city using a density plot
    ggplot(toy_copy, aes(Income, fill = City)) +
    geom_density() + labs(title = "Income Distribution by City")

    1
    2
    3
    #Distribution of Income over Gender using a density plot
    ggplot(toy_copy, aes(Income, fill = Gender)) +
    geom_density() + labs(title = "Income Distribution by Gender")

    Conclusion

    When it comes to the art of EDA, it is very important that one knows the wide variety of tools available and spend time practicing with all kinds of dataset. Real world problems usually differ, one to the next, you hardly face the same problem twice, you must therefore be flexible enough to adapt the different concepts previously learnt to solve the problems encountered. Remember, it is an art and thus, requires mastery and not a one-size-fits-all design.

  • Author
    Posts
Viewing 1 post (of 1 total)
  • You must be logged in to reply to this topic.
Log In

Primary Sidebar

FORUM   MEMBERSHIP

Log In
Register Lost Password

POPULAR   FORUM   TOPICS

  • How to find the title of a song without knowing the lyrics
  • Welcome Message
  • How To Change Or Remove The WordPress Login Error Message
  • The Art of Exploratory Data Analysis (Part 1)
  • Getting Started with SQL: A Beginners Guide to Databases
  • Replacing The Default SQLite Database With PostgreSQL In Django
  • How to Implement Local SEO On Your Business Website And Drive Traffic
  • Forums
  • About
  • Contact

© 2021   ·   Ehi Kioya   ·   All Rights Reserved
Privacy Policy