- This topic has 2 replies, 3 voices, and was last updated 4 months ago by .
- February 6, 2020 at 7:06 pm #85297Participant@simileoluwa
We produce large chunks of data daily and there has to be a way to a way to explore insights in these data to examine if they can be used for further decision-making processes. Exploratory Data Analysis (EDA) involves procedures (mainly visualization) undertaken such that the underlying and interesting features of a dataset are made obvious. EDA is an approach/philosophy for data analysis that employs a variety of techniques;
• Show underlying patterns and outliers
• Display anomalies
• Understand data structure for accurate modeling.
Exploratory data analysis” is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as those we believe to be there. — John W. Tukey
Unlike statistical models, EDA does not make any assumptions before examining a dataset rather one explores a dataset, dissecting it in details, more like looking for a treasure you’ve not seen before, with no premeditated hypothesis. Assumptions and models are chosen after the peculiarities of the dataset have been properly understood, this ensures that the right model is used and the right assumption is made. EDA has its own set of techniques for dealing with datasets and datatypes., in the first part of this article we will focus on dealing with Numerical components of a dataset.
Numerical Data Analysis in R
Before delving into the practical, it is important to understand what numerical data is. This is a type of data that is measurable e.g height, rainfall, temperature, weight etc. One of the ways you can identify numerical data is by seeing if the data can be added together. In fact, you should be able to perform just about any mathematical operation on numerical data. You can also put data in ascending (least to greatest) and descending (greatest to least) order.
There are two types of Numerical data namely (1) discrete and (2) continuous.
- Discrete; These kinds usually take a finite number; counts such as age, population take a finite number. They are easily measured.
- Continuous; These kinds can be considered the opposite of continuous, impossible to count i.e. allocating a finite number e.g. Temperature, rainfall. All these data do not have a precise measurement just approximations to assign a certain weight or extent to it.
Remember that there are no one-size-fits-all in EDA as we explore a little experimentation in R. Download the dataset for this tutorial here.1234567# Load librarieslibrary(tidyverse)library(corrplot)library(gridExtra)library(GGally)library(knitr)library(reshape2)
Set your working directory and load in the dataset and examine its structure;123#set working directory and read in datasetWine <- read.csv("Wine.csv")str(Wine)#Examine data structures perhaps if there are categorical variables
We will remove the customer segment since it is a grouping variable (Categorical);123#Remove customer segment columnnew_wine <- Wine[,-14]
The Customer Segment column has now been removed, so only numerical datatypes are left. Some EDA techniques can now be implemented;123#Examine descriptive statisticssummary(new_wine)
Correlation is a statistical technique that can show whether and how strongly pairs of variables are related, since we have 13 variables we can do a pairwise correlation that can be easily visualized;1234567891011#A correlation matrixcor(new_wine) %>% #Create a correlation table pairwisemelt() %>% #Convert data to a dataframe formggplot(., aes(Var1, Var2, fill = value)) + #Pipe in ggplotgeom_tile() +scale_fill_gradient(low = "grey", high = "darkred") +geom_text(aes(Var1, Var2, label = round(value,2)), size = 2)+labs(title = "Correlation Matrix", x = "Numeric column", y = "Numeric Column",fill = "Coefficient Range") +theme(axis.text.x = element_text(vjust = 0.5, angle = 45),plot.title = element_text(face = "bold", hjust = 0.5))
From the graph we can see a high correlation between;
- Flavanoids and phenol
- Flavanoids and OD280
Visualizing these relationships;123456789101112131415#A regression model is a very great way to independently visualize these relationshipsplot1 <- ggplot(new_wine, aes(Flavanoids, Total_Phenols)) +geom_point() +geom_smooth(method = "lm", se = F)+labs(title="A Simple Linear Regression", #Add a regression linesubtitle="Examine Relationship Between Flavanoids and phenol")plot2 <- ggplot(new_wine, aes(Flavanoids,OD280)) +geom_point() +geom_smooth(method = "lm", se = F)+labs(title="A Simple Linear Regression", #Add a regression linesubtitle="Examine Relationship Between Flavanoids and OD280")#We can visualize both plots simultaneouslygrid.arrange(plot1, plot2, ncol = 2)
A histogram is also another interesting way to visualize numerical datatypes, For an histogram;123ggplot(new_wine, aes(x = Flavanoids)) +geom_histogram(colour="black", show.legend=FALSE, fill = "orange", bins = 35) +labs(title = "Histplot of Flavanoids") + theme_bw()
Remember we have 13 columns in our dataframe, so that means 13 individual plots? Well not really there R provides a way to create same plots for different subsets of a data. The facet approach partitions a plot into a matrix of panels. Each panel shows a different subset of the data;12345678new_wine %>%gather(1:13, key = "Variable", value = "Values") %>%ggplot(aes(x=Values, fill=Variable)) +geom_histogram(colour="black", show.legend=FALSE) +facet_wrap(~Variable, scales="free_x") + #the important line creating the facetlabs(x="Values", y="Frequency",title="Histograms for Individual subsets") +theme_bw()
Implementing the same procedure for the DensityPlot;123456789#Density plotnew_wine %>%gather(1:13, key = "Variable", value = "Values") %>%ggplot(aes(x=Values, fill=Variable)) +geom_density(colour="black", show.legend=FALSE) +facet_wrap(~Variable, scales="free_x") + #the important line creating the facetlabs(x="Values", y="Frequency",title="DensityPlot for Individual subsets") +theme_bw()
In this first part we have examined how to perform some simple exploratory data analysis, in the next part we will be dealing with another data type called Categorical datatypes. As earlier stated, there is no one-size-fits-all in EDA so master the different tools available, practice then you get better.February 16, 2020 at 2:08 pm #85947Participant@ebere
Thank you very much. I’m really interested. Beginner level aspiring Data analyst. I know basic excel. I would like more tutorial articles in R, Python and SQL. Warm regardsFebruary 16, 2020 at 4:42 pm #85953Keymaster@ehi-kioya
Great to see you here @ebere, and welcome to the forums!
I strongly suggest you add more details to your profile so that we can get to know you a little better. For example, it might be a good idea to include a link to your LinkedIn profile in your bio. You may also want to put up a picture of yourself. The picture improves your authenticity, and as per the LinkedIn profile link, you never know where your next opportunity could come from.
Now that you are registered, you can even go ahead and write your own articles to boost your portfolio. Or if you have any specific questions to ask, feel free to do so. If you ask a question and mention me (using @ehi-kioya), I will get notified immediately and I will try to offer assistance quickly (either by answering your question directly or by linking you with a someone who can help).
Again, welcome aboard. Looking forward to seeing more from you.
- You must be logged in to reply to this topic.