Tagged: Data Analysis, Python, Statistics
- This topic has 0 replies, 1 voice, and was last updated 10 months, 2 weeks ago by
Simileoluwa.
- AuthorPosts
- March 1, 2020 at 4:21 pm #86945Participant@simileoluwa
Descriptive Statistics is an integral part of any Data Analysis procedure, its concepts are necessary because they help make better business decisions. Generally, Statistics is defined as the study of the collection, analysis, interpretation, presentation, and organization of data. When raw data is presented, it would be hard to instantly make inferences from such data, this is where Descriptive Statistics comes in. It provides methods of presenting data in a meaningful way and eases the processes of interpretation. We can classify its importance into two parts:
- To provide basic information about the characteristics of the variables in a dataset.
- To identify existing relationships amidst variables in a dataset.
In a world of Big Data, it is becoming increasingly important for organizations to derive meaning from there voluminous datasets and one of the very basic procedures to derive meaning is Descriptive Statistics. There are different measures used for Descriptive Statistics which includes some visual methods, aimed at providing insights into a dataset:
- Measures of Central Tendency: These measures are the most widely used and perhaps the most informative methods existing in describing a population’s characteristics. There are three notable Measures of Central Tendency:
- Mean: This is a sum of the variables in a dataset, divided by the total number of variables
- Median: This is the middle value of a variable in a dataset
- Mode: This is the most occurring value existing in a variable
- Visualization Techniques: The most prominent are Histograms and Density Plots, Boxplots. Visualizations are very efficient methods of giving summaries of data, identification of patterns, outliers, etc.
- Measures of Dispersion and Frequency: These measures are used to show how spread-out data is and how often a response is given in a dataset. Some of the prominent measures include:
- Range: This is the difference between the lowest and highest values.
- Frequency: This refers to the number of times an event occurs or response is given.
- Standard Deviation: This is a measure of the amount of variation or dispersion of a set of values.
- Measures of Position: These include Percentiles and Quantile Ranges.
Implementing Descriptive Statistics in Python
We will address the four different Descriptive Statistics Measures; the dataset can be downloaded from Kaggle. The “MELBOURNE_HOUSE_PRICES_LESS.csv” dataset contains 13 columns and 63023 rows, it provides details of houses in Melbourne and how prices differ based on the location and characteristics of the house. Implementing the above-listed techniques in Python, we will need to first set up our libraries:
1234#Import the necessary librariesimport pandas as pdimport seaborn as snimport matplotlib.pyplot as pltLastly, for our setup, we want our Jupyter Notebook to show all results, therefore we will run the code below:
12from IPython.core.interactiveshell import InteractiveShellInteractiveShell.ast_node_interactivity = "all"The downloaded dataset needs to be loaded into our console, we will also examine the contents of the dataset, once we have it loaded as shown below:
12345678#Load dataset and examine the structuredf = pd.read_csv(r"C:\Users\SIMIYOUNG\Downloads\MELBOURNE_HOUSE_PRICES_LESS.csv")#Examine structure of the datasetprint(df)#Print the column namesdf.columns#Examine the tail of the datasetdf.tail()
Examining the dataset shows us the existing columns, number of rows and columns. An important step in any Analytical process is to determine if there are missing variables in the dataset, this is easily achievable in Python:1234#Examine missing values in each columndf.isnull().sum()#Visualizing missing variablessn.heatmap(df.isnull(), cbar=False)
Examining the dataset reveals that we have 14590 missing only in the price column, there are lots of ways to deal with missing values, however, that is not the focus of this tutorial. We will adopt the simple method of dropping all rows of data containing missing values in out dataset using the function provided in the pandas library:12#Drop all columns containing Nan (Not a number)df = df.dropna()We have dropped the rows containing missing prices. A practical implementation of Measures of Central Tendency is implemented as shown in the code below:
123456#Meandf["Price"].mean()#Mediandf["Price"].median()#Modedf["Price"].mode()The results for each measure is given as 997898.2414882415, 830000.0, 600000.0 respectively. We can now move on to implementing visual methods which includes but not limited to Histograms and Densityplots and Boxplots for Descriptive Statistics:
1234#Visualizing price using Histogramsn.distplot(a = df["Price"], kde = False)plt.title("Histogram for Prices")plt.show()12345#Visualizing using a density plotsn.kdeplot(data = df["Price"], shade = True)plt.title("Density Plot for Prices")plt.show()A Boxplot is usually very efficient in visualizing the relationship between a categorical and a numeric data type. The Type column is a categorical type and the Price column is a numeric type, we can implement a Boxplot to help determine how the Price of houses varies across the Type of houses available.
123#Implement boxplot to see how price varies across housing typessn.boxplot(x = "Type", y = "Price", data = df, orient = "v")plt.title("Type vs Price")From the image above, we can easily determine the housing type having the highest pricing. The third method of implement Descriptive Statistics is the Measures of Dispersion and Frequency, some of the procedures can be implemented as shown below:
12345678#Standard deviationdf["Price"].std()#Assessing the range.df["Price"].max()df["Price"].min()#We can use frequency statistics to visualize the most common type of roomsrooms = df["Rooms"].value_counts() #Extract the Rooms variable and countFrom the code chunk above, we have easily assessed the standard deviation, range using the min and max functions and also executed a frequency count which helps us assess the number of rooms most prominent across the numerous houses in Melbourne. We can also visualize the rooms variable created using a bar graph for ease of inference:
12345#Visualizing the most prominent roomsrooms.plot.bar()plt.title("Most common room types")plt.xlabel("Room Numbers")plt.ylabel("Count")Another robust way to implement a frequency count on the dataset is by assessing the region most represented in the dataset, in this case, we will find out which region in Melbourne is most represented in the dataset:
1234sn.set(style='darkgrid')plt.figure(figsize=(10,5))ax = sn.countplot(x='Regionname', data=df)plt.setp(ax.get_xticklabels(), rotation=20)The plot above, produced by using a frequency count and a barplot helps to visualize which region is largely represented. The last method mentioned for Descriptive statistics is the Measure of Position, we can easily implement on the dataset a function that helps know the quantile rankings in the dataset:
123#Examining the quantile rangesimport numpy as npdf["Price"].quantile(np.linspace(.1, 1, 9, 0))After implementing the code above, the result is displayed in the image. It shows us the quantile which divides the distribution of the data into ten equal parts.
Conclusion
This tutorial has covered the basic procedures existing for Descriptive Statistics. The main objective is to discuss the various ways of summarizing a dataset and also implementing these procedures using Python. This isn’t to say that this tutorial covers all there is to Descriptive Statistics, it, however, gives an introduction to Descriptive Statistics.
- AuthorPosts
- You must be logged in to reply to this topic.