Home › Forums › ML, AI, Data Science › Unsupervised Machine Learning For Data Clustering
 This topic has 0 replies, 1 voice, and was last updated 1 month, 2 weeks ago by Simileoluwa.

AuthorPosts

February 15, 2020 at 10:56 pm #85904Participant@simileoluwa
Unsupervised Machine Learning helps us understand the relationships that exist within a dataset. It is a branch of Machine Learning that learns from test data that has not been labeled, classified or categorized. In a Multiple Linear Regression problem (which is a Supervised Machine Learning Algorithm), you would typically have independent variables and a response variable. But in Unsupervised Learning, there are only independent variables and no target class in the dataset. We rely on Learning Algorithms to find patterns in the data and predict a possible target.
The Learning Algorithm does something similar to the image above. Although it does not know there are three types of fruits, it learns by attempting to understand the characteristic of each fruit type and then combines similar fruits based on observation into the same group.
Unsupervised Machine Learning Algorithms are used to group unstructured data according to its similarities and distinct patterns in the dataset. The Unsupervised Algorithm handles data without prior training – it is a function that does its job with the data at its disposal. The goal of Unsupervised Learning is generally to cluster the data into characteristically different groups and can be more challenging than supervised Learning due to the absence of labels. There are various applications of Unsupervised Machine Learning, the main applications include:
 Clustering
 Visualization
 Dimensionality Reduction
 Finding Association Rules
 Anomaly Detection etc.
Importance of Unsupervised Machine Learning:
 It allows a Machine to tackle problems that humans might find insurmountable either due to limited capacity or a bias.
 It is ideal for exploring raw and unknown data.
 Unsupervised methods help you to find features that can be useful for categorization.
 It is often easier to get unlabeled data — from a computer than labeled data, which need person intervention.
 Unsupervised Learning is very useful in the exploratory analysis because it can automatically identify structure in data.
Implementing an Unsupervised Machine Learning in R
In this tutorial, we will implement an Unsupervised Machine Learning Algorithm called kmeans Clustering. The dataset for the exercise is from Kaggle. It contains the characteristics of different wine types. Each wine can be grouped based on the observation from various parameters and using the kmeans Algorithm we can assess patterns in the dataset and group the wine based on the observed pattern.
We can set up our libraries required for this process and then take a look at our dataset:123456789101112131415# Load librarieslibrary(tidyverse)library(factoextra)library(skimr)library(NbClust)library(gridExtra)#Set your working directory and read the CSV file#Load in the datasetWine < read.csv("Wine.csv")#Some Descriptive statisticsdescriptive < skim_to_wide(Wine)After observing the dataset, the Customer_Segment column is a classification column, however, since we want to build a Learning Algorithm that can classify the dataset based on observed patterns, we will leave out the column by executing the code below:
123#Remove the Customer_SegmentWine < Wine[,14]The purpose of clustering analysis is to identify patterns in your data and create groups according to those patterns. Therefore, if two points have similar characteristics, that means they have the same pattern and consequently, they belong to the same group. By doing clustering analysis we should be able to check what features usually appear together and see what characterizes a group.
To begin this procedure, kmeans clustering requires the Normalization of variables. This means adjusting values measured on different scales to a common scale. To elaborate on this process we will normalize the dataset and save in a new variable, we will then plot the original data and compare with the scaled data:1234567891011121314151617#Lets us examine this visuallywine_scaled < as.data.frame(scale(Wine))#Plotting same two variables of the original and scaled datasetp_1 < ggplot(Wine, aes(Flavanoids, Total_Phenols)) +geom_point() +labs(title="Wines Attributes",subtitle="Original Data")p_2 < ggplot(wine_scaled, aes(Flavanoids, Total_Phenols)) +geom_point() +labs(title="Wines Attributes",subtitle="Scaled Data")#Arrange both plots and the same gridgrid.arrange(p_1, p_2, ncol = 2)From the image above, although the scale of the axis changed, the plots are similar. Since we have scaled our dataset we can implement the kmeans Algorithm on the scaled dataset:
1234# Execution of kmeans with k=2set.seed(1234)wines_clusters < kmeans(wine_scaled, centers=2)We have just built a Learning model; however, we are not so sure that the data has been accurately grouped given the fact that we started with two centers. This implies that all observations in the dataset can only exist in two groups. Clustering involves grouping observations based on similarity, it is cogent that one finds the optimal kgrouping number to ensure that data points are rightly grouped. There are a variety of ways to determine the optimal kgrouping number:
12345678910111213141516#There are three functions that can be used to compute the optimal k#wss, silhouette, gap_stata <fviz_nbclust(wine_scaled, kmeans, method = 'wss') +geom_vline(xintercept = 3, linetype = 2)+labs(subtitle = "Elbow method")b < fviz_nbclust(wine_scaled, kmeans, method = 'silhouette') +geom_vline(xintercept = 4, linetype = 2)+labs(subtitle = "Elbow method")c < fviz_nbclust(wine_scaled, kmeans, nstart = 25, method = "gap_stat", nboot = 50)+labs(subtitle = "Gap statistic method")#Arranging all three plots for the visual observationgrid.arrange(a, b, c, ncol = 1)
Each model built suggests an optimal number of classification, we then plot the results and then visually inspect the result. From the image above, it can easily be deduced that 3 seems to be the optimal grouping variable. We can also apply another and more robust method just to be sure.The NbClust package provides 30 indices for determining the number of clusters and proposes to the user the best clustering scheme from the different results obtained by varying all combinations of the number of clusters, distance measures, and clustering methods:
123nc < NbClust(wine_scaled, min.nc=2, max.nc=15, method="kmeans") nc$Best.nc[1,] #When counted we can see that 15 indices chose 3 as the best k value #This can also be visually represented nc$Best.nc[1,] %>%table() %>%barplot() #We can see that 15 indices chose 3
After implementing the Nbclust function and visualizing the result, we can see that 15 of the indices show 3 as their optimal kgrouping number. We can now rebuild our model using 3 as our optimal grouping variable:12345678910#So reimplementing the kmeans now# Execution of kmeans with k=3set.seed(1234)wines_k3 < kmeans(wine_scaled, centers=3)# Mean values of each cluster#It's possible to compute the mean of each variable by clusters using the original data:aggregate(Wine, by=list(wines_k3$cluster), mean)The data has now been grouped into three groups. When you examine the Customer_Segment column you’d discover that three groups are existing. To examine the accuracy of your algorithm you can compute the mean value of each variable according to its class as demonstrated below:
12345678#reload in datasetWine < read.csv("Wine.csv")#Compute the mean of the dataset using Customer_Segment as the grouping variableagg < Wine %>%group_by(Customer_Segment) %>%summarise_each(funs(mean))Comparing the results of the computed mean on the actual dataset and the result of our kmeans model, we find that are similar with only slight approximate differences. This indicates that our model has performed well and can thus we deployed for usage.
Conclusion
We have implemented an Unsupervised Machine Learning model using the kmeans clustering technique, this is not to say that there aren’t other models. However, kmeans is widely used due to its simplicity and its high performance on classification models.

AuthorPosts
You must be logged in to reply to this topic.