Home › Forums › ML, AI, Data Science › Introduction to Machine Learning in R
Tagged: Data science, Machine Learning
 This topic has 0 replies, 1 voice, and was last updated 2 weeks, 2 days ago by Simileoluwa.

AuthorPosts

Machine Learning is a branch of Artificial Intelligence (AI) that requires the use of algorithms that enables computers to learn through examining patterns in historical data to predict future events. AI involves building smart machines capable of performing tasks or making decisions that usually require human intelligence. Artificial Intelligence: A Modern Approach, authored by Stuart Russell and Peter Norvig explored four different approaches that have continually shaped the field of AI;
 Thinking humanly
 Thinking rationally
 Acting humanly
 Acting rationally.
The first 2 involves processes of reasoning while the last 2 involves execution or behavior. Machine learning models use these four approaches by using historical data (input/reasoning) and predicting outcomes on new sets of data (action).
Applications cuts across fields such as; In banks to prevent fraud and customer churn
 Government to Analyze sensor data in other to identifies ways to increase efficiency and save money
 Communications: Image (facial) Recognition, Fingerprint Recognition etc.
 Recommendations such as Movies on Netflix, applications from google play store.
It is important to know that Artificial intelligence will shape our future more powerfully than any other innovation this century. Anyone who does not understand it will soon find themselves feeling left behind, waking up in a world full of technology that feels more and more like magic.
Terminologies in Machine Learning
There are certain terminologies that must be understood essential to successful training in machine learning as we will be mentioning a few.
Dataset: This is simply a collection of data, Data are observations or measurements (unprocessed or processed) represented as text, numbers, or multimedia.
Training Set: This is the actual dataset on which a machine learning model is built upon or the data used to fit a machine learning model.
Test set: Every machine learning model needs to be tested in the real world to measure how robust its predictions are. This is data that it has never seen before.
Consider both the training set a soldier in training and the test set a soldier at war.
Learning: Two of the most widely used widely adopted machine learning methods are supervised learning which trains algorithms based on example input and output data that is labeled by humans and unsupervised learning which provides the algorithm with no labeled data in order to allow it to find structure within its input data.Regression: is a form of Supervised Machine modeling technique that investigates the relationship between a dependent (target) and independent variable (s) (predictor). They estimate continuous values (Real valued output).
Classification: is a technique in supervised Machine modeling that attempts to predict which class an observation belongs to by examining patterns in the input data. They identify a unique class (Discrete values, Boolean or Categories).
Both Regression and classification are types of supervised learning. Some popular examples of supervised machine learning algorithms are; Linear regression for regression problems
 Random forest for classification and regression problems
 Support vector machines for classification problems
 Artificial Neural Networks for classification and Regression Problems
 Decision trees for classification and regression.
The most common unsupervised learning method is cluster analysis, which is used for exploratory data analysis to find hidden patterns or grouping in data. Other complex learning types include reinforced and semisupervised learning.
Implementing a Supervised Model in R
For this tutorial, we will learn how to implement a Linear regression model and an Artificial Neural Network. Download the required Kaggle dataset here .
Setting up your library;
1234567#Required libraries includelibrary(dplyr)library(ggplot2)library(caret)library(caTools)library(skimr)library(reshape2)Set your working directory and load in your dataset;
12#read the housing datahousing < read.csv("kc_house_data.csv")123#examine structure of datastr(housing)The goal is to predict the price of houses using other qualities such as house size, views etc. (A typical regression problem).
We use a correlation matrix to visualize extent of relationship between variables pairwise;
12345678910111213#correlation matrixhousing %>%select(date) %>%cor() %>% #This makes a correlation of every possible numeric columnmelt() %>% #Convert the cor result to a matrix data formatggplot(., aes(Var1, Var2, fill = value)) + #Plottinggeom_tile() +scale_fill_gradient(low = "grey", high = "darkred") +geom_text(aes(Var1, Var2, label = round(value,2)), size = 2)+labs(title = "Correlation Matrix", x = "Numeric column", y = "Numeric Column",fill = "Coefficient Range") +theme(axis.text.x = element_text(vjust = 0.5, angle = 45),plot.title = element_text(face = "bold", hjust = 0.5))Splitting the dataset into training and test data using the caTools package as earlier theorized;
123#We will remove the date and id column since they are just identifiersnew_housing < housing %>%select(date, id)1234#Splitting data into test and trainsample_2 < sample.split(new_housing, SplitRatio = 0.75) # 75%  25%train_2 < subset(new_housing, sample_ == T) #75% are imputed heretest_2 < subset(new_housing, sample_ == F) #rest 25% is kept hereImplementing a linear regression model;
12fit2 < lm(price~., data = train_2)summary(fit2)Having fitted the model, we will now use the predict() function to see how the model performs on the train set before the test data;
12345678#create prediction from the modeltrain_pred < predict(fit2, train_2)#Bind the results of the prediction model with the actual price datanew_pred < cbind.data.frame(actual = train_2$price, prediction = train_pred)#We examine the accuracy between the predicted prices and the actual pricescor(new_pred)Note: There are other metrics of measurement such as the mean, mean squared error. We will now apply the previously fitted model on the test data earlier gotten;
12345678#Get predictions using the test datatest_pred < predict(fit2, test_2)#Bind the predicted results from the test data with the actual test resultnew_pred_test < cbind.data.frame(actual = test_2$price, predicition = test_pred)#Find the correlation between both columns of data.cor(new_pred_test)
We have a correlation coefficient of 0.8254583. However, we can attempt to use other models and examine which model performs better, in this case we will be using an Artificial Neural Network. Specifically, you will see how to; Normalize data for meaningful analysis
 Predict using a neural network
 Test accuracy.
123#Neural networks require normalization of data and we will do that with the preprocess tool in caretpreProcess_range_model < preProcess(train_2, method='range')train_2 < predict(preProcess_range_model, newdata = train_2)Normalization is simply a process of ensuring numerical data lies between 0 and 1.
123#Building the neural newtwork modelNN < neuralnet(price~., data = train_2, linear.output = T, hidden = c(3,2))plot(NN)12345678#We will create prediction set from the fitted model, notice that we did not include the price column in this casepred_NN_train < compute(NN, train_2[2:19])#Bind the actual price data from the train data with the result from above linenn_values < cbind.data.frame(actual = train_2$price , prediction = pred_NN_train$net.result)#We find a much higher correlation unlike using a linear regression modelcor(nn_values) #We geta a correlation 0.938003
Having seen how the model performs on this set, we examine its performance on the test data;1234567891011#Normalizing the test data using the model from the train datatest_2 < predict(preProcess_range_model, newdata = test_2)#prediction for the test datapred_NN_test < compute(NN, test_2[,2:19])#Bind results with actual datann_values_test < cbind.data.frame(actual = pred_NN_test$net.result, prediction = test_2$price)#Examine the correlationcor(nn_values_test) #0.9347564Conclusion
A greater improvement can be seen when the results of a simple linear model is compared with that of a Neural Network i.e the latter performs better. There are ways of improving a models performance, the neural net’s performance can be improved using more hidden layers which however, requires more computational time. Other complex methods include hyperparameter tuning, grid search etc.

AuthorPosts