Tagged: Data science, Machine Learning
- This topic has 0 replies, 1 voice, and was last updated 2 years, 4 months ago by
Simileoluwa.
- AuthorPosts
- February 2, 2020 at 6:06 pm #85026Spectator@simileoluwa
Machine Learning is a branch of Artificial Intelligence (AI) that requires the use of algorithms that enables computers to learn through examining patterns in historical data to predict future events. AI involves building smart machines capable of performing tasks or making decisions that usually require human intelligence. Artificial Intelligence: A Modern Approach, authored by Stuart Russell and Peter Norvig explored four different approaches that have continually shaped the field of AI;
- Thinking humanly
- Thinking rationally
- Acting humanly
- Acting rationally.
The first 2 involves processes of reasoning while the last 2 involves execution or behavior. Machine learning models use these four approaches by using historical data (input/reasoning) and predicting outcomes on new sets of data (action).
Applications cuts across fields such as;- In banks to prevent fraud and customer churn
- Government to Analyze sensor data in other to identifies ways to increase efficiency and save money
- Communications: Image (facial) Recognition, Fingerprint Recognition etc.
- Recommendations such as Movies on Netflix, applications from google play store.
It is important to know that Artificial intelligence will shape our future more powerfully than any other innovation this century. Anyone who does not understand it will soon find themselves feeling left behind, waking up in a world full of technology that feels more and more like magic.
Terminologies in Machine Learning
There are certain terminologies that must be understood essential to successful training in machine learning as we will be mentioning a few.
Dataset: This is simply a collection of data, Data are observations or measurements (unprocessed or processed) represented as text, numbers, or multimedia.
Training Set: This is the actual dataset on which a machine learning model is built upon or the data used to fit a machine learning model.
Test set: Every machine learning model needs to be tested in the real world to measure how robust its predictions are. This is data that it has never seen before.
Consider both the training set a soldier in training and the test set a soldier at war.
Learning: Two of the most widely used widely adopted machine learning methods are supervised learning which trains algorithms based on example input and output data that is labeled by humans and unsupervised learning which provides the algorithm with no labeled data in order to allow it to find structure within its input data.Regression: is a form of Supervised Machine modeling technique that investigates the relationship between a dependent (target) and independent variable (s) (predictor). They estimate continuous values (Real valued output).
Classification: is a technique in supervised Machine modeling that attempts to predict which class an observation belongs to by examining patterns in the input data. They identify a unique class (Discrete values, Boolean or Categories).
Both Regression and classification are types of supervised learning. Some popular examples of supervised machine learning algorithms are;- Linear regression for regression problems
- Random forest for classification and regression problems
- Support vector machines for classification problems
- Artificial Neural Networks for classification and Regression Problems
- Decision trees for classification and regression.
The most common unsupervised learning method is cluster analysis, which is used for exploratory data analysis to find hidden patterns or grouping in data. Other complex learning types include reinforced and semi-supervised learning.
Implementing a Supervised Model in R
For this tutorial, we will learn how to implement a Linear regression model and an Artificial Neural Network. Download the required Kaggle dataset here .
Setting up your library;
#Required libraries include library(dplyr) library(ggplot2) library(caret) library(caTools) library(skimr) library(reshape2)
Set your working directory and load in your dataset;
#read the housing data housing <- read.csv("kc_house_data.csv")
#examine structure of data str(housing)
The goal is to predict the price of houses using other qualities such as house size, views etc. (A typical regression problem).
We use a correlation matrix to visualize extent of relationship between variables pairwise;
#correlation matrix housing %>% select(-date) %>% cor() %>% #This makes a correlation of every possible numeric column melt() %>% #Convert the cor result to a matrix data format ggplot(., aes(Var1, Var2, fill = value)) + #Plotting geom_tile() + scale_fill_gradient(low = "grey", high = "darkred") + geom_text(aes(Var1, Var2, label = round(value,2)), size = 2)+ labs(title = "Correlation Matrix", x = "Numeric column", y = "Numeric Column", fill = "Coefficient Range") + theme(axis.text.x = element_text(vjust = 0.5, angle = 45), plot.title = element_text(face = "bold", hjust = 0.5))
Splitting the dataset into training and test data using the caTools package as earlier theorized;
#We will remove the date and id column since they are just identifiers new_housing <- housing %>% select(-date, -id)
#Splitting data into test and train sample_2 <- sample.split(new_housing, SplitRatio = 0.75) # 75% - 25% train_2 <- subset(new_housing, sample_ == T) #75% are imputed here test_2 <- subset(new_housing, sample_ == F) #rest 25% is kept here
Implementing a linear regression model;
fit2 <- lm(price~., data = train_2) summary(fit2)
Having fitted the model, we will now use the predict() function to see how the model performs on the train set before the test data;
#create prediction from the model train_pred <- predict(fit2, train_2) #Bind the results of the prediction model with the actual price data new_pred <- cbind.data.frame(actual = train_2$price, prediction = train_pred) #We examine the accuracy between the predicted prices and the actual prices cor(new_pred)
Note: There are other metrics of measurement such as the mean, mean squared error. We will now apply the previously fitted model on the test data earlier gotten;
#Get predictions using the test data test_pred <- predict(fit2, test_2) #Bind the predicted results from the test data with the actual test result new_pred_test <- cbind.data.frame(actual = test_2$price, predicition = test_pred) #Find the correlation between both columns of data. cor(new_pred_test)
We have a correlation coefficient of 0.8254583. However, we can attempt to use other models and examine which model performs better, in this case we will be using an Artificial Neural Network. Specifically, you will see how to;- Normalize data for meaningful analysis
- Predict using a neural network
- Test accuracy.
#Neural networks require normalization of data and we will do that with the preprocess tool in caret preProcess_range_model <- preProcess(train_2, method='range') train_2 <- predict(preProcess_range_model, newdata = train_2)
Normalization is simply a process of ensuring numerical data lies between 0 and 1.
#Building the neural newtwork model NN <- neuralnet(price~., data = train_2, linear.output = T, hidden = c(3,2)) plot(NN)
#We will create prediction set from the fitted model, notice that we did not include the price column in this case pred_NN_train <- compute(NN, train_2[2:19]) #Bind the actual price data from the train data with the result from above line nn_values <- cbind.data.frame(actual = train_2$price , prediction = pred_NN_train$net.result) #We find a much higher correlation unlike using a linear regression model cor(nn_values) #We geta a correlation 0.938003
Having seen how the model performs on this set, we examine its performance on the test data;#Normalizing the test data using the model from the train data test_2 <- predict(preProcess_range_model, newdata = test_2) #prediction for the test data pred_NN_test <- compute(NN, test_2[,2:19]) #Bind results with actual data nn_values_test <- cbind.data.frame(actual = pred_NN_test$net.result, prediction = test_2$price) #Examine the correlation cor(nn_values_test) #0.9347564
Conclusion
A greater improvement can be seen when the results of a simple linear model is compared with that of a Neural Network i.e the latter performs better. There are ways of improving a models performance, the neural net’s performance can be improved using more hidden layers which however, requires more computational time. Other complex methods include hyperparameter tuning, grid search etc.
- AuthorPosts
- You must be logged in to reply to this topic.