• Skip to main content
  • Skip to primary sidebar

Technical Notes Of
Ehi Kioya

Technical Notes Of Ehi Kioya

  • About
  • Contact
MENUMENU
  • Blog Home
  • AWS, Azure, Cloud
  • Backend (Server-Side)
  • Frontend (Client-Side)
  • SharePoint
  • Tools & Resources
    • CM/IN Ruler
    • URL Decoder
    • Text Hasher
    • Word Count
    • IP Lookup
  • Linux & Servers
  • Zero Code Tech
  • WordPress
  • Musings
  • More
    Categories
    • Cloud
    • Server-Side
    • Front-End
    • SharePoint
    • Tools
    • Linux
    • Zero Code
    • WordPress
    • Musings

Introduction to Machine Learning in R

Tagged: Data science, Machine Learning

  • This topic has 0 replies, 1 voice, and was last updated 2 years, 4 months ago by Simileoluwa.
Viewing 1 post (of 1 total)
  • Author
    Posts
  • February 2, 2020 at 6:06 pm #85026
    Spectator
    @simileoluwa

    Machine Learning is a branch of Artificial Intelligence (AI) that requires the use of algorithms that enables computers to learn through examining patterns in historical data to predict future events. AI involves building smart machines capable of performing tasks or making decisions that usually require human intelligence. Artificial Intelligence: A Modern Approach, authored by Stuart Russell and Peter Norvig explored four different approaches that have continually shaped the field of AI;

    1. Thinking humanly
    2. Thinking rationally
    3. Acting humanly
    4. Acting rationally.

    The first 2 involves processes of reasoning while the last 2 involves execution or behavior. Machine learning models use these four approaches by using historical data (input/reasoning) and predicting outcomes on new sets of data (action).
    Applications cuts across fields such as;

    1. In banks to prevent fraud and customer churn
    2. Government to Analyze sensor data in other to identifies ways to increase efficiency and save money
    3. Communications: Image (facial) Recognition, Fingerprint Recognition etc.
    4. Recommendations such as Movies on Netflix, applications from google play store.

    It is important to know that Artificial intelligence will shape our future more powerfully than any other innovation this century. Anyone who does not understand it will soon find themselves feeling left behind, waking up in a world full of technology that feels more and more like magic.

    Terminologies in Machine Learning

    There are certain terminologies that must be understood essential to successful training in machine learning as we will be mentioning a few.
    Dataset: This is simply a collection of data, Data are observations or measurements (unprocessed or processed) represented as text, numbers, or multimedia.
    Training Set: This is the actual dataset on which a machine learning model is built upon or the data used to fit a machine learning model.
    Test set: Every machine learning model needs to be tested in the real world to measure how robust its predictions are. This is data that it has never seen before.
    Consider both the training set a soldier in training and the test set a soldier at war.
    Learning: Two of the most widely used widely adopted machine learning methods are supervised learning which trains algorithms based on example input and output data that is labeled by humans and unsupervised learning which provides the algorithm with no labeled data in order to allow it to find structure within its input data.

    Regression: is a form of Supervised Machine modeling technique that investigates the relationship between a dependent (target) and independent variable (s) (predictor). They  estimate continuous values (Real valued output).
    Classification: is a technique in supervised Machine modeling that attempts to predict which class an observation belongs to by examining patterns in the input data. They identify a unique class (Discrete values, Boolean or Categories).
    Both Regression and classification are types of supervised learning. Some popular examples of supervised machine learning algorithms are;

    • Linear regression for regression problems
    • Random forest for classification and regression problems
    • Support vector machines for classification problems
    • Artificial Neural Networks for classification and Regression Problems
    • Decision trees for classification and regression.

    The most common unsupervised learning method is cluster analysis, which is used for exploratory data analysis to find hidden patterns or grouping in data. Other complex learning types include reinforced and semi-supervised learning.

    Implementing a Supervised Model in R

    For this tutorial, we will learn how to implement a Linear regression model and an Artificial Neural Network. Download the required Kaggle dataset here .

    Setting up your library;

    #Required libraries include
    library(dplyr)
    library(ggplot2)
    library(caret)
    library(caTools)
    library(skimr)
    library(reshape2)

    Set your working directory and load in your dataset;

     #read the housing data
    housing <- read.csv("kc_house_data.csv")
    #examine structure of data
    str(housing)
    

    The goal is to predict the price of houses using other qualities such as house size, views etc. (A typical regression problem).

    We use a correlation matrix to visualize extent of relationship between variables pairwise;

    #correlation matrix
    housing %>%
    select(-date) %>%
    cor() %>% #This makes a correlation of every possible numeric column
    melt() %>% #Convert the cor result to a matrix data format
    ggplot(., aes(Var1, Var2, fill = value)) + #Plotting
    geom_tile() +
    scale_fill_gradient(low = "grey", high = "darkred") +
    geom_text(aes(Var1, Var2, label = round(value,2)), size = 2)+
    labs(title = "Correlation Matrix", x = "Numeric column", y = "Numeric Column",
    fill = "Coefficient Range") +
    theme(axis.text.x = element_text(vjust = 0.5, angle = 45),
    plot.title = element_text(face = "bold", hjust = 0.5))

    Splitting the dataset into training and test data using the caTools package as earlier theorized;

    #We will remove the date and id column since they are just identifiers
    new_housing <- housing %>%
    select(-date, -id)
    #Splitting data into test and train
    sample_2 <- sample.split(new_housing, SplitRatio = 0.75) # 75% - 25%
    train_2 <- subset(new_housing, sample_ == T) #75% are imputed here
    test_2 <- subset(new_housing, sample_ == F) #rest 25% is kept here

    Implementing a linear regression model;

    fit2 <- lm(price~., data = train_2)
    summary(fit2)

    Having fitted the model, we will now use the predict() function to see how the model performs on the train set before the test data;

    #create prediction from the model
    train_pred <- predict(fit2, train_2)
    
    #Bind the results of the prediction model with the actual price data
    new_pred <- cbind.data.frame(actual = train_2$price, prediction = train_pred)
    
    #We examine the accuracy between the predicted prices and the actual prices
    cor(new_pred)

    Note: There are other metrics of measurement such as the mean, mean squared error. We will now apply the previously fitted model on the test data earlier gotten;

    #Get predictions using the test data
    test_pred <- predict(fit2, test_2)
    
    #Bind the predicted results from the test data with the actual test result
    new_pred_test <- cbind.data.frame(actual = test_2$price, predicition = test_pred)
    
    #Find the correlation between both columns of data.
    cor(new_pred_test)


    We have a correlation coefficient of 0.8254583. However, we can attempt to use other models and examine which model performs better, in this case we will be using an Artificial Neural Network. Specifically, you will see how to;

    • Normalize data for meaningful analysis
    • Predict using a neural network
    • Test accuracy.
    #Neural networks require normalization of data and we will do that with the preprocess tool in caret
    preProcess_range_model <- preProcess(train_2, method='range')
    train_2 <- predict(preProcess_range_model, newdata = train_2)

    Normalization is simply a process of ensuring numerical data lies between 0 and 1.

    #Building the neural newtwork model
    NN <- neuralnet(price~., data = train_2, linear.output = T, hidden = c(3,2))
    plot(NN)

    #We will create prediction set from the fitted model, notice that we did not include the price column in this case
    pred_NN_train <- compute(NN, train_2[2:19])
    
    #Bind the actual price data from the train data with the result from above line
    nn_values <- cbind.data.frame(actual = train_2$price , prediction = pred_NN_train$net.result)
    
    #We find a much higher correlation unlike using a linear regression model
    cor(nn_values) #We geta a correlation 0.938003


    Having seen how the model performs on this set, we examine its performance on the test data;

    #Normalizing the test data using the model from the train data
    test_2 <- predict(preProcess_range_model, newdata = test_2)
    
    #prediction for the test data
    pred_NN_test <- compute(NN, test_2[,2:19])
    
    #Bind results with actual data
    nn_values_test <- cbind.data.frame(actual = pred_NN_test$net.result, prediction = test_2$price)
    
    #Examine the correlation
    cor(nn_values_test) #0.9347564

    Conclusion

    A greater improvement can be seen when the results of a simple linear model is compared with that of a Neural Network i.e the latter performs better. There are ways of improving a models performance, the neural net’s performance can be improved using more hidden layers which however, requires more computational time. Other complex methods include hyperparameter tuning, grid search etc.

  • Author
    Posts
Viewing 1 post (of 1 total)
  • You must be logged in to reply to this topic.
Log In

Primary Sidebar

FORUM   MEMBERSHIP

Log In
Register Lost Password

POPULAR   FORUM   TOPICS

  • How to find the title of a song without knowing the lyrics
  • How To Change Or Remove The WordPress Login Error Message
  • The Art of Exploratory Data Analysis (Part 1)
  • Welcome Message
  • Replacing The Default SQLite Database With PostgreSQL In Django
  • Getting Started with SQL: A Beginners Guide to Databases
  • How to Implement Local SEO On Your Business Website And Drive Traffic
  • About
  • Contact

© 2022   ·   Ehi Kioya   ·   All Rights Reserved
Privacy Policy