• Skip to main content
  • Skip to primary sidebar

Technical Notes Of
Ehi Kioya

Technical Notes Of Ehi Kioya

  • Forums
  • About
  • Contact
MENUMENU
  • Blog Home
  • AWS, Azure, Cloud
  • Backend (Server-Side)
  • Frontend (Client-Side)
  • SharePoint
  • Tools & Resources
    • CM/IN Ruler
    • URL Decoder
    • Text Hasher
    • Word Count
    • IP Lookup
  • Linux & Servers
  • Zero Code Tech
  • WordPress
  • Musings
  • More
    Categories
    • Cloud
    • Server-Side
    • Front-End
    • SharePoint
    • Tools
    • Linux
    • Zero Code
    • WordPress
    • Musings

Web Scraping of HTML Content Using R

Tagged: R, Web Scraping

  • This topic has 0 replies, 1 voice, and was last updated 10 months, 1 week ago by Simileoluwa.
Viewing 1 post (of 1 total)
  • Author
    Posts
  • March 6, 2020 at 11:13 pm #87232
    Participant
    @simileoluwa

    Web scraping has become one of the most effective ways of getting data online. Due to its robustness and reliability, it affords stakeholders the ability to extract information from the endless free resource on the internet. Before the advent of scraping tools and technologies, business owners found it difficult to get their desired data from the web. It used to require outsourcing to Data Mining providers who manually extract the required data and this kind of process required a lot of tedious work. It was considered an ineffective and time-consuming process. It could take weeks and sometimes months before the required data is obtained, however, a lot of business decision-making processes happen in real-time and thus may be ineffective.

    Every 21st-century business enterprise: Startup or Large Organizations all have on thing in common – Their most effective strategies and insights are data-driven. Regardless of whether you need to begin another task or produce another technique for a current business, you have to constantly get to and dissect a tremendous measure of information. This is the place web scraping comes in.

    Web scraping can be referred to as the process of collecting the data from the World Wide Web and transforming it into a structured format. There are several web scraping techniques:

    1. Human Manual Copy and Paste: This is perhaps the most basic technique in web scraping, where humans use their copy and paste computer functions to collect data from the internet, however, this may be very tedious.
    2. HTML parsing: Most of the web pages are generated dynamically from databases using similar templates and CSS selectors. Identifying such CSS selectors allows mimicking the structure of databases’ tables.
    3. Using API/SDK (socket programming): This is usually provided by large websites, which allows getting content from their webpages. A typical example is how twitter allows developers to obtain tweets using hashtags by using the API it provides through its developer account.

    Web Scraping with R

    The focus of this tutorial will be majorly scraping HTML text. To start this tutorial, we will need a Google Chrome browser installed, and then proceed to install a selector gadget extension for Google Chrome. A successful installation should give you an image like a lens at the top right corner of your browser as shown below.

    We will scrape some information off the IMDB movie feature for 2016, and replicate some of the key movie features into a table. Loading our required libraries:

    1
    2
    3
    4
    Scraping IMDB
    library(rvest) #The web scraping package
    library(tidyverse) #For data wrangling
    library(stringr) #Text wrangling

    We will scrape different parts of the IMDB Feature Film, Released between 2016-01-01 and 2016-12-31 (Sorted by Popularity Ascending). Using various Scraping techniques we can extract desired movie information. The movies are ranked, thus we will first extract the ranks:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    #Specifying the url for desired website to be scraped
    url <- 'http://www.imdb.com/search/title?count=100&release_date=2016,2016&title_type=feature'
    #Reading the HTML code from the website
    webpage <- read_html(url)
    #retrieve ranks
    rank_html <- html_nodes(webpage, ".text-primary")
    rank <- html_text(rank_html)
    rank
    #Converting from text to ranked
    rank <- as.numeric(rank)

    Notice that we copied and pasted the URL of the website, we then used the rvest package to extract the HTML contents of the tag supplied, the image below shows how the CSS selector obtained the tag.

    From the image above, the selector gadget helped to easily obtain the HTML tag of the ranks. We have 100 movies in this feature, we will get the titles of all the movies featured in this ranking, movie genre, the descriptions and then bind all results into a dataframe:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    #Obtain the title
    title_html <- html_nodes(webpage, ".lister-item-header a")
    title <- html_text(title_html)
    title
     
    #Genre
    genre_html <- html_nodes(webpage, ".genre")
    genre <- html_text(genre_html)
    genre
     
    #Preprocess genre remover n\ and excess spaces
    genre <- gsub("\n", "", genre)
    genre <- gsub(" ", "", genre)
    genre
     
    #Get description
    #Using CSS selectors to scrape the description section
    description <- html_nodes(webpage,'.ratings-bar+ .text-muted')
     
    #Converting the description data to text
    description <- html_text(description)
     
    #Bind all results into a dataframe
    cbind.data.frame(rank, title, genre, description)
     


    The code block will help you extract the desired information you want and then combine all into a data frame. If there are updates on the website, running the code again will help you extract that new information. The rvest package makes it very easy to extract data from the web using simple and understandable syntax. We can even go further to extract the gross earnings of each movie, do a little data wrangling.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    #Obtain the gross revenue
    gross <- html_nodes(webpage,'.ghost~ .text-muted+ span')
     
    #Extract the html text
    gross <- html_text(gross)#examinedata
    gross
     
    #use gsub to remove '$' and 'M' signs
    gross <-gsub("M","",gross)
    gross <-gsub("[$]", "", gross)

    There are other HTML contents available such as the rating, year, runtime. Scraping them all follows the same procedures as the previous texts obtained. Ensure you get the right HTML tag and then parse it into the HTML text extracting functions available to you in R, it fetches you your desired result.

    Conclusion

    Web scraping requires practicing with various websites as different websites have different architectures, thus there may be a need to constantly adapt your code to fit into your need. Remember that you must obtain the right HTML tag for the desired content, else there may be issues of extracting incorrect information. It is important to know that your data wrangling skills are very important as scraping data off the web always comes out messy, thus, the data is only as good as your ability to clean them for use.

  • Author
    Posts
Viewing 1 post (of 1 total)
  • You must be logged in to reply to this topic.
Log In

Primary Sidebar

FORUM   MEMBERSHIP

Log In
Register Lost Password

POPULAR   FORUM   TOPICS

  • How to find the title of a song without knowing the lyrics
  • The Art of Exploratory Data Analysis (Part 1)
  • Welcome Message
  • How To Change Or Remove The WordPress Login Error Message
  • Getting Started with SQL: A Beginners Guide to Databases
  • Replacing The Default SQLite Database With PostgreSQL In Django
  • Understanding Cookies In PHP
  • Forums
  • About
  • Contact

© 2021   ·   Ehi Kioya   ·   All Rights Reserved
Privacy Policy