• Skip to main content
  • Skip to primary sidebar

Technical Notes Of
Ehi Kioya

Technical Notes Of Ehi Kioya

  • About
  • Contact
MENUMENU
  • Blog Home
  • AWS, Azure, Cloud
  • Backend (Server-Side)
  • Frontend (Client-Side)
  • SharePoint
  • Tools & Resources
    • CM/IN Ruler
    • URL Decoder
    • Text Hasher
    • Word Count
    • IP Lookup
  • Linux & Servers
  • Zero Code Tech
  • WordPress
  • Musings
  • More
    Categories
    • Cloud
    • Server-Side
    • Front-End
    • SharePoint
    • Tools
    • Linux
    • Zero Code
    • WordPress
    • Musings

Web Scraping of HTML Tables Using R

Tagged: Data science, Web Scraping

  • This topic has 0 replies, 1 voice, and was last updated 2 years, 5 months ago by Simileoluwa.
Viewing 1 post (of 1 total)
  • Author
    Posts
  • February 14, 2020 at 8:25 pm #85840
    Spectator
    @simileoluwa


    Data and Information available online are growing exponentially, stemming from the fact that new content is posted every second through various platforms on the internet which includes; social media texts, images, videos, blog posts, newspaper articles etc. With the amount of data available over the web, it opens new horizons of possibility for a Data Scientist. Almost any information you need online is available somewhere hidden in the vast platform called the Internet.

    Most of the data available on the web are unstructured i.e. they do not reside in the ready-to-use row and column database format for modeling. It is present in an unstructured format (HTML format) and is not downloadable, therefore, it requires knowledge and expertise to use this data to eventually build models required for decision making processes. The more data you collect, the better your models, but what if the data you want resides on a website? One needs to apply techniques that will help extract the data needed to build the models. Web Scraping is a technique employed to extract large amounts of data from websites whereby the data is extracted and stored in a database of choice. There are quite a number of applications of Web Scraping, to mention a few:

    1. Trend Tracking for Analytics: By applying Web Scraping techniques, one can easily track events and perform analysis as they occur. A typical example is extracting tweets and analyzing trends and opinions in real-time.
    2. Price Monitoring: It is possible to track in real-time, prices of products on your website, and of competitors as well. To facilitate that, a pricing system powered by Web scraping is an essential component.
    3. Service providers: Certain information such as complaints, customer reviews, etc. are very crucial to the improvement of services by service providers. Web Scraping provides ways of extracting these large chunks of data. For example, reviews of competitors can be obtained and scraped to analyze their weak and selling points thus, facilitating an effective decision-making process.

    Web Scraping of HTML Tables Using R

    To start this tutorial, we will need a Google Chrome browser installed, and then proceed to install a selector gadget extension for Google Chrome. If you have successfully downloaded and installed both, then you should have an image like a lens at the top right corner of your browser as shown below:

    Most of the web is built out of HTML, CSS, and JavaScript. When you visit a web page your computer sends a request to a web server that returns a bunch of text in the form of HTML. Your browser then renders that HTML text into the web page that you see. You can actually view the HTML code (in Chrome: View > more tools> developer tools).
    When you click the selector gadget installed, you can move it over the HTML tag you want to extract from the web page. In this case we want to extract the List of English Districts by Population table available on Wikipedia. By carefully clicking on the table found on the page with the selector gadget, the image below is the result:


    The yellow highlighted text is what we are interested in, it contains the location on the web URL where the table is wrapped. We can then move on to setting up our R libraries to be used for achieving this exercise:

    #Load libraries
    library(rvest)
    library(tidyverse)
    library(stringr)

    After setting up the library, extracting the HTML table can be performed as shown below using the variety of functions available in the rvest package loaded:

    #Reading table from wikipage
    table_01 <- read_html("https://en.wikipedia.org/w/index.php?title=List_of_English_districts_by_population&oldid=879013307") %>% #Read the Url of the page
    html_node(".wikitable") %>% #Extract the link to the table
    html_table(fill = T)
    

    We now have the HTML table, notice where we used the path (.wikitable) which is where the table is located on the website. To make this tutorial more elaborate, we will scrape the cryptocurrency table on tradingview website. Implementing the necessary procedures we can obtain the desired table:

    #Extract the html table using the url
    crypt_table <- read_html("https://www.tradingview.com/markets/cryptocurrencies/prices-all/") %>%
    html_node("table") %>%
    html_table(fill = T) %>%
    as.data.frame()
    

    It is quite easy to scrape tables from websites, however, tables do not come out in a ready-to-use format. Analytical procedures such as descriptive statistics, data visualization etc. can not be performed because most of the columns have some special characters attached to them, therefore, they will not be recognized as numerical values. There is a need to be effective with data wrangling techniques so as to transform the data to a usable format. The cryptocurrency table scraped needs to be cleaned, thus, we execute the following transformation procedure:

    #create vector of names
    x <- c("Currency","Mkt Cap", "Fd Mkt Cap", "Last", "Avail coins", "Total coins", "Traded Vol", "Chng%")
    
    #Add column names to dataframe
    names(crypt_table) <- x
    
    #Remove the last column containing NA's
    crypt_table <- crypt_table[,-9]
    
    #We can then do some cleaning on this dataset
    #Clean dataframe and remove unwanted values/symbols
    crypt_table <- apply(crypt_table[,2:8], 2, gsub, pattern = "[B M % K]", replacement = "")
    
    #Join the currency names with the crypto_table dataframe
    crypt_table1 <- crypt_table %>%
    as_tibble() %>%
    mutate_if(is.factor, as.integer) %>%
    cbind.data.frame(currency)
    


    The scraped table which was once unsuitable for analytical procedures has now been transformed into a format which can entertain visualizations and other exploratory processes. It is therefore paramount to understand the concepts of wrangling because, while data can be easily scraped in most cases from various websites, the insights that can be drawn is largely dependent on the ability of the Analyst to efficiently transform the acquired data.

    Conclusion

    In this article, we saw how to scrape tables off websites. Although we only attempted to scrape tables, texts which usually are also unstructured and with a lot of unwanted characters in analytics can also be scraped and wrangled so as to derive insights from them.

  • Author
    Posts
Viewing 1 post (of 1 total)
  • You must be logged in to reply to this topic.
Log In

Primary Sidebar

FORUM   MEMBERSHIP

Log In
Register Lost Password

POPULAR   FORUM   TOPICS

  • How to find the title of a song without knowing the lyrics
  • How To Change Or Remove The WordPress Login Error Message
  • The Art of Exploratory Data Analysis (Part 1)
  • Welcome Message
  • Replacing The Default SQLite Database With PostgreSQL In Django
  • Getting Started with SQL: A Beginners Guide to Databases
  • How to Implement Local SEO On Your Business Website And Drive Traffic
  • About
  • Contact

© 2022   ·   Ehi Kioya   ·   All Rights Reserved
Privacy Policy