February 14, 2020 at 8:25 pm #85840SimileoluwaParticipant@simileoluwa
Data and Information available online are growing exponentially, stemming from the fact that new content is posted every second through various platforms on the internet which includes; social media texts, images, videos, blog posts, newspaper articles etc. With the amount of data available over the web, it opens new horizons of possibility for a Data Scientist. Almost any information you need online is available somewhere hidden in the vast platform called the Internet.
Most of the data available on the web are unstructured i.e. they do not reside in the ready-to-use row and column database format for modeling. It is present in an unstructured format (HTML format) and is not downloadable, therefore, it requires knowledge and expertise to use this data to eventually build models required for decision making processes. The more data you collect, the better your models, but what if the data you want resides on a website? One needs to apply techniques that will help extract the data needed to build the models. Web Scraping is a technique employed to extract large amounts of data from websites whereby the data is extracted and stored in a database of choice. There are quite a number of applications of Web Scraping, to mention a few:
- Trend Tracking for Analytics: By applying Web Scraping techniques, one can easily track events and perform analysis as they occur. A typical example is extracting tweets and analyzing trends and opinions in real-time.
- Price Monitoring: It is possible to track in real-time, prices of products on your website, and of competitors as well. To facilitate that, a pricing system powered by Web scraping is an essential component.
- Service providers: Certain information such as complaints, customer reviews, etc. are very crucial to the improvement of services by service providers. Web Scraping provides ways of extracting these large chunks of data. For example, reviews of competitors can be obtained and scraped to analyze their weak and selling points thus, facilitating an effective decision-making process.
Web Scraping of HTML Tables Using R
To start this tutorial, we will need a Google Chrome browser installed, and then proceed to install a selector gadget extension for Google Chrome. If you have successfully downloaded and installed both, then you should have an image like a lens at the top right corner of your browser as shown below:
When you click the selector gadget installed, you can move it over the HTML tag you want to extract from the web page. In this case we want to extract the List of English Districts by Population table available on Wikipedia. By carefully clicking on the table found on the page with the selector gadget, the image below is the result:
The yellow highlighted text is what we are interested in, it contains the location on the web URL where the table is wrapped. We can then move on to setting up our R libraries to be used for achieving this exercise:1234#Load librarieslibrary(rvest)library(tidyverse)library(stringr)
After setting up the library, extracting the HTML table can be performed as shown below using the variety of functions available in the rvest package loaded:12345#Reading table from wikipagetable_01 <- read_html("https://en.wikipedia.org/w/index.php?title=List_of_English_districts_by_population&oldid=879013307") %>% #Read the Url of the pagehtml_node(".wikitable") %>% #Extract the link to the tablehtml_table(fill = T)
We now have the HTML table, notice where we used the path (.wikitable) which is where the table is located on the website. To make this tutorial more elaborate, we will scrape the cryptocurrency table on tradingview website. Implementing the necessary procedures we can obtain the desired table:123456#Extract the html table using the urlcrypt_table <- read_html("https://www.tradingview.com/markets/cryptocurrencies/prices-all/") %>%html_node("table") %>%html_table(fill = T) %>%as.data.frame()
It is quite easy to scrape tables from websites, however, tables do not come out in a ready-to-use format. Analytical procedures such as descriptive statistics, data visualization etc. can not be performed because most of the columns have some special characters attached to them, therefore, they will not be recognized as numerical values. There is a need to be effective with data wrangling techniques so as to transform the data to a usable format. The cryptocurrency table scraped needs to be cleaned, thus, we execute the following transformation procedure:12345678910111213141516171819#create vector of namesx <- c("Currency","Mkt Cap", "Fd Mkt Cap", "Last", "Avail coins", "Total coins", "Traded Vol", "Chng%")#Add column names to dataframenames(crypt_table) <- x#Remove the last column containing NA'scrypt_table <- crypt_table[,-9]#We can then do some cleaning on this dataset#Clean dataframe and remove unwanted values/symbolscrypt_table <- apply(crypt_table[,2:8], 2, gsub, pattern = "[B M % K]", replacement = "")#Join the currency names with the crypto_table dataframecrypt_table1 <- crypt_table %>%as_tibble() %>%mutate_if(is.factor, as.integer) %>%cbind.data.frame(currency)
The scraped table which was once unsuitable for analytical procedures has now been transformed into a format which can entertain visualizations and other exploratory processes. It is therefore paramount to understand the concepts of wrangling because, while data can be easily scraped in most cases from various websites, the insights that can be drawn is largely dependent on the ability of the Analyst to efficiently transform the acquired data.
In this article, we saw how to scrape tables off websites. Although we only attempted to scrape tables, texts which usually are also unstructured and with a lot of unwanted characters in analytics can also be scraped and wrangled so as to derive insights from them.
- This topic has 0 replies, 1 voice, and was last updated 4 days, 19 hours ago by .
Viewing 1 post (of 1 total)
Viewing 1 post (of 1 total)