Tagged: R, Web Scraping
- This topic has 0 replies, 1 voice, and was last updated 10 months, 1 week ago by
Simileoluwa.
- AuthorPosts
- March 6, 2020 at 11:13 pm #87232Participant@simileoluwa
Web scraping has become one of the most effective ways of getting data online. Due to its robustness and reliability, it affords stakeholders the ability to extract information from the endless free resource on the internet. Before the advent of scraping tools and technologies, business owners found it difficult to get their desired data from the web. It used to require outsourcing to Data Mining providers who manually extract the required data and this kind of process required a lot of tedious work. It was considered an ineffective and time-consuming process. It could take weeks and sometimes months before the required data is obtained, however, a lot of business decision-making processes happen in real-time and thus may be ineffective.
Every 21st-century business enterprise: Startup or Large Organizations all have on thing in common – Their most effective strategies and insights are data-driven. Regardless of whether you need to begin another task or produce another technique for a current business, you have to constantly get to and dissect a tremendous measure of information. This is the place web scraping comes in.
Web scraping can be referred to as the process of collecting the data from the World Wide Web and transforming it into a structured format. There are several web scraping techniques:
- Human Manual Copy and Paste: This is perhaps the most basic technique in web scraping, where humans use their copy and paste computer functions to collect data from the internet, however, this may be very tedious.
- HTML parsing: Most of the web pages are generated dynamically from databases using similar templates and CSS selectors. Identifying such CSS selectors allows mimicking the structure of databases’ tables.
- Using API/SDK (socket programming): This is usually provided by large websites, which allows getting content from their webpages. A typical example is how twitter allows developers to obtain tweets using hashtags by using the API it provides through its developer account.
Web Scraping with R
The focus of this tutorial will be majorly scraping HTML text. To start this tutorial, we will need a Google Chrome browser installed, and then proceed to install a selector gadget extension for Google Chrome. A successful installation should give you an image like a lens at the top right corner of your browser as shown below.
We will scrape some information off the IMDB movie feature for 2016, and replicate some of the key movie features into a table. Loading our required libraries:
1234Scraping IMDBlibrary(rvest) #The web scraping packagelibrary(tidyverse) #For data wranglinglibrary(stringr) #Text wranglingWe will scrape different parts of the IMDB Feature Film, Released between 2016-01-01 and 2016-12-31 (Sorted by Popularity Ascending). Using various Scraping techniques we can extract desired movie information. The movies are ranked, thus we will first extract the ranks:
12345678910#Specifying the url for desired website to be scrapedurl <- 'http://www.imdb.com/search/title?count=100&release_date=2016,2016&title_type=feature'#Reading the HTML code from the websitewebpage <- read_html(url)#retrieve ranksrank_html <- html_nodes(webpage, ".text-primary")rank <- html_text(rank_html)rank#Converting from text to rankedrank <- as.numeric(rank)Notice that we copied and pasted the URL of the website, we then used the rvest package to extract the HTML contents of the tag supplied, the image below shows how the CSS selector obtained the tag.
From the image above, the selector gadget helped to easily obtain the HTML tag of the ranks. We have 100 movies in this feature, we will get the titles of all the movies featured in this ranking, movie genre, the descriptions and then bind all results into a dataframe:
12345678910111213141516171819202122232425#Obtain the titletitle_html <- html_nodes(webpage, ".lister-item-header a")title <- html_text(title_html)title#Genregenre_html <- html_nodes(webpage, ".genre")genre <- html_text(genre_html)genre#Preprocess genre remover n\ and excess spacesgenre <- gsub("\n", "", genre)genre <- gsub(" ", "", genre)genre#Get description#Using CSS selectors to scrape the description sectiondescription <- html_nodes(webpage,'.ratings-bar+ .text-muted')#Converting the description data to textdescription <- html_text(description)#Bind all results into a dataframecbind.data.frame(rank, title, genre, description)
The code block will help you extract the desired information you want and then combine all into a data frame. If there are updates on the website, running the code again will help you extract that new information. The rvest package makes it very easy to extract data from the web using simple and understandable syntax. We can even go further to extract the gross earnings of each movie, do a little data wrangling.12345678910#Obtain the gross revenuegross <- html_nodes(webpage,'.ghost~ .text-muted+ span')#Extract the html textgross <- html_text(gross)#examinedatagross#use gsub to remove '$' and 'M' signsgross <-gsub("M","",gross)gross <-gsub("[$]", "", gross)There are other HTML contents available such as the rating, year, runtime. Scraping them all follows the same procedures as the previous texts obtained. Ensure you get the right HTML tag and then parse it into the HTML text extracting functions available to you in R, it fetches you your desired result.
Conclusion
Web scraping requires practicing with various websites as different websites have different architectures, thus there may be a need to constantly adapt your code to fit into your need. Remember that you must obtain the right HTML tag for the desired content, else there may be issues of extracting incorrect information. It is important to know that your data wrangling skills are very important as scraping data off the web always comes out messy, thus, the data is only as good as your ability to clean them for use.
- AuthorPosts
- You must be logged in to reply to this topic.