- This topic has 0 replies, 1 voice, and was last updated 2 years, 6 months ago by Simileoluwa.
- February 1, 2020 at 1:40 am #84938Spectator@simileoluwa
The Concept of Text Mining
Recent statistics show that 90% of the data currently available in the world has been produced in the last two years. At our current pace, we produce about 2.5 quintillion bytes of data each day. That is the number 10 with 18 zeros. In fact, it would take about 181 million years for one person to download all the data in the world.
The question is where does all of the data come from? Well from Google searches, Applications, Social Media, etc. Every form of data is being recorded in real time whether in numbers or text which now leads us to the concept of Text Mining or Text Analytics and Sentiment Analysis.
Text mining or Analytics
This involves various techniques of exploring unstructured texts in search of keywords, topics, concept identification, etc. Unstructured data simply refers to data that is not stored in a row and column basis e.g A tweet is unstructured data, an article or blog post, these are all examples of unstructured data. The process of understanding what these data entails is referred to as Text mining/Analytics.
So what is Sentiment Analysis? Simply put, it refers to determining if a text is positive, negative or neutral, a subset of Text Mining. For example, hundreds of thousands of reviews about a product can be understood using this technique to answer business questions such as attitude towards products, complaints, etc.
Performing a Sentiment Analysis Using R Programming
In this tutorial, we will be executing a basic introduction to the concept of Text Mining and Sentiment Analysis using the R programming. For this beginner’s guide, I will be providing a dataset of tweets. Click here to download. It contains 1000 tweets on the currently trending coronavirus.
#Setting up your libraries #Load in the required packages library(tm) library(lubridate) library(ggplot2) library(scales) library(reshape2) library(dplyr) library(wordcloud) library(wordcloud2) library(tidytext) library(scales) library(stringr) library(RColorBrewer)
Now you load in your data set.
#Ensure your working directory is where your data is found Corona_tweets_df <- read.csv(“corona_tweets.csv”) getwd() #To know and get your current working directory
If you have correctly loaded in the dataset, then it should look like the above in your console. It is very essential you clean your dataset as that is the basis for all insights; we will attempt that using the tidytext package.
The tidytext structures these data using the rules of a tidy data which states that:
• Each variable must have its own column.
• Each observation must have its own row.
• Each value must have its own cell.
We use the unnest_tokens() function to transform our data set from one speech per row to one word per row as seen below.
corona <- corona_tweets_df %>% mutate(text = removeNumbers(text)) %>% #remove numbers in text unnest_tokens(word, text) #text is unnested and saved in a column called word select(word)
Now you’d discover each text has its own row, the clustered texts have been unnested and each row has its own data. Next stage is to remove stop words, these are just frequently used words and then count the rest words after that. The following are examples of stopwords:
corona_1 <- corona %>% anti_join(stop_words) %>% #Remove stopwords count(word) %>% arrange(desc(word)) #Arrange in descending order
Let’s move to the interesting visualizations available to us for more insights
#Lets create a wordcloud wordcloud(corona_1$word,freq = corona_1$n, max.words =200, min.freq=1,scale=c(4,.5), random.order = FALSE,rot.per=.5, colors = brewer.pal(8, "Dark2"))
Get the top words with highest count of occurrence. This allows us view the top words in summation and how many times they occurred
#The top 20 words used corona_1 %>% top_n(20) %>% ggplot(., aes(reorder(word, n), n, fill = word)) + geom_bar(stat = "identity", show.legend = F) + coord_flip() + xlab("Words") + ggtitle("Top 20 Words Used")
One of the core aspects of sentiment analysis is the ability to classify if a text is positive or negative. To fully understand this concept please visit here for more information, we will be using the bing sentiment here. It classifies a word as positive or negative. So, combining the techniques we’ve learned before and a few more lines of code we can obtain the top 10 positive and negative tweets.
Combining all the procedures we earlier learnt
#Getting the positive and negative tweets corona_tweets_df %>% mutate(text = removeNumbers(text)) %>% unnest_tokens(word, text) %>% #text is unnested and saved in a column called word select(word) %>% #select the column containing unested tweets anti_join(stop_words) %>% #remove stop words inner_join(get_sentiments("bing")) %>% #get the sentiment classification of words count(sentiment, word) %>% #count sentiments grouped by the words arrange(desc(n)) %>% #arranging in descending order ungroup() %>% #ungroup group_by(sentiment) %>% #group by on the senitiment classification top_n(10) %>% #select top 10 tweets for both positive and negative tweets ggplot(., aes(reorder(word, n), n, fill = word)) + #plot positive and negative tweets geom_col(show.legend = F) + facet_wrap(~sentiment, scales = "free_y") + coord_flip() + labs(y = "Count", x = "Words", title = "Positive vs Negative words")
We can now see from above the top most common negative and positive words used, finally we create a word cloud showing these positive and negative words in a single wordcloud.
#We can also create a wordcloud classifying positive and negative tweets #Get positive and negative tweets corona_tweets_df %>% unnest_tokens(word, text) %>% #unnest words anti_join(stop_words) %>% #Remove stop words inner_join(get_sentiments("bing")) %>% #classify words by bing group_by(word, sentiment) %>% #group by word and sentiment type count(word, sentiment, sort = T) %>% #count words ungroup() %>% acast(word~factor(sentiment), value.var = "n", fill = 1) %>% #word cloud comparison takes data inform of a matrix thus the need for acast comparison.cloud(color = c("#1b2a49", "#00909e"), max.words =50, random.order = F, scale = c(6, 0.3))
From the image above you can easily classify each text as positive or negative and you know which words is most used in most cases. The larger the text the higher its significance in the word count. Since the tweets are about the Corona Virus, you can easily see that virus is about the largest word.
This is a soft introduction to Sentiment Analysis, in consequent posts, we will build on this basic understanding to understand other concepts such as Topic Modeling, Bigrams etc.
- You must be logged in to reply to this topic.