- This topic has 0 replies, 1 voice, and was last updated 2 years, 5 months ago by
Simileoluwa.
- AuthorPosts
- February 10, 2020 at 4:48 pm #85569Spectator@simileoluwa
Natural Language Processing (NLP) is a branch of Artificial Intelligence which aims to aid computers in understanding human languages. NLP involves interactions between humans and machines with the ultimate objective of reading, deciphering, and making sense of human languages in a manner that is valuable. The connection of data sciences with human languages via NLP is experiencing rapid growth and scaling to lots of industries.
Today, NLP is booming thanks to huge improvements in access to data and increase in computational power. These days, NLP is helping professionals achieve meaningful results in areas like healthcare, media, finance, human resources, and so on.
Applications of Natural Language Processing
Chatbots: AI-powered customer service makes use of NLP in processing your request and also give responses to queries done in real-time.
Email Assistant: Functions in Emails such as word suggestions, auto-completion, grammar checks etc. are products of NLP. The spam filter on your Email system uses NLP to determine what Emails you’d like to keep in your inbox and what are likely spam and should be sorted out.
Virtual Assistant: These include products such as Apples’ Siri, Microsoft’s Cortana and Amazons’ Alexa etc. All these products make use of NLP in speech processing. When the user speaks to the assistant, it knows what kind of instruction is being told based on the learned speech patterns. After translating the speech into text, it will execute the instructions told by the user.
Data mining: NLP allows the development of machines of understanding and processing human languages, enabling it to convert mass information either from Ebooks or websites, into structured data, before stocking them into a huge database.
Basic Processes in Natural Language Processing
There are quite a number of processes involved in NLP we will discuss them briefly and see the various application processes in the later part.
- Data Retrieval: These involves processes you took to extract your data from a database or any other source.
- Data Wrangling and Pre-processing: refers to the process of cleaning, restructuring and enriching the raw data available into a more usable format. This process is very important as it determines the extent of the accuracy of any model built with the dataset. By far the hardest part of machine learning is the dataset, collecting, labeling, and cleaning the data. There are a number of processes involved in wrangling text data;
- Tokenization: Tokenizing means splitting your text into minimal meaningful units. It is a mandatory step before any kind of processing.
- Stopwords: Stopwords are those words that do not provide any useful information to decide in which category a text should be classified. This may be either because they don’t have any meaning (prepositions, conjunctions, etc.) or because they are too frequent in the classification context. e.g. a, an, the, at, etc.
- Stemming: This is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally, a written word form e.g Consulting, consultant, consultants are all from the basic form – consult. Stemming reduces these words to there basic forms.
- Bag of words: A bag-of-words is a representation of text that describes the occurrence of words within documents,
For example, I have 1000 messages and in each message, I have 10 common words. Then,
1000 messages = 1000 Documents(rows)
1000 * 10 = 10,000 variables or Columns - Modeling: This involves feeding the already processed data into a machine learning model such that it can independently make decisions like humans after training the model.
NLP for Email Filtering in R
In this tutorial, we will build a model using NLP techniques that classifies Emails as either Spam or Ham. The word “Spam” as applied to Email means “Unsolicited Bulk Email”. Unsolicited means that the Recipient has not granted verifiable permission for the message to be sent. “Email” that is generally desired and isn’t considered Spam is called “Ham”. We will obtain the dataset from Kaggle, downloadable here. It contains two rows which include a text column, containing the body of the Email and a type column which depicts if an Email is classified as Spam or Ham. The following procedures are essential to accomplishing this task:
#Set up your working directory and load in your dataset #Load in dataset spamraw <- read.csv("spamraw.csv") #Save in a new variable name called spam spam <- spamraw #Examine datastructure str(spam)
We can see that our dataset has 5559 rows and 2 columns. Data Wrangling as earlier discussed is a major procedure in NLP which includes: removing punctuation and stopwords, converting all strings to lower case etc. Implementing these procedures below:
library(tm) library(tidyverse) library(wordcloud) #extract corpus corpus <- iconv(spam$text, to = "UTF-8") head(corpus) #Create a Corpus from the corpus data above corpus <- Corpus(VectorSource(corpus)) inspect(corpus[1:5]) #Clean text #convert all to lower case corpus <- tm_map(corpus, tolower) inspect(corpus[1:5]) #Remove punctuations corpus <- tm_map(corpus, removePunctuation) inspect(corpus[1:5]) #Remove numbers corpus <- tm_map(corpus, removeNumbers) inspect(corpus[1:5]) #Remove stop words that do not mean much the text analysis cleanset <- tm_map(corpus, removeWords, stopwords('english')) #You can run the stop word phrase to examine the those stop words inspect(cleanset[1:5]) #remove urls and special characters but first we create the functions that will execute this procedure removeURL <- function(x) gsub('http[[:alnum:]]*', '', x) #removes url removeURL2 <- function(x) gsub("([[:alpha:]])(?=\\1)", "", x, perl = TRUE) #removes other url forms removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x) #removes punctuations #transform using our predefined functions from above; cleanset <- tm_map(cleanset, content_transformer(removeURL)) cleanset <- tm_map(cleanset, content_transformer(removeNumPunct)) cleanset <- tm_map(cleanset, content_transformer(removeURL2)) inspect(cleanset[1:5]) #remove white spaces cleanset <- tm_map(cleanset, stripWhitespace) inspect(cleanset[1:5])
We have removed some parts of the text not useful for our analysis. As part of our data wrangling process, we will now perform Stemming on this document using the following code:
#Load the package that execute this task library(SnowballC) #Stem document clean <- tm_map(cleanset, stemDocument) #Create a wordcloud of clean document wordcloud(clean,min.freq = 30,random.order = FALSE)
This dataset contains Emails classified as Spam or Ham, by creating new dataframes, one containing Spam Emails and the other Ham Emails we can create a wordcloud for visualization:
#Lets visualize ham and spam messages spam_1 <- subset(spam, type == "spam") ham_1 <- subset(spam, type == "ham") #Counts of ham and spam messages table(spam$type) #Visualize spam messages wordcloud(spam_1$text,min.freq = 30,random.order = FALSE)
#Visualize ham messages wordcloud(ham_1$text, min.freq = 50, random.order = F, random.color = T, max.words = 40)
We will now move on to creating a Bag of words as earlier explained, by executing this process:
#Create TDF clean_dtm <- DocumentTermMatrix(clean) #remove sparse terms clean_dtm <- removeSparseTerms(clean_dtm, 0.999) inspect(clean_dtm[40:50, 10:15])
From the result of the process above, we can see that we have a lot of zeros, we will create a function such that if the frequency is = 0, the value is replaced with No, otherwise it is replaced by Yes:
#The function executes the process defined earlier convert_count <- function(x) { y <- ifelse(x > 0, 1,0) y <- factor(y, levels=c(0,1), labels=c("No", "Yes")) y } # Apply the convert_count function to get final training and testing DTMs clean_dtm_count <- apply(clean_dtm, 2, convert_count) #2 there mean apply column wise #Convert data to a dataframe dataset <- as.data.frame(as.matrix(clean_dtm_count))
From our dataset clean_dtm we can get the frequency of each word i.e. the number of times any word appears in the entire text column of the dataset. A bar plot can then be used to visualize these words and their frequencies as executed below:
#Get frequency of all words appearance freq <- sort(colSums(as.matrix(clean_dtm)), decreasing=TRUE) tail(freq, 10) #Convert to a dataframe and plot word frequency #Convert freq results to dataframe word_frequency <- data.frame(word = names(freq), freq = freq) #Plot frequency top 20 words word_frequency %>% top_n(20) %>% #get the top 20 words ggplot(., aes(reorder(word, freq), freq, fill = word)) + geom_bar(stat = "identity", show.legend = F) + coord_flip() + labs(y = "Frequency", x = "words", title = "Word Frequency")
We can also create a wordcloud for the entire text document frequency regardless of whether the text is classified as Spam or Ham as seen below:
#Plot frequency library(RColorBrewer) wordcloud(words = word_frequency$word, freq = word_frequency$freq, min.freq = 1, max.words=200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"), scale = c(5, 0.3))
The last stage of the outlined processes is Modeling, which involves imputing the dataset into a Machine Learning model which is trained so that it can independently decide if an Email through the texts it contains is a Spam or Ham:
#Adding the class variable to the dataset dataset$Class <- factor(spam$type) #Create train and test dataset library(caTools) #For splitting data into test and train library(caret) #For building a confusionMatrix #Using the caTools sample_ <- sample.split(dataset, SplitRatio = 0.75) train_set <- subset(dataset, sample_ == T) test_set <- subset(dataset, sample_ == F) library(randomForest) #Building a RandomForest Machine Learning Model rf_classifier = randomForest(x = train_set %>% select(-Class), y = train_set$Class, ntree = 300) rf_classifier
Now that we have built the model, we can then apply the fitted model to the test_set to assess how well the model performs:
#Predict testdata prediction <- predict(rf_classifier, test_set %>% select(-Class)) #Create a confusion matrix confusionMatrix(test_set$Class, prediction) #We have and accuracy of 0.9737 was obtained after applying model on test_set
Conclusion
We have explored briefly a simple application of NLP. The procedural application of NLP techniques ensured that we created a dataset suitable for Machine Learning Application, this is just a simple application of NLP as there are quite a number of other complex applications.
- AuthorPosts
- You must be logged in to reply to this topic.