Text Mining and NLP

Machine Learning
R Programming
Text Mining
NLP
Learn how to perform text mining and natural language processing (NLP) in R, including data preparation, model building, and evaluation. This lecture covers essential techniques for text analysis in R.
Author

TERE

Published

June 21, 2024

Introduction

Text mining and natural language processing (NLP) involve extracting meaningful information from text data. These techniques are widely used in various applications such as sentiment analysis, topic modeling, and text classification. In this lecture, we will learn how to perform text mining and NLP in R, including data preparation, model building, and evaluation.

Key Concepts

1. What is Text Mining and NLP?

Text mining is the process of extracting useful information from text. NLP involves the interaction between computers and human language, aiming to read, decipher, understand, and make sense of human languages in a valuable way.

2. Applications of Text Mining and NLP

  • Sentiment Analysis: Determining the sentiment expressed in text.

  • Topic Modeling: Identifying the main topics in a collection of documents.

  • Text Classification: Categorizing text into predefined categories.

  • Named Entity Recognition (NER): Identifying and classifying entities in text.

Performing Text Mining and NLP in R

1. Installing Required Packages

We will use the tm, textclean, and tidytext packages for text mining and NLP.


# Installing required packages

install.packages("tm")

install.packages("textclean")

install.packages("tidytext")

2. Data Preparation

Preparing text data involves cleaning and preprocessing steps such as removing stop words, punctuation, and stemming.


# Loading the required packages

library(tm)

library(textclean)

library(tidytext)



# Sample text data

texts <- c("This is the first document.", "This document is the second document.", "And this is the third one.")



# Creating a corpus

corpus <- VCorpus(VectorSource(texts))



# Cleaning the text data

corpus <- tm_map(corpus, content_transformer(tolower))

corpus <- tm_map(corpus, removePunctuation)

corpus <- tm_map(corpus, removeNumbers)

corpus <- tm_map(corpus, removeWords, stopwords("en"))

corpus <- tm_map(corpus, stripWhitespace)

corpus <- tm_map(corpus, stemDocument)



# Inspecting the cleaned text

inspect(corpus)
Warning: package 'tm' was built under R version 4.3.3
Loading required package: NLP
Warning: package 'textclean' was built under R version 4.3.3
Warning: package 'tidytext' was built under R version 4.3.3
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 3

[[1]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 14

[[2]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 24

[[3]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 9

3. Creating a Document-Term Matrix

A document-term matrix is a matrix where rows represent documents and columns represent terms (words), with the values indicating the frequency of terms in the documents.


# Creating a document-term matrix

dtm <- DocumentTermMatrix(corpus)

print(dtm)
<<DocumentTermMatrix (documents: 3, terms: 5)>>
Non-/sparse entries: 6/9
Sparsity           : 60%
Maximal term length: 8
Weighting          : term frequency (tf)

4. Performing Text Analysis

You can perform various text analysis tasks such as sentiment analysis and topic modeling.

Sentiment Analysis


# Loading sentiment lexicons

library(textclean)

library(dplyr)



# Sample text data

texts <- data.frame(text = c("I love this product!", "This is the worst experience ever.", "I am very happy with the service."))



# Cleaning the text data

texts$text <- tolower(texts$text)

texts$text <- replace_contraction(texts$text)

texts$text <- replace_symbol(texts$text)

texts$text <- replace_ordinal(texts$text)

texts$text <- replace_number(texts$text)

texts$text <- replace_internet_slang(texts$text)

texts$text <- replace_emoticon(texts$text)



# Tokenizing the text data

tokens <- texts %>%

  unnest_tokens(word, text)



# Performing sentiment analysis

sentiment <- tokens %>%

  inner_join(get_sentiments("bing")) %>%

  count(word, sentiment, sort = TRUE)



print(sentiment)
Warning: package 'dplyr' was built under R version 4.3.2

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
Joining with `by = join_by(word)`
   word sentiment n
1 happy  positive 1
2  love  positive 1
3 worst  negative 1

Topic Modeling


# Installing the topicmodels package

install.packages("topicmodels")

library(topicmodels)



# Performing topic modeling

lda_model <- LDA(dtm, k = 2, control = list(seed = 123))

topics <- tidy(lda_model, matrix = "beta")



# Displaying the top terms in each topic

top_terms <- topics %>%

  group_by(topic) %>%

  slice_max(beta, n = 10) %>%

  ungroup() %>%

  arrange(topic, -beta)



print(top_terms)
Warning: package 'topicmodels' was built under R version 4.3.3
# A tibble: 10 × 3
   topic term       beta
   <int> <chr>     <dbl>
 1     1 document 0.468 
 2     1 second   0.200 
 3     1 third    0.168 
 4     1 one      0.104 
 5     1 first    0.0604
 6     2 document 0.390 
 7     2 first    0.226 
 8     2 one      0.182 
 9     2 third    0.118 
10     2 second   0.0853

Example: Comprehensive Text Mining and NLP Analysis

Here’s a comprehensive example of performing text mining and NLP in R.


# Loading required packages

library(tm)

library(textclean)

library(tidytext)

library(dplyr)

library(topicmodels)



# Sample text data

texts <- c("I love this product!", "This is the worst experience ever.", "I am very happy with the service.")



# Creating a corpus

corpus <- VCorpus(VectorSource(texts))



# Cleaning the text data

corpus <- tm_map(corpus, content_transformer(tolower))

corpus <- tm_map(corpus, removePunctuation)

corpus <- tm_map(corpus, removeNumbers)

corpus <- tm_map(corpus, removeWords, stopwords("en"))

corpus <- tm_map(corpus, stripWhitespace)

corpus <- tm_map(corpus, stemDocument)



# Creating a document-term matrix

dtm <- DocumentTermMatrix(corpus)

print(dtm)



# Performing topic modeling

lda_model <- LDA(dtm, k = 2, control = list(seed = 123))

topics <- tidy(lda_model, matrix = "beta")



# Displaying the top terms in each topic

top_terms <- topics %>%

  group_by(topic) %>%

  slice_max(beta, n = 10) %>%

  ungroup() %>%

  arrange(topic, -beta)



print(top_terms)



# Performing sentiment analysis

texts_df <- data.frame(text = texts)

tokens <- texts_df %>%

  unnest_tokens(word, text)



sentiment <- tokens %>%

  inner_join(get_sentiments("bing")) %>%

  count(word, sentiment, sort = TRUE)



print(sentiment)
<<DocumentTermMatrix (documents: 3, terms: 7)>>
Non-/sparse entries: 7/14
Sparsity           : 67%
Maximal term length: 7
Weighting          : term frequency (tf)
# A tibble: 14 × 3
   topic term      beta
   <int> <chr>    <dbl>
 1     1 ever    0.209 
 2     1 product 0.164 
 3     1 happi   0.150 
 4     1 experi  0.146 
 5     1 servic  0.136 
 6     1 love    0.0983
 7     1 worst   0.0957
 8     2 worst   0.190 
 9     2 love    0.187 
10     2 servic  0.150 
11     2 experi  0.139 
12     2 happi   0.135 
13     2 product 0.122 
14     2 ever    0.0767
Joining with `by = join_by(word)`
   word sentiment n
1 happy  positive 1
2  love  positive 1
3 worst  negative 1

Summary

In this lecture, we covered how to perform text mining and NLP in R, including data preparation, model building, and evaluation. Text mining and NLP are powerful techniques for extracting meaningful information from text data and have a wide range of applications.

Further Reading

For more detailed information, consider exploring the following resources:

Call to Action

If you found this lecture helpful, make sure to check out the other lectures in the ML R series. Happy coding!