NLP-R

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Fundamentals

Chapter 1 of Introduction to Natural Langauge Processing prepares you for running your first analysis on text. You will explore regular expressions and tokenization, two of the most common components of most analysis tasks. With regular expressions, you can search for any pattern you can think of, and with tokenization, you can prepare and clean text for more sophisticated analysis. This chapter is necessary for tackling the techniques we will learn in the remaining chapters of this course.

Regular Expressions (regex)

Sequence of characters or patterns used to search text

words <- c("DW-40", "Mike's Oil", "5w30", "Joe's Gas", "Unleaded", "Plus-89")

# Finding digits
grep("\\d", words, value = TRUE)

[1] "DW-40"   "5w30"    "Plus-89"

# Finding apostrophes
grep("\\'", words, value = TRUE)

[1] "Mike's Oil" "Joe's Gas"

Regex examples

“wildcards” extend search beyond a single character
- + allows us to find a word or digit of any length
Negate any expression using a capital letter
- \S finds any non-whitespace character
- \D finds any non-digit character
- \W finds any non-alphanumeric character

Pattern	Text Matches	R Example	Text Example
`\w`	Any alphanumeric	`gregexpr(pattern = '\w', <text>)`	a
`\d`	Any digit	`gregexpr(pattern = '\d', <text>)`	1
`\w+`	Any alphanumeric of any length	`gregexpr(pattern = '\w+', <text>)`	word
`\d+`	Any digit of any length	`gregexpr(pattern = '\d+', <text>)`	123
`\s`	Any whitespace	`gregexpr(pattern = '\s', <text>)`	” ”
`\S`	Any non-whitespace	`gregexpr(pattern = '\S', <text>)`	a

(base) R functions

Function	Description	Syntax
`grep()`	Search for a pattern in a vector	`grep(pattern, x, value = FALSE)`
`gsub()`	Replace a pattern in a vector	`gsub(pattern, replacement, x)`

Examples

text <- c("John's favorite number is 1111",
          "John lives at P Sherman, 42 Wallaby Way, Sydney",
          "He is 7 feet tall",
          "John has visited 30 countries",
          "He can speak 3 languages",
          "John can name 10 facts about himself")

# Print off each item that contained a numeric number
grep(pattern = "\\d", x = text, value = TRUE)

[1] "John's favorite number is 1111"                 
[2] "John lives at P Sherman, 42 Wallaby Way, Sydney"
[3] "He is 7 feet tall"                              
[4] "John has visited 30 countries"                  
[5] "He can speak 3 languages"                       
[6] "John can name 10 facts about himself"

# Find all items with a number followed by a space
grep(pattern = "\\d\\s", x = text)

[1] 2 3 4 5 6

# How many times did you write down 'favorite'?
length(grep(pattern = "favorite", x = text))

[1] 1

Exploring regular expression functions.

You have a vector of ten facts about your boss saved as a vector called text. In order to create a new ice-breaker for your team at work, you need to remove the name of your boss, John, from each fact that you have written down. This can easily be done using regular expressions (as well as other search/replace functions). Use regular expressions to correctly replace “John” from the facts you have written about him.

# Print off the text for every time you used your boss's name, John
grep("John", x = text, value = TRUE)

[1] "John's favorite number is 1111"                 
[2] "John lives at P Sherman, 42 Wallaby Way, Sydney"
[3] "John has visited 30 countries"                  
[4] "John can name 10 facts about himself"

# Try replacing all occurences of "John" with He
gsub(pattern = 'John', replacement = 'He', x = text)

[1] "He's favorite number is 1111"                 
[2] "He lives at P Sherman, 42 Wallaby Way, Sydney"
[3] "He is 7 feet tall"                            
[4] "He has visited 30 countries"                  
[5] "He can speak 3 languages"                     
[6] "He can name 10 facts about himself"

# Replace all occurences of "John " with 'He '. 
clean_text <- gsub(pattern = 'John\\s', replacement = 'He ', x = text)

clean_text

[1] "John's favorite number is 1111"               
[2] "He lives at P Sherman, 42 Wallaby Way, Sydney"
[3] "He is 7 feet tall"                            
[4] "He has visited 30 countries"                  
[5] "He can speak 3 languages"                     
[6] "He can name 10 facts about himself"

# Replace all occurences of "John's" with 'His'
gsub(pattern = "John\\'s", replacement = 'His', x = clean_text)

[1] "His favorite number is 1111"                  
[2] "He lives at P Sherman, 42 Wallaby Way, Sydney"
[3] "He is 7 feet tall"                            
[4] "He has visited 30 countries"                  
[5] "He can speak 3 languages"                     
[6] "He can name 10 facts about himself"

Tokenization

Fundamental part of text preprocessing
Tokenization: process of breaking text into individual small pieces (tokens)
- Tokens can be as small as individual characters, words, or the entire text
- Common: characters, words, sentences, documents, regex separations
tidytext package
- unnest_tokens() function, takes input tibble and extracts tokens from column specified as input, can specify tokens, and what output column should be labeled

library(tidytext)

Animal farm data:

animal_farm
# A tibble: 10 × 2
   chapter    text_column                                                       
   <chr>      <chr>                                                             
 1 Chapter 1  "Mr. Jones, of the Manor Farm, had locked the hen-houses for the …
 2 Chapter 2  "Three nights later old Major died peacefully in his sleep. His b…
 3 Chapter 3  "How they toiled and sweated to get the hay in! But their efforts…
 4 Chapter 4  "By the late summer the news of what had happened on Animal Farm …
 5 Chapter 5  "As winter drew on, Mollie became more and more troublesome. She …
 6 Chapter 6  "All that year the animals worked like slaves. But they were happ…
 7 Chapter 7  "It was a bitter winter. The stormy weather was followed by sleet…
 8 Chapter 8  "A few days later, when the terror caused by the executions had d…
 9 Chapter 9  "Boxer's split hoof was a long time in healing. They had started …
10 Chapter 10 "Years passed. The seasons came and went, the short animal lives

animal_farm %>%
  # tokenize
  unnest_tokens(output = "word",
                input = text_column,
                token = "words") %>%
  # count the top tokens
  count(word, sort = TRUE)

Find all mentions of a particular word and see what follows it:

animal_farm %>%
  filter(chapter == "Chapter 1") %>%
  # look for any mention of Boxer, capitalized or not
  unnest_tokens(output = "Boxer",
                input = text_column,
                token = "regex",
                pattern = "(?i)boxer") %>%
  # first token starts at beginning of text, use slice to skip first token
  slice(2:n())

Examples

# Split the text_column into sentences
animal_farm %>%
  unnest_tokens(output = "sentences", input = text_column, token = "sentences")

# A tibble: 1,523 × 2
   chapter   sentences                                                          
   <chr>     <chr>                                                              
 1 Chapter 1 mr.                                                                
 2 Chapter 1 jones, of the manor farm, had locked the hen-houses for the night,…
 3 Chapter 1 with the ring of light from his lantern dancing from side to side,…
 4 Chapter 1 jones was already snoring.as soon as the light in the bedroom went…
 5 Chapter 1 word had gone round during the day that old major, the prize middl…
 6 Chapter 1 it had been agreed that they should all meet in the big barn as so…
 7 Chapter 1 jones was safely out of the way.                                   
 8 Chapter 1 old major (so he was always called, though the name under which he…
 9 Chapter 1 he was twelve years old and had lately grown rather stout, but he …
10 Chapter 1 before long the other animals began to arrive and make themselves …
# … with 1,513 more rows
# ℹ Use `print(n = ...)` to see more rows

# Split the text_column into sentences
animal_farm %>%
  unnest_tokens(output = "sentences", input = text_column, token = "sentences") %>%
  # Count sentences using the chapter column
  count(chapter, sort = TRUE)

# A tibble: 10 × 2
   chapter        n
   <chr>      <int>
 1 Chapter 8    203
 2 Chapter 9    195
 3 Chapter 7    190
 4 Chapter 10   167
 5 Chapter 5    158
 6 Chapter 2    140
 7 Chapter 1    136
 8 Chapter 6    136
 9 Chapter 3    114
10 Chapter 4     84

# Split the text_column into sentences
animal_farm %>%
  unnest_tokens(output = "sentences", input = text_column, token = "sentences") %>%
  # Count sentences, per chapter
  count(chapter)

# Split the text_column using regular expressions
animal_farm %>%
  unnest_tokens(output = "sentences", input = text_column,
                token = "regex", pattern = "\\.") %>%
  count(chapter)

# A tibble: 10 × 2
   chapter        n
   <chr>      <int>
 1 Chapter 1    131
 2 Chapter 10   179
 3 Chapter 2    150
 4 Chapter 3    113
 5 Chapter 4     92
 6 Chapter 5    158
 7 Chapter 6    127
 8 Chapter 7    188
 9 Chapter 8    200
10 Chapter 9    174

Great job. Notice how the two methods produce slightly different results. You’ll notice that a lot when processing text. It’s all about the technique used to do the analysis.

Text Cleaning Basics

https://github.com/fivethirtyeight/russian-troll-tweets
remove stop words (e.g. “to” “the”) with anti_join(stop_words)
- anti_join() will remove a tibble of words from a column of text
  - entry is word you want to remove
  - lexicon is source for where the word came from

stop_words

# A tibble: 1,149 × 2
   word        lexicon
   <chr>       <chr>  
 1 a           SMART  
 2 a's         SMART  
 3 able        SMART  
 4 about       SMART  
 5 above       SMART  
 6 according   SMART  
 7 accordingly SMART  
 8 across      SMART  
 9 actually    SMART  
10 after       SMART  
# ℹ 1,139 more rows

russian_tweets %>%
  unnest_tokens(word, content) %>%
  count(word, sort = T)

# see that top words are mostly junk
# t.co, https, etc. 

# remove stop words
tidy_tweets <- russian_tweets %>%
  unnest_tokens(word, content) %>%
  anti_join(stop_words)

tidy_tweets %>%
  count(word, sort = T)
# still t.co, https, http, etc. but now blacklivesmatter and trump

add custom stop words

custom <- add_row(stop_words, word = "https", lexicon = "custom")
custom <- add_row(stop_words, word = "http", lexicon = "custom")
custom <- add_row(stop_words, word = "t.co", lexicon = "custom")

russian_tweets %>%
  unnest_tokens(word, content) %>%
  anti_join(custom) %>%
  count(word, sort = T)

stemming - transforming words into their roots - e.g. enlisted and enlisting -> enlist - use wordStem() from SnowballC package

library(SnowballC)
tidy_tweets <- russian_tweets %>%
  unnest_tokens(word, content) %>%
  anti_join(custom)

# stemming
stemmed_tweets <- tidy_tweets %>%
  mutate(stem = wordStem(word))

examples

Stop words are unavoidable in writing. However, to determine how similar two pieces of text are to each other are or when trying to find themes within text, stop words can make things difficult. In the book Animal Farm, the first chapter contains only 2,636 words, while almost 200 of them are the word “the”.

# Tokenize animal farm's text_column column
tidy_animal_farm <- animal_farm %>%
  unnest_tokens(word, text_column)

# Print the word frequencies - most frequent first!
tidy_animal_farm %>%
  count(word, sort = T)

# A tibble: 4,076 × 2
   word      n
   <chr> <int>
 1 the    2187
 2 and     966
 3 of      899
 4 to      814
 5 was     633
 6 a       620
 7 in      537
 8 had     529
 9 that    451
10 it      384
# … with 4,066 more rows
# ℹ Use `print(n = ...)` to see more rows

# Remove stop words, using stop_words from tidytext
tidy_animal_farm %>%
  anti_join(stop_words)

# A tibble: 10,579 × 2
   chapter   word    
   <chr>     <chr>   
 1 Chapter 1 jones   
 2 Chapter 1 manor   
 3 Chapter 1 farm    
 4 Chapter 1 locked  
 5 Chapter 1 hen     
 6 Chapter 1 houses  
 7 Chapter 1 night   
 8 Chapter 1 drunk   
 9 Chapter 1 remember
10 Chapter 1 shut    
# … with 10,569 more rows
# ℹ Use `print(n = ...)` to see more rows

Excellent. You should always consider removing stop words before performing text analysis. They muddy your results and can increase computation time for large analysis tasks.

The root of words are often more important than their endings, especially when it comes to text analysis. The book Animal Farm is obviously about animals. However, knowing that the book mentions animal’s 248 times, and animal 107 times might not be helpful for your analysis.

tidy_animal_farm contains a tibble of the words from Animal Farm, tokenized and without stop words. The next step is to stem the words and explore the results.

# Perform stemming on tidy_animal_farm
stemmed_animal_farm <- tidy_animal_farm %>%
  mutate(word = wordStem(word))

# Print the old word frequencies 
tidy_animal_farm %>%
  count(word, sort = T)

# A tibble: 3,611 × 2
   word         n
   <chr>    <int>
 1 animals    248
 2 farm       163
 3 napoleon   141
 4 animal     107
 5 snowball   106
 6 pigs        91
 7 boxer       76
 8 time        71
 9 windmill    68
10 squealer    61
# … with 3,601 more rows
# ℹ Use `print(n = ...)` to see more rows

# Print the new word frequencies
stemmed_animal_farm %>%
  count(word, sort = T)

# A tibble: 2,751 × 2
   word         n
   <chr>    <int>
 1 anim       363
 2 farm       173
 3 napoleon   141
 4 pig        114
 5 snowbal    106
 6 comrad      94
 7 dai         86
 8 time        83
 9 boxer       76
10 windmil     70
# … with 2,741 more rows
# ℹ Use `print(n = ...)` to see more rows

Nice job. There is a clear difference in word frequencies after we performed stemming. Comrade is used throughout Animal Farm but until you stemmed the words, it didn’t show up in the top 10! In Chapter 2 you will expand this analysis and start building your first text analysis models.

Representations of Text

In this chapter, you will learn the most common and studied ways to analyze text. You will look at creating a text corpus, expanding a bag-of-words representation into a TFIDF matrix, and use cosine-similarity metrics to determine how similar two pieces of text are to each other. You build on your foundations for practicing NLP before you dive into applications of NLP in chapters 3 and 4.

Corpus (collections of text)
- collections of documents containing natural language text
- from tm package as corpus
  - VCorpus (volatile corpus) most common representation, to host both text and metadata about the collection of text
  - example dataset acq (50 articles from Reuters)

library(tm)

Loading required package: NLP


Attaching package: 'NLP'

The following object is masked from 'package:ggplot2':

    annotate

data("acq")

# metadata of first article
acq[[1]]$meta

  author       : character(0)
  datetimestamp: 1987-02-26 15:18:06
  description  : 
  heading      : COMPUTER TERMINAL SYSTEMS <CPML> COMPLETES SALE
  id           : 10
  language     : en
  origin       : Reuters-21578 XML
  topics       : YES
  lewissplit   : TRAIN
  cgisplit     : TRAINING-SET
  oldid        : 5553
  places       : usa
  people       : character(0)
  orgs         : character(0)
  exchanges    : character(0)

# meta item of 1st article and then the char value for place where article originated (usa)
# note the nested object!
acq[[1]]$meta$places

[1] "usa"

# contet of first item
acq[[1]]$content

[1] "Computer Terminal Systems Inc said\nit has completed the sale of 200,000 shares of its common\nstock, and warrants to acquire an additional one mln shares, to\n<Sedio N.V.> of Lugano, Switzerland for 50,000 dlrs.\n    The company said the warrants are exercisable for five\nyears at a purchase price of .125 dlrs per share.\n    Computer Terminal said Sedio also has the right to buy\nadditional shares and increase its total holdings up to 40 pct\nof the Computer Terminal's outstanding common stock under\ncertain circumstances involving change of control at the\ncompany.\n    The company said if the conditions occur the warrants would\nbe exercisable at a price equal to 75 pct of its common stock's\nmarket price at the time, not to exceed 1.50 dlrs per share.\n    Computer Terminal also said it sold the technolgy rights to\nits Dot Matrix impact technology, including any future\nimprovements, to <Woodco Inc> of Houston, Tex. for 200,000\ndlrs. But, it said it would continue to be the exclusive\nworldwide licensee of the technology for Woodco.\n    The company said the moves were part of its reorganization\nplan and would help pay current operation costs and ensure\nproduct delivery.\n    Computer Terminal makes computer generated labels, forms,\ntags and ticket printers and terminals.\n Reuter"

# and second
acq[[2]]$content

[1] "Ohio Mattress Co said its first\nquarter, ending February 28, profits may be below the 2.4 mln\ndlrs, or 15 cts a share, earned in the first quarter of fiscal\n1986.\n    The company said any decline would be due to expenses\nrelated to the acquisitions in the middle of the current\nquarter of seven licensees of Sealy Inc, as well as 82 pct of\nthe outstanding capital stock of Sealy.\n    Because of these acquisitions, it said, first quarter sales\nwill be substantially higher than last year's 67.1 mln dlrs.\n    Noting that it typically reports first quarter results in\nlate march, said the report is likely to be issued in early\nApril this year.\n    It said the delay is due to administrative considerations,\nincluding conducting appraisals, in connection with the\nacquisitions.\n Reuter"

to get data into a table format for more tidier analysis, where each obs is represented by a row and each variable is a column, use tidy() function on corpus:

tidy_data <- tidy(acq)
tidy_data

# A tibble: 50 × 16
   author   datetimestamp       description heading id    language origin topics
   <chr>    <dttm>              <chr>       <chr>   <chr> <chr>    <chr>  <chr> 
 1 <NA>     1987-02-26 15:18:06 ""          COMPUT… 10    en       Reute… YES   
 2 <NA>     1987-02-26 15:19:15 ""          OHIO M… 12    en       Reute… YES   
 3 <NA>     1987-02-26 15:49:56 ""          MCLEAN… 44    en       Reute… YES   
 4 By Cal … 1987-02-26 15:51:17 ""          CHEMLA… 45    en       Reute… YES   
 5 <NA>     1987-02-26 16:08:33 ""          <COFAB… 68    en       Reute… YES   
 6 <NA>     1987-02-26 16:32:37 ""          INVEST… 96    en       Reute… YES   
 7 By Patt… 1987-02-26 16:43:13 ""          AMERIC… 110   en       Reute… YES   
 8 <NA>     1987-02-26 16:59:25 ""          HONG K… 125   en       Reute… YES   
 9 <NA>     1987-02-26 17:01:28 ""          LIEBER… 128   en       Reute… YES   
10 <NA>     1987-02-26 17:08:27 ""          GULF A… 134   en       Reute… YES   
# ℹ 40 more rows
# ℹ 8 more variables: lewissplit <chr>, cgisplit <chr>, oldid <chr>,
#   places <named list>, people <lgl>, orgs <lgl>, exchanges <lgl>, text <chr>

reverse, to get back to corpus format from a tibble, use VCorpus() function:

corpus <- VCorpus(VectorSource(tidy_data$text)) # this only captures the text

# add a column to metadata dataframe attached to corpus:
meta(corpus, "Author") <- tidy_data$author
meta(corpus, "oldid") <- tidy_data$oldid
head(meta(corpus))

examples

Explore an R corpus

One of your coworkers has prepared a corpus of 20 documents discussing crude oil, named crude. This is only a sample of several thousand articles you will receive next week. In order to get ready for running text analysis on these documents, you have decided to explore their content and metadata. Remember that in R, a VCorpus contains both meta and content regarding each text. In this lesson, you will explore these two objects.

# Print out the corpus
print(crude)

Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 20

# Print the content of the 10th article
crude[[10]]$content

[1] "Saudi Arabian Oil Minister Hisham Nazer\nreiterated the kingdom's commitment to last December's OPEC\naccord to boost world oil prices and stabilise the market, the\nofficial Saudi Press Agency SPA said.\n    Asked by the agency about the recent fall in free market\noil prices, Nazer said Saudi Arabia \"is fully adhering by the\n... Accord and it will never sell its oil at prices below the\npronounced prices under any circumstance.\"\n    Nazer, quoted by SPA, said recent pressure on free market\nprices \"may be because of the end of the (northern hemisphere)\nwinter season and the glut in the market.\"\n    Saudi Arabia was a main architect of the December accord,\nunder which OPEC agreed to lower its total output ceiling by\n7.25 pct to 15.8 mln barrels per day (bpd) and return to fixed\nprices of around 18 dlrs a barrel.\n    The agreement followed a year of turmoil on oil markets,\nwhich saw prices slump briefly to under 10 dlrs a barrel in\nmid-1986 from about 30 dlrs in late 1985. Free market prices\nare currently just over 16 dlrs.\n    Nazer was quoted by the SPA as saying Saudi Arabia's\nadherence to the accord was shown clearly in the oil market.\n    He said contacts among members of OPEC showed they all\nwanted to stick to the accord.\n    In Jamaica, OPEC President Rilwanu Lukman, who is also\nNigerian Oil Minister, said the group planned to stick with the\npricing agreement.\n    \"We are aware of the negative forces trying to manipulate\nthe operations of the market, but we are satisfied that the\nfundamentals exist for stable market conditions,\" he said.\n    Kuwait's Oil Minister, Sheikh Ali al-Khalifa al-Sabah, said\nin remarks published in the emirate's daily Al-Qabas there were\nno plans for an emergency OPEC meeting to review prices.\n    Traders and analysts in international oil markets estimate\nOPEC is producing up to one mln bpd above the 15.8 mln ceiling.\n    They named Kuwait and the United Arab Emirates, along with\nthe much smaller producer Ecuador, among those producing above\nquota. Sheikh Ali denied that Kuwait was over-producing.\n REUTER"

# Find the first ID
crude[[1]]$meta$id

[1] "127"

# Make a vector of IDs
ids <- c()
for(i in c(1:20)){
  ids <- append(ids, crude[[i]]$meta$id)
}

ids

 [1] "127" "144" "191" "194" "211" "236" "237" "242" "246" "248" "273" "349"
[13] "352" "353" "368" "489" "502" "543" "704" "708"

Well done. You now understand the basics of an R corpus. However, creating the ID vector was a bit of work. Let’s use the tidy() function to help make this process easier.

Creating a tibble from a corpus

To further explore the corpus on crude oil data that you received from a coworker, you have decided to create a pipeline to clean the text contained in the documents. Instead of exploring how to do this with the tm package, you have decided to transform the corpus into a tibble so you can use the functions unnest_tokens(), count(), and anti_join() that you are already familiar with. The corpus crude contains both the metadata and the text of each document.

# Create a tibble & Review
crude_tibble <- tidy(crude)
names(crude_tibble)

crude_counts <- crude_tibble %>%
  # Tokenize by word 
  unnest_tokens(word, text) %>%
  # Count by word
  count(word, sort = TRUE) %>%
  # Remove stop words
  anti_join(stop_words)

crude_counts

# A tibble: 900 × 2
   word       n
   <chr>  <int>
 1 oil       86
 2 prices    48
 3 opec      44
 4 mln       31
 5 bpd       23
 6 dlrs      23
 7 crude     21
 8 market    20
 9 reuter    20
10 saudi     18
# … with 890 more rows
# ℹ Use `print(n = ...)` to see more rows

Creating a corpus

You have created a tibble called russian_tweets that contains around 20,000 tweets auto generated by bots during the 2016 U.S. election cycle so that you can perform text analysis. However, when searching through the available options for performing the analysis you have chosen to do, you believe that the tm package offers the easiest path forward. In order to conduct the analysis, you first must create a corpus and attach potentially useful metadata.

Be aware that this is real data from Twitter and as such there is always a risk that it may contain profanity or other offensive content (in this exercise, and any following exercises that also use real Twitter data).

# Create a corpus
tweet_corpus <- VCorpus(VectorSource(russian_tweets$content))

# Attach following and followers
meta(tweet_corpus, 'following') <- russian_tweets$following
meta(tweet_corpus, 'followers') <- russian_tweets$followers

# Review the meta data
head(meta(tweet_corpus))

  following followers
1      1052      9636
2      1054      9637
3      1054      9637
4      1062      9642
5      1050      9645
6      1050      9644

Bag-of-words representation

bag-of-words representation uses vectors to specify which words are in each text
- consider the following three texts
- find unique words and then convert this into vector representations
  - “few” only in text1
  - “all” only in text2
  - “most” only in text3
  - words, are, important in all three texts

text1 <- c("Few words are important.")
text2 <- c("All words are important.")
text3 <- c("Most words are important.")

First, create clean vector of the unique words used in all of the text

# lowercase, without stop words
# optional but good ideas: removing punctuation and stemming words
word_vector <- c("few", "all", "most", "words", "important")

# convert each text into binary representation of which words are in that text

# Representation for text1
text1 <- c("Few words are important.")
text1_vector <- c(1, 0, 0, 1, 1)

# Representation for text2
text2 <- c("All words are important.")
text2_vector <- c(0, 1, 0, 1, 1)

# Representation for text3
text3 <- c("Most words are important.")
text3_vector <- c(0, 0, 1, 1, 1)

could have used word counts instead of binary 1s and 0s
tidytext’s representation is different
- tibble or word count by chapter, sorted from most to least common
Sparse matrix
- consider the russian tweet dataset
  - 20,000 tweets (rows)
  - 43,000 (non stop words) words (columns)
  - Need 860 million elements in matrix, but only 177,000 non-0 entries (0.02%)
- tidytext and tm packages can handle this sparse matrix problem in an efficient manner

examples

BoW Example

In literature reviews, researchers read and summarize as many available texts about a subject as possible. Sometimes they end up reading duplicate articles, or summaries of articles they have already read. You have been given 20 articles about crude oil as an R object named crude_tibble. Instead of jumping straight to reading each article, you have decided to see what words are shared across these articles. To do so, you will start by building a bag-of-words representation of the text.

# Count occurrence by article_id and word
words <- crude_tibble %>%
  unnest_tokens(output = "word", token = "words", input = text) %>%
  anti_join(stop_words) %>%
  count(article_id, word, sort=TRUE)

words

# A tibble: 1,498 × 3
   word    article_id     n
   <chr>        <int> <int>
   article_id word        n
        <int> <chr>   <int>
 1          2 opec       13
 2          2 oil        12
 3          6 kuwait     10
 4         10 oil         9
 5         10 prices      9
 6         11 mln         9
 7         19 futures     9
 8          6 opec        8
 9          7 report      8
10         10 market      8# … with 1,488 more rows
# ℹ Use `print(n = ...)` to see more rows

# Count occurrence by article_id and word
words <- crude_tibble %>%
  unnest_tokens(output = "word", token = "words", input = text) %>%
  anti_join(stop_words) %>%
  count(word, article_id, sort=TRUE)

# How many different word/article combinations are there?
unique_combinations <- nrow(words)

# Filter to responses with the word "prices"
words_with_prices <- words %>%
  filter(word == "prices")

# How many articles had the word "prices"?
number_of_price_articles <- nrow(words_with_prices)

number_of_price_articles
15

Excellent job. BOW representations are one of the quickest ways to start analyzing text. Several more advanced techniques also start by simply looking at which words are used in each piece of text.

Sparse matrices

During the video lesson you learned about sparse matrices. Sparse matrices can become computational nightmares as the number of text documents and the number of unique words grow. Creating word representations with tweets can easily create sparse matrices because emojis, slang, acronyms, and other forms of language are used.

In this exercise you will walk through the steps to calculate how sparse the Russian tweet dataset is. Note that this is a small example of how quickly text analysis can become a major computational problem.

# Tokenize and remove stop words
tidy_tweets <- russian_tweets %>%
  unnest_tokens(word, content) %>%
  anti_join(stop_words)
# Count by word
unique_words <- tidy_tweets %>%
  count(word, sort = T)

unique_words
# A tibble: 43,666 × 2
   word                 n
   <chr>            <int>
 1 t.co             18121
 2 https            16003
 3 http              2135
 4 blacklivesmatter  1292
 5 trump             1004
 6 black              781
 7 enlist             764
 8 police             745
 9 people             723
10 cops               693
# … with 43,656 more rows
# ℹ Use `print(n = ...)` to see more rows

# Count by tweet (tweet_id) and word
unique_words_by_tweet <- tidy_tweets %>%
  count(tweet_id, word)

unique_words_by_tweet
# A tibble: 177,140 × 3
   tweet_id word           n
      <int> <chr>      <int>
 1        1 barely         1
 2        1 corruption     1
 3        1 democrat       1
 4        1 gh6g0d1oic     1
 5        1 heard          1
 6        1 https          1
 7        1 mainstream     1
 8        1 media          1
 9        1 nedryun        1
10        1 peep           1
# … with 177,130 more rows
# ℹ Use `p

# Find the size of matrix
size <- nrow(russian_tweets) * nrow(unique_words)

size
[1] 873320000

# Find percent of entries that would have a value
percent <- nrow(unique_words_by_tweet) / size

percent

[1] 0.0002028352

Well done! This percent is tiny - indicating that we are dealing with a very sparse matrix. Imagine if we looked at a million tweets instead of just 20,000.

Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF is a way to represent text data that is more informative than BoW
- represents word counts by considering two components
  - Term frequency (TF): proportion of words in a text that are that term
  - Inverse document frequency (IDF): how unique a word is across all documents
IDF equation $I D F = l o g \frac{N}{n_{t}}$
- $N$ : total number of documents in corpus
- $n_{t}$ : number of documents where term $t$ appears
TF-IDF: $T F - I D F = T F \times I D F$
In tidytext’s bind_tf_idf()

t1 <- c("My name is John. My best friend is Joe. We like tacos.")
t2 <- c("Two common best friend names are John and Joe.")
t3 <- c("Tacos are my favorite food. I eat them with my friend Joe.")

df <- data.frame('text' = c(t1, t2, t3),
                 'ID' = c(1,2,3),
                 stringsAsFactors = FALSE)

df %>%
  unnest_tokens(output = "word",
                token = "words",
                input = text) %>%
  anti_join(stop_words) %>%
  count(ID, word, sort = TRUE) %>%
  bind_tf_idf(word, # column with terms
              ID, # column with document ids
              n) # word count produced by count()

Joining with `by = join_by(word)`

   ID     word n        tf       idf     tf_idf
1   1   friend 1 0.2500000 0.0000000 0.00000000
2   1      joe 1 0.2500000 0.0000000 0.00000000
3   1     john 1 0.2500000 0.4054651 0.10136628
4   1    tacos 1 0.2500000 0.4054651 0.10136628
5   2   common 1 0.2000000 1.0986123 0.21972246
6   2   friend 1 0.2000000 0.0000000 0.00000000
7   2      joe 1 0.2000000 0.0000000 0.00000000
8   2     john 1 0.2000000 0.4054651 0.08109302
9   2    names 1 0.2000000 1.0986123 0.21972246
10  3      eat 1 0.1666667 1.0986123 0.18310205
11  3 favorite 1 0.1666667 1.0986123 0.18310205
12  3     food 1 0.1666667 1.0986123 0.18310205
13  3   friend 1 0.1666667 0.0000000 0.00000000
14  3      joe 1 0.1666667 0.0000000 0.00000000
15  3    tacos 1 0.1666667 0.4054651 0.06757752

Examples

TFIDF Practice

Earlier you looked at a bag-of-words representation of articles on crude oil. Calculating TFIDF values relies on this bag-of-words representation, but takes into account how often a word appears in an article, and how often that word appears in the collection of articles.

To determine how meaningful words would be when comparing different articles, calculate the TFIDF weights for the words in crude, a collection of 20 articles about crude oil.

# Create a tibble with TFIDF values
crude_weights <- crude_tibble %>%
  unnest_tokens(output = "word", token = "words", input = text) %>%
  anti_join(stop_words) %>%
  count(article_id, word) %>%
  bind_tf_idf(word, article_id, n)

# Find the highest TFIDF values
crude_weights %>%
  arrange(desc(tf_idf))

# Find the lowest non-zero TFIDF values
crude_weights %>%
  filter(tf_idf != 0) %>%
  arrange(desc(tf_idf))

# Find the highest TFIDF values
crude_weights %>%
  arrange(desc(tf_idf))
# A tibble: 1,498 × 6
   article_id word         n     tf   idf tf_idf
        <int> <chr>    <int>  <dbl> <dbl>  <dbl>
 1         20 january      4 0.0930  2.30  0.214
 2         15 power        4 0.0690  3.00  0.207
 3         19 futures      9 0.0643  3.00  0.193
 4          8 8            6 0.0619  3.00  0.185
 5          3 canada       2 0.0526  3.00  0.158
 6          3 canadian     2 0.0526  3.00  0.158
 7         15 ship         3 0.0517  3.00  0.155
 8         19 nymex        7 0.05    3.00  0.150
 9         20 cubic        2 0.0465  3.00  0.139
10         20 fiscales     2 0.0465  3.00  0.139
# … with 1,488 more rows
# ℹ Use `print(n = ...)` to see more rows
# Find the lowest non-zero TFIDF values
crude_weights %>%
  filter(tf_idf != 0) %>%
  arrange(tf_idf)
# A tibble: 1,458 × 6
   article_id word          n      tf   idf  tf_idf
        <int> <chr>     <int>   <dbl> <dbl>   <dbl>
 1          7 prices        1 0.00452 0.288 0.00130
 2          9 prices        1 0.00513 0.288 0.00148
 3          7 dlrs          1 0.00452 0.598 0.00271
 4          7 opec          1 0.00452 0.693 0.00314
 5          9 opec          1 0.00513 0.693 0.00355
 6          7 mln           1 0.00452 0.799 0.00361
 7          7 petroleum     1 0.00452 0.799 0.00361
 8         11 petroleum     1 0.00455 0.799 0.00363
 9          6 barrels       1 0.00429 0.916 0.00393
10          6 industry      1 0.00429 0.916 0.00393
# … with 1,448 more rows
# ℹ Use `print(n = ...)` to see more rows```

Excellent. We see that ‘prices’ and ‘petroleum’ have very low values for some articles. This could be because they were mentioned just a few times in that article, or because they were used in too many articles.

Cosine Similarity

Assess how similar two documents are using cosine similarity
- a measure of similarity between two vectors (measured by the angle formed between them)
- can be found by taking the dot product of two vectors and dividing it by the product of their magnitudes:

$similarity = \cos (θ) = \frac{A \cdot B}{| | A | | \cdot | | B | |} = \frac{\sum_{i = 1}^{n} A_{i} \times B_{i}}{\sqrt{\sum_{i = 1}^{n} A_{i}^{2}} \times \sqrt{\sum_{i = 1}^{n} B_{i}^{2}}}$

$A$ and $B$ : vectors of word counts for each document
Can use pairwise_similarity() from widyr package:

pairwise_similarity(tbl, # tibble or table
                    item, # item to compare (articles, tweets, etc)
                    feature, # column with link between items e.g. words
                    value) # name of the column with comparison values e.g. n or tf_idf

e.g.

crude_weights %>%
  pairwise_similarity(X, word, tf_idf) %>%
  arrange(desc(similarity))

use cases for cosine similarity: - find duplicate/similar pieces of text - use for clustering and classification analysis

examples

An example of failing at text analysis

Early on, you discussed the power of removing stop words before conducting text analysis. In this most recent chapter, you reviewed using cosine similarity to identify texts that are similar to each other.

In this exercise, you will explore the very real possibility of failing to use text analysis properly. You will compute cosine similarities for the chapters in the book Animal Farm, without removing stop-words.

# Create word counts
animal_farm_counts <- animal_farm %>%
  unnest_tokens(word, text_column) %>%
  count(chapter, word)

# Calculate the cosine similarity by chapter, using words
comparisons <- animal_farm_counts %>%
  pairwise_similarity(chapter, word, n) %>%
  arrange(desc(similarity))

# A tibble: 90 × 3
   item1      item2      similarity
   <chr>      <chr>           <dbl>
 1 Chapter 9  Chapter 8       0.972
 2 Chapter 8  Chapter 9       0.972
 3 Chapter 8  Chapter 7       0.970
 4 Chapter 7  Chapter 8       0.970
 5 Chapter 8  Chapter 10      0.969
 6 Chapter 10 Chapter 8       0.969
 7 Chapter 9  Chapter 5       0.968
 8 Chapter 5  Chapter 9       0.968
 9 Chapter 9  Chapter 10      0.966
10 Chapter 10 Chapter 9       0.966
# … with 80 more rows
# ℹ Use `print(n = ...)` to see more rows

# Print the mean of the similarity values
comparisons %>%
  summarize(mean = mean(similarity))

# A tibble: 1 × 1
   mean
  <dbl>
1 0.949

Well done. Unfortunately, these results are useless. As every single chapter is highly similar to every other chapter. We need to remove stop words to see which chapters are more similar to each other.

Cosine similarity example

The plot of Animal Farm is pretty simple. In the beginning the animals are unhappy with following their human leaders. In the middle they overthrow those leaders, and in the end they become unhappy with the animals that eventually became their new leaders.

If done correctly, cosine similarity can help identify documents (chapters) that are similar to each other. In this exercise, you will identify similar chapters in Animal Farm. Odds are, chapter 1 (the beginning) and chapter 10 (the end) will be similar.

# Create word counts 
animal_farm_counts <- animal_farm %>%
  unnest_tokens(word, text_column) %>%
  anti_join(stop_words) %>%
  count(chapter, word) %>%
  bind_tf_idf(word, chapter, n)

# Calculate cosine similarity on word counts
animal_farm_counts %>%
  pairwise_similarity(chapter, word, n) %>%
  arrange(desc(similarity))

# A tibble: 90 × 3
   item1      item2      similarity
   <chr>      <chr>           <dbl>
 1 Chapter 8  Chapter 7       0.696
 2 Chapter 7  Chapter 8       0.696
 3 Chapter 7  Chapter 5       0.693
 4 Chapter 5  Chapter 7       0.693
 5 Chapter 8  Chapter 5       0.642
 6 Chapter 5  Chapter 8       0.642
 7 Chapter 7  Chapter 6       0.641
 8 Chapter 6  Chapter 7       0.641
 9 Chapter 6  Chapter 10      0.638
10 Chapter 10 Chapter 6       0.638
# … with 80 more rows
# ℹ Use `print(n = ...)` to see more rows

# Calculate cosine similarity using tf_idf values
animal_farm_counts %>%
  pairwise_similarity(chapter, word, tf_idf) %>%
  arrange(desc(similarity))

# A tibble: 90 × 3
   item1     item2     similarity
   <chr>     <chr>          <dbl>
 1 Chapter 8 Chapter 7      0.177
 2 Chapter 7 Chapter 8      0.177
 3 Chapter 7 Chapter 5      0.117
 4 Chapter 5 Chapter 7      0.117
 5 Chapter 7 Chapter 6      0.116
 6 Chapter 6 Chapter 7      0.116
 7 Chapter 9 Chapter 8      0.109
 8 Chapter 8 Chapter 9      0.109
 9 Chapter 8 Chapter 4      0.108
10 Chapter 4 Chapter 8      0.108
# … with 80 more rows
# ℹ Use `print(n = ...)` to see more rows

Excellent job. Cosine similarity scores can be calculated on word counts or TFIDF values. We see drastically different results for both. Animal Farm has a very low reading level, and most chapters share the same vocabulary. This was evident in the previous exercise. You’ll need to consider the context of the text you are analyzing when deciding on an approach.

Applications: Classification and Topic Modelling

Preparing text for modeling
For classification tasks:
1. clean/prepare data
2. split into training & testing datasets
3. train model on training dataset
4. evaluate model on testing dataset
Use classification modeling on the Animal Farm dataset to determine which sentences are discussing Napoleon or Boxer

# Make sentences
sentences <- animal_farm %>%
  unnest_tokens(output = "sentence",
                token = "sentences",
                input = text_column)

# label sentences by animal (so algorithm doesn't use these during training)
sentences$boxer <- grepl('boxer', sentences$sentence)
sentences$napoleon <- grepl('napoleon', sentences$sentence)

# Replace the animal name
sentences$sentence <- gsub("boxer", "animal X", sentences$sentence)
sentences$sentence <- gsub("napoleon", "animal X", sentences$sentence)
# filter to sentences that contain Boxer or Napoleon but not both
animal_sentences <- sentences[sentences$boxer + sentences$napoleon == 1,]

# add label to dataset
animal_sentences$Name <- as.factor(ifelse(animal_sentences$boxer, "boxer", "napoleon"))

# select 75 sentences for each
animal_sentences <-
  rbind(animal_sentences[animal_sentences$Name == "boxer", ][c(1:75), ],
        animal_sentences[animal_sentences$Name == "napoleon", ][c(1:75), ])

animal_sentences$sentence_id <- c(1:dim(animal_sentences)[1])

# next predict which sentences originally included each animal

library(tm)
library(tidytext)
library(dplyr)
library(SnowballC)

# create tokens
animal_tokens <- animal_sentences %>%
  unnest_tokens(output = "word",
                token = "words",
                input = sentence) %>%
  anti_join(stop_words) %>%
  mutate(word = wordStem(word))

# for classification, create a document term matrix with tfidf weights using cast_dtm() from tidytext
animal_matrix <- animal_tokens %>%
  # count words by sentence
  count(sentence_id, word) %>%
  # cast to a dtm (one row per document (sentence, here), and one column for each word)
  cast_dtm(document = sentence_id,
           term = word,
           value = n,
           weighting = tm::weightTfIdf)

animal_matrix

Non-/sparse entries: 1235/102865
Sparsity           : 99%
Maximal term length: 17
Weighting          : term frequency - inverse document frequency

Using large, sparse matrices can be computationally expensive. In this case, we have 150 sentences and 694 unique words. The matrix is 99% sparse, meaning that 99% of the cells are empty. This is a common issue when working with text data.

Remove sparse terms with removeSparseTerms()

How sparse is too sparse?

If we set maximum sparsity to 90%:

removeSparseTerms(animal_matrix, sparse = 0.90)

Non-/sparse entries: 207/393
Sparsity           : 66%

Would remove all words but four! Couldn’t classify sentences using only 4 words!

If we set maximum sparsity to 99%:

removeSparseTerms(animal_matrix, sparse = 0.99)

Non-/sparse entries: 713/25087
Sparsity           : 97%

here we’d have 172 terms (remember we started with 694).

Deciding on matrix sparsity depends on how many terms are in the matrix and how fast your computer is

Examples

Classification modeling example

You have previously prepared a set of Russian tweets for classification. Of the 20,000 tweets, you have filtered to tweets with an account_type of Left or Right, and selected the first 2000 tweets of each. You have already tokenized the tweets into words, removed stop words, and performed stemming. Furthermore, you converted word counts into a document-term matrix with TFIDF values for weights and saved this matrix as: left_right_matrix_small.

You will use this matrix to predict whether a tweet was generated from a left-leaning tweet bot, or a right-leaning tweet bot. The labels can be found in the vector, left_right_labels.

library(randomForest)

# Create train/test split
set.seed(1111)
sample_size <- floor(0.75 * nrow(left_right_matrix_small))
train_ind <- sample(nrow(left_right_matrix_small), size = sample_size)
train <- left_right_matrix_small[train_ind, ]
test <- left_right_matrix_small[-train_ind, ]

# Create a random forest classifier
rfc <- randomForest(x = as.data.frame(as.matrix(train)), 
                    y = left_right_labels[train_ind],
                    nTree = 50)
# Print the results
rfc


Call:
 randomForest(x = as.data.frame(as.matrix(train)), y = left_right_labels[train_ind],      nTree = 50) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 3

        OOB estimate of  error rate: 22.43%
Confusion matrix:
      Left Right class.error
Left   976   540  0.35620053
Right  133  1351  0.08962264

Excellent! Classification modeling with text follows the same principals as classification models built on continuous data. You can also use all kinds of fun machine learning algorithms and are not stuck using random forest models.

Confusion matrices

You have just finished creating a classification model. This model predicts whether tweets were created by a left-leaning (democrat) or right-leaning (republican) tweet bot. You have made predictions on the test data and have the following result:

Predictions	Left	Right
Left	350	157
Right	57	436

Use the confusion matrix above to answer questions about the models accuracy.

# Percentage correctly labeled "Left"
left <- (350) / (350 + 157)
left

# Percentage correctly labeled "Right"
right <- (436) / (57 + 436)
right

# Overall Accuracy:
accuracy <- (350 + 436) / (350 + 157 + 57 + 436)
accuracy

left
[1] 0.6903353

right
[1] 0.8843813

accuracy
[1] 0.786

Excellent. Although accuracy is only one of many metrics to determine if an algorithm is doing a good job, it is usually a good indicator of model performance!

left_right_tfidf
# A tibble: 38,821 × 6
       X word        n    tf   idf tf_idf
   <int> <chr>   <int> <dbl> <dbl>  <dbl>
 1  6028 ʷʰʸ        11 0.917  8.29  7.60 
 2    16 obama       3 0.231  3.99  0.921
 3    24 scout       3 0.333  7.20  2.40 
 4    96 peopl       3 0.333  3.53  1.18 
 5   141 hillari     3 0.15   4.53  0.680
 6   141 trump       3 0.15   2.16  0.323
 7  5732 door        3 0.214  6.21  1.33 
 8  5735 albino      3 0.214  8.29  1.78 
 9  5798 cop         3 0.214  2.27  0.487
10  6012 cop         3 0.176  2.27  0.401
# … with 38,811 more rows
# ℹ Use `print(n = ...)` to see more rows

Topic Modeling

Collection of texts is likely to be made up of a collection of topics (e.g. articles about sports, with topics like player gossip, scores, scouting, draft picks)
Algorithms can identify topics within a collection of text, one of the most common is Latent dirichlet allocation (LDA):
1. Each document is a mixture of topics
2. Topics are mixtures of words
e.g. a sports story on a player being traded:
- 70% on team news
  - words: trade, pitcher, move, new
- 30% player gossip
  - words: angry, change, money
to perform LDA, need a document-term matrix with term frequency weights

animal_farm_tokens <- animal_farm %>%
  unnest_tokens(output = "word", token = "words", input = text_column) %>%
  anti_join(stop_words) %>%
  mutate(word = wordStem(word))

# cast to DTM
animal_farm_matrix <- animal_farm_tokens %>%
  count(document, word) %>%
  cast_dtm(document = chapter,
           term = word, 
           value = n,
           weighting = tm::weightTf) # LDA requires term-frequency weighting

# Perform LDA
library(topicmodels)
animal_farm_lda <- 
  LDA(train,
      k = 4, # number of topics
      method = "Gibbs", # 
      control = list(seed = 111)) # seed

animal_farm_lda
# A LDA_Gibbs topic model with 4 topics.

# extract a tibble of results
animal_farm_betas <- 
  tidy(animal_farm_lda,
       matrix = "beta")

animal_farm_betas

# A tibble: 11,004 x 3
  topic term        beta
  <int> <chr>      <dbl>
...
5     1 abolish 0.000360
6     2 abolish 0.00129
7     3 abolish 0.000355
8     4 abolish 0.000381
...

beta is a per-topic word distribution, how related to each topic a word is - probability of a word given a topic - sum of these values should be equal to the topic number of topics

sum(animal_farm_betas$beta)
[1] 4

LDA in Practice

Need to select the number of topics
- LDA will create topics for you, but won’t tell you how many to choose
“Perplexity” can help us use the right number of topics
- measure of how well a probability model fits new data (lower is better)
- often used to compare models
- in LDA parameter tuning
- selecting number of topics

# first create train/test split
sample_size <- floor(0.90 * nrow(doc_term_matrix))
set.seed(1111)

train_ind <- sample(nrow(doc_term_matrix), size = sample_size)

train <- matrix[train_ind, ]
test <- matrix[-train_ind, ]

Must assess perplexity on the testing dataset to make sure topics are extendable to new data.

Next create LDA models for each number of topics and calculate perplexity for each model. - perplexity() function from topicmodels package

library(topicmodels)
values = c()

# for each K from 2 to 35, train a model and calculate perplexity
for(i in c(2:35)){
  lda_model <- LDA(train,
                   k = i,
                   method = "Gibbs",
                   control = list(iter = 25,
                                  seed = 1111))
  values <- c(values, perplexity(lda_model, newdata = test))
}

# plot these values with # of topics as X, perplexity as Y
plot(c(2:35), values, main = "Perplexity for Topics", xlab = "Number of Topics", ylab = "Perplexity")

A scree-plot (like for Kmeans) - find where the perplexity score is not improving (decreasing) much with the addition of more topics

LDA is often more about practical use than selecting the optimal number of topics based on perplexity - e.g. describing 10-15 topics to an audience might not be feasible - graphics with 5 topics are easier to view than graphics with 50 topics

Good rule of thumb: go with smaller number of topics, where each topic is represented by a large number of documents

Common for having a subject matter expert review the words of the topics and some of the articles aligned with each topic to provide a theme for each topic

betas <- tidy(lda_model, matrix = "beta")
betas %>%
  filter(topic == 1) %>%
  arrange(-beta) %>%
  select(term)

# A tibble: 2,000 × 1
   term       
   <chr>      
 1 athletic
 2 quick       
 3 strong       
 4 tough

Look like topic 1 is describing athletes in this example. Can also confirm by reviewing articles:

gammas <- tidy(lda_model, matrix = "gamma")
gammas %>%
  filter(topic == 1) %>%
  arrange(-gamma) %>%
  select(document)

How to summarize output: - count how many times each topic was the highest weighted topic:

gammas <- tidy(lda_model, matrix = "gamma")
gammas %>%
  group_by(document) %>%
  arrange(desc(gamma)) %>%
  slice(1) %>%
  group_by(topic) %>%
  tally(topic, sort = TRUE)

  topic    n
1    1  1326
2    5   1215
3    4   804

Topic 1 was the top topic for 1326 documents, etc.

View how strong a topic was when it was the top topic:

gammas %>%
  group_by(document) %>%
  arrange(desc(gamma)) %>%
  slice(1) %>%
  group_by(topic) %>%
  summarize(avg = mean(gamma)) %>%
  arrange(desc(avg))

  topic   avg
1    1 0.696
2    5 0.530
3    4 0.438

topic 1 had highest average weight when it was the top topic

examples

Testing perplexity

You have been given a dataset full of tweets that were sent by tweet bots during the 2016 US election. Your boss has identified two different account types of interest, Left and Right. Your boss has asked you to perform topic modeling on the tweets from Right tweet bots. Furthermore, your boss is hoping to summarize the content of these tweets with topic modeling. Perform topic modeling on 5, 15, and 50 topics to determine a general idea of how many topics are contained in the data.

library(topicmodels)
# Setup train and test data
sample_size <- floor(0.90 * nrow(right_matrix))
set.seed(1111)
train_ind <- sample(nrow(right_matrix), size = sample_size)
train <- right_matrix[train_ind, ]
test <- right_matrix[-train_ind, ]

# Peform topic modeling 
lda_model <- LDA(train, k = 5, method = "Gibbs",
                 control = list(seed = 1111))
# Train
perplexity(lda_model, newdata = train) 
# Test
perplexity(lda_model, newdata = test)

# Train
perplexity(lda_model, newdata = train) 
[1] 577.9461
# Test
perplexity(lda_model, newdata = test) 
[1] 792.8027

Now with 15 topics:

library(topicmodels)
# Setup train and test data
sample_size <- floor(0.90 * nrow(right_matrix))
set.seed(1111)
train_ind <- sample(nrow(right_matrix), size = sample_size)
train <- right_matrix[train_ind, ]
test <- right_matrix[-train_ind, ]

# Peform topic modeling 
lda_model <- LDA(train, k = 15, method = "Gibbs",
                 control = list(seed = 1111))
# Train
perplexity(lda_model, newdata = train) 
# Test
perplexity(lda_model, newdata = test)

perplexity(lda_model, newdata = train) 
[1] 595.5198
# Test
perplexity(lda_model, newdata = test) 
[1] 718.2236

Now with 50 topics:

library(topicmodels)
# Setup train and test data
sample_size <- floor(0.90 * nrow(right_matrix))
set.seed(1111)
train_ind <- sample(nrow(right_matrix), size = sample_size)
train <- right_matrix[train_ind, ]
test <- right_matrix[-train_ind, ]

# Peform topic modeling 
lda_model <- LDA(train, k = 50, method = "Gibbs",
                 control = list(seed = 1111))
# Train
perplexity(lda_model, newdata = train) 
# Test
perplexity(lda_model, newdata = test)

perplexity(lda_model, newdata = train) 
[1] 718.5356
# Test
perplexity(lda_model, newdata = test) 
[1] 800.6809

Excellent. 15 topics performs much better on this dataset. 5 topics was not enough, while 50 topics is probably way too many.

Reviewing LDA results

You have developed a topic model, napoleon_model, with 5 topics for the sentences from the book Animal Farm that reference the main character Napoleon. You have had 5 local authors review the top words and top sentences for each topic and they have provided you with themes for each topic.

To finalize your results, prepare some summary statistics about the topics. You will present these summary values along with the themes to your boss for review.

# Extract the gamma matrix 
gamma_values <- tidy(napoleon_model, matrix = "gamma")
# Create grouped gamma tibble
grouped_gammas <- gamma_values %>%
  group_by(document) %>%
  arrange(desc(gamma)) %>%
  slice(1) %>%
  arrange(topic)
# Count (tally) by topic
grouped_gammas %>% 
  tally(topic, sort=TRUE)

# A tibble: 5 × 2
  topic     n
  <int> <int>
1     4   116
2     5   110
3     2    80
4     3    72
5     1    41

# Average topic weight for top topic for each sentence
grouped_gammas %>% 
  summarize(avg=mean(gamma)) %>%
  arrange(desc(avg))

# A tibble: 5 × 2
  topic   avg
  <int> <dbl>
1     3 0.240
2     4 0.236
3     5 0.235
4     2 0.231
5     1 0.226

Well done. Topic 4 had the most sentences most similar to that topic. However, notice that the average weights were very similar for each topic.

Sentiment Analysis

Sentiment analysis assesses subjective information from text
Types:
- positive vs. negative
- words eliciting emotions
Start with dictionary of words that have a predefined value or score
- each word is given a meaning & (sometimes) score
  - abandon -> fear
  - accomplish -> joy

library(tidytext)
sentiments

# A tibble: 6,786 × 2
   word        sentiment
   <chr>       <chr>    
 1 2-faces     negative 
 2 abnormal    negative 
 3 abolish     negative 
 4 abominable  negative 
 5 abominably  negative 
 6 abominate   negative 
 7 abomination negative 
 8 abort       negative 
 9 aborted     negative 
10 aborts      negative 
# ℹ 6,776 more rows

Look at these 3 different dictionaries, or lexicons, in tidytext
- AFINN: scores words from -5 (extremely negative) to 5 (extremely positive)
- bing: binary positive or negative for all words
- nrc: labels words as categories fear, joy, anger, etc
access data using get_sentiments() with the name of the lexicon

library(tidytext)
library(textdata)
get_sentiments("afinn")

# A tibble: 2,477 × 2
   word       value
   <chr>      <dbl>
 1 abandon       -2
 2 abandoned     -2
 3 abandons      -2
 4 abducted      -2
 5 abduction     -2
 6 abductions    -2
 7 abhor         -3
 8 abhorred      -3
 9 abhorrent     -3
10 abhors        -3
# ℹ 2,467 more rows

First we need to prepare our text

# Read data
animal_farm <- read.csv("animal_farm.csv", stringsAsFactors = FALSE)
animal_farm <- as_tibble(animal_farm)

# Tokenize and remove stop words
animal_farm_tokens <- animal_farm %>%
  unnest_tokens(output = "word", token = "words", input = text_column) %>%
  anti_join(stop_words)

Note: we did not perform stemming, as words might create different sentiments than just their stem!!

With tokens, can join words with their sentiment using inner_join()

# Join sentiment data
animal_farm_tokens %>%
  inner_join(get_sentiments("afinn"))

# A tibble: 1,175 x 3
  chapter   word  score
  <chr>     <chr> <int>
1 Chapter 1 drunk    -2
2 Chapter 1 strange  -1
3 Chapter 1 dream     1
4 Chapter 1 agreed    1
5 Chapter 1 safelt    1

Can group sentiments by chapter, summarize overall score

animal_farm_tokens %>%
  inner_join(get_sentiments("afinn")) %>%
  group_by(chapter) %>%
  summarize(sentiment = sum(score)) %>%
  arrange(sentiment)

# A tibble: 10 x 2
   chapter    sentiment
   <chr>          <int>
 1 Chapter 7       -166
 2 Chapter 8       -158
 3 Chapter 4        -84

bing lexicon: positive or negative sentiment - instead of summarizing scores, just need to count the words used

# find total words used by chapter
word_totals <- animal_farm_tokens %>%
  group_by(chapter) %>%
  count()

# count how many negative words were used
animal_farm_tokens %>%
  inner_join(get_sentiments("bing")) %>%
  group_by(chapter) %>%
  count(sentiment) %>%
  filter(sentiment == "negative") %>%
  transform(p = n / word_totals$n) %>%
  arrange(desc(p))

# A tibble: 10 x 4
   chapter    sentiment     n     p
   <chr>      <chr>     <int> <dbl>
 1 Chapter 7 negative     154 0.11711027
 2 Chapter 6 negative     106 0.10750507
 3 Chapter 4 negative      68 0.10559006

chapter 7 contains highest proportion of negative words, almost 12%

nrc lexicon

as.data.frame(table(get_sentiments("nrc")$sentiment)) %>%
  arrange(desc(Freq))

       Var1 Freq
1     negative  3324
2     positive  2312
3        fear   1476
4       anger   1247
5       trust   1231
6     sadness   1191
...

use to see if certain emotions are in text:

# what words related to fear are in the text?
fear <- get_sentiments("nrc") %>%
  filter(sentiment == "fear")

animal_farm_tokens %>%
  inner_join(fear) %>%
  count(word, sort = TRUE)

# A tibble: 220 x 2
   word       n
   <chr>  <int>
1  rebellion   29
2  death       19
3  gun         19
4  terrible    15
5  bad         14
...

EXAMPLES

tidytext lexicons

Before you begin applying sentiment analysis to text, it is essential that you understand the lexicons being used to aid in your analysis. Each lexicon has advantages when used in the right context. Before running any analysis, you must decide which type of sentiment you are hoping to extract from the text available.

In this exercise, you will explore the three different lexicons offered by tidytext’s sentiments’ datasets.

# Print the lexicon
get_sentiments("bing")

        X                     word sentiment
1       1                  2-faced  negative
2       2                  2-faces  negative
3       3                       a+  positive
4       4                 abnormal  negative
5       5                  abolish  negative
6       6               abominable  negative
7       7               abominably  negative
8       8                abominate  negative
...

# Count the different sentiment types
get_sentiments("bing") %>%
  count(sentiment) %>%
  arrange(desc(n))

  sentiment    n
1  negative 4782
2  positive 2006

# Print the lexicon
get_sentiments("nrc")

          X              word    sentiment
1         1            abacus        trust
2         2           abandon         fear
3         3           abandon     negative
4         4           abandon      sadness
5         5         abandoned        anger
...

# Count the different sentiment types
get_sentiments("nrc") %>%
  count(sentiment) %>%
  arrange(desc(n))

      sentiment    n
1      negative 3324
2      positive 2312
3          fear 1476
4         anger 1247
5         trust 1231
6       sadness 1191
7       disgust 1058
8  anticipation  839
9           joy  689
10     surprise  534

# Print the lexicon
get_sentiments("afinn")

        X               word score
1       1            abandon    -2
2       2          abandoned    -2
3       3           abandons    -2
4       4           abducted    -2
5       5          abduction    -2
6       6         abductions    -2
...

# Count how many times each score was used
get_sentiments("afinn") %>%
  count(score) %>%
  arrange(desc(n))

   score   n
1     -2 965
2      2 448
3     -1 309
4     -3 264
5      1 208
6      3 172
7      4  45
8     -4  43
9     -5  16
10     5   5
11     0   1

Great job. Each lexicon serves its own purpose. These are not the only three sentiment dictionaries available but they are great examples of the type of dictionaries you can use.

Sentiment scores

In the book Animal Farm, three main pigs are responsible for the events of the book: Napoleon, Snowball, and Squealer. Throughout the book they are spreading thoughts of rebellion and encouraging the other animals to take over the farm from Mr. Jones - the owner of the farm.

Using the sentences that mention each pig, determine which character has the most negative sentiment associated with them. The sentences tibble contains a tibble of the sentences from the book Animal Farm.

# Print the overall sentiment associated with each pig's sentences
for(name in c("napoleon", "snowball", "squealer")) {
  # Filter to the sentences mentioning the pig
  pig_sentences <- sentences[grepl(name, sentences$sentence), ]
  # Tokenize the text
  napoleon_tokens <- pig_sentences %>%
    unnest_tokens(output = "word", token = "words", input = sentence) %>%
    anti_join(stop_words)
  # Use afinn to find the overall sentiment score
  result <- napoleon_tokens %>% 
    inner_join(get_sentiments("afinn")) %>%
    summarise(sentiment = sum(score))
  # Print the result
  print(paste0(name, ": ", result$sentiment))
}

[1] "napoleon: -45"
[1] "snowball: -77"
[1] "squealer: -30"

Excellent job. Although Napoleon is the main antagonist, the sentiment surrounding Snowball is extremely negative!

Sentiment and emotion

Within the sentiments dataset, the lexicon nrc contains a dictionary of words and an emotion associated with that word. Emotions such as joy, trust, anticipation, and others are found within this dataset.

In the Russian tweet bot dataset you have been exploring, you have looked at tweets sent out by both a left- and a right-leaning tweet bot. Explore the contents of the tweets sent by the left-leaning (democratic) tweet bot by using the nrc lexicon. The left tweets, left, have been tokenized into words, with stop-words removed.

left_tokens <- left %>%
  unnest_tokens(output = "word", token = "words", input = content) %>%
  anti_join(stop_words)
# Dictionaries 
anticipation <- get_sentiments("nrc") %>% 
  filter(sentiment == "anticipation")
joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")
# Print top words for Anticipation and Joy
left_tokens %>%
  inner_join(anticipation, by = "word") %>%
  count(word, sort = TRUE)

# A tibble: 391 × 2
   word      n
   <chr> <int>
 1 time    232
 2 god     185
 3 feat    126
 4 watch   123
 5 happy    98
 6 money    92
 7 vote     92
 8 death    85
 9 track    70
10 art      65
# … with 381 more rows
# ℹ Use `print(n = ...)` to see more rows

left_tokens %>%
  inner_join(joy, by = "word") %>%
  count(word, sort = TRUE)

# A tibble: 340 × 2
   word          n
   <chr>     <int>
 1 music       355
 2 love        273
 3 god         185
 4 feat        126
 5 happy        98
 6 money        92
 7 vote         92
 8 beautiful    89
 9 art          65
10 true         63
# … with 330 more rows
# ℹ Use `print(n = ...)` to see more rows

Excellent work. Tweets are supposed to stir feelings of joy, fear, and others. Especially tweets meant to turn the political left against the political right.

Word Embeddings

Flaw in word counts - consider two statements: 1. “Bob is the smartest person I know.” 2. “Bob is the most brilliant person I know.” - Statements say same thing, but consider with stop words removed: 1. Bob smartest person 2. Bob brilliant person - Smartest and brilliant aren’t identical words, so traditional similiarity metrics would not do well here

Word embeddings - instead of just counting how many times each word was used, but access to info on which words used in conjunction with those words, and their meaning - word2vec most popular word embeddings methods around - uses large vector space to represent words, words of similar meaning are close together - captures multiple similarities between words - words appearing often together are also closer together in vector space - e.g. pork, beef, chicken grouped together - implementation in R with h20 package

library(h20)
h2o.init()  # start h20 instance

# convert tibble into h2o object

h2o_object = as.h2o(animal_farm)

# using h2o methods:

# tokenize
words <- h2o.tokenize(h2o_object$text_column, "\\\\W+") # places an NA after last word in each chapter

# lowercase all letters
words <- h2o.tolower(words)

# remove stop words
words = words[is.na(words) || (!words %in% stop_words$word), ]

word2vec_model <- h2o.word2vec(words, 
                               min_word_freq = 5, # remove words used fewer than 5 times
                               epochs = 5) # number of training iterations to run (use larger for larger texts)

# find similar words, synonyms
h2o.findSynonyms(word2vec_model, "animal")

    synonym   score
1     drink   0.8209008
2       age   0.7952490
3    alcohol  0.7867004

“animal” is most related to wrods like “drink” “age” and “alcohol”

# find similar words, synonyms
h2o.findSynonyms(word2vec_model, "jones")

    synonym   score
1    battle   0.7996588
2  discovered   0.7944554
3    cowshed  0.7867004

“jones” the enemy of the animals in the book is most related to words like battle and enemies

Examples

h2o practice

There are several machine learning libraries available in R. However, the h2o library is easy to use and offers a word2vec implementation. h2o can also be used for several other machine learning tasks. In order to use the h2o library however, you need to take additional pre-processing steps with your data. You have a dataset called left_right which contains tweets that were auto-tweeted during the 2016 US election campaign.

Instead of preparing your data for other text analysis techniques, prepare this dataset for use with the h2o library.

left_right
# A tibble: 4,000 × 22
       X externa…¹ author content region langu…² publi…³ harve…⁴ follo…⁵ follo…⁶
   <int>     <dbl> <chr>  <chr>   <chr>  <chr>   <chr>   <chr>     <int>   <int>
 1     1   9.06e17 10_GOP "\"We … Unkno… English 10/1/2… 10/1/2…    1052    9636
 2     2   9.06e17 10_GOP "Marsh… Unkno… English 10/1/2… 10/1/2…    1054    9637
 3     3   9.06e17 10_GOP "Daugh… Unkno… English 10/1/2… 10/1/2…    1054    9637
 4     4   9.06e17 10_GOP "JUST … Unkno… English 10/1/2… 10/1/2…    1062    9642
 5     5   9.06e17 10_GOP "19,00… Unkno… English 10/1/2… 10/1/2…    1050    9645
 6     6   9.06e17 10_GOP "Dan B… Unkno… English 10/1/2… 10/1/2…    1050    9644
 7     7   9.06e17 10_GOP "🐝🐝…  Unkno… English 10/1/2… 10/1/2…    1050    9644
 8     8   9.06e17 10_GOP "'@Sen… Unkno… English 10/1/2… 10/1/2…    1050    9644
 9     9   9.06e17 10_GOP "As mu… Unkno… English 10/1/2… 10/1/2…    1050    9646
10    10   9.06e17 10_GOP "After… Unkno… English 10/1/2… 10/1/2…    1050    9646
# … with 3,990 more rows, 12 more variables: updates <int>, post_type <chr>,
#   account_type <chr>, retweet <int>, account_category <chr>,
#   new_june_2018 <int>, alt_external_id <dbl>, tweet_id <int>,
#   article_url <chr>, tco1_step1 <chr>, tco2_step1 <chr>, tco3_step1 <chr>,
#   and abbreviated variable names ¹external_author_id, ²language,
#   ³publish_date, ⁴harvested_date, ⁵following, ⁶followers
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

# Initialize an h2o session
library(h2o)
h2o.init()

# Create an h2o object for left_right
h2o_object = as.h2o(left_right)

# Tokenize the words from the column of text in left_right
tweet_words <- h2o.tokenize(h2o_object$content, "\\\\W+")

# Lowercase
tweet_words <- h2o.tolower(tweet_words)
# Remove stopwords from tweet_words
tweet_words <- tweet_words[is.na(tweet_words) || (!tweet_words %in% stop_words$word),]
tweet_words

          C1
1           
2    sitting
3   democrat
4    senator
5      trial
6 corruption

[43270 rows x 1 column]

Great job. The h2o library is easy to use and intuitive, making it a great candidate for machine learning tasks such as creating word2vec models.

word2vec

You have been web-scrapping a lot of job titles from the internet and are unsure if you need to scrap additional job titles for your analysis. So far, you have collected over 13,000 job titles in a dataset called job_titles. You have read that word2vec generally performs best if the model has enough data to properly train, and if words are not mentioned enough in your data, the model might not be useful.

In this exercise you will test how helpful additional data is by running your model 3 times; each run will use additional data.

job_titles
# A tibble: 13,845 × 2
   category  jobtitle                                            
   <chr>     <chr>                                               
 1 education After School Supervisor                             
 2 education *****TUTORS NEEDED - FOR ALL SUBJECTS, ALL AGES*****
 3 education Bay Area Family Recruiter                           
 4 education Adult Day Programs/Community Access/Job Coaches     
 5 education General Counselor - Non Tenure track                
 6 education Part-Time Summer Math Teachers/Tutors               
 7 education Preschool Teacher (temp-to-hire)                    
 8 education *****TUTORS NEEDED - FOR ALL SUBJECTS, ALL AGES*****
 9 education Private Teachers and Tutors Needed in the South Bay 
10 education Art Therapist at Esther B. Clark School             
# … with 13,835 more rows
# ℹ Use `print(n = ...)` to see more rows

use 33% of data:

library(h2o)
h2o.init()

set.seed(1111)
# Use 33% of the available data
sample_size <- floor(0.33 * nrow(job_titles))
sample_data <- sample(nrow(job_titles), size = sample_size)

h2o_object = as.h2o(job_titles[sample_data, ])
words <- h2o.tokenize(h2o_object$jobtitle, "\\\\W+")
words <- h2o.tolower(words)
words = words[is.na(words) || (!words %in% stop_words$word),]

word2vec_model <- h2o.word2vec(words, min_word_freq=5, epochs = 10)
# Find synonyms for the word "teacher"
h2o.findSynonyms(word2vec_model, "teacher", count=10)

      synonym     score
1    teaching 0.8506054
2   preschool 0.8186548
3    teachers 0.8076779
4   education 0.7821815
5     special 0.7817721
6   classroom 0.7800377
7  elementary 0.7718362
8     toddler 0.7705406
9      intern 0.7633918
10       aide 0.7567133

Now 66% of data

library(h2o)
h2o.init()

set.seed(1111)
# Use 66% of the available data
sample_size <- floor(0.66 * nrow(job_titles))
sample_data <- sample(nrow(job_titles), size = sample_size)

h2o_object = as.h2o(job_titles[sample_data, ])
words <- h2o.tokenize(h2o_object$jobtitle, "\\\\W+")
words <- h2o.tolower(words)
words = words[is.na(words) || (!words %in% stop_words$word),]

word2vec_model <- h2o.word2vec(words, min_word_freq=5, epochs = 10)
# Find synonyms for the word "teacher"
h2o.findSynonyms(word2vec_model, "teacher", count=10)

      synonym     score
1    teaching 0.8285403
2   preschool 0.8229606
3    teachers 0.7953759
4  elementary 0.7930176
5        aide 0.7833517
6      intern 0.7813422
7   childhood 0.7786321
8   education 0.7784756
9   classroom 0.7779680
10    toddler 0.7718694

now 100% of data:

library(h2o)
h2o.init()

set.seed(1111)
# Use all of the available data
sample_size <- floor(1 * nrow(job_titles))
sample_data <- sample(nrow(job_titles), size = sample_size)

h2o_object = as.h2o(job_titles[sample_data, ])
words <- h2o.tokenize(h2o_object$jobtitle, "\\\\W+")
words <- h2o.tolower(words)
words = words[is.na(words) || (!words %in% stop_words$word),]

word2vec_model <- h2o.word2vec(words, min_word_freq=5, epochs = 10)
# Find synonyms for the word "teacher"
h2o.findSynonyms(word2vec_model, "teacher", count=10)

        synonym     score
1     classroom 0.7697209
2          aide 0.7410879
3       floater 0.7261476
4        infant 0.7201972
5    elementary 0.7162822
6  kindergarten 0.6859407
7       toddler 0.6826063
8     preschool 0.6813027
9     christian 0.6713323
10     teachers 0.6661046

Well done. After adding additional data, the words most similar to teacher started to become more clear.