── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Fundamentals
Chapter 1 of Introduction to Natural Langauge Processing prepares you for running your first analysis on text. You will explore regular expressions and tokenization, two of the most common components of most analysis tasks. With regular expressions, you can search for any pattern you can think of, and with tokenization, you can prepare and clean text for more sophisticated analysis. This chapter is necessary for tackling the techniques we will learn in the remaining chapters of this course.
Regular Expressions (regex)
Sequence of characters or patterns used to search text
words <-c("DW-40", "Mike's Oil", "5w30", "Joe's Gas", "Unleaded", "Plus-89")# Finding digitsgrep("\\d", words, value =TRUE)
[1] "DW-40" "5w30" "Plus-89"
# Finding apostrophesgrep("\\'", words, value =TRUE)
[1] "Mike's Oil" "Joe's Gas"
Regex examples
“wildcards” extend search beyond a single character
+ allows us to find a word or digit of any length
Negate any expression using a capital letter
\S finds any non-whitespace character
\D finds any non-digit character
\W finds any non-alphanumeric character
Pattern
Text Matches
R Example
Text Example
\w
Any alphanumeric
gregexpr(pattern = '\w', <text>)
a
\d
Any digit
gregexpr(pattern = '\d', <text>)
1
\w+
Any alphanumeric of any length
gregexpr(pattern = '\w+', <text>)
word
\d+
Any digit of any length
gregexpr(pattern = '\d+', <text>)
123
\s
Any whitespace
gregexpr(pattern = '\s', <text>)
” ”
\S
Any non-whitespace
gregexpr(pattern = '\S', <text>)
a
(base) R functions
Function
Description
Syntax
grep()
Search for a pattern in a vector
grep(pattern, x, value = FALSE)
gsub()
Replace a pattern in a vector
gsub(pattern, replacement, x)
Examples
text <-c("John's favorite number is 1111","John lives at P Sherman, 42 Wallaby Way, Sydney","He is 7 feet tall","John has visited 30 countries","He can speak 3 languages","John can name 10 facts about himself")# Print off each item that contained a numeric numbergrep(pattern ="\\d", x = text, value =TRUE)
[1] "John's favorite number is 1111"
[2] "John lives at P Sherman, 42 Wallaby Way, Sydney"
[3] "He is 7 feet tall"
[4] "John has visited 30 countries"
[5] "He can speak 3 languages"
[6] "John can name 10 facts about himself"
# Find all items with a number followed by a spacegrep(pattern ="\\d\\s", x = text)
[1] 2 3 4 5 6
# How many times did you write down 'favorite'?length(grep(pattern ="favorite", x = text))
[1] 1
Exploring regular expression functions.
You have a vector of ten facts about your boss saved as a vector called text. In order to create a new ice-breaker for your team at work, you need to remove the name of your boss, John, from each fact that you have written down. This can easily be done using regular expressions (as well as other search/replace functions). Use regular expressions to correctly replace “John” from the facts you have written about him.
# Print off the text for every time you used your boss's name, Johngrep("John", x = text, value =TRUE)
[1] "John's favorite number is 1111"
[2] "John lives at P Sherman, 42 Wallaby Way, Sydney"
[3] "John has visited 30 countries"
[4] "John can name 10 facts about himself"
# Try replacing all occurences of "John" with Hegsub(pattern ='John', replacement ='He', x = text)
[1] "He's favorite number is 1111"
[2] "He lives at P Sherman, 42 Wallaby Way, Sydney"
[3] "He is 7 feet tall"
[4] "He has visited 30 countries"
[5] "He can speak 3 languages"
[6] "He can name 10 facts about himself"
# Replace all occurences of "John " with 'He '. clean_text <-gsub(pattern ='John\\s', replacement ='He ', x = text)clean_text
[1] "John's favorite number is 1111"
[2] "He lives at P Sherman, 42 Wallaby Way, Sydney"
[3] "He is 7 feet tall"
[4] "He has visited 30 countries"
[5] "He can speak 3 languages"
[6] "He can name 10 facts about himself"
# Replace all occurences of "John's" with 'His'gsub(pattern ="John\\'s", replacement ='His', x = clean_text)
[1] "His favorite number is 1111"
[2] "He lives at P Sherman, 42 Wallaby Way, Sydney"
[3] "He is 7 feet tall"
[4] "He has visited 30 countries"
[5] "He can speak 3 languages"
[6] "He can name 10 facts about himself"
Tokenization
Fundamental part of text preprocessing
Tokenization: process of breaking text into individual small pieces (tokens)
Tokens can be as small as individual characters, words, or the entire text
unnest_tokens() function, takes input tibble and extracts tokens from column specified as input, can specify tokens, and what output column should be labeled
library(tidytext)
Animal farm data:
animal_farm# A tibble: 10 × 2 chapter text_column <chr> <chr> 1 Chapter 1 "Mr. Jones, of the Manor Farm, had locked the hen-houses for the … 2 Chapter 2 "Three nights later old Major died peacefully in his sleep. His b… 3 Chapter 3 "How they toiled and sweated to get the hay in! But their efforts… 4 Chapter 4 "By the late summer the news of what had happened on Animal Farm … 5 Chapter 5 "As winter drew on, Mollie became more and more troublesome. She … 6 Chapter 6 "All that year the animals worked like slaves. But they were happ… 7 Chapter 7 "It was a bitter winter. The stormy weather was followed by sleet… 8 Chapter 8 "A few days later, when the terror caused by the executions had d… 9 Chapter 9 "Boxer's split hoof was a long time in healing. They had started …10 Chapter 10 "Years passed. The seasons came and went, the short animal lives
animal_farm %>%# tokenizeunnest_tokens(output ="word",input = text_column,token ="words") %>%# count the top tokenscount(word, sort =TRUE)
Find all mentions of a particular word and see what follows it:
animal_farm %>%filter(chapter =="Chapter 1") %>%# look for any mention of Boxer, capitalized or notunnest_tokens(output ="Boxer",input = text_column,token ="regex",pattern ="(?i)boxer") %>%# first token starts at beginning of text, use slice to skip first tokenslice(2:n())
Examples
# Split the text_column into sentencesanimal_farm %>%unnest_tokens(output ="sentences", input = text_column, token ="sentences")
# A tibble: 1,523 × 2 chapter sentences <chr> <chr> 1 Chapter 1 mr. 2 Chapter 1 jones, of the manor farm, had locked the hen-houses for the night,… 3 Chapter 1 with the ring of light from his lantern dancing from side to side,… 4 Chapter 1 jones was already snoring.as soon as the light in the bedroom went… 5 Chapter 1 word had gone round during the day that old major, the prize middl… 6 Chapter 1 it had been agreed that they should all meet in the big barn as so… 7 Chapter 1 jones was safely out of the way. 8 Chapter 1 old major (so he was always called, though the name under which he… 9 Chapter 1 he was twelve years old and had lately grown rather stout, but he …10 Chapter 1 before long the other animals began to arrive and make themselves …# … with 1,513 more rows# ℹ Use `print(n = ...)` to see more rows
# Split the text_column into sentencesanimal_farm %>%unnest_tokens(output ="sentences", input = text_column, token ="sentences") %>%# Count sentences using the chapter columncount(chapter, sort =TRUE)
Great job. Notice how the two methods produce slightly different results. You’ll notice that a lot when processing text. It’s all about the technique used to do the analysis.
remove stop words (e.g. “to” “the”) with anti_join(stop_words)
anti_join() will remove a tibble of words from a column of text
entry is word you want to remove
lexicon is source for where the word came from
stop_words
# A tibble: 1,149 × 2
word lexicon
<chr> <chr>
1 a SMART
2 a's SMART
3 able SMART
4 about SMART
5 above SMART
6 according SMART
7 accordingly SMART
8 across SMART
9 actually SMART
10 after SMART
# ℹ 1,139 more rows
russian_tweets %>%unnest_tokens(word, content) %>%count(word, sort = T)# see that top words are mostly junk# t.co, https, etc. # remove stop wordstidy_tweets <- russian_tweets %>%unnest_tokens(word, content) %>%anti_join(stop_words)tidy_tweets %>%count(word, sort = T)# still t.co, https, http, etc. but now blacklivesmatter and trump
add custom stop words
custom <-add_row(stop_words, word ="https", lexicon ="custom")custom <-add_row(stop_words, word ="http", lexicon ="custom")custom <-add_row(stop_words, word ="t.co", lexicon ="custom")russian_tweets %>%unnest_tokens(word, content) %>%anti_join(custom) %>%count(word, sort = T)
stemming - transforming words into their roots - e.g. enlisted and enlisting -> enlist - use wordStem() from SnowballC package
Stop words are unavoidable in writing. However, to determine how similar two pieces of text are to each other are or when trying to find themes within text, stop words can make things difficult. In the book Animal Farm, the first chapter contains only 2,636 words, while almost 200 of them are the word “the”.
# Tokenize animal farm's text_column columntidy_animal_farm <- animal_farm %>%unnest_tokens(word, text_column)# Print the word frequencies - most frequent first!tidy_animal_farm %>%count(word, sort = T)
# A tibble: 4,076 × 2 word n <chr> <int> 1 the 2187 2 and 966 3 of 899 4 to 814 5 was 633 6 a 620 7 in 537 8 had 529 9 that 45110 it 384# … with 4,066 more rows# ℹ Use `print(n = ...)` to see more rows
# Remove stop words, using stop_words from tidytexttidy_animal_farm %>%anti_join(stop_words)
# A tibble: 10,579 × 2 chapter word <chr> <chr> 1 Chapter 1 jones 2 Chapter 1 manor 3 Chapter 1 farm 4 Chapter 1 locked 5 Chapter 1 hen 6 Chapter 1 houses 7 Chapter 1 night 8 Chapter 1 drunk 9 Chapter 1 remember10 Chapter 1 shut # … with 10,569 more rows# ℹ Use `print(n = ...)` to see more rows
Excellent. You should always consider removing stop words before performing text analysis. They muddy your results and can increase computation time for large analysis tasks.
The root of words are often more important than their endings, especially when it comes to text analysis. The book Animal Farm is obviously about animals. However, knowing that the book mentions animal’s 248 times, and animal 107 times might not be helpful for your analysis.
tidy_animal_farm contains a tibble of the words from Animal Farm, tokenized and without stop words. The next step is to stem the words and explore the results.
# Perform stemming on tidy_animal_farmstemmed_animal_farm <- tidy_animal_farm %>%mutate(word =wordStem(word))# Print the old word frequencies tidy_animal_farm %>%count(word, sort = T)
# A tibble: 3,611 × 2 word n <chr> <int> 1 animals 248 2 farm 163 3 napoleon 141 4 animal 107 5 snowball 106 6 pigs 91 7 boxer 76 8 time 71 9 windmill 6810 squealer 61# … with 3,601 more rows# ℹ Use `print(n = ...)` to see more rows
# Print the new word frequenciesstemmed_animal_farm %>%count(word, sort = T)
# A tibble: 2,751 × 2 word n <chr> <int> 1 anim 363 2 farm 173 3 napoleon 141 4 pig 114 5 snowbal 106 6 comrad 94 7 dai 86 8 time 83 9 boxer 7610 windmil 70# … with 2,741 more rows# ℹ Use `print(n = ...)` to see more rows
Nice job. There is a clear difference in word frequencies after we performed stemming. Comrade is used throughout Animal Farm but until you stemmed the words, it didn’t show up in the top 10! In Chapter 2 you will expand this analysis and start building your first text analysis models.
Representations of Text
In this chapter, you will learn the most common and studied ways to analyze text. You will look at creating a text corpus, expanding a bag-of-words representation into a TFIDF matrix, and use cosine-similarity metrics to determine how similar two pieces of text are to each other. You build on your foundations for practicing NLP before you dive into applications of NLP in chapters 3 and 4.
Corpus (collections of text)
collections of documents containing natural language text
from tm package as corpus
VCorpus (volatile corpus) most common representation, to host both text and metadata about the collection of text
example dataset acq (50 articles from Reuters)
library(tm)
Loading required package: NLP
Attaching package: 'NLP'
The following object is masked from 'package:ggplot2':
annotate
data("acq")# metadata of first articleacq[[1]]$meta
author : character(0)
datetimestamp: 1987-02-26 15:18:06
description :
heading : COMPUTER TERMINAL SYSTEMS <CPML> COMPLETES SALE
id : 10
language : en
origin : Reuters-21578 XML
topics : YES
lewissplit : TRAIN
cgisplit : TRAINING-SET
oldid : 5553
places : usa
people : character(0)
orgs : character(0)
exchanges : character(0)
# meta item of 1st article and then the char value for place where article originated (usa)# note the nested object!acq[[1]]$meta$places
[1] "usa"
# contet of first itemacq[[1]]$content
[1] "Computer Terminal Systems Inc said\nit has completed the sale of 200,000 shares of its common\nstock, and warrants to acquire an additional one mln shares, to\n<Sedio N.V.> of Lugano, Switzerland for 50,000 dlrs.\n The company said the warrants are exercisable for five\nyears at a purchase price of .125 dlrs per share.\n Computer Terminal said Sedio also has the right to buy\nadditional shares and increase its total holdings up to 40 pct\nof the Computer Terminal's outstanding common stock under\ncertain circumstances involving change of control at the\ncompany.\n The company said if the conditions occur the warrants would\nbe exercisable at a price equal to 75 pct of its common stock's\nmarket price at the time, not to exceed 1.50 dlrs per share.\n Computer Terminal also said it sold the technolgy rights to\nits Dot Matrix impact technology, including any future\nimprovements, to <Woodco Inc> of Houston, Tex. for 200,000\ndlrs. But, it said it would continue to be the exclusive\nworldwide licensee of the technology for Woodco.\n The company said the moves were part of its reorganization\nplan and would help pay current operation costs and ensure\nproduct delivery.\n Computer Terminal makes computer generated labels, forms,\ntags and ticket printers and terminals.\n Reuter"
# and secondacq[[2]]$content
[1] "Ohio Mattress Co said its first\nquarter, ending February 28, profits may be below the 2.4 mln\ndlrs, or 15 cts a share, earned in the first quarter of fiscal\n1986.\n The company said any decline would be due to expenses\nrelated to the acquisitions in the middle of the current\nquarter of seven licensees of Sealy Inc, as well as 82 pct of\nthe outstanding capital stock of Sealy.\n Because of these acquisitions, it said, first quarter sales\nwill be substantially higher than last year's 67.1 mln dlrs.\n Noting that it typically reports first quarter results in\nlate march, said the report is likely to be issued in early\nApril this year.\n It said the delay is due to administrative considerations,\nincluding conducting appraisals, in connection with the\nacquisitions.\n Reuter"
to get data into a table format for more tidier analysis, where each obs is represented by a row and each variable is a column, use tidy() function on corpus:
tidy_data <-tidy(acq)tidy_data
# A tibble: 50 × 16
author datetimestamp description heading id language origin topics
<chr> <dttm> <chr> <chr> <chr> <chr> <chr> <chr>
1 <NA> 1987-02-26 15:18:06 "" COMPUT… 10 en Reute… YES
2 <NA> 1987-02-26 15:19:15 "" OHIO M… 12 en Reute… YES
3 <NA> 1987-02-26 15:49:56 "" MCLEAN… 44 en Reute… YES
4 By Cal … 1987-02-26 15:51:17 "" CHEMLA… 45 en Reute… YES
5 <NA> 1987-02-26 16:08:33 "" <COFAB… 68 en Reute… YES
6 <NA> 1987-02-26 16:32:37 "" INVEST… 96 en Reute… YES
7 By Patt… 1987-02-26 16:43:13 "" AMERIC… 110 en Reute… YES
8 <NA> 1987-02-26 16:59:25 "" HONG K… 125 en Reute… YES
9 <NA> 1987-02-26 17:01:28 "" LIEBER… 128 en Reute… YES
10 <NA> 1987-02-26 17:08:27 "" GULF A… 134 en Reute… YES
# ℹ 40 more rows
# ℹ 8 more variables: lewissplit <chr>, cgisplit <chr>, oldid <chr>,
# places <named list>, people <lgl>, orgs <lgl>, exchanges <lgl>, text <chr>
reverse, to get back to corpus format from a tibble, use VCorpus() function:
corpus <-VCorpus(VectorSource(tidy_data$text)) # this only captures the text# add a column to metadata dataframe attached to corpus:meta(corpus, "Author") <- tidy_data$authormeta(corpus, "oldid") <- tidy_data$oldidhead(meta(corpus))
examples
Explore an R corpus
One of your coworkers has prepared a corpus of 20 documents discussing crude oil, named crude. This is only a sample of several thousand articles you will receive next week. In order to get ready for running text analysis on these documents, you have decided to explore their content and metadata. Remember that in R, a VCorpus contains both meta and content regarding each text. In this lesson, you will explore these two objects.
# Print out the corpusprint(crude)
Metadata: corpus specific: 0, document level (indexed): 0Content: documents: 20
# Print the content of the 10th articlecrude[[10]]$content
[1] "Saudi Arabian Oil Minister Hisham Nazer\nreiterated the kingdom's commitment to last December's OPEC\naccord to boost world oil prices and stabilise the market, the\nofficial Saudi Press Agency SPA said.\n Asked by the agency about the recent fall in free market\noil prices, Nazer said Saudi Arabia \"is fully adhering by the\n... Accord and it will never sell its oil at prices below the\npronounced prices under any circumstance.\"\n Nazer, quoted by SPA, said recent pressure on free market\nprices \"may be because of the end of the (northern hemisphere)\nwinter season and the glut in the market.\"\n Saudi Arabia was a main architect of the December accord,\nunder which OPEC agreed to lower its total output ceiling by\n7.25 pct to 15.8 mln barrels per day (bpd) and return to fixed\nprices of around 18 dlrs a barrel.\n The agreement followed a year of turmoil on oil markets,\nwhich saw prices slump briefly to under 10 dlrs a barrel in\nmid-1986 from about 30 dlrs in late 1985. Free market prices\nare currently just over 16 dlrs.\n Nazer was quoted by the SPA as saying Saudi Arabia's\nadherence to the accord was shown clearly in the oil market.\n He said contacts among members of OPEC showed they all\nwanted to stick to the accord.\n In Jamaica, OPEC President Rilwanu Lukman, who is also\nNigerian Oil Minister, said the group planned to stick with the\npricing agreement.\n \"We are aware of the negative forces trying to manipulate\nthe operations of the market, but we are satisfied that the\nfundamentals exist for stable market conditions,\" he said.\n Kuwait's Oil Minister, Sheikh Ali al-Khalifa al-Sabah, said\nin remarks published in the emirate's daily Al-Qabas there were\nno plans for an emergency OPEC meeting to review prices.\n Traders and analysts in international oil markets estimate\nOPEC is producing up to one mln bpd above the 15.8 mln ceiling.\n They named Kuwait and the United Arab Emirates, along with\nthe much smaller producer Ecuador, among those producing above\nquota. Sheikh Ali denied that Kuwait was over-producing.\n REUTER"
# Find the first IDcrude[[1]]$meta$id
[1] "127"
# Make a vector of IDsids <-c()for(i inc(1:20)){ ids <-append(ids, crude[[i]]$meta$id)}
Well done. You now understand the basics of an R corpus. However, creating the ID vector was a bit of work. Let’s use the tidy() function to help make this process easier.
Creating a tibble from a corpus
To further explore the corpus on crude oil data that you received from a coworker, you have decided to create a pipeline to clean the text contained in the documents. Instead of exploring how to do this with the tm package, you have decided to transform the corpus into a tibble so you can use the functions unnest_tokens(), count(), and anti_join() that you are already familiar with. The corpus crude contains both the metadata and the text of each document.
# Create a tibble & Reviewcrude_tibble <-tidy(crude)names(crude_tibble)crude_counts <- crude_tibble %>%# Tokenize by word unnest_tokens(word, text) %>%# Count by wordcount(word, sort =TRUE) %>%# Remove stop wordsanti_join(stop_words)crude_counts
# A tibble: 900 × 2 word n <chr> <int> 1 oil 86 2 prices 48 3 opec 44 4 mln 31 5 bpd 23 6 dlrs 23 7 crude 21 8 market 20 9 reuter 2010 saudi 18# … with 890 more rows# ℹ Use `print(n = ...)` to see more rows
Creating a corpus
You have created a tibble called russian_tweets that contains around 20,000 tweets auto generated by bots during the 2016 U.S. election cycle so that you can perform text analysis. However, when searching through the available options for performing the analysis you have chosen to do, you believe that the tm package offers the easiest path forward. In order to conduct the analysis, you first must create a corpus and attach potentially useful metadata.
Be aware that this is real data from Twitter and as such there is always a risk that it may contain profanity or other offensive content (in this exercise, and any following exercises that also use real Twitter data).
# Create a corpustweet_corpus <-VCorpus(VectorSource(russian_tweets$content))# Attach following and followersmeta(tweet_corpus, 'following') <- russian_tweets$followingmeta(tweet_corpus, 'followers') <- russian_tweets$followers# Review the meta datahead(meta(tweet_corpus))
bag-of-words representation uses vectors to specify which words are in each text
consider the following three texts
find unique words and then convert this into vector representations
“few” only in text1
“all” only in text2
“most” only in text3
words, are, important in all three texts
text1 <-c("Few words are important.")text2 <-c("All words are important.")text3 <-c("Most words are important.")
First, create clean vector of the unique words used in all of the text
# lowercase, without stop words# optional but good ideas: removing punctuation and stemming wordsword_vector <-c("few", "all", "most", "words", "important")# convert each text into binary representation of which words are in that text# Representation for text1text1 <-c("Few words are important.")text1_vector <-c(1, 0, 0, 1, 1)# Representation for text2text2 <-c("All words are important.")text2_vector <-c(0, 1, 0, 1, 1)# Representation for text3text3 <-c("Most words are important.")text3_vector <-c(0, 0, 1, 1, 1)
could have used word counts instead of binary 1s and 0s
tidytext’s representation is different
tibble or word count by chapter, sorted from most to least common
Sparse matrix
consider the russian tweet dataset
20,000 tweets (rows)
43,000 (non stop words) words (columns)
Need 860 million elements in matrix, but only 177,000 non-0 entries (0.02%)
tidytext and tm packages can handle this sparse matrix problem in an efficient manner
examples
BoW Example
In literature reviews, researchers read and summarize as many available texts about a subject as possible. Sometimes they end up reading duplicate articles, or summaries of articles they have already read. You have been given 20 articles about crude oil as an R object named crude_tibble. Instead of jumping straight to reading each article, you have decided to see what words are shared across these articles. To do so, you will start by building a bag-of-words representation of the text.
# Count occurrence by article_id and wordwords <- crude_tibble %>%unnest_tokens(output ="word", token ="words", input = text) %>%anti_join(stop_words) %>%count(article_id, word, sort=TRUE)
words
# A tibble: 1,498 × 3 word article_id n <chr> <int> <int> article_id word n <int> <chr> <int> 1 2 opec 13 2 2 oil 12 3 6 kuwait 10 4 10 oil 9 5 10 prices 9 6 11 mln 9 7 19 futures 9 8 6 opec 8 9 7 report 810 10 market 8# … with 1,488 more rows# ℹ Use `print(n = ...)` to see more rows
# Count occurrence by article_id and wordwords <- crude_tibble %>%unnest_tokens(output ="word", token ="words", input = text) %>%anti_join(stop_words) %>%count(word, article_id, sort=TRUE)# How many different word/article combinations are there?unique_combinations <-nrow(words)# Filter to responses with the word "prices"words_with_prices <- words %>%filter(word =="prices")# How many articles had the word "prices"?number_of_price_articles <-nrow(words_with_prices)
number_of_price_articles15
Excellent job. BOW representations are one of the quickest ways to start analyzing text. Several more advanced techniques also start by simply looking at which words are used in each piece of text.
Sparse matrices
During the video lesson you learned about sparse matrices. Sparse matrices can become computational nightmares as the number of text documents and the number of unique words grow. Creating word representations with tweets can easily create sparse matrices because emojis, slang, acronyms, and other forms of language are used.
In this exercise you will walk through the steps to calculate how sparse the Russian tweet dataset is. Note that this is a small example of how quickly text analysis can become a major computational problem.
# Tokenize and remove stop wordstidy_tweets <- russian_tweets %>%unnest_tokens(word, content) %>%anti_join(stop_words)# Count by wordunique_words <- tidy_tweets %>%count(word, sort = T)
unique_words# A tibble: 43,666 × 2 word n <chr> <int> 1 t.co 18121 2 https 16003 3 http 2135 4 blacklivesmatter 1292 5 trump 1004 6 black 781 7 enlist 764 8 police 745 9 people 72310 cops 693# … with 43,656 more rows# ℹ Use `print(n = ...)` to see more rows
# Count by tweet (tweet_id) and wordunique_words_by_tweet <- tidy_tweets %>%count(tweet_id, word)
unique_words_by_tweet# A tibble: 177,140 × 3 tweet_id word n <int> <chr> <int> 1 1 barely 1 2 1 corruption 1 3 1 democrat 1 4 1 gh6g0d1oic 1 5 1 heard 1 6 1 https 1 7 1 mainstream 1 8 1 media 1 9 1 nedryun 110 1 peep 1# … with 177,130 more rows# ℹ Use `p
# Find the size of matrixsize <-nrow(russian_tweets) *nrow(unique_words)
size[1] 873320000
# Find percent of entries that would have a valuepercent <-nrow(unique_words_by_tweet) / sizepercent
[1] 0.0002028352
Well done! This percent is tiny - indicating that we are dealing with a very sparse matrix. Imagine if we looked at a million tweets instead of just 20,000.
Term Frequency-Inverse Document Frequency (TF-IDF)
TF-IDF is a way to represent text data that is more informative than BoW
represents word counts by considering two components
Term frequency (TF): proportion of words in a text that are that term
Inverse document frequency (IDF): how unique a word is across all documents
IDF equation
: total number of documents in corpus
: number of documents where term appears
TF-IDF:
In tidytext’s bind_tf_idf()
t1 <-c("My name is John. My best friend is Joe. We like tacos.")t2 <-c("Two common best friend names are John and Joe.")t3 <-c("Tacos are my favorite food. I eat them with my friend Joe.")df <-data.frame('text'=c(t1, t2, t3),'ID'=c(1,2,3),stringsAsFactors =FALSE)df %>%unnest_tokens(output ="word",token ="words",input = text) %>%anti_join(stop_words) %>%count(ID, word, sort =TRUE) %>%bind_tf_idf(word, # column with terms ID, # column with document ids n) # word count produced by count()
Earlier you looked at a bag-of-words representation of articles on crude oil. Calculating TFIDF values relies on this bag-of-words representation, but takes into account how often a word appears in an article, and how often that word appears in the collection of articles.
To determine how meaningful words would be when comparing different articles, calculate the TFIDF weights for the words in crude, a collection of 20 articles about crude oil.
# Create a tibble with TFIDF valuescrude_weights <- crude_tibble %>%unnest_tokens(output ="word", token ="words", input = text) %>%anti_join(stop_words) %>%count(article_id, word) %>%bind_tf_idf(word, article_id, n)# Find the highest TFIDF valuescrude_weights %>%arrange(desc(tf_idf))# Find the lowest non-zero TFIDF valuescrude_weights %>%filter(tf_idf !=0) %>%arrange(desc(tf_idf))
# Find the highest TFIDF valuescrude_weights %>% arrange(desc(tf_idf))# A tibble: 1,498 × 6 article_id word n tf idf tf_idf <int> <chr> <int> <dbl> <dbl> <dbl> 1 20 january 4 0.0930 2.30 0.214 2 15 power 4 0.0690 3.00 0.207 3 19 futures 9 0.0643 3.00 0.193 4 8 8 6 0.0619 3.00 0.185 5 3 canada 2 0.0526 3.00 0.158 6 3 canadian 2 0.0526 3.00 0.158 7 15 ship 3 0.0517 3.00 0.155 8 19 nymex 7 0.05 3.00 0.150 9 20 cubic 2 0.0465 3.00 0.13910 20 fiscales 2 0.0465 3.00 0.139# … with 1,488 more rows# ℹ Use `print(n = ...)` to see more rows# Find the lowest non-zero TFIDF valuescrude_weights %>% filter(tf_idf != 0) %>% arrange(tf_idf)# A tibble: 1,458 × 6 article_id word n tf idf tf_idf <int> <chr> <int> <dbl> <dbl> <dbl> 1 7 prices 1 0.00452 0.288 0.00130 2 9 prices 1 0.00513 0.288 0.00148 3 7 dlrs 1 0.00452 0.598 0.00271 4 7 opec 1 0.00452 0.693 0.00314 5 9 opec 1 0.00513 0.693 0.00355 6 7 mln 1 0.00452 0.799 0.00361 7 7 petroleum 1 0.00452 0.799 0.00361 8 11 petroleum 1 0.00455 0.799 0.00363 9 6 barrels 1 0.00429 0.916 0.0039310 6 industry 1 0.00429 0.916 0.00393# … with 1,448 more rows# ℹ Use `print(n = ...)` to see more rows```
Excellent. We see that ‘prices’ and ‘petroleum’ have very low values for some articles. This could be because they were mentioned just a few times in that article, or because they were used in too many articles.
Cosine Similarity
Assess how similar two documents are using cosine similarity
a measure of similarity between two vectors (measured by the angle formed between them)
can be found by taking the dot product of two vectors and dividing it by the product of their magnitudes:
and : vectors of word counts for each document
Can use pairwise_similarity() from widyr package:
pairwise_similarity(tbl, # tibble or table item, # item to compare (articles, tweets, etc) feature, # column with link between items e.g. words value) # name of the column with comparison values e.g. n or tf_idf
e.g.
crude_weights %>%pairwise_similarity(X, word, tf_idf) %>%arrange(desc(similarity))
use cases for cosine similarity: - find duplicate/similar pieces of text - use for clustering and classification analysis
examples
An example of failing at text analysis
Early on, you discussed the power of removing stop words before conducting text analysis. In this most recent chapter, you reviewed using cosine similarity to identify texts that are similar to each other.
In this exercise, you will explore the very real possibility of failing to use text analysis properly. You will compute cosine similarities for the chapters in the book Animal Farm, without removing stop-words.
# Create word countsanimal_farm_counts <- animal_farm %>%unnest_tokens(word, text_column) %>%count(chapter, word)# Calculate the cosine similarity by chapter, using wordscomparisons <- animal_farm_counts %>%pairwise_similarity(chapter, word, n) %>%arrange(desc(similarity))
# Print the mean of the similarity valuescomparisons %>%summarize(mean =mean(similarity))
# A tibble: 1 × 1 mean <dbl>1 0.949
Well done. Unfortunately, these results are useless. As every single chapter is highly similar to every other chapter. We need to remove stop words to see which chapters are more similar to each other.
Cosine similarity example
The plot of Animal Farm is pretty simple. In the beginning the animals are unhappy with following their human leaders. In the middle they overthrow those leaders, and in the end they become unhappy with the animals that eventually became their new leaders.
If done correctly, cosine similarity can help identify documents (chapters) that are similar to each other. In this exercise, you will identify similar chapters in Animal Farm. Odds are, chapter 1 (the beginning) and chapter 10 (the end) will be similar.
# Create word counts animal_farm_counts <- animal_farm %>%unnest_tokens(word, text_column) %>%anti_join(stop_words) %>%count(chapter, word) %>%bind_tf_idf(word, chapter, n)# Calculate cosine similarity on word countsanimal_farm_counts %>%pairwise_similarity(chapter, word, n) %>%arrange(desc(similarity))
Excellent job. Cosine similarity scores can be calculated on word counts or TFIDF values. We see drastically different results for both. Animal Farm has a very low reading level, and most chapters share the same vocabulary. This was evident in the previous exercise. You’ll need to consider the context of the text you are analyzing when deciding on an approach.
Applications: Classification and Topic Modelling
Preparing text for modeling
For classification tasks:
clean/prepare data
split into training & testing datasets
train model on training dataset
evaluate model on testing dataset
Use classification modeling on the Animal Farm dataset to determine which sentences are discussing Napoleon or Boxer
# Make sentencessentences <- animal_farm %>%unnest_tokens(output ="sentence",token ="sentences",input = text_column)# label sentences by animal (so algorithm doesn't use these during training)sentences$boxer <-grepl('boxer', sentences$sentence)sentences$napoleon <-grepl('napoleon', sentences$sentence)# Replace the animal namesentences$sentence <-gsub("boxer", "animal X", sentences$sentence)sentences$sentence <-gsub("napoleon", "animal X", sentences$sentence)# filter to sentences that contain Boxer or Napoleon but not bothanimal_sentences <- sentences[sentences$boxer + sentences$napoleon ==1,]# add label to datasetanimal_sentences$Name <-as.factor(ifelse(animal_sentences$boxer, "boxer", "napoleon"))# select 75 sentences for eachanimal_sentences <-rbind(animal_sentences[animal_sentences$Name =="boxer", ][c(1:75), ], animal_sentences[animal_sentences$Name =="napoleon", ][c(1:75), ])animal_sentences$sentence_id <-c(1:dim(animal_sentences)[1])# next predict which sentences originally included each animallibrary(tm)library(tidytext)library(dplyr)library(SnowballC)# create tokensanimal_tokens <- animal_sentences %>%unnest_tokens(output ="word",token ="words",input = sentence) %>%anti_join(stop_words) %>%mutate(word =wordStem(word))# for classification, create a document term matrix with tfidf weights using cast_dtm() from tidytextanimal_matrix <- animal_tokens %>%# count words by sentencecount(sentence_id, word) %>%# cast to a dtm (one row per document (sentence, here), and one column for each word)cast_dtm(document = sentence_id,term = word,value = n,weighting = tm::weightTfIdf)animal_matrix
Non-/sparse entries: 1235/102865Sparsity : 99%Maximal term length: 17Weighting : term frequency - inverse document frequency
Using large, sparse matrices can be computationally expensive. In this case, we have 150 sentences and 694 unique words. The matrix is 99% sparse, meaning that 99% of the cells are empty. This is a common issue when working with text data.
Remove sparse terms with removeSparseTerms()
How sparse is too sparse?
If we set maximum sparsity to 90%:
removeSparseTerms(animal_matrix, sparse =0.90)
Non-/sparse entries: 207/393Sparsity : 66%
Would remove all words but four! Couldn’t classify sentences using only 4 words!
If we set maximum sparsity to 99%:
removeSparseTerms(animal_matrix, sparse =0.99)
Non-/sparse entries: 713/25087Sparsity : 97%
here we’d have 172 terms (remember we started with 694).
Deciding on matrix sparsity depends on how many terms are in the matrix and how fast your computer is
Examples
Classification modeling example
You have previously prepared a set of Russian tweets for classification. Of the 20,000 tweets, you have filtered to tweets with an account_type of Left or Right, and selected the first 2000 tweets of each. You have already tokenized the tweets into words, removed stop words, and performed stemming. Furthermore, you converted word counts into a document-term matrix with TFIDF values for weights and saved this matrix as: left_right_matrix_small.
You will use this matrix to predict whether a tweet was generated from a left-leaning tweet bot, or a right-leaning tweet bot. The labels can be found in the vector, left_right_labels.
library(randomForest)# Create train/test splitset.seed(1111)sample_size <-floor(0.75*nrow(left_right_matrix_small))train_ind <-sample(nrow(left_right_matrix_small), size = sample_size)train <- left_right_matrix_small[train_ind, ]test <- left_right_matrix_small[-train_ind, ]# Create a random forest classifierrfc <-randomForest(x =as.data.frame(as.matrix(train)), y = left_right_labels[train_ind],nTree =50)# Print the resultsrfc
Call: randomForest(x = as.data.frame(as.matrix(train)), y = left_right_labels[train_ind], nTree = 50) Type of random forest: classification Number of trees: 500No. of variables tried at each split: 3 OOB estimate of error rate: 22.43%Confusion matrix: Left Right class.errorLeft 976 540 0.35620053Right 133 1351 0.08962264
Excellent! Classification modeling with text follows the same principals as classification models built on continuous data. You can also use all kinds of fun machine learning algorithms and are not stuck using random forest models.
Confusion matrices
You have just finished creating a classification model. This model predicts whether tweets were created by a left-leaning (democrat) or right-leaning (republican) tweet bot. You have made predictions on the test data and have the following result:
Predictions
Left
Right
Left
350
157
Right
57
436
Use the confusion matrix above to answer questions about the models accuracy.
Excellent. Although accuracy is only one of many metrics to determine if an algorithm is doing a good job, it is usually a good indicator of model performance!
left_right_tfidf# A tibble: 38,821 × 6 X word n tf idf tf_idf <int> <chr> <int> <dbl> <dbl> <dbl> 1 6028 ʷʰʸ 11 0.917 8.29 7.60 2 16 obama 3 0.231 3.99 0.921 3 24 scout 3 0.333 7.20 2.40 4 96 peopl 3 0.333 3.53 1.18 5 141 hillari 3 0.15 4.53 0.680 6 141 trump 3 0.15 2.16 0.323 7 5732 door 3 0.214 6.21 1.33 8 5735 albino 3 0.214 8.29 1.78 9 5798 cop 3 0.214 2.27 0.48710 6012 cop 3 0.176 2.27 0.401# … with 38,811 more rows# ℹ Use `print(n = ...)` to see more rows
Topic Modeling
Collection of texts is likely to be made up of a collection of topics (e.g. articles about sports, with topics like player gossip, scores, scouting, draft picks)
Algorithms can identify topics within a collection of text, one of the most common is Latent dirichlet allocation (LDA):
Each document is a mixture of topics
Topics are mixtures of words
e.g. a sports story on a player being traded:
70% on team news
words: trade, pitcher, move, new
30% player gossip
words: angry, change, money
to perform LDA, need a document-term matrix with term frequency weights
animal_farm_tokens <- animal_farm %>%unnest_tokens(output ="word", token ="words", input = text_column) %>%anti_join(stop_words) %>%mutate(word =wordStem(word))# cast to DTManimal_farm_matrix <- animal_farm_tokens %>%count(document, word) %>%cast_dtm(document = chapter,term = word, value = n,weighting = tm::weightTf) # LDA requires term-frequency weighting# Perform LDAlibrary(topicmodels)animal_farm_lda <-LDA(train,k =4, # number of topicsmethod ="Gibbs", # control =list(seed =111)) # seedanimal_farm_lda# A LDA_Gibbs topic model with 4 topics.# extract a tibble of resultsanimal_farm_betas <-tidy(animal_farm_lda,matrix ="beta")animal_farm_betas
# A tibble: 11,004 x 3 topic term beta <int> <chr> <dbl>...5 1 abolish 0.0003606 2 abolish 0.001297 3 abolish 0.0003558 4 abolish 0.000381...
beta is a per-topic word distribution, how related to each topic a word is - probability of a word given a topic - sum of these values should be equal to the topic number of topics
sum(animal_farm_betas$beta)[1] 4
Top words per topic:
# look at topic 1animal_farm_betas %>%group_by(topic) %>%slice_max(beta, n =10) %>%arrange(topic, -beta) %>%filter(topic ==1)
topic term beta <int> <chr> <dbl>1 1 napoleon 0.03392 1 anim 0.03173 1 windmill 0.01444 1 squealer 0.0119
# look at topic 2animal_farm_betas %>%group_by(topic) %>%slice_max(beta, n =10) %>%arrange(topic, -beta) %>%filter(topic ==2)
topic term beta <int> <chr> <dbl>...3 2 anim 0.0189...6 2 napoleon 0.0148
see similar words in topic 2 - indicates we might need to remove some of the non-entity words such as “animal” and re-run our analysis.
Labelling topics: - now that we know words correspond to topics, use words of each chapter to assign topics to chapters - to extract topic assignments, use tidy() again, only specify matrix as gamma (document-topic distribution, how much of the chpater is made up of a single topic)
You are interested in the common themes surrounding the character Napoleon in your favorite new book, Animal Farm. Napoleon is a Pig who convinces his fellow comrades to overthrow their human leaders. He also eventually becomes the new leader of Animal Farm.
You have extracted all of the sentences that mention Napoleon’s name, pig_sentences, and created tokenized version of these sentences with stop words removed and stemming completed, pig_tokens. Complete LDA on these sentences and review the top words associated with some of the topics.
pig_matrixNon-/sparse entries: 1448/132400Sparsity : 99%Maximal term length: 22Weighting : term frequency (tf)pig_sentences# A tibble: 157 × 4 chapter sentence napol…¹ sente…² <chr> <chr> <lgl> <int> 1 Chapter 2 "pre-eminent among the pigs were two young boars n… TRUE 1 2 Chapter 2 "napoleon was a large, rather fierce-looking berks… TRUE 2 3 Chapter 2 "snowball was a more vivacious pig than napoleon, … TRUE 3 4 Chapter 2 "napoleon then led them back to the store-shed and… TRUE 4 5 Chapter 2 "after a moment, however, snowball and napoleon bu… TRUE 5 6 Chapter 2 "all were agreed that no animal must ever live the… TRUE 6 7 Chapter 2 "napoleon sent for pots of black and white paint a… TRUE 7 8 Chapter 2 "after this they went back to the farm buildings, … TRUE 8 9 Chapter 2 "cried napoleon, placing himself in front of the b… TRUE 910 Chapter 3 "snowball and napoleon were by far the most active… TRUE 10# … with 147 more rows, and abbreviated variable names ¹napoleon, ²sentence_id# ℹ Use `print(n = ...)` to see more rowspig_tokens# A tibble: 1,483 × 4 chapter napoleon sentence_id word <chr> <lgl> <int> <chr> 1 Chapter 2 TRUE 1 pre 2 Chapter 2 TRUE 1 emin 3 Chapter 2 TRUE 1 pig 4 Chapter 2 TRUE 1 boar 5 Chapter 2 TRUE 1 name 6 Chapter 2 TRUE 1 snowbal 7 Chapter 2 TRUE 2 fierc 8 Chapter 2 TRUE 2 berkshir 9 Chapter 2 TRUE 2 boar 10 Chapter 2 TRUE 2 berkshir# … with 1,473 more rows# ℹ Use `print(n = ...)` to see more rows
library(topicmodels)# Perform Topic Modelingsentence_lda <-LDA(pig_matrix, k =10, method ='Gibbs', control =list(seed =1111))# Extract the beta matrix sentence_betas <-tidy(sentence_lda, matrix ="beta")# Topic #2sentence_betas %>%filter(topic =="2") %>%arrange(-beta)
# A tibble: 858 × 3 topic term beta <int> <chr> <dbl> 1 2 comrad 0.0906 2 2 announc 0.0434 3 2 napoleon' 0.0348 4 2 live 0.0262 5 2 maxim 0.0133 6 2 whymper 0.0133 7 2 tabl 0.0133 8 2 speech 0.00902 9 2 dog 0.0090210 2 stood 0.00902# … with 848 more rows# ℹ Use `print(n = ...)` to see more rows
# A tibble: 858 × 3 topic term beta <int> <chr> <dbl> 1 3 comrad 0.0306 2 3 snowball' 0.0220 3 3 usual 0.0177 4 3 boar 0.0134 5 3 sheep 0.0134 6 3 moment 0.00906 7 3 walk 0.00906 8 3 beast 0.00906 9 3 complet 0.0090610 3 bound 0.00906# … with 848 more rows# ℹ Use `print(n = ...)` to see more rows
Well done. Notice the differences in words for topic 2 and topic 3. Each topic should be made up of mostly different words, otherwise all topics would end up being the same. We will give meaning to these differences in the next lesson.
Assigning topics to documents
Creating LDA models are useless unless you can interpret and use the results. You have been given the results of running an LDA model, sentence_lda on a set of sentences, pig_sentences. You need to explore both the beta, top words by topic, and the gamma, top topics per document, matrices to fully understand the results of any LDA analysis.
Given what you know about these two matrices, extract the results for a specific topic and see if the output matches expectations.
# Extract the beta and gamma matricessentence_betas <-tidy(sentence_lda, matrix ="beta")sentence_gammas <-tidy(sentence_lda, matrix ="gamma")# Explore Topic 5 Betassentence_betas %>%filter(topic =="5") %>%arrange(-beta)
# A tibble: 858 × 3 topic term beta <int> <chr> <dbl> 1 5 dog 0.0373 2 5 windmil 0.0291 3 5 napoleon' 0.0168 4 5 time 0.0127 5 5 mind 0.0127 6 5 feel 0.0127 7 5 egg 0.0127 8 5 act 0.0127 9 5 emin 0.0086110 5 snowbal 0.00861# … with 848 more rows# ℹ Use `print(n = ...)` to see more rows
Must assess perplexity on the testing dataset to make sure topics are extendable to new data.
Next create LDA models for each number of topics and calculate perplexity for each model. - perplexity() function from topicmodels package
library(topicmodels)values =c()# for each K from 2 to 35, train a model and calculate perplexityfor(i inc(2:35)){ lda_model <-LDA(train,k = i,method ="Gibbs",control =list(iter =25,seed =1111)) values <-c(values, perplexity(lda_model, newdata = test))}# plot these values with # of topics as X, perplexity as Yplot(c(2:35), values, main ="Perplexity for Topics", xlab ="Number of Topics", ylab ="Perplexity")
A scree-plot (like for Kmeans) - find where the perplexity score is not improving (decreasing) much with the addition of more topics
LDA is often more about practical use than selecting the optimal number of topics based on perplexity - e.g. describing 10-15 topics to an audience might not be feasible - graphics with 5 topics are easier to view than graphics with 50 topics
Good rule of thumb: go with smaller number of topics, where each topic is represented by a large number of documents
Common for having a subject matter expert review the words of the topics and some of the articles aligned with each topic to provide a theme for each topic
topic 1 had highest average weight when it was the top topic
examples
Testing perplexity
You have been given a dataset full of tweets that were sent by tweet bots during the 2016 US election. Your boss has identified two different account types of interest, Left and Right. Your boss has asked you to perform topic modeling on the tweets from Right tweet bots. Furthermore, your boss is hoping to summarize the content of these tweets with topic modeling. Perform topic modeling on 5, 15, and 50 topics to determine a general idea of how many topics are contained in the data.
Excellent. 15 topics performs much better on this dataset. 5 topics was not enough, while 50 topics is probably way too many.
Reviewing LDA results
You have developed a topic model, napoleon_model, with 5 topics for the sentences from the book Animal Farm that reference the main character Napoleon. You have had 5 local authors review the top words and top sentences for each topic and they have provided you with themes for each topic.
To finalize your results, prepare some summary statistics about the topics. You will present these summary values along with the themes to your boss for review.
bing lexicon: positive or negative sentiment - instead of summarizing scores, just need to count the words used
# find total words used by chapterword_totals <- animal_farm_tokens %>%group_by(chapter) %>%count()# count how many negative words were usedanimal_farm_tokens %>%inner_join(get_sentiments("bing")) %>%group_by(chapter) %>%count(sentiment) %>%filter(sentiment =="negative") %>%transform(p = n / word_totals$n) %>%arrange(desc(p))
# A tibble: 10 x 4 chapter sentiment n p <chr> <chr> <int> <dbl> 1 Chapter 7 negative 154 0.11711027 2 Chapter 6 negative 106 0.10750507 3 Chapter 4 negative 68 0.10559006
chapter 7 contains highest proportion of negative words, almost 12%
# what words related to fear are in the text?fear <-get_sentiments("nrc") %>%filter(sentiment =="fear")animal_farm_tokens %>%inner_join(fear) %>%count(word, sort =TRUE)
# A tibble: 220 x 2 word n <chr> <int>1 rebellion 292 death 193 gun 194 terrible 155 bad 14...
EXAMPLES
tidytext lexicons
Before you begin applying sentiment analysis to text, it is essential that you understand the lexicons being used to aid in your analysis. Each lexicon has advantages when used in the right context. Before running any analysis, you must decide which type of sentiment you are hoping to extract from the text available.
In this exercise, you will explore the three different lexicons offered by tidytext’s sentiments’ datasets.
Great job. Each lexicon serves its own purpose. These are not the only three sentiment dictionaries available but they are great examples of the type of dictionaries you can use.
Sentiment scores
In the book Animal Farm, three main pigs are responsible for the events of the book: Napoleon, Snowball, and Squealer. Throughout the book they are spreading thoughts of rebellion and encouraging the other animals to take over the farm from Mr. Jones - the owner of the farm.
Using the sentences that mention each pig, determine which character has the most negative sentiment associated with them. The sentences tibble contains a tibble of the sentences from the book Animal Farm.
# Print the overall sentiment associated with each pig's sentencesfor(name inc("napoleon", "snowball", "squealer")) {# Filter to the sentences mentioning the pig pig_sentences <- sentences[grepl(name, sentences$sentence), ]# Tokenize the text napoleon_tokens <- pig_sentences %>%unnest_tokens(output ="word", token ="words", input = sentence) %>%anti_join(stop_words)# Use afinn to find the overall sentiment score result <- napoleon_tokens %>%inner_join(get_sentiments("afinn")) %>%summarise(sentiment =sum(score))# Print the resultprint(paste0(name, ": ", result$sentiment))}
Excellent job. Although Napoleon is the main antagonist, the sentiment surrounding Snowball is extremely negative!
Sentiment and emotion
Within the sentiments dataset, the lexicon nrc contains a dictionary of words and an emotion associated with that word. Emotions such as joy, trust, anticipation, and others are found within this dataset.
In the Russian tweet bot dataset you have been exploring, you have looked at tweets sent out by both a left- and a right-leaning tweet bot. Explore the contents of the tweets sent by the left-leaning (democratic) tweet bot by using the nrc lexicon. The left tweets, left, have been tokenized into words, with stop-words removed.
left_tokens <- left %>%unnest_tokens(output ="word", token ="words", input = content) %>%anti_join(stop_words)# Dictionaries anticipation <-get_sentiments("nrc") %>%filter(sentiment =="anticipation")joy <-get_sentiments("nrc") %>%filter(sentiment =="joy")# Print top words for Anticipation and Joyleft_tokens %>%inner_join(anticipation, by ="word") %>%count(word, sort =TRUE)
# A tibble: 391 × 2 word n <chr> <int> 1 time 232 2 god 185 3 feat 126 4 watch 123 5 happy 98 6 money 92 7 vote 92 8 death 85 9 track 7010 art 65# … with 381 more rows# ℹ Use `print(n = ...)` to see more rows
left_tokens %>%inner_join(joy, by ="word") %>%count(word, sort =TRUE)
# A tibble: 340 × 2 word n <chr> <int> 1 music 355 2 love 273 3 god 185 4 feat 126 5 happy 98 6 money 92 7 vote 92 8 beautiful 89 9 art 6510 true 63# … with 330 more rows# ℹ Use `print(n = ...)` to see more rows
Excellent work. Tweets are supposed to stir feelings of joy, fear, and others. Especially tweets meant to turn the political left against the political right.
Word Embeddings
Flaw in word counts - consider two statements: 1. “Bob is the smartest person I know.” 2. “Bob is the most brilliant person I know.” - Statements say same thing, but consider with stop words removed: 1. Bob smartest person 2. Bob brilliant person - Smartest and brilliant aren’t identical words, so traditional similiarity metrics would not do well here
Word embeddings - instead of just counting how many times each word was used, but access to info on which words used in conjunction with those words, and their meaning - word2vec most popular word embeddings methods around - uses large vector space to represent words, words of similar meaning are close together - captures multiple similarities between words - words appearing often together are also closer together in vector space - e.g. pork, beef, chicken grouped together - implementation in R with h20 package
library(h20)h2o.init() # start h20 instance# convert tibble into h2o objecth2o_object =as.h2o(animal_farm)# using h2o methods:# tokenizewords <-h2o.tokenize(h2o_object$text_column, "\\\\W+") # places an NA after last word in each chapter# lowercase all letterswords <-h2o.tolower(words)# remove stop wordswords = words[is.na(words) || (!words %in% stop_words$word), ]word2vec_model <-h2o.word2vec(words, min_word_freq =5, # remove words used fewer than 5 timesepochs =5) # number of training iterations to run (use larger for larger texts)
# find similar words, synonymsh2o.findSynonyms(word2vec_model, "animal")
synonym score1 drink 0.82090082 age 0.79524903 alcohol 0.7867004
“animal” is most related to wrods like “drink” “age” and “alcohol”
# find similar words, synonymsh2o.findSynonyms(word2vec_model, "jones")
“jones” the enemy of the animals in the book is most related to words like battle and enemies
Examples
h2o practice
There are several machine learning libraries available in R. However, the h2o library is easy to use and offers a word2vec implementation. h2o can also be used for several other machine learning tasks. In order to use the h2o library however, you need to take additional pre-processing steps with your data. You have a dataset called left_right which contains tweets that were auto-tweeted during the 2016 US election campaign.
Instead of preparing your data for other text analysis techniques, prepare this dataset for use with the h2o library.
left_right# A tibble: 4,000 × 22 X externa…¹ author content region langu…² publi…³ harve…⁴ follo…⁵ follo…⁶ <int> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <int> <int> 1 1 9.06e17 10_GOP "\"We … Unkno… English 10/1/2… 10/1/2… 1052 9636 2 2 9.06e17 10_GOP "Marsh… Unkno… English 10/1/2… 10/1/2… 1054 9637 3 3 9.06e17 10_GOP "Daugh… Unkno… English 10/1/2… 10/1/2… 1054 9637 4 4 9.06e17 10_GOP "JUST … Unkno… English 10/1/2… 10/1/2… 1062 9642 5 5 9.06e17 10_GOP "19,00… Unkno… English 10/1/2… 10/1/2… 1050 9645 6 6 9.06e17 10_GOP "Dan B… Unkno… English 10/1/2… 10/1/2… 1050 9644 7 7 9.06e17 10_GOP "🐝🐝… Unkno… English 10/1/2… 10/1/2… 1050 9644 8 8 9.06e17 10_GOP "'@Sen… Unkno… English 10/1/2… 10/1/2… 1050 9644 9 9 9.06e17 10_GOP "As mu… Unkno… English 10/1/2… 10/1/2… 1050 964610 10 9.06e17 10_GOP "After… Unkno… English 10/1/2… 10/1/2… 1050 9646# … with 3,990 more rows, 12 more variables: updates <int>, post_type <chr>,# account_type <chr>, retweet <int>, account_category <chr>,# new_june_2018 <int>, alt_external_id <dbl>, tweet_id <int>,# article_url <chr>, tco1_step1 <chr>, tco2_step1 <chr>, tco3_step1 <chr>,# and abbreviated variable names ¹external_author_id, ²language,# ³publish_date, ⁴harvested_date, ⁵following, ⁶followers# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
# Initialize an h2o sessionlibrary(h2o)h2o.init()# Create an h2o object for left_righth2o_object =as.h2o(left_right)# Tokenize the words from the column of text in left_righttweet_words <-h2o.tokenize(h2o_object$content, "\\\\W+")# Lowercasetweet_words <-h2o.tolower(tweet_words)# Remove stopwords from tweet_wordstweet_words <- tweet_words[is.na(tweet_words) || (!tweet_words %in% stop_words$word),]tweet_words
Great job. The h2o library is easy to use and intuitive, making it a great candidate for machine learning tasks such as creating word2vec models.
word2vec
You have been web-scrapping a lot of job titles from the internet and are unsure if you need to scrap additional job titles for your analysis. So far, you have collected over 13,000 job titles in a dataset called job_titles. You have read that word2vec generally performs best if the model has enough data to properly train, and if words are not mentioned enough in your data, the model might not be useful.
In this exercise you will test how helpful additional data is by running your model 3 times; each run will use additional data.
job_titles# A tibble: 13,845 × 2 category jobtitle <chr> <chr> 1 education After School Supervisor 2 education *****TUTORS NEEDED - FOR ALL SUBJECTS, ALL AGES***** 3 education Bay Area Family Recruiter 4 education Adult Day Programs/Community Access/Job Coaches 5 education General Counselor - Non Tenure track 6 education Part-Time Summer Math Teachers/Tutors 7 education Preschool Teacher (temp-to-hire) 8 education *****TUTORS NEEDED - FOR ALL SUBJECTS, ALL AGES***** 9 education Private Teachers and Tutors Needed in the South Bay 10 education Art Therapist at Esther B. Clark School # … with 13,835 more rows# ℹ Use `print(n = ...)` to see more rows
use 33% of data:
library(h2o)h2o.init()set.seed(1111)# Use 33% of the available datasample_size <-floor(0.33*nrow(job_titles))sample_data <-sample(nrow(job_titles), size = sample_size)h2o_object =as.h2o(job_titles[sample_data, ])words <-h2o.tokenize(h2o_object$jobtitle, "\\\\W+")words <-h2o.tolower(words)words = words[is.na(words) || (!words %in% stop_words$word),]word2vec_model <-h2o.word2vec(words, min_word_freq=5, epochs =10)# Find synonyms for the word "teacher"h2o.findSynonyms(word2vec_model, "teacher", count=10)
library(h2o)h2o.init()set.seed(1111)# Use 66% of the available datasample_size <-floor(0.66*nrow(job_titles))sample_data <-sample(nrow(job_titles), size = sample_size)h2o_object =as.h2o(job_titles[sample_data, ])words <-h2o.tokenize(h2o_object$jobtitle, "\\\\W+")words <-h2o.tolower(words)words = words[is.na(words) || (!words %in% stop_words$word),]word2vec_model <-h2o.word2vec(words, min_word_freq=5, epochs =10)# Find synonyms for the word "teacher"h2o.findSynonyms(word2vec_model, "teacher", count=10)
library(h2o)h2o.init()set.seed(1111)# Use all of the available datasample_size <-floor(1*nrow(job_titles))sample_data <-sample(nrow(job_titles), size = sample_size)h2o_object =as.h2o(job_titles[sample_data, ])words <-h2o.tokenize(h2o_object$jobtitle, "\\\\W+")words <-h2o.tolower(words)words = words[is.na(words) || (!words %in% stop_words$word),]word2vec_model <-h2o.word2vec(words, min_word_freq=5, epochs =10)# Find synonyms for the word "teacher"h2o.findSynonyms(word2vec_model, "teacher", count=10)