Natural Language Generation with R (sort of)

Introduction

“Loyal employees are your best brand ambassadors.”

This is the kind of message that gets re-posted and re-shared on LinkedIn. The only exception in this case: a computer program generated this quote.

I got tired of seeing the superficial posts on LinkedIn, so I thought I would experiment with natural language generation to create my flimsy wisdom.

Just as deep learning with images and videos has alarmed researchers and practitioners, Natural Language Generation (NLG) has people worried. The Wall Street Journal recently did a special on AI and presented a fake article created using the latest NLG algorithms.

Here's Jordan Peele acting in President Obama's deep fake video

Although the dangers of fake news existed even before the developments in natural language generation technologies, the scale at which such news can be produced and how human-like it reads is concerning.

NYT reporter in a deep fake of Adele:

But for this post, I will try to create something less sinister, but perhaps, more annoying using R and Python.

A Quick Intro to Natural Language Generation

Natural Language Generation is exactly like it sounds: computer produced text like to what a human would write. Although a template-based script can produce natural text (think: mail merges), NLG methods are considered a sub-domain of Artificial Intelligence (AI). AI systems learn using prior data and produce new knowledge. The NLG methods typically complete these tasks:

deep learning Natural Language Generation steps

Tasks in Natural Language Generation

As Gatt and Krahmer describe in their research paper titled “Survey of the state of the art in natural language generation”, there are two common ways to produce text:

  • Text-to-text: these techniques take existing text and either summarize or simplify it. An amusing example in this category, the paper explains, is of Philip Parker, a professor at INSEAD, who “wrote” more than 200,000 books. He compiled facts from various sources and published them in a book format. One such book is called: “The 2007-2012 Outlook for Tufted Washable Scatter Rugs, Bathmats and Sets That Measure 6-Feet by 9-Feet or Smaller in India.”
  • Data-to-text: these techniques summarize data found in financial reports or play-by-play commentary and turn these data into articles. This process is also called as Robo-journalism. Some common examples are: producing quarterly earnings statements and fantasy football stories.

In this rapidly evolving field, researchers are establishing new techniques often. A very recent paper by Sashank Santhanam and Samira Shaikh, titled A Survey of Natural Language Generation Techniques with a Focus on Dialogue Systems - Past, Present and Future Directions lists more than 20 research papers since 2015 with different input text documents, NLG methods and evaluation methods.

In this blog post, let’s look at Markov Chain methods and GPT-2, a state-of-the-art technique.

Text Generation Using Markov Chains

At the risk of oversimplifying how Markov Chains work, here’s my understanding of them: the computer program builds the chain of words by looking at what words have followed each other in the provided training data (aka corpus).

For example, if the term “each other” appears often, after seeing an input of “each”, the script will predict “other” as the next word. In the next round, the script will try to predict what word comes after “other.” For more information on Markov Chains, visit this visual explainer.

Here’s an example from Josh Millard of using Markov Chains on Garfield comics:

Creating Garfield comic strip using Markov Chains by Josh Millard

Did you say something?

Text Generation Using Deep Learning Methods

I will not even pretend to understand all the things that go into creating pre-trained models that can generate very human-like text. That’s a job for the researchers at OpenAI and Google. There are two articles at floydhub.com that helped me understand a little bit more about these techniques. One is on transformers, and the other is on implementing GPT-2.

But let’s summarize the two deep learning methods presented in Gatt and Krahmer’s paper. Deep learning is achieved by chaining multiple layers of a neural network. The two models are:

  • Encoder-Decoder Architectures: in this architecture, the input data is encoded (i.e. transformed) into a vector form using a neural network. Another neural network then decodes this input vector to generate the output.
  • Conditioned Language Models: These models generate text by taking samples from a distribution – this distribution is learned from the input text data.

The researchers at OpenAI were very concerned that their models were so good that they could be misused to generate fake news. Hence, they initially released only a smaller set of their pre-trained models to the public.

Before we jump into creating our natural language generation models, here’s a video of a writer you can use on Huggingface’s site. I started by saying “There was once a lion.” The computer generated all the highlighted text!

Story writing with GPT-2 Huggingface writer

Enough of the background. Now, let’s write our text generators.

Get Data

Let’s get some data for our natural language generators. Since my focus was on “inspirational” or “business” language we see on LinkedIn, I collected some of those articles and quotes.

Get articles

Let’s get those LinkedIn articles:

Load our favorite libraries.

library(rvest) # to extract text from sites
library(stringr) # for easier string manipulation
library(readr) # to read text files
library(tidytext) # for natural language processing
library(dplyr) # for easier data manipulation
library(tidyr) # to make data wide and long
library(jsonlite) # to deal with json files

Extract the articles using XPATH. To find xpath or the css syntax on a webpage use the SelectorGadget Chrome extension.

html_page_base <-
"https://linkedinarticle.wordpress.com/category/leadership/oleg-vishnepolsky/page/"

get_article <- function(url) {
read_html(url) %>%
html_nodes(xpath = '//*[contains(concat( " ", @class, " " ), concat( " ", "entry-content", " " ))]') %>%
html_text(trim = TRUE)
}

articles <- sapply(paste0(html_page_base, 1:12), get_article) #run the above function for 12 pages

articles_text <- unname(unlist(articles))



write(articles_text, file = "articles.txt", append = TRUE)

Get Quotes

Next, get inspirational quotes from goodreads.com. I again found the xpath syntax for the quotes and extracted the top quotes spread 100 pages, about 3500 quotes.

html_page_base <-
'https://www.goodreads.com/quotes/tag/inspirational?page='

get_quotes <- function(url) {
read_html(url) %>%
html_nodes(xpath = '//*[contains(concat( " ", @class, " " ), concat( " ", "quoteText", " " ))]') %>%
html_text(trim = TRUE) %>%
str_replace_all(pattern = '“', replacement = "") %>%
str_replace_all(pattern = '”', replacement = "") %>%
str_replace_all(pattern = "\n.*", replacement = "")
}

quotes <- sapply(paste0(html_page_base, 1:100), get_quotes) # run the above function for 100 pages

quotes_text <- unname(unlist(quotes))



write(quotes_text, file = "quotes.txt", append = TRUE)

Now the fun part: chain these words together until we get to the given length of words in a sentence.

generate_quote <- function(word1, word2, sentence_length = 10, debug = FALSE) {

if (sentence_length < 3)
stop("I need more to work with")
sentence_length <- sentence_length - 2

sentence <- c(word1, word2)
save_word1 <- word1
save_word2 <- word2
for (i in seq_len(sentence_length)) {
if (debug == TRUE) print(i)
word <- return_third_word(save_word1, save_word2)
sentence <- c(sentence, word)
save_word1 <- save_word2
save_word2 <- word
}
output <- paste(sentence, collapse = " ")
output
}

Let’s test the quote generator by randomly selecting two sample words from our quotes.

generate_quote(sample(quotes_bigrams$word1, 1, replace = FALSE),
sample(quotes_bigrams$word2, 1, replace = FALSE))
## [1] "we used don't your like shoulders a of big those"

Now, let’s create ten more:

for(i in 1:10) {
print(
generate_quote(sample(quotes_bigrams$word1, 1, replace = FALSE), sample(quotes_bigrams$word2, 1, replace = FALSE)))
}
## [1] "??? loved ?????? and ?? not ??? need ????? is"
## [1] "entertainment you tell them to their highest fullest and best"
## [1] "trying and failing she liked being reminded of butterflies she"
## [1] "of half power a by product that is weaker than"
## [1] "nobody cultivate in the winter returning in the distance they"
## [1] "he's despite just your pick excuse a first time i’ve"
## [1] "be being true to yourself means refusing to abide in"
## [1] "reality elend of said our may tomorrows be she able"
## [1] "only creator way there to scare you they're there to"
## [1] "right core walk values stumble do every with man's all"

Fun!


Generate Quotes Using Markov Chains (libraries)

Why build your own Markov Chains when you can use some pre-built libraries? We will have to cheat here a little bit as we will need to use Python to run functions from the markovify Python library. Fortunately, Alex Bresler wrote a wrapper in R for us to use. Thanks, Alex!

You will need to install markovify as the instructions on markovifyR’s page say. You will, of course, need Python.

system("pip install markovify")
devtools::install_github("abresler/markovifyR")

Since I have different versions of Python installed on this machine, I am specifying with Python version to use.

reticulate::use_python('/usr/local/bin/python3', required = TRUE)

Let’s build a model using the defaults given in the manual.

library(markovifyR)
markov_model <- generate_markovify_model(
input_text = all_quotes$X1,
markov_state_size = 2L,
max_overlap_total = 25,
max_overlap_ratio = .85
)

Now, let’s generate ten quotes from our model trained on the quotes data set.

markovify_text(
markov_model = markov_model,
maximum_sentence_length = NULL,
output_column_name = 'quote',
count = 10,
tries = 100,
only_distinct = TRUE,
return_message = TRUE
)
## quote: The harder you fall, the heavier your heart, you cannot control, shift your energy to what you cannot do yet, in order that I had learned that everyone else is about decisions.
## quote: Most don't deserve your tears... and the grey, too, and He will bless you, --even--no, -especially--when your days and whirring air conditioners and bright plastic flip-flops from the splinters and turn to any rules.
## quote: The primary thing when you are using and have won many battles.
## quote: If through a hundred years of age, and with love.
## quote: If you love someone.
## quote: When things break, it's not the tongue.
## quote: Opinion is really there
## quote: So, whether you are doing here.
## quote: And all good things to dance,May your gravity by lightened by grace.Like the dignity of moonlight restoring the earth,May your thoughts create your belief in some degree helping each other or ever so well known to each and every day and live like it's heaven on earth.
## quote: My love is something inside me break.So that was that.

You will notice that the quotes generated from this library make more sense and are coherent. More fun! Now I can use these quotes in my Twitter bot built-in R to schedule original inspirational tweets. One of them has to go viral. :)


I thought that maybe these natural language generators are good with business language. So, I tried Huggingface’s GPT-2 model trained on the arXiv dataset. See what happens:

State of the Art Natural Language Generation Using GPT-2

As we discussed before, GPT-2 is a very powerful text generator that can create a text that humans find convincing. Just to see the power of this tool, visit a GPT-2 powered subreddit that creates and responds to threads generated by GPT-2. Very freaky!

 reddit sub simulator gpt2.

Reddit repeats itself

How do we put this to our use then?

Fortunately, the team at Hugging Face has created wrappers and scripts that help us in running these deep learning models.

You will need to install pytorch-transformers on your machine. Follow this excellent article on pytorch transformers.

pip install pytorch-transformers

Interactive prompt for GPT-2 text generation

Unfortunately, we need to run this next part in Python, because it is based on a response that a user enters on a prompt. When I find a script that can run and generate text without a prompt, I will update this article.

Once you have the libraries installed, you can run this command on the terminal/shell. It will load the weights from the pre-trained models. Ask you for a writing prompt and generate sentences for the desired length.

Here’s an example of using the GPT-2 model.

python pytorch-transformers/examples/run_generation.py \
--model_type=gpt2 \
--length=100 \
--model_name_or_path=gpt2

Here’s what GPT-2 wrote when I entered this prompt: “You don’t build a business. You build people, and people build the business.

python pytorch transformers interactive gpt2 output leadership thought

And here’s what GPT-2 wrote when I entered this prompt: “Employees don’t leave bad jobs, they leave bad bosses!”

python-interactive-gpt2-output-leadership

Here’s an example of using the XL-net model.

python pytorch-transformers/examples/run_generation.py \
--model_type=xlnet \
--length=150 \
--model_name_or_path=xlnet-base-cased

This is what XL-net generated after I fed it “A truly Great Boss is hard to find, difficult to part with, and impossible to forget.”

python pytoch transformers interactive xlnet output leadership

Compared to XL-net, GPT-2 generates text that’s more on topic and makes sense.

Conclusion

Natural text generation has become better with modern computing power and techniques. It has many uses and there are dangers of it falling in the wrong hands. But for now, I hope you can run your experiments and see if for yourself.

I leave you with this gem generated, of course, by GPT-2:

“No amount of talking, no amount of negotiation, no amount of waiting will bring you to an agreement. You must take the initiative and move forward. Once you are ready to take action, your goal is to make yourself a better leader and to create the culture that allows others to follow. You must put in the time, you must learn the skills, and you must invest in yourself.”

Remember to invest in yourself!

What do you think? How can we put these text generators to use?

About the Author

A co-author of Data Science for Fundraising, an award winning keynote speaker, Ashutosh R. Nandeshwar is one of the few analytics professionals in the higher education industry who has developed analytical solutions for all stages of the student life cycle (from recruitment to giving). He enjoys speaking about the power of data, as well as ranting about data professionals who chase after “interesting” things. He earned his PhD/MS from West Virginia University and his BEng from Nagpur University, all in industrial engineering. Currently, he is leading the data science, reporting, and prospect development efforts at the University of Southern California.

  • […] Writer November 26, 2019 Automated Reports and Dashboards in R November 24, 2019 Natural Language Generation with R (sort of) November 17, […]

  • >