Generating Text with Context

Sudeep Dasari, Hankun Zhao, William Zhao


Text serves as one of the primary means through which people communicate. Through text we share stories, argue politics, and generally share our opinions with the world (looking at you Kanye). A controllable generator that can create semantically meaningful text has diverse uses. For example, such a generator could look at a large body of movie reviews, and on command create a short positive or negative summary of the movie. In this way, the model could synthesize multiple nuanced views of a topic, rather than echo one specific idea continuously (e.g “I like the graphics, but hate the plot”).


Image labeling and machine translation, two related tasks. Images borrowed from TowardsDataScience and TensorFlow NMT Tutorial

Two related tasks are machine translation and image captioning. In both these tasks, an encoder network generates some form of conditioning from the source. The conditioning is then used to guide the decoder, or generator. Our proposed task is significantly more complex than either machine translation or image captioning. In both these tasks, the dimensionality of the conditioning is equal to, or even greater than the dimensionality of the desired output. In machine translation, typically the source and output sequences are of similar length. In image captioning, the input is a high-dimensional image, while the output is typically a sentence of the form “A …“.

However, in our case the network is given a little bit of context (“generate 5 star review of Bongo Burger”), and asked to generate human-like output. Obviously, the task is extremely under-constrained. Additionally, the generator has to learn both a model of the world (5 star review of a burger joint would praise the patties/fries), as well as human-like grammar.

On a high level, this project seeks to investigate the problems facing semantically meaningful text generation, and try various word and character level approaches to deal with them.  Specifically, we approach this problem using well established techniques in the machine translation domain. A LSTM with Bahdanau attention serves as a useful baseline model. Additionally, we explore the WaveNet/ByteNet architecture for character level prediction, and the Transformers architecture for word level prediction. Both aforementioned networks have shown good results in the machine translation task, and Transformers has additionally been used in Wikipedia summarization. The prior success of these models in related fields lead us to believe that it would be worth trying them on this task as well.

We implement both models in TensorFlow and train locally and on Amazon GPU instances. In the process of refining the models, we expose several of the challenges in this domain, attempt to rectify them, and dive into the grimy details of neural text generation. In the end, we find that both models are able to generate text with meaningful phrases that relate to the conditioning.

Our code is available on Github.


Since our application of the WaveNet model is fairly novel, we decided to use a smaller, test data-set to evaluate the baseline proficiency of that model at generating text. We choose the NLTK Shakespeare data-set for this purpose, which consists of six of Shakespeare’s plays. For this model we do not provide a conditioning vector, since we are only concerned with the network’s ability to generate coherent phrases.

For the actual task we chose a dataset released by Yelp for their data challenge. The Yelp dataset gives of information about various businesses and contains a wealth of reviews with varying sentiment for us to train on. Due to the size and variability in the dataset, properly cleaning it proved to be difficult and extremely important. We chose to condition on the business name and the review star rating, since they were guaranteed to be present in any review data we found (other data is less reliable and we didn’t want to deal with imputation). Reviews are taken only from businesses with a threshold number of both positive and negative reviews. This ensures variation in our dataset, and eliminates outlier businesses with 1 or 2 reviews. All punctuation and capitalization is cleaned. Periods are replaced with special <PERIOD> tokens that break sentences, and <START>/<END> tokens note the beginning and end of review respectively. Below is an example of a cleaned 3 star review:

<START> we went here on a saturday no wait it must be losing its flare <PERIOD> the vibe is ok i thought it would be more rustic <PERIOD> the drinks are divine <PERIOD> we had apps and were going to sit down when i realized that the apps were extremely salty and heavy <PERIOD> there are literally no vegetables on the menu after thanksgiving weekend i just needed something more green <END>

Due to the size of the Yelp dataset, these tokenized strings are processed further from a CSV representation into a TFRecord representation. These record files are much faster than naive methods for loading data into TensorFlow, and also allow for easier distributed/cloud training.

We select 500 random businesses and all their reviews for the validation set. These reviews are used to calculate the BLEU score and precisions. As a generative model, the test set is simply some fictional business names that do not appear in the dataset.


As with any machine generation task, it is difficult to define a good metric to quantify the quality of the generations. In deciding what metrics to use, we notice that generated text needs to satisfy two criteria:

  1. Generated text should be fluent. Generated text should be grammatically correct and have reasonable syntax.
  2. Generated text should be meaningful. Generated text should have some meaning that relates to the conditioning. When we change the conditioning, the generated text should change meaningfully.

The first metric is more straightforward, since it is very similar to one of the metrics used in machine translation, the BLEU score. The Bilingual Evaluation Understudy Score, or BLEU score, evaluates the quality of a machine translation by measuring its unigram, bigram, trigram, etc. precisions against the ground truth target translation. Precision measures the proportion of phrases in the generated sequence which match phrases appearing in the ground truth references.

In calculating the BLEU score, the unigram precision measures the translation system’s adequacy at understanding vocabulary. Bigram, trigram, and higher precisions measure the system’s ability to string words together into grammatically correct phrases, and string phrases together into coherent sentences. In other words, higher precisions evaluate the fluency of the translation system.

We observe that these higher-order components of the BLEU score also apply to our task of text generation. We modify the calculation slightly. Since we do not have a clear “target sequence” for text generation, and since we want our generated text to have some variety, we calculate the BLEU score for each generated sequence against a large sample of text from our test data-set.

The second metric is much trickier to quantify. We considered calculating the divergence of the distributions of generated text to evaluate our networks’ ability to respond to the conditioning. However, we found that there is no clear way to quantify this divergence when our networks operate on different language models. In other words, a distribution calculated across output character probabilities cannot be compared with a distribution over words.

Another approach is to define a system to perform sentiment analysis on generated text. However, this is a wholly different problem in natural language processing. We choose instead to evaluate this second requirement qualitatively.

Baseline – LSTM Encoder-Decoder

Screenshot from 2018-05-11 23-04-24
Basic LSTM Encoder-Decoder w. Attention. Image borrowed from TensorFlow NMT Tutorial

For a baseline model we run a simple LSTM Encoder-Decoder with Bahdanau attention. The purpose of this network is to demonstrate that a generative network can respond to conditioning in a meaningful way. We show the results after very few iterations of training.

  • 1 star review for Teriyaki Madness – “I for a minutes for a food. The menu. The door. Was I order was going made. Order. Was got home order. And too spicy. Fried rice I rice and delicious and the to I sauce was so better the to order. It very very so busy fresh. I not only customer in there the place. . Me place. . I definitely come go.”
  • 5 star review for Toll House Cafe by Chip – “I was this place. I have not have what place. I I as I walked in I was like good. I was find between I up eating a a. The. I were all. I husband favorite was the chocolate chocolate chip. . I. I I been to. I have don’t say eating here I I – chip in cones are a die for.”

After only a few iterations of training, the results are promising. The network seems to have learned some relationships between the source conditioning and words in the output sequence. With the knowledge that deep networks can respond to global conditioning, we turn our attention to more complex networks.

Network 1: WaveNet for Words

We wanted to explore using WaveNet, a generative model build on dilated causal convolutions, for this task. There has been significant previous work using this model for similar tasks:

  1. WaveNet has been applied successfully to audio generation, conditioned on the genre of music, or the text to synthesize
  2. ByteNet, a WaveNet-inspired network, learns a character model to perform text translation

Our task, text generation, is a combination of these. Given the success of these two previous WaveNet models, we expected that WaveNet could also perform well at our task. Before we dived in further we wanted to verify that character-level text generation using WaveNet was feasible using the Shakespeare dataset.

<LINE>Why, then the made her so than but as a sellay,</LINE>
... (6 lines) ...
<LINE>That they made the barmition and say shall as a man.</LINE>

The results were promising, so we explore deeper into WaveNet’s design.

Network Architecture

WaveNet is an auto-regressive deep model built on stacked dilated, causal convolutions over the input sequence with local and global conditioning. It is designed to recognize long-term patterns and dependencies in sequences.

  • The dilated convolutions increase the receptive field without increasing the computation. The receptive field of a size-5 convolution is the same as that of a size-3 convolution with dilation 2.
  • Causal convolutions cast predictions as functions of older inputs of the input sequence. This lets us use WaveNet as a sequential generator
Modified from WaveNet Blog Post

We made two significant modifications to the base WaveNet design to make it more amenable for character-level text generation.

  • Wider convolution filters. The naive WaveNet implementation uses only size-2 convolution filters. Size 2 filters are very cheap to compute, yet, when coupled with the dilated convolutions, significantly increase the area of the receptive field. However, we found that using wider (size-10) filters in the first layer significantly improves the performance of the network. This is likely because these wide filters can attend to common sequences and sub-sequences in words, i.e. “the”, “ough”, “est”, “ing”.
  • 1D convolutions. 1D convolutions increase the expressiveness of the network. We found that 1D convolutions especially improved the network’s ability to attend to words far-away in the input sequence.

We used WaveNet’s gated activation operator. In this operation, the hyperbolic tangent function applies a non-linearity, and the sigmoid function “gates” the activation. We find that this activation works well in the character domain.


Our final network has 21 layers and a receptive field of 285 characters. The hidden dimension is size 1024. The size, receptive field, and dimensionality of this network is similar to ByteNet, which also operated on characters.

Global Conditioning

We would like to condition the outputs of WaveNet not only on previous generated tokens, but also on a global condition vector, which embeds global properties of the generated sequence. For our task, the global conditioning has a textual descriptor, the business name, and a number, the review. We generate a conditioning vector by running a very simple RNN over the business name and concatenating the rating.

The global conditioning vector is transformed at every layer of the network and added to the activations.



We minimize the cross-entropy loss for the predicted next token over the entire sequence. Since a sequence of length N generates N output distributions, the sequence length acts as a batch size. This is similar to how the original WaveNet was trained. As expected, WaveNet was excruciatingly slow to train, since most reviews are far longer than the receptive field of the network.

For optimization, we used the ADAM optimizer. Given the slowness of the training, we decided manually iterate the learning rate instead of experimenting to find the optimal training schedule.

Feedback training.

We explored using feedback during training to improve the generalization of the network. In the normal training routine, the next character is predicted from the ground-truth characters seen so far. However, in evaluation, the next character is predicted from a sequence of generated probability distributions over the characters. This mismatch could lead to low quality generated text.

We attempted to rectify this issue by training on the output of the network for a few iterations. This is a form of teacher forcing. During an initialization period, we strongly supervise the training to give the model some functionality. We apply teacher forcing after the network has trained to a point where it can very reliably predict the next character.

Our modified WaveNet model had ambiguous results on performance. We suspect that we did not have enough time and/or data to get the feedback-modified model near its best performance, given its more sophisticated architecture.

Generating Text

We initialize the WaveNet generator with the start token. At every timestep, we choose the next token from the output softmax probabilities, and then iterate. We experimented with several methods to choose the next token from this distribution.

  • Greedy search. This method simply picks the token with the maximum output probability as the next token. We found that this method tends to generate repetitive text. Ex: “The staff is friendly and the staff is friendly and the staff is …”latex_32778c740a1cdfe1b48985c74f88b9b5
  • Beam search. This method considers a window of sequences with maximal probability. While this method roughly maximizes the joint probability, we found that it tends to make spelling errors. Ex: “I was hereafore and thereturea wa … “
  • Sampling. This method randomly samples the next token from the output distribution. We found that this method generates the most varied text, and surprisingly makes very few spelling errors, at the expense of grammatical correctedness. Ex: “The staff was super the entire up early better eating”

Given these results, we choose to perform all text generation with WaveNet using the sampling method. We observe that a flaw with this method is that it maximizes the conditional probability for the next output, rather than the joint probability over all generated characters.


Network 2: Transformers

The Transformers architecture has achieved state of the art results in natural language tasks (namely neural machine translation). Our text generation task can be formulated in exactly the same Encoder-Decoder style that worked in previous settings. Given Transformers performance in the related Wikipedia summarization task, we entirely expect it to be competitive in our own.

Network Architecture

The Transformers network works by passing inputs into a multi-headed attention cell, before performing an element wise feed-forward multiplication (a.k.a a 1×1 convolution). The encoder cell handles conditioning outputs on the business name/star inputs, and the decoder cell handles creating the outputs themselves. The general structure of the network is shown in the diagram below.

Screenshot from 2018-05-11 21-04-02
Transformer architecture. Sourced from paper

To summarize, a positional encoding is first added separately to the encoder and decoder inputs in order to give the network a sense of time. The encoder inputs are an embedding vector for the star (1-5 star reviews are present in Yelp dataset), and an embedding vector for each token in the business name. During training decoder inputs are the shifted ground truth tokens, and during test the inputs are the network’s generated tokens (just “<START>” token at first time-step).


Screenshot from 2018-05-11 21-04-10
Multi-head Attention Diagram. Sourced from original paper

The encoder inputs go through a multi-headed attention cell (shown above). The encoder inputs serve as the query, key, and value, and are thus linearly projected down 3h times (h times for query/key/value). Scaled dot product attention is computed on each of these h input sets, and the individual sets are then concatenated together to form the multi-headed attention output. We must be careful to ensure the input dimension matches the output dimension of this cell so that residual layers can be employed. The result of this step is normalized (layer-norm) and then fed into a residual feed-forward layer – two sets of 1×1 convolutions with a ReLU activation in between them. The feed-forward output is fed into layer-norm before being sent into both the decoder layer at this stack, and the following encoder layer for the next stack.

The decoder inputs similarly go through a multi-headed attention/norm layer, but the attention output is masked so that the network can’t compute attention weights that depend on later inputs (past predictions don’t depend on future outputs). After the first attention layer, the decoder cell uses a second multi-headed attention/norm layer where the encoder outputs for that stack serve as query/key, and decoder outputs serve as value. The results are fed into a feed-forward/norm layer, at which point they’re finally fed into either the next decoder stack or a linear layer (to act as prediction logits).

Our experimentation found that using N = 4 stacks, h=8 attention heads, 3096 as feed-forward dimension (number of filters in first convolution), embedding dimension of 256, sinusoidal position encoding (described in paper), and applying dropout at rate 0.25 to multi-head attention output, worked really well. We take the 14000 most commonly used title (output) and text (input) tokens and replace the rest with <UNK> tokens. Note that the Transformers network is pretty big in memory, and that while each iteration is fast we had to take many iterations (~200k) to see good results.


We used the same maximum-likelihood cross-entropy loss described in the paper. To elaborate, we feed the shifted inputs and business name/star (for conditioning) into the network, and compute a cross-entropy loss on softmax output vs ground truth words. Our experiments showed that dividing the loss by both time and batch size, yielded inferior performance to dividing just be batch size (essentially penalizing longer inputs). For optimization, we used the ADAM optimizer, with the inverse square root learning schedule described in “Attention is All You Need”.

Generating Text

During test time the business name and desired star of the review is passed into the encoder. The <START> token is given to the decoder as the first element of the output stream, and the decoder is run until it generates 300 tokens or it generates an <END> token (whichever is first). Beam search with a depth of 1 and width of 20 proved to have a substantial positive effect on the quality of outputs.


Loss Curves

We train both WaveNet and the transformer network extensively on the reviews and tags of the Yelp dataset. We calculate loss on the validation set every few iterations.

The WaveNet network trains well initially, with loss consistently decreasing. However, WaveNet is very slow to train compared to the transformer. The Transformer’s loss seemed to be very sensitive to each iteration. After smoothing, we see that loss generally went down in the first part of training.


Note, while the Transformer’s loss seems to generally plateau after roughly 100k iterations, the model does actually qualitatively increase in performance in the proceeding iterations. This effect can be seen in the below generations for a 5 star review of “Taco Bell”:

  • Transformers at 100k – “Very good food in the area and it is one of the best customer service. They also have lots of food options and the food is great too. They also have lots of options on the menu as well. They also have a lot of food and it is great too. I highly recommend this place to anyone looking for lunch and it is a great place to go. I highly recommend this place to anyone looking for lunch and it is worth it. I highly recommend this place to anyone looking for this place. Thank you for a great meal. Thank you. For you. Thank you. Thank you. Thank you. Thank you – – -. – – – – – – – – – – – – – – – – – – – – – – – – – – -. – – -. – – -. – – – . – – – – – – -. – – – – . – – . – – – – – – – – – — – – – – – – – – – – – – – – – – – – – – – – – -. – -. – – -. – – -. – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – -“
  • Transformers at 200k – “Very good food and the amount of food you get. The carne asada burrito is really good too. The – – – and it is so good and the service is fast and the food is very good too. I always get the carne asada burrito and it comes with a side of chips and salsa. I highly recommend this place to anyone looking for a quick bite for a quick bite and will definitely be coming back for a quick bite on the side of the day. I will definitely be coming back for my next trip to Las Vegas for a quick bite in the future. Thank you so much for a great meal. highly recommend this place. – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – -. “

Quantitative Evaluation

To quantitatively measure our models’ performance we calculate BLEU scores against reviews of the same rating. The scores were evaluated on a held out set of data, so the models have never seen these specific business names before during training/validation.

WaveNet Transformer
1-Gram Precision 0.98  0.93
2-Gram Precision 0.83  0.77
3-Gram Precision 0.45  0.60
4-Gram Precision 0.09  0.40
5-Gram Precision 0.00 0.20
BLEU5 Score 0.46  0.51

We see that both models learn adequate fluency over  the dataset, the Transformer performing better on longer sequences than WaveNet. This makes sense because WaveNet learns a character model, which can learn words but isn’t as powerful as an embedding model for handling long sequences. WaveNet also seems to have very good unigram and bigram precision, but we realize that this actually describes the network overfitting to the most common generic words, while the Transformer actually learns to use unusual words.

Qualitative Evaluation

The quantitative metrics are interesting benchmarks, but qualitatively human like text is what we’re actually after. As such, we qualitatively evaluate both the responsiveness of our networks to the conditioning, and its ability to create reasonable reviews. We choose several review descriptors which should generate very different reviews. For brevity and readability we denote the <UNK> token with “-‘,  clean out the <PERIOD><START><END> tokens, and capitalize appropriately.

  • Bob’s Bodacious BBQ, 5 stars
    • Transformer – “Very good food and the portions are huge and you get what you get. The pulled pork and comes with a side of fries and they are so good. They also have so many options too. They also have so many options too. it comes with fries and fries. they also have a variety of food and drinks too. They also have a good amount of food. They also have a great selection of food and service. I cant wait to go back and try some of the food. Cant wait to go back and try it again in the future for a quick lunch or dinner. Highly recommend this place. Highly recommended. Highly recommended. Highly recommended this place. 5 stars for sure. 5 stars for sure. 5 stars. – – – – – – – – – – -. – -. -.”
    • WaveNet – “A very love was your business great and make  I ended up in the staff was always good. 5. I paid the vacuum make sure what do have been to see the door complaints for a really the decor is pretty standard. Completely when I was so you back for what a place to pas but when I got in a kinda with done by the cheese not done. One for her been waiting in a place in the staff was mediocre but worth it for a start because of the prices.”
  • Pizza Palace – 3 star
    • Transformers – “Very good pizza and the pizza is good too. The pizza is good and the pizza is good too. The pizza was good and the crust was way too thin and soggy. The pizza is good too and the pizza is good too. There is plenty of toppings on the side and you can get what you get. The pizza on the side and fries are good and the pizza is good and the service is good and the service is good and the service is good and the service is always on par with the quality of the food and service is always on par with the quality of the food and service is very friendly and the food is good too. Highly recommended this place again.”
    • WaveNet – “What a traditional to get back and friendly for the market entree. We purchased the prices great understanding the counter service is always friendly good food for some needs. The staff was super the entire up early better eating. She said they are friendly good and don’t make the waitress is hard to be the area is because they took the price has been for a hard to get my favorite experience because I said I was a disappointed. I never know it was one bar weekend. I think it all was more super friendly.”
  • Grand Budapest Hotel – 1 star
    •  Transformers – “Very clean and a good price. The rooms are <UNK> and there is a lot of room and there is no room. There is a lot of room and there is plenty of room and there is plenty of room and there is plenty of room. The room is clean and the rooms are nice and there is plenty of room for a lot of room and there is plenty of room in the middle of the room. I would stay here again if i would like to stay in the near the near the front desk at the front desk at the front desk at the front desk at the front desk at the front desk at the front desk desk again.”
    • Wavenet – “Dirty food was hard to check out great and cheese has a horrible or great great and friendly for and it was on a Monday complaints and the food or great but I could give up and the food on the menu was cooked easy was great.”


While the BLEU/precision scores are somewhat ambiguous, the qualitative results are decisively in favor of Transformers. Specifically, Transformers deduces what to talk about from the business name, and somewhat gets the sentiment right from the star input.  WaveNet on the other hand is not able to condition on the business name inputs as well, and as a result talks about cheese in a hotel review!

Both models also have problems where they repeat generic – often times positive – phrases that likely show up in many reviews. This is known as exposure bias, and represents the models overfitting to match the distribution of training text. This highlights a flaw with our approach, which is that there is no adversarial loss to push our model away from probabilistically likely, yet obviously fake sequences. For example, the models would frequently string together phrases like “the food was great and cheap and great and cheap …” for several iterations. An adversarial model could easily penalize such a sequence with an adversarial loss.

Interestingly, our models also revealed a hidden bias of the Yelp dataset! Many generated reviews mentioned the business being “on the strip,” which seemed odd to us. We scrutinized the businesses in more detail, and found that indeed, Yelp sampled its public dataset from businesses located in Las Vegas.

Despite the flaws, both networks demonstrate a deep understanding of the structure and content of Yelp reviews. There’s obviously a lot to go, but for a first stab the results are promising!

Lessons Learned

Text generation from characters is very difficult. While ByteNet had success generating long sequences from characters, it was able to locally condition on an embedding vector from the source sentence. This substantially decreased the complexity the model had to learn, since it essentially received the input word as some hidden embedding. Without this local embedding, the problem of text generation becomes very difficult, since the model has to recognize the meaning of previous words, predict the next word, and determine the spelling of the next word. This pushes the capacity of WaveNet too far. In the future we may explore a word-model with WaveNet.

Attention is powerful. We realized that WaveNet’s dilated convolutions and gated activations crudely attempt to provide a form of attention over the input sequence. However, this corresponds to attention over broad chunks of the input sequence, rather than specific, target words. In light of this, it seems apparent that transformers are a better choice.

Attention Tuning is all you need. It’s really important to find a good set of hyper-parameters. Sometimes it’s crucial to have faith and train for a while even if the loss doesn’t look great. A lot of our initial results with Transformers were a far cry from the “state of the art” results we were expecting. Our group doesn’t have the hardware, data, or time that Google does, so we had to toy around until we found a bag of tricks that worked. Essentially, our team has the same complaint hundreds of other deep learning researchers do: it’s hard to know if the problem is a bug in your code/data, a bad hyper-parameter, or perhaps you didn’t train long enough.

Team Contributions

Hankun Zhao – 40%. Built initial skeleton and test datasets. Built WaveNet model and evaluated on Shakespeare and Yelp datasets. Tuned WaveNet model extensively. Evaluated metrics for WaveNet model. Setup WordPress site and generated graphics. Contributed to all sections of this blog post.

Sudeep Dasari – 40%. Cleaned the Yelp dataset and converted it to TFRecord format for speedy training. Built the code-base for Baseline-LSTM and Transformer networks. Trained and extensively tuned the Transformers network. Evaluated the metrics for Transformers/baseline model, and contributed to all sections of this blog post.

William Zhao – 20%. Tuned the WaveNet model and experimented with preprocessing input data. Created the WaveNet version which conditionally feeds in its own output as the input for the next timestep, after a certain initialization period of teacher forcing. Tuned model version’s performance and evaluated metrics for its output.


  • Abadi, Martín, et al. “TensorFlow: A System for Large-Scale Machine Learning.” OSDI. Vol. 16. 2016.
  • Kalchbrenner, Nal, et al. “Neural machine translation in linear time.” arXiv preprint arXiv:1610.10099 (2016).
  • Lu, Sidi, et al. “Neural Text Generation: Past, Present and Beyond.” arXiv preprint arXiv:1803.07133 (2018).
  • Papineni, Kishore, et al. “BLEU: a method for automatic evaluation of machine translation.” Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002.
  • Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. “Sequence to sequence learning with neural networks.” Advances in neural information processing systems. 2014.
  • Van Den Oord, Aaron, et al. “Wavenet: A generative model for raw audio.” arXiv preprint arXiv:1609.03499 (2016).
  • Vaswani, Ashish, et al. “Attention is all you need.” Advances in Neural Information Processing Systems. 2017.