RL in NMT: The Good, the Bad and the Ugly

13 minute read

image-right Let me introduce you to three popular practices for using reinforcement learning (RL) in neural machine translation (NMT): the Good, combining it with good old maximum likelihood estimation (MLE), the Ugly, combining it with “hacks”, and the Bad, applying it with ignorance of more evolved techniques. Those three are helping NMT researchers on the hunt for BLEU scores.

Western movies aside, the aim of this blogpost is to take a critical look at the recent trend to include RL-inspired objectives in NMT training. We’ll start with a recap of RL training in NMT, dive right into an empirical study by Wu et al. 2018, leading to the discussion of the three following questions:

  1. How do NMT and RL fit together?
  2. Why do we even get any benefits from an RL objective in supervised learning?
  3. Where can we find the real challenges?

tl;dr RL is a popular first-aid method to fix supervised NMT training, but maybe not the most suitable one. RL shines outside supervised learning; new challenges and opportunities are to be found there.

The Basics

Introducing RL to incorporate rewards.

Maximum Likelihood Estimation

Standard auto-regressive NMT models, parametrized by a neural network with parameters , are trained with maximum likelihood estimation on parallel data resulting in the popular cross-entropy objective:

Expected Reward

So how does RL come into play? The idea is to introduce rewards to encourage model outputs that would obtain a high reward, not only the perfect reference translation (=MLE). In practice, rewards can be simulated with e.g., sentence-level BLEU scores, to reinforce samples that – if evaluated – would obtain a high BLEU score. You might ask yourself, why is it even necessary? We’ll discuss that in a bit. Assuming the existence of such scalar rewards obtained from we can formulate an objective that aims to maximize the expected reward for all model outputs:

Policy Gradient

In contrast to the MLE objective, the RL objective is not differentiable with respect to , because the reward is a discrete function of the outputs of the model. Luckily, with the help of the log-derivative trick, we can reformulate the gradient for this objective, also referred to as the policy gradient:

We can now empirically estimate the gradient with e.g. Monte Carlo sampling and train our model with stochastic gradient ascent. This solution was introduced in the famous REINFORCE algorithm by J. Williams 1992. REINFORCE proposes to estimate the gradient with one sample for each input:

How does this bring us to RL? In RL, more precisely in policy search, is a policy that predicts actions . The policy chooses one action and then receives a reward for this action from the environment. Importantly, it is not possible to go back and try other actions instead and compare their rewards. In a genuine RL setup, we are limited to single-sample estimates.


The current practice in NMT is to approximate the policy gradient with either multinomial sampling from the softmax-normalized outputs of the NMT model, or by beam search. The two objectives are trained either sequentially (e.g., supervised pre-training before reinforced fine-tuning, or alternating batches) or simultaneously (e.g., by linear interpolation).

If we care about BLEU, RL alone won’t help.

In the recent EMNLP paper “A Study of RL for NMT” Wu et al. 18 observe that RL-inspired training objectives have been shown to improve NMT quality, but usually don’t come without tricks and rather weak baselines. Their question is now: Combining various variants of these tricks with learning from monolingual data, does RL still shine as expected?

To spoil the suspense right away, the study finds that using RL leads to marginal improvements over well-tuned baselines, also in combination with MLE and monolingual data (the good). However, the largest portions of improvement come from leveraging additional monolingual data (old news) (the ugly). But the RL-inspired approaches evaluated here lack comparisons to more evolved techniques, and assume access to reference translations (the bad). Let’s take a closer look!

RL Tricks

Variance Reduction

The variance of the gradient estimator can be a problem for optimization, i.e. slow down or hinder convergence. The paper investigates the following solutions:

  • Average reward baseline: Instead of using the reward directly, subtract its empirical average from the reward obtained.
  • Learned baseline: Subtract a learned reward instead of the empirical average. The learned reward is the output of a regression model, e.g. another neural network.

The baseline was actually already proposed in the original REINFORCE paper and can be interpreted as an additive control variate (Ross 2013). Actor-critic (AC) approaches go a step further and replace the reward obtained by the environment by a reward given by a critic that is trained to imitate the original reward (applied to NMT by e.g. Bahdanau et al. 2017, Ngyuen et al. 2017).

Despite the reported effectiveness in practice, Greensmith et al. 2004 showed that both above solutions are suboptimal and that one can actually learn an optimal baseline with minimal variance.

One important aspect that has been completely neglected in the present study is that the number of samples used for the Monte Carlo gradient estimate has an essential influence of the variance of the gradient. If rewards are simulated anyway, e.g., from references using sentence-level BLEU, why not sample multiple times and average the gradients over this subset? This may sound familiar, since this is exactly what is done in minimum risk training (proposed for NMT by Shen et al. 2016).

In Wu et al. 2018’s empirical study, there was no beneficial effect observed when using the learned baseline. This contradicts with the experience in e.g., Bahdanau et al. 2017 and Kreutzer et al. 2017. The conclusion that reward baselines are not necessary from “the economic perspective” (Wu et al. 2018) might be a bit overhasty.

Reward Shaping

If the reward is only obtained at the end of each sequence (here: translation), how does the model know where the errors are? The problem of credit assignment is often addressed by introducing methods for reward shaping (Ng et al. 1999). Wu et al. 2018 investigate the implementation by Bahdanau et al. 2017: For each element of the output, the individual reward is the difference between the BLEU score for the partial output including and the BLEU score for the partial output excluding the element: . Note that the BLEU scores are computed with respect to the full reference output. Once again the references are exploited to simulate the rewards.

But does this even address the original problem of credit assignment? The problem arose because we had to wait for rewards from the environment until we completed a sequence of actions (in NMT: produced a complete translation). As soon as references are used, we are in principle not restricted to delayed rewards anymore. One could for example compare each word in the translation to the words in the reference translation and then come up with a token-based reward. Simple binary rewards were for example proposed in Petrushkov et al. 2018.

As long as we simulate the rewards using references, we can cheat our way around the real problem. When references are not available and you simply cannot compute BLEU scores for any arbitrary, partial translation – what would you do?

To this end Ngyuen et al. 2017 adopt the advantage-actor critic (A2C) framework (Mnih et al. 2016). A critic network predicts the expected future reward for each element, although the reward from the environment (here: BLEU) is only obtained at the end of the sequence. Unfortunately, the latter study does not include a comparison to RL approaches without reward shaping.

The empirical gains from reward shaping reported in Wu et al. 2018’s study are diminishingly small which leaves the question of the usefulness of this method unanswered.

Using Monolingual Data


Leveraging monolingual data for improving MT systems has become increasingly popular, since simple methods have shown to be very effective for NMT. When target-side monolingual data is available, the trick-of-the-trade is to use back-translation as demonstrated in Sennrich et al. 2016. The only burden here is here one has to train a system in the opposite translation direction. This system can then generate pseudo-sources for the available target data. The “hallucinated” parallel data can then be used for standard training, with simulated rewards or without.

But isn’t it problematic to feed the NMT with fake data? Apparently not, at least as long as the targets are intact. Edunov et al. 2018a investigate this question systematically and surprisingly find that models get even better when the pseudo-sources are of low quality (not for small data, though). They hypothesize that the noise introduced actually enriches the training data and helps learning as e.g., in denoising auto-encoders.


Wu et al. 2018 propose to leverage not only target-side monolingual data, but also source-side monolingual data. Evoking techniques developed in the context of self-training, the idea is to let the model generate pseudo-targets for its own training. We have to assume that it is able to generate targets that are “good enough”, in the sense that the model can bootstrap itself. In practice, this is addressed by using beam search decoding for generating translations that are likely to have higher quality than sampled or greedy decoded targets.

Does the quality of the pseudo-targets matter? When they are part of the RL objective, they are only used to simulate rewards for sampled translations, which perhaps can absorb some of the noise. In supervised MLE training Wu et al. 2018 add them to the much larger original parallel data – the small amount of extra noise might be negligible. However, this has not been investigated systematically.

NMT as an RL problem

We only (mis-)use a subset of RL methods in NMT.

The “Study of RL in NMT” is limited to a very specific scenario where policy gradient is used for fine-tuning of well-trained models. What about other RL algorithms? RL researchers have in fact dealt with reinforced objectives as above for decades and have developed more sophisticated training algorithms (such as Trust Region Policy Optimization and Proximal Policy Optimization) than vanilla policy gradient. But that’s to be discussed in another blog post. Nevertheless, so far only policy gradient and actor-critic have become really popular for structured prediction tasks. So what’s wrong, are we just slow in adopting their algorithms?

In fact, it is not trivial to cast NMT, or more general structured prediction, as a standard (PO)MDP problem which is the basis for most RL algorithms: What is the environment? What is the state? Where does the reward come from? Translation rsesearchers don’t agree on it (comparing e.g. definitions in Wu et al. 2018, Ngyuen et al. 2017, Bahdanau et al. 2017). It is in fact often more suitable to cast it as a simpler contextual bandit problem, aka bandit structured prediction (e.g., Sokolov et al. 2016, Kreutzer et al. 2017, Daumé III et al. 2018), as Hal Daumé III discussed in his blogpost on structured prediction and RL – you may see it as a one-state MDP.

What we can agree on is that in NMT we’re dealing with large and structured action spaces, where actions are discrete and rewards are sparse (and most of the time delayed) and potentially noisy. This calls for algorithms that are particularly suited for those methods, but neither REINFORCE nor AC address these issues in particular.

In fact, training NMT from scratch with pure RL objectives, i.e. cold-start RL, has so far not succeeded (despite Xia et al. 2016’s optimism).

RL to the rescue?

RL can improve NMT because it fixes problems of our standard objective.

What’s wrong with MLE training for NMT? Ranzato et al. 2016 elaborated on this when proposing the MIXER algorithm that mixes policy gradient-style updates with MLE. They identify the following problems:

  • Exposure bias: During training reference targets are fed to the model (=teacher forcing), while during inference the model has to produce outputs based on its own previous outputs.
  • Token-level objective (aka “loss-evaluation mismatch” in Wiseman and Rush 2016): In standard autoregressive NMT models, the sequence-level log-likelihood is decomposed as sum over token-level log-likelihoods. The training hence is optimized to finding the next perfect output token given the previous perfect token. During inference, however, we’re measuring the model’s quality with metrics like BLEU that evaluate whole sequences of outputs.

Algorithms like scheduled sampling (Bengio et al. 2015), DAgger (Ross et al. 2011) and DAD (Venkatraman et al. 2015) have been designed to reduce the exposure bias by gradually exposing the model with its own outputs during training (imitation learning).

The same effect is obtained when including some policy gradient in the training objective (e.g. in MIXER, MRT), since the gradient update is based on the log-likelihood of the model’s own output. It is directly optimized towards a sentence-level reward that is closer to the corpus BLEU we’re evaluating our models with. Furthermore, it can help with other non-differentiable objectives than the expected reward, e.g., for adversarial training (Wu et al. 2017, Yang et al. 2017. Or you might just use it to teach the NMT system what you actually want from it (beyond generating translations close to the reference), e.g., copying certain words of the input Pham et al. 2018.

Large gains using RL have been reported under domain shift, i.e., gains over baseline models that are not fine-tuned on the evaluation domain (e.g., Kreutzer et al. 2017, Petrushkov et al. 2018) or when combined with classic objectives (e.g., Wu et al. 2016, Ranzato et al. 2016). The above discussed paper demonstrates that without these factors, expected improvements vanish.

Most commonly, RL is exploited as a first aid for obvious MLE problems, in a fully-supervised setting where references are available and rewards are simulated. Why not use (or at least compare against) other training strategies that may be better suited for NMT and fix the above problems equally, as proposed e.g., in Edunov et al. 2018b, Shen et al. 2016 and Norouzi et al. 2016?

Beyond supervised learning

The challenges in RL for NLP lie outside supervised learning.

So what about more realistic uses of RL, e.g., where rewards cannot simply be simulated, or reward signals are not given as well-defined functions, or not available in unlimited amounts? In NLP, the following scenarios are evident:

  • Gold standard structures may not be available because of the cost or the lack of expertise of human annotators. Weaker signals such as human judgments on the quality of output structures may be easier to obtain and may require less expertise. This is the case for example in semantic parsing (Lawrence at al. 2018) or in machine translation (Kreutzer et al. 2018b).
  • In genuinely interactive settings where a system directly interacts with a human, the human responses can be interpreted as a weak signal how to further improve the system. A prime example is dialogue, where learning from human feedback has successfully been implemented to train systems e.g., for small-talk (Serban et al. 2017) and task-oriented dialogue (Su et al. 2016).
  • Systems that need to be heavily customized towards a user or domain. User preferences or ratings (that usually come for free) can be used to specifically adapt the system. In industrial settings, large-scale collections of feedback have been utilised in personalized news recommendation (Li et al. 2010) or e-commerce translations systems (Kreutzer et al. 2018a).

These scenarios bring challenges that can only partly be addressed by simulations and arise from the interaction with humans in real-life scenarios. The human factor entails several differences to the popular simulation scenarios of RL. Firstly, human rewards are not well-defined functions, but complex and inconsistent signals. Secondly, humans cannot be expected to provide feedback for unlimited amounts of outputs. Exciting challenges (“RL is hard”) like the collection of reliable feedback, building robustness against adversarial feedback, fair evaluation, and off-policy learning, are ahead of us!

So instead of asking the question “How to get high BLEU with RL-objectives?” let’s move to “How to learn from rewards with RL when we depend on them?”.

Acknowledgment: Thanks to Carolin Lawrence, Stefan Riezler and Joost Bastings for their valuable and much needed feedback for improving this post.

Disclaimer: This blogpost reflects solely the opinion of the author, not any of her affiliated organizations and makes no claim or warranties as to completeness, accuracy and up-to-dateness.

Comments, ideas and critical views are very welcome. We appreciate your feedback! If you want to cite this blogpost, use this bibfile.