Jekyll2019-04-11T13:24:22+00:00https://statnlp.github.io/feed.xmlStatNLP HD BlogBlog of the Statistical NLP Group at Heidelberg University, GermanyCounterfactual Learning of Semantic Parsers When Even Gold Answers Are Unattainable2019-01-14T00:00:00+00:002019-01-14T00:00:00+00:00https://statnlp.github.io/parsing-overview<!--# Counterfactual Learning of Semantic Parsers When Even Gold Answers Are Unattainable-->
<p>In semantic parsing, natural language questions are mapped to semantic parses. A semantic parse can be executed against a database to obtain an answer. This answer can then be presented to a user.</p>
<p>Semantic parsers for question-answering can be employed in virtual personal assistants which are increasingly on the rise in recent years. Because such assistants are desired to help on an increasing number of tasks, we need to explore the best possible options to efficiently and effectively set up a parser for a new domain, to adapt them for specific user needs and to generally ensure that they improve.</p>
<p>However, obtaining labelled data can be challenging. In this post, we first consider the different possible supervision signals that can be used to train a semantic parser. This influences which objectives can be used for training, which we explore in the second part.</p>
<h2 id="supervision-signal">Supervision Signal</h2>
<h3 id="question-parse-pairs">Question-Parse Pairs</h3>
<p>To train a semantic parser, direct supervision means the collection of question-parse pairs. This can be difficult if the parse language is only understood by expert users. One option is to ensure that the parse language is as broad as possible, e.g. by choosing SQL (<a href="http://aclweb.org/anthology/P17-1089">Iyer et al., 2017</a>). However, even in the case of SQL, experts are required for the annotation, which can get quickly very expensive.</p>
<h3 id="question-answer-pairs">Question-Answer Pairs</h3>
<p>An alternative option is to employ a weaker supervision signal. Collecting question-answer pairs is easier for many domains (<a href="www.aclweb.org/anthology/D/D13/D13-1160.pdf">Berant et al., 2013</a>; <a href="http://www.aclweb.org/anthology/D14-1070">Iyyer et al., 2014</a>; <a href="http://www.aclweb.org/anthology/D15-1237">Yang et al., 2015</a>; <em>inter alia</em>) and can typically be done by non-experts.</p>
<p>However, the weaker supervision signal from question-answer presents a harder learning task. While the gold answer is known, it remains unclear which parse will lead to the gold answer. During training, the parser has to explore the output space to find a parse that executes to the correct gold answer. This search can be difficult as the output space is large. Furthermore, instead of finding a parse that represents the correct meaning of the question, one might find a <strong>spurious</strong> parse instead. Such a parse happens to execute to the gold answer, but conveys the wrong meaning. This hampers generalisation.</p>
<p>For example, assume we have the question “Are there any bars?” and instead of mapping “bar” to the logical form for “<bar>", the parser maps it to the logical form of "restaurant" instead. If the answer for both "Are there any bars?" and "Are there any restaurants?" is "Yes", then the wrong logical form "restaurant" for the question "Are there any bars?", will lead to the correct answer. The parser has now wrongly learnt to map "bar" to the logical form "restaurant" and for other questions, such as "Where is the closest bar?" it will now return the closest restaurant instead.</bar></p>
<h3 id="comparison-question-parse-vs-question-answer-pairs">Comparison: Question-Parse vs. Question-Answer Pairs</h3>
<p><a href="http://www.aclweb.org/anthology/P16-2033">Yih et al., 2016</a> investigated the cost and benefit of obtaining question-parse pairs compared to collecting question-answer pairs. For this, they use the WebQuestion corpus <a href="www.aclweb.org/anthology/D/D13/D13-1160.pdf">Berant et al., 2013</a> which is based on the <a href="https://developers.google.com/freebase/">Freebase Database</a>. The corpus was originally collected with the help of non-expert crowd-source workers in the form of question-answer pairs. <a href="http://www.aclweb.org/anthology/P16-2033">Yih et al., 2016</a> annotate each question in the corpus with corresponding gold parses. To ease the annotation, they designed a simple user interface and hired experts familiar with Freebase.</p>
<p>Next, they compared a system trained on question-parse pairs to a system trained on question-answer pairs. In their experiments, they were able to show three, in part surprising, results:</p>
<ol>
<li>
<p>The model trained on question-parse pairs outperforms the model on question-answer pairs by over 5 percentage points in answer accuracy.</p>
</li>
<li>
<p>Answer annotation by crowd-source workers is often incorrect, in their evaluation it was incorrect 34\% of the time.</p>
</li>
<li>
<p>With an easy to use interface, experts can write the correct semantic parse faster than they can retrieve the correct answer.</p>
</li>
</ol>
<p>Observation 1. does not come as a surprise as question-parse pairs offer a stronger learning signal. But both 2. and 3. are surprising. However, as noted previously, hiring experts to annotate gold parses can be expensive.</p>
<p>A further problem arises for domains where it is not easy to collect gold answers. For example, when answers are open-ended lists, fuzzily defined or very large.</p>
<p>This is for example the case on the domain of geographical question-answering using the OpenStreetMap database. Here, the underlying parse language is only known to a few expert users, which makes the collection of gold parses particularly difficult. Furthermore, it is often impossible to collect gold answers because in many cases the gold answer set is too large or fuzzily defined (e.g. when searching for objects “near” another one) to be obtained in a reasonable amount of time or without error.</p>
<h3 id="question-feedback-pairs">Question-Feedback Pairs</h3>
<p>In cases were both the collection of gold parses and gold answers is infeasible, we need to obtain a learning signal from other sources. One option is to obtain feedback from users while they are interacting with the system (<a href="http://aclweb.org/anthology/P18-1169">Lawrence&Riezler 2018</a>).</p>
<p>For this, a baseline semantic parser is trained on a small amount of question-parse pairs. This parser can be used to parse further questions for which neither gold parses nor gold answers exist. The parse suggested by the baseline, can then be automatically transformed into a set of human understandable statements. Given to human users, they can easily judge each statement as correct or incorrect. This feedback can be used to further improve the parser.</p>
<p>For example, below is a question and the statements automatically generated from the corresponding parse.</p>
<p><img src="https://statnlp.github.io/pics/2018-11-14_parsing_overview.png" alt="" /></p>
<p>With the filled in form, we know which parts of the parse are wrong.</p>
<p>This allows us to go further than just promoting correct parses. For each statement, we are able to map it back to the tokens in the parse that produced it. This allows us to learn from partially correct parses, where we only promote the tokens associated with correct statements.</p>
<h2 id="objectives">Objectives</h2>
<p>The collected data decides which objectives can be applied during training. Below we give an overview of various objectives, which data they require and what their advantages and disadvantages are.</p>
<p>First off, here is some general notation:</p>
<ul>
<li><script type="math/tex">\pi_w</script>: neural network with parameters <script type="math/tex">w</script></li>
<li><script type="math/tex">x = x_1, x_2, \dots x_{\mid x\mid }</script>: input question</li>
<li><script type="math/tex">y = y_1, y_2, \dots y_{\mid y\mid }</script>: output parse</li>
<li><script type="math/tex">\bar{y} = \bar{y}_1, \bar{y}_2, \dots \bar{y}_{\mid \bar{y}\mid }</script>: gold parse</li>
<li><script type="math/tex">\bar{a}</script>: gold answer</li>
</ul>
<p>We define an objective in terms of a loss function <script type="math/tex">\mathcal{L}</script>. For training, we derive it with regards to the model’s parameters <script type="math/tex">w</script> to make (stochastic) gradient descent updates, <script type="math/tex">w = w - \eta \nabla_w \mathcal{L}</script>, where <script type="math/tex">\eta</script> is a suitable learning rate.</p>
<h3 id="question-parse-pairs-maximum-likelihood-estimation-mle">Question-Parse Pairs: Maximum Likelihood Estimation (MLE)</h3>
<p>Neural networks are typically trained using MLE, where the probability of a gold parse <script type="math/tex">\bar{y}</script> is raised for given a question <script type="math/tex">x</script> (e.g. <a href="http://www.aclweb.org/anthology/P/P16/P16-1004.pdf">Dong & Lapata, 2016</a> or <a href="http://www.aclweb.org/anthology/P16-1002">Jia & Liang, 2016</a>). The objective is defined as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
\mathcal{L}_{MLE} = - \sum_{j=1}^{\mid \bar{y}\mid } \log \pi_w(\bar{y}_{j} \mid \bar{y}_{<j}, x), %]]></script>
<p>where <script type="math/tex">% <![CDATA[
\bar{y}_{<j} = y_{1}, y_{2}, \dots y_{j-1}. %]]></script></p>
<p>However, this approach is only possible if gold targets <script type="math/tex">\bar{y}</script> are available. As mentioned in the first section, obtaining these might be too expensive in praxis and weaker supervision signals are the practical alternative.</p>
<p>There is a further reason for a different objective, even when question-parse pairs are available:</p>
<p>There might be other parses, not just the annotated gold parse, that lead to the correct answer. But these can never be discovered if the MLE objective is used. Discovering further valid parses could stabilise learning and help generalisation. Further, this allows the parser to find suitable parses in its own output space.</p>
<p>Next, we turn to objectives which assume the existence of gold answers. Either from executing gold parses to obtain gold answers or because gold answers where annotated. For these objectives, a parser produces model outputs which are executed to obtain a corresponding answers. The answers can be compared to the available gold answer and a reward can be assigned to the various model outputs.</p>
<h3 id="question-answer-pairs-reinforce-and-minimum-risk-training-mrt">Question-Answer Pairs: REINFORCE and Minimum Risk Training (MRT)</h3>
<p>Recently, there has been a popular surge of applying reinforcement learning approaches, in particular the REINFORCE algorithm (<a href="http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf">Williams 1992</a>), to (weakly) supervised NLP tasks. The inherent issues that arise in this context, are also explored with regards to neural machine translation in <a href="https://statnlp.github.io/rl4nmt">another blog post</a>.</p>
<p>We will first introduce the REINFORCE algorithm, then discuss potential issues.</p>
<p>In REINFORCE, given an input question <script type="math/tex">x</script>, <strong>one</strong> output <script type="math/tex">y</script> is sampled from the current model distribution (see Section 13.3 of <a href="https://drive.google.com/file/d/1opPSz5AZ_kVa1uWOdOiveNiBFiEOHjkG/view">Sutton & Barto, 2018</a>). Executing this sampled parse to obtain an answer <script type="math/tex">a</script>, the comparison with the gold answer <script type="math/tex">\bar{a}</script> provides us with a reward <script type="math/tex">\delta</script>. On the basis of this single reward, the model’s parameters are updated, i.e. we can define the following objective:</p>
<script type="math/tex; mode=display">\mathcal{L}_{REINFORCE} = - \delta \pi_w(y\mid x).</script>
<p>However, with just one sample, this objective can suffer from high variance. This can be combated by introducing control variates, which lower variance. The most popular choice is using a baseline, where we keep track of the average reward, which is subtracted from <script type="math/tex">\delta</script>.</p>
<p>But why only sample one output?</p>
<p><strong>We have the luxury of having gold answers available.</strong></p>
<p>This allows us to sample several outputs and obtain rewards for all of them. With this, an average can be computed and on the basis of this average updates to <script type="math/tex">w</script> are performed. First, this lowers the variance. Second, it allows us to try out several model outputs, which helps us to explore the output space and in turn increases our chance of finding a parse that leads to the correct answer.</p>
<p>Building an average based on several outputs obtained for one input, is exactly the characteristic idea of Minimum Risk Training (MRT).</p>
<p>MRT was introduced in the context of log-linear models for dependency parsing and machine translation (<a href="https://people.cs.umass.edu/~dasmith/dtrain_acl_2006.pdf">Smith & Eisner, 2006</a>). It has also been tested for neural models in the context of machine translation (<a href="http://anthology.aclweb.org/P/P16/P16-1159.pdf">Shen et al., 2016</a>).</p>
<p>Sampling <script type="math/tex">S</script> outputs per input, we can define the following MRT objective:</p>
<script type="math/tex; mode=display">\mathcal{L}_{MRT} = - \frac{1}{S} \sum_{s=1}^{S} \delta \pi_w(y_s\mid x).</script>
<p>This objective is for example employed in <a href="http://www.aclweb.org/anthology/P17-1003">Liang et al., 2017</a>. Although they use the term REINFORCE (“We apply REINFORCE”), their later objective is based upon <script type="math/tex">S</script> outputs (“Thus, in contrast with common practice of approximating the gradient by sampling from the
model, we use the top-<script type="math/tex">k</script> action sequences”), which is reminiscent of MRT. Similarly, <a href="http://aclweb.org/anthology/P17-1097">Guu et al., 2017</a> also calculate an average over several output samples for one input (see their Equation 9). <a href="http://proceedings.mlr.press/v70/mou17a/mou17a.pdf">Mou et al., 2017</a> also take advantage of the gold answers to sample and evaluate several parses for one input (“We adjust the reward by subtracting the mean reward, averaged over sampled actions for a certain data point.”).</p>
<p>MRT is superior because by sampling several outputs per input, it exhibits lower variance than REINFORCE. But it can only be applied if question-answer pairs are available. Additionally, it is more expensive to compute.</p>
<p>For our final scenario from the previous section, where neither gold answers nor gold parses are available and we only have feedback collected for one model output, we are limited to only one sample and MRT cannot be applied.</p>
<p>Let’s see which objectives we can apply in such scenarios.</p>
<h3 id="question-feedback-pairs-reinforce-and-counterfactual-learning">Question-Feedback Pairs: REINFORCE and Counterfactual Learning</h3>
<p>A setup, where only one outputs and its corresponding feedback is available, is also called a bandit learning scenario. The name is inspired from choosing one among several slot machines (colloquially referred to as “one-armed bandit”), where we only observe the reward for the chosen machine (i.e. output) and it remains unknown what reward the other machines (or outputs) would have obtained.</p>
<p>This is a crucial contrast to learning from question-answer pairs. We illustrate this graphically in the figure below. The left side shows the scenario where question-answer pairs are available, whereas the right assumes question-feedback pairs where no gold answers are available.</p>
<p><img src="https://statnlp.github.io/pics/2018-11-14_QA.png" alt="qa" class="align-left" /></p>
<p>REINFORCE is still applicable in bandit learning scenarios. But if we collect feedback as users are using the system, it can be dangerous to update the parser’s parameters online.</p>
<p>The parser’s performance could deteriorate without notice which can lead to user dissatisfaction and monetary loss. It also makes it impossible to explore different hyperparameter setting.</p>
<p>Instead, it is safer to first collect the feedback in a log of triples <script type="math/tex">\mathcal{D}_{log}=\{(x_m,y_m,\delta_m)\}_{m=1}^M</script>. Once enough feedback has been collected, the log can be used to further improve the parser offline. The resulting model can then be validated against additional test sets before it is deployed.</p>
<p>However, once we start learning, the outputs produced in log might no longer be the outputs the updated parser would choose; i.e. the log we collected is biased towards the parser that was deployed at the time. Learning from such a log leads to a counterfactual, or off-policy, learning setup.</p>
<p>The bias in the log can be corrected using <a href="https://en.wikipedia.org/wiki/Importance_sampling">importance sampling</a>, where we divide the probability that the new model <script type="math/tex">\pi_w</script> prescribes to the logged output, by the probability that the deployed parser <script type="math/tex">\mu</script> assigned to that output. This leads to the following Inverse Propensity Score (IPS) objective:</p>
<script type="math/tex; mode=display">\mathcal{L}_{IPS} = - \frac{1}{M} \sum_{m=1}^{M} \delta \frac{\pi_w(y_m\mid x_m)}{\mu(y_m\mid x_m)}.</script>
<p>However, because we ideally want to present only correct parses and answers to our users, we want to always show the most likely output under the currently deployed model. This results in a deterministic log because the probability of choosing the most likely output is always one. Consequently, we can no longer correct the data bias.</p>
<p>This leads to the Deterministic Propensity Matching (DPM) objective:</p>
<script type="math/tex; mode=display">\mathcal{L}_{DPM} = - \frac{1}{M} \sum_{m=1}^{M} \delta \pi_w(y_m\mid x_m).</script>
<p>Just like REINFORCE, both IPS and DPM suffer from high variance because only one output received a reward for each input. It is thus advisable to employ control variates, e.g. one-step-late reweighting (<a href="http://aclweb.org/anthology/P18-1169">Lawrence & Riezler, 2018</a>).</p>
<p>With these objectives it is possible to learn from question-feedback pairs where feedback was collected from users for one system output. This approach is useful for scenarios where neither the collection of gold parses nor the collection of gold answers is feasible.</p>
<p>Furthermore, this approach can also be applied to other tasks, such as machine translation (<a href="http://www.aclweb.org/anthology/D/D17/D17-1272.pdf">Lawrence et al., 2017</a> ; <a href="https://arxiv.org/pdf/1804.05958.pdf">Kreutzer et al., 2018a</a> ; <a href="https://arxiv.org/pdf/1805.10627.pdf">Kreutzer et al., 2018b</a>).</p>
<h2 id="summary">Summary</h2>
<ul>
<li>Semantic parsers are important modules in virtual personal assistants and with an increasing number of domains in which these assistants are used, we need to find efficient and effective methods to train parsers on new domains.</li>
<li>In general, the stronger the learning signal, the better the result. For each new domain, we should estimate the time and cost of the different approaches, while keeping in mind that:
<strong>question-parse pairs > question-answer pairs > question-feedback pairs.</strong></li>
<li>If it is too expensive to obtain gold parses or gold answers, then using counterfactual learning from question-feedback pairs is a viable alternative.</li>
</ul>
<p><strong>Acknowledgment: Thanks to Julia Kreutzer and Stefan Riezler for their valuable and much needed feedback for improving this post.</strong></p>
<p><strong>Disclaimer: This blogpost reflects solely the opinion of the author, not any of her affiliated organizations and makes no claim or warranties as to completeness, accuracy and up-to-dateness.</strong></p>
<p><strong>Comments, ideas and critical views are very welcome. In particular, feel free to let us know if you think there is an important paper that we should add to this overview! We appreciate your feedback! If you want to cite this blogpost, use this <a href=""/2019-01-14_parsing_overview.bibtex"">bibfile</a>.</strong></p>Carolin LawrenceHow can we train semantic parsers if neither question-parse nor question-answer pairs can be collected?RL in NMT: The Good, the Bad and the Ugly2018-11-15T00:00:00+00:002018-11-15T00:00:00+00:00https://statnlp.github.io/rl-nmt<!--# RL in NMT: the Good, the Bad and the Ugly-->
<p><img src="https://statnlp.github.io/pics/horse_small.png" alt="image-right" class="align-right" /> Let me introduce you to three popular practices for using reinforcement learning (RL) in neural machine translation (NMT): <strong>the Good</strong>, combining it with good old maximum likelihood estimation (MLE), <strong>the Ugly</strong>, combining it with “hacks”, and <strong>the Bad</strong>, applying it with ignorance of more evolved techniques. Those three are helping NMT researchers on the hunt for BLEU scores.</p>
<p>Western movies aside, the aim of this blogpost is to take a critical look at the recent trend to include RL-inspired objectives in NMT training. We’ll start with a <a href="#the-basics">recap</a> of RL training in NMT, dive right into an empirical study by <a href="https://arxiv.org/abs/1808.08866">Wu et al. 2018</a>, leading to the discussion of the three following questions:</p>
<ol>
<li>How do <a href="#nmt-as-an-rl-problem">NMT and RL</a> fit together?</li>
<li>Why do we even get any benefits from an <a href="#rl-to-the-rescue">RL objective in supervised learning</a>?</li>
<li>Where can we find the <a href="#beyond-supervised-learning">real challenges</a>?</li>
</ol>
<p><strong>tl;dr</strong> RL is a popular first-aid method to fix supervised NMT training, but maybe not the most suitable one. RL shines outside supervised learning; new challenges and opportunities are to be found there.</p>
<h2 id="the-basics">The Basics</h2>
<p><strong>Introducing RL to incorporate rewards.</strong></p>
<h3 class="no_toc">Maximum Likelihood Estimation</h3>
<p>Standard auto-regressive NMT models, parametrized by a neural network with parameters <script type="math/tex">\theta</script>, are trained with <strong>maximum likelihood estimation</strong> on parallel data <script type="math/tex">(x, y) \in \mathcal{D}</script> resulting in the popular cross-entropy objective:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align} \text{MLE} &= \sum_{(x,y) \in D} \log p_{\theta}(y \mid x) \end{align} %]]></script>
<h3 class="no_toc">Expected Reward</h3>
<p>So how does RL come into play? The idea is to introduce rewards to encourage model outputs that would obtain a high reward, not only the perfect reference translation (=MLE). In practice, rewards can be simulated with e.g., sentence-level BLEU scores, to reinforce samples that – if evaluated – would obtain a high BLEU score. You might ask yourself, why is it even necessary? We’ll discuss that in <a href="#nmt-as-an-rl-problem">a bit</a>. Assuming the existence of such scalar rewards obtained from <script type="math/tex">R: \mathcal{Y} \to [0,1]</script> we can formulate an objective that aims to maximize the <strong>expected reward</strong> for all model outputs:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align} \text{RL} &= \mathbb{E}_{p_{\theta}(y \mid x)} \left[ R(y) \right] \end{align} %]]></script>
<h3 class="no_toc">Policy Gradient</h3>
<p>In contrast to the MLE objective, the RL objective is not differentiable with respect to <script type="math/tex">\theta</script>, because the reward is a discrete function of the outputs of the model. Luckily, with the help of the <a href="http://blog.shakirm.com/2015/11/machine-learning-trick-of-the-day-5-log-derivative-trick/"><strong>log-derivative trick</strong></a>, we can reformulate the gradient for this objective, also referred to as the <strong>policy gradient</strong>:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align} \nabla_{\theta} \text{RL} &= \mathbb{E}_{p_{\theta}(y \mid x)} \left[ R(y) \nabla_{\theta} \log p_{\theta}(y \mid x)\right] \end{align} %]]></script>
<p>We can now empirically <strong>estimate</strong> the gradient with e.g. Monte Carlo sampling and train our model with stochastic gradient ascent. This solution was introduced in the famous <strong>REINFORCE</strong> algorithm by <a href="http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf">J. Williams 1992</a>. REINFORCE proposes to estimate the gradient with one sample for each input:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align} \tilde{\nabla}_{\theta} \text{RL} &= R(\tilde{y}) \nabla_{\theta} \log p_{\theta}(\tilde{y} \mid x),& \tilde{y} \sim p_{\theta}(y \mid x) \end{align} %]]></script>
<p>How does this bring us to RL? In RL, more precisely in policy search, <script type="math/tex">p_{\theta}</script> is a policy that predicts actions <script type="math/tex">y</script>. The policy chooses one action and then receives a reward for this action from the environment. Importantly, it is not possible to go back and try other actions instead and compare their rewards. In a genuine RL setup, we are limited to <strong>single-sample</strong> estimates.</p>
<h3 class="no_toc">Training</h3>
<p>The current practice in NMT is to approximate the policy gradient with either multinomial sampling from the softmax-normalized outputs of the NMT model, or by beam search.
The two objectives are trained either sequentially (e.g., supervised pre-training before reinforced fine-tuning, or alternating batches) or simultaneously (e.g., by linear interpolation).</p>
<h2 id="discussing-recent-trends-of-rl-in-nmt">Discussing Recent Trends of RL in NMT</h2>
<p><strong>If we care about BLEU, RL alone won’t help.</strong></p>
<p>In the recent EMNLP paper “A Study of RL for NMT” <a href="https://arxiv.org/abs/1808.08866">Wu et al. 18</a> observe that RL-inspired training objectives have been shown to improve NMT quality, but usually don’t come without tricks and rather weak baselines. Their question is now: Combining various variants of these tricks with learning from monolingual data, does RL still shine as expected?</p>
<p>To spoil the suspense right away, the study finds that using RL leads to marginal improvements over well-tuned baselines, also in combination with MLE and monolingual data (<strong>the good</strong>). However, the largest portions of improvement come from leveraging additional monolingual data (old news) (<strong>the ugly</strong>). But the RL-inspired approaches evaluated here lack comparisons to more evolved techniques, and assume access to reference translations (<strong>the bad</strong>). Let’s take a closer look!</p>
<h3 class="no_toc">RL Tricks</h3>
<h4 class="no_toc">Variance Reduction</h4>
<p>The variance of the <a href="#policy-gradient">gradient estimator</a> <strong>can</strong> be a problem for optimization, i.e. slow down or hinder convergence. The paper investigates the following solutions:</p>
<ul>
<li>Average reward baseline: Instead of using the reward directly, subtract its empirical average from the reward obtained.</li>
<li>Learned baseline: Subtract a learned reward instead of the empirical average. The learned reward is the output of a regression model, e.g. another neural network.</li>
</ul>
<p>The baseline was actually already proposed in the original REINFORCE paper and can be interpreted as an additive control variate (<a href="https://www.elsevier.com/books/simulation/ross/978-0-12-415825-2">Ross 2013</a>).
Actor-critic (AC) approaches go a step further and replace the reward obtained by the environment by a reward given by a critic that is trained to imitate the original reward (applied to NMT by e.g. <a href="https://arxiv.org/abs/1607.07086">Bahdanau et al. 2017</a>, <a href="https://arxiv.org/abs/1707.07402">Ngyuen et al. 2017</a>).</p>
<p>Despite the reported effectiveness in practice, <a href="http://jmlr.csail.mit.edu/papers/volume5/greensmith04a/greensmith04a.pdf">Greensmith et al. 2004</a> showed that both above solutions are suboptimal and that one can actually <strong>learn</strong> an optimal baseline with minimal variance.</p>
<p>One important aspect that has been completely neglected in the present study is that the number of samples used for the Monte Carlo gradient estimate has an essential influence of the variance of the gradient. If rewards are simulated anyway, e.g., from references using sentence-level BLEU, why not sample multiple times and average the gradients over this subset? This may sound familiar, since this is exactly what is done in <strong>minimum risk training</strong> (proposed for NMT by <a href="https://arxiv.org/abs/1512.02433">Shen et al. 2016</a>).</p>
<p>In <a href="https://arxiv.org/abs/1808.08866">Wu et al. 2018</a>’s empirical study, there was no beneficial effect observed when using the learned baseline. This contradicts with the experience in e.g., <a href="https://arxiv.org/abs/1607.07086">Bahdanau et al. 2017</a> and <a href="https://arxiv.org/abs/1704.06497">Kreutzer et al. 2017</a>. The conclusion that reward baselines are not necessary from “the economic perspective” (<a href="https://arxiv.org/abs/1808.08866">Wu et al. 2018</a>) might be a bit overhasty.</p>
<h4 class="no_toc">Reward Shaping</h4>
<p>If the reward is only obtained at the end of each sequence (here: translation), how does the model know where the errors are? The problem of <a href="https://scholarworks.umass.edu/dissertations/AAI8410337/">credit assignment</a> is often addressed by introducing methods for reward shaping (<a href="https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf">Ng et al. 1999</a>).
<a href="https://arxiv.org/abs/1808.08866">Wu et al. 2018</a> investigate the implementation by <a href="https://arxiv.org/abs/1607.07086">Bahdanau et al. 2017</a>: For each element of the output, the individual reward is the difference between the BLEU score for the partial output including and the BLEU score for the partial output excluding the element: <script type="math/tex">R(y_t) = R(y_{1:t}) - R(y_{1:t-1})</script>. Note that the BLEU scores are computed with respect to the full reference output. Once again the references are exploited to <strong>simulate</strong> the rewards.</p>
<p>But does this even address the original problem of credit assignment? The problem arose because we had to wait for rewards from the environment until we completed a sequence of actions (in NMT: produced a complete translation). As soon as <strong>references</strong> are used, we are in principle not restricted to delayed rewards anymore. One could for example compare each word in the translation to the words in the reference translation and then come up with a token-based reward. Simple binary rewards were for example proposed in <a href="https://arxiv.org/abs/1806.07169">Petrushkov et al. 2018</a>.</p>
<p>As long as we simulate the rewards using references, we can cheat our way around the real problem.
When references are not available and you simply cannot compute BLEU scores for any arbitrary, partial translation – what would you do?</p>
<p>To this end <a href="https://arxiv.org/abs/1707.07402">Ngyuen et al. 2017</a> adopt the advantage-actor critic (A2C) framework (<a href="http://proceedings.mlr.press/v48/mniha16.pdf">Mnih et al. 2016</a>). A critic network predicts the expected future reward for each element, although the reward from the environment (here: BLEU) is only obtained at the end of the sequence. Unfortunately, the latter study does not include a comparison to RL approaches without reward shaping.</p>
<p>The empirical gains from reward shaping reported in <a href="https://arxiv.org/abs/1808.08866">Wu et al. 2018</a>’s study are diminishingly small which leaves the question of the usefulness of this method unanswered.</p>
<h3 class="no_toc">Using Monolingual Data</h3>
<h4 class="no_toc">Target-side</h4>
<p>Leveraging monolingual data for improving MT systems has become increasingly popular, since simple methods have shown to be very effective for NMT. When target-side monolingual data is available, the trick-of-the-trade is to use back-translation as demonstrated in <a href="https://arxiv.org/abs/1511.06709">Sennrich et al. 2016</a>. The only burden here is here one has to train a system in the opposite translation direction. This system can then generate <strong>pseudo-sources</strong> for the available target data. The <a href="https://arxiv.org/abs/1806.04402">“hallucinated”</a> parallel data can then be used for standard training, with simulated rewards or without.</p>
<p>But isn’t it problematic to feed the NMT with fake data? Apparently not, at least as long as the targets are intact. <a href="https://arxiv.org/pdf/1808.09381.pdf">Edunov et al. 2018a</a> investigate this question systematically and surprisingly find that models get even better when the pseudo-sources are of low quality (not for small data, though). They hypothesize that the noise introduced actually enriches the training data and helps learning as e.g., in <a href="https://dl.acm.org/citation.cfm?id=1390294">denoising auto-encoders</a>.</p>
<h4 class="no_toc">Source-side</h4>
<p><a href="https://arxiv.org/abs/1808.08866">Wu et al. 2018</a> propose to leverage not only target-side monolingual data, but also source-side monolingual data.
Evoking techniques developed in the context of <a href="http://ruder.io/semi-supervised/index.html#selftraining">self-training</a>, the idea is to let the model generate <strong>pseudo-targets</strong> for its own training. We have to assume that it is able to generate targets that are “good enough”, in the sense that the model can bootstrap itself. In practice, this is addressed by using beam search decoding for generating translations that are likely to have higher quality than sampled or greedy decoded targets.</p>
<p>Does the quality of the pseudo-targets matter? When they are part of the RL objective, they are only used to simulate rewards for sampled translations, which perhaps can absorb some of the noise. In supervised MLE training <a href="https://arxiv.org/abs/1808.08866">Wu et al. 2018</a> add them to the much larger original parallel data – the small amount of extra noise might be negligible. However, this has not been investigated systematically.</p>
<h2 id="nmt-as-an-rl-problem">NMT as an RL problem</h2>
<p><strong>We only (mis-)use a subset of RL methods in NMT.</strong></p>
<p>The “Study of RL in NMT” is limited to a very specific scenario where policy gradient is used for fine-tuning of well-trained models. What about other RL algorithms? RL researchers have in fact dealt with reinforced objectives as above for decades and have developed more sophisticated training algorithms (such as <a href="https://arxiv.org/abs/1502.05477">Trust Region Policy Optimization</a> and <a href="https://arxiv.org/abs/1707.06347">Proximal Policy Optimization</a>) than vanilla policy gradient. But that’s to be discussed in another blog post.
Nevertheless, so far only policy gradient and actor-critic have become really popular for structured prediction tasks. So what’s wrong, are we just slow in adopting their algorithms?</p>
<p>In fact, it is not trivial to cast NMT, or more general structured prediction, as a standard (PO)MDP problem which is the basis for most RL algorithms: What is the environment? What is the state? Where does the reward come from? Translation researchers don’t agree on it (comparing e.g. definitions in <a href="https://arxiv.org/abs/1808.08866">Wu et al. 2018</a>, <a href="https://arxiv.org/abs/1707.07402">Ngyuen et al. 2017</a>, <a href="https://arxiv.org/abs/1607.07086">Bahdanau et al. 2017</a>). It is in fact often more suitable to cast it as a simpler contextual bandit problem, aka <strong>bandit structured prediction</strong> (e.g., <a href="http://papers.nips.cc/paper/6133-stochastic-structured-prediction-under-bandit-feedback">Sokolov et al. 2016</a>, <a href="https://arxiv.org/abs/1704.06497">Kreutzer et al. 2017</a>, <a href="https://openreview.net/forum?id=HJNMYceCW">Daumé III et al. 2018</a>), as Hal Daumé III discussed in <a href="https://nlpers.blogspot.com/2017/04/structured-prediction-is-not-rl.html">his blogpost on structured prediction and RL</a> – you may see it as a one-state MDP.</p>
<p>What we can agree on is that in NMT we’re dealing with large and structured action spaces, where actions are discrete and rewards are sparse (and most of the time delayed) and potentially noisy. This calls for algorithms that are particularly suited for those methods, but neither REINFORCE nor AC address these issues in particular.</p>
<p>In fact, training NMT from scratch with pure RL objectives, i.e. <a href="https://arxiv.org/abs/1709.09346"><strong>cold-start</strong> RL</a>, has so far not succeeded (despite <a href="https://arxiv.org/abs/1611.00179">Xia et al. 2016</a>’s optimism).</p>
<h2 id="rl-to-the-rescue">RL to the rescue?</h2>
<p><strong>RL can improve NMT because it fixes problems of our standard objective.</strong></p>
<p>What’s wrong with MLE training for NMT? <a href="https://arxiv.org/abs/1511.06732">Ranzato et al. 2016</a> elaborated on this when proposing the MIXER algorithm that mixes policy gradient-style updates with MLE. They identify the following problems:</p>
<ul>
<li>Exposure bias: During training reference targets are fed to the model (=teacher forcing), while during inference the model has to produce outputs based on its own previous outputs.</li>
<li>Token-level objective (aka “loss-evaluation mismatch” in <a href="https://arxiv.org/pdf/1606.02960.pdf">Wiseman and Rush 2016</a>): In standard autoregressive NMT models, the sequence-level <a href="#maximum-likelihood-estimation">log-likelihood</a> is decomposed as sum over token-level log-likelihoods. The training hence is optimized to finding the next perfect output token given the previous perfect token. During inference, however, we’re measuring the model’s quality with metrics like BLEU that evaluate whole sequences of outputs.</li>
</ul>
<p>Algorithms like scheduled sampling (<a href="https://arxiv.org/abs/1506.03099">Bengio et al. 2015</a>), DAgger (<a href="https://arxiv.org/abs/1011.0686">Ross et al. 2011</a>) and DAD (<a href="https://www.ri.cmu.edu/pub_files/2015/1/Venkatraman.pdf">Venkatraman et al. 2015</a>) have been designed to reduce the exposure bias by gradually exposing the model with its own outputs during training (<strong>imitation learning</strong>).</p>
<p>The same effect is obtained when including some policy gradient in the training objective (e.g. in MIXER, MRT), since the gradient update is based on the log-likelihood of the model’s own output. It is directly optimized towards a sentence-level reward that is closer to the corpus BLEU we’re evaluating our models with. Furthermore, it can help with other non-differentiable objectives than the <a href="#expected-reward">expected reward</a>, e.g., for adversarial training (<a href="https://arxiv.org/pdf/1704.06933.pdf">Wu et al. 2017</a>, <a href="https://arxiv.org/abs/1703.04887">Yang et al. 2017</a>. Or you might just use it to teach the NMT system what you actually want from it (beyond generating translations close to the reference), e.g., copying certain words of the input <a href="https://arxiv.org/pdf/1809.03182.pdf">Pham et al. 2018</a>.</p>
<p>Large gains using RL have been reported under domain shift, i.e., gains over baseline models that are not fine-tuned on the evaluation domain (e.g., <a href="https://arxiv.org/abs/1704.06497">Kreutzer et al. 2017</a>, <a href="https://arxiv.org/abs/1806.07169">Petrushkov et al. 2018</a>) or when combined with classic objectives (e.g., <a href="https://arxiv.org/abs/1609.08144">Wu et al. 2016</a>, <a href="https://arxiv.org/abs/1511.06732">Ranzato et al. 2016</a>). The above discussed paper demonstrates that without these factors, expected improvements vanish.</p>
<p>Most commonly, RL is exploited as a first aid for obvious MLE problems, in a fully-supervised setting where references are available and rewards are <strong>simulated</strong>.
Why not use (or at least compare against) other training strategies that may be better suited for NMT and fix the above problems equally, as proposed e.g., in <a href="https://arxiv.org/abs/1711.04956">Edunov et al. 2018b</a>, <a href="https://arxiv.org/abs/1512.02433">Shen et al. 2016</a> and <a href="https://arxiv.org/abs/1609.00150">Norouzi et al. 2016</a>?</p>
<h2 id="beyond-supervised-learning">Beyond supervised learning</h2>
<p><strong>The challenges in RL for NLP lie outside supervised learning.</strong></p>
<p>So what about more realistic uses of RL, e.g., where rewards cannot simply be simulated, or reward signals are not given as well-defined functions, or not available in unlimited amounts? In NLP, the following scenarios are evident:</p>
<ul>
<li>Gold standard structures may not be available because of the <strong>cost or the lack of expertise</strong> of human annotators. Weaker signals such as human judgments on the quality of output structures may be easier to obtain and may require less expertise. This is the case for example in semantic parsing (<a href="https://arxiv.org/abs/1805.01252">Lawrence at al. 2018</a>) or in machine translation (<a href="https://arxiv.org/abs/1805.10627">Kreutzer et al. 2018b</a>).</li>
<li>In genuinely <strong>interactive</strong> settings where a system directly interacts with a human, the human responses can be interpreted as a weak signal how to further improve the system. A prime example is dialogue, where learning from human feedback has successfully been implemented to train systems e.g., for small-talk (<a href="https://arxiv.org/abs/1709.02349">Serban et al. 2017</a>) and task-oriented dialogue (<a href="https://arxiv.org/abs/1606.02689">Su et al. 2016</a>).</li>
<li>Systems that need to be heavily <strong>customized</strong> towards a user or domain. User preferences or ratings (that usually come for free) can be used to specifically adapt the system. In industrial settings, large-scale collections of feedback have been utilised in personalized news recommendation (<a href="https://dl.acm.org/citation.cfm?doid=1772690.1772758">Li et al. 2010</a>) or e-commerce translations systems (<a href="https://arxiv.org/abs/1804.05958">Kreutzer et al. 2018a</a>).</li>
</ul>
<p>These scenarios bring challenges that can only partly be addressed by simulations and arise from the interaction with humans in real-life scenarios.
The human factor entails several differences to the popular simulation scenarios of RL. Firstly, human rewards are not well-defined functions, but complex and inconsistent signals. Secondly, humans cannot be expected to provide feedback for unlimited amounts of outputs. Exciting challenges (<a href="https://www.alexirpan.com/2018/02/14/rl-hard.html">“RL is hard”</a>) like the collection of reliable feedback, building robustness against adversarial feedback, fair evaluation, and off-policy learning, are ahead of us!</p>
<p>So instead of asking the question “How to get high BLEU with RL-objectives?” let’s move to “How to learn from rewards with RL when we depend on them?”.</p>
<p><strong>Acknowledgment: Thanks to Carolin Lawrence, Stefan Riezler and Joost Bastings for their valuable and much needed feedback for improving this post.</strong></p>
<p><strong>Disclaimer: This blogpost reflects solely the opinion of the author, not any of her affiliated organizations and makes no claim or warranties as to completeness, accuracy and up-to-dateness.</strong></p>
<p><strong>Comments, ideas and critical views are very welcome. We appreciate your feedback! If you want to cite this blogpost, use this <a href="/bibtex/rl-nmt.bibtex">bibfile</a>.</strong></p>Julia KreutzerDiscussing good, bad and ugly practices of reinforcement learning in neural machine translation.Taming Wild Reward Functions: The Score Function Gradient Estimator Trick2018-11-12T00:00:00+00:002018-11-12T00:00:00+00:00https://statnlp.github.io/score-function<!--# Taming Wild Reward Functions: The Score Function Gradient Estimator Trick-->
<p>MLE is often not enough to train sequence-to-sequence neural networks in NLP. Instead we employ an external metric, which is a reward function that can help us judge model outputs. The parameters of the network are then updated on the basis of the model outputs and corresponding rewards.</p>
<p>For this update, it is necessary to obtain a derivative.</p>
<p>But how can we do this, if the external function is unknown or cannot be derived?</p>
<p><strong>Enter:</strong> The score function gradient estimator trick.</p>
<p><img src="https://statnlp.github.io/pics/luke.jpg" alt="" /></p>
<h2 id="why-mle-is-not-enough">Why MLE is not Enough</h2>
<p>Traditionally, neural networks are trained using Maximum Likelihood Estimation (MLE): given an input sequence <script type="math/tex">x = x_1, x_2, \dots x_{ \mid x \mid }</script> and a corresponding gold target sequence <script type="math/tex">\bar{y} = \bar{y}_1, \bar{y}_2, \dots \bar{y}_{ \mid \bar{y} \mid }</script> , we want to increase the probability that the current model <script type="math/tex">\pi</script> with parameters <script type="math/tex">w</script> assigns for the pair <script type="math/tex">(x,\bar{y})</script> . This gives the following loss function:</p>
<script type="math/tex; mode=display">% <![CDATA[
\mathcal{L}_{MLE} = - \sum_{j=1}^{ \mid \bar{y} \mid } \log \pi_w(\bar{y}_{j} \mid \bar{y}_{<j}, x), %]]></script>
<p>where <script type="math/tex">% <![CDATA[
\bar{y}_{<j} = \bar{y}_1, \bar{y}_2, \dots \bar{y}_{j-1}. %]]></script></p>
<p>The parameters <script type="math/tex">w</script> of <script type="math/tex">\pi</script> are then updated using stochastic gradient descent,</p>
<script type="math/tex; mode=display">w = w - \eta \nabla_w \mathcal{L}_{MLE}.</script>
<p>But there are various issues with using MLE that has led researchers to explore alternative objectives. Let’s looks at them next.</p>
<h3 id="1-gold-targets-bary-are-not-available">1. Gold targets <script type="math/tex">\bar{y}</script> are not Available</h3>
<p>This is most prominently the case in many domains of semantic parsing for question-answering, where questions <script type="math/tex">x</script> are mapped to a semantic parse <script type="math/tex">y</script>, which can be executed to obtain an answer <script type="math/tex">a</script> . For many domains, it is easier to collect question-answer pairs, rather than question-parse pairs (e.g. see <a href="http://www.aclweb.org/anthology/D/D13/D13-1160.pdf">Berant et al. 2013</a>). But with no gold parses available, MLE cannot be applied.</p>
<p>What can we do instead?</p>
<p>The current model produces a set of likely parses (e.g. by sampling from the model distribution or by employing beam search). Each parse is then executed to obtain an answer. Next, we compare the answer to the gold answer to get a reward <script type="math/tex">\delta</script> . Generally, we have <script type="math/tex">\delta=0</script> if there is no overlap between answer and gold answer and <script type="math/tex">\delta=1</script> if they match exactly. With this, we can update the model’s parameters.</p>
<h3 id="2-exposure-bias-ranzato-et-al-2016">2. Exposure Bias: <a href="https://arxiv.org/pdf/1511.06732.pdf">Ranzato et al. 2016</a></h3>
<p>During traditional MLE training the model is fed the perfect tokens from the available gold target <script type="math/tex">\bar{y}</script> , but at test time the output sequence is produced on the basis of the model distribution. This causes a distribution mismatch and inferior performance.</p>
<p>How can we reduce this mismatch?</p>
<p>Instead, we can feed model output sequences already at training time. Typically, once an entire output sequence has been produced, this sequence is judged by an external metric and the resulting reward <script type="math/tex">\delta</script> can be used as feedback to update the model’s parameters.</p>
<h3 id="3-loss-evaluation-mismatch-wiseman--rush-2016">3. Loss-Evaluation Mismatch: <a href="http://www.aclweb.org/anthology/D16-1137">Wiseman & Rush 2016</a></h3>
<p>MLE is agnostic to the final evaluation metric. Ideally we would like to have the final evaluation metric in the objective used at training time, so that the parameters of the model are specifically tuned to perform well on the intended task.</p>
<p>How can we do that?</p>
<p>Similar to problem (2.), we can feed model output sequences at training time. In this case the external metric is the final evaluation metric. For example, in the case of machine translation, typically a per-sentence approximation of the BLEU score is used.</p>
<h2 id="maximise-the-expected-reward-obtained-for-model-outputs">Maximise the Expected Reward Obtained for Model Outputs</h2>
<p>To solve all three problems, we can instead maximise the expected reward <script type="math/tex">\delta</script> or, equivalently, minimise the expected risk <script type="math/tex">-\delta</script> . This can be formulated as the following expectation:</p>
<script type="math/tex; mode=display">\mathcal{L}_\delta = \mathbb{E}_{p(x)} \mathbb{E}_{\pi_w(y \mid x)} [-\delta],</script>
<p>where <script type="math/tex">p(x)</script> is the probability distribution over inputs <script type="math/tex">x</script> and <script type="math/tex">\pi_w(y \mid x)</script> is the probability distribution over outputs <script type="math/tex">y</script> given <script type="math/tex">x</script> .</p>
<p>In praxis, this expectation has to be approximated. For example, using Monte-Carlo sampling leads to the REINFORCE algorithm (<a href="http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf">Williams 1992</a>): we sample one output <script type="math/tex">y</script> from the model distribution <script type="math/tex">\pi_w(y \mid x)</script> (see also Chapter 13 of <a href="https://drive.google.com/file/d/1opPSz5AZ_kVa1uWOdOiveNiBFiEOHjkG/view">Sutton & Barto 2018</a>).
Approximating the expectation over <script type="math/tex">y</script>, the actual training objective becomes:</p>
<script type="math/tex; mode=display">\mathcal{L}_{REINFORCE} = - \delta \pi_w(y \mid x) \approx \mathbb{E}_{p(x)} \mathbb{E}_{\pi_w(y \mid x)} [-\delta].</script>
<p>The goal of this objective is to increase the probability of an output proportionally to its reward. The gradient of the REINFORCE objective is an unbiased estimate of the gradient of the <script type="math/tex">\mathcal{L}_\delta</script> objective.</p>
<p>Alternatively, we can use Minimum Risk Training (MRT) (<a href="http://aclweb.org/anthology/P06-2101">Smith & Eisner ‘06</a>, <a href="anthology.aclweb.org/P/P16/P16-1159.pdf">Shen et al. 2016</a>). Here, several outputs are sampled from the model distribution. This stabilises learning, but requires that more outputs are evaluated to get corresponding rewards. Assuming <script type="math/tex">S</script> sampled outputs, the objective then takes the following form:</p>
<script type="math/tex; mode=display">\mathcal{L}_{MRT} = - \frac{1}{S} \sum_{s=1}^{S} \delta_s \pi_w(y_s \mid x) \approx \mathbb{E}_{p(x)} \mathbb{E}_{\pi_w(y \mid x)} [-\delta].</script>
<p>Due to sampling, both approaches can suffer from high variance, which can be combatted using control variates (see for example Chapter 9 of <a href="https://www.elsevier.com/books/simulation/ross/978-0-12-415825-2">Ross 2013</a>).</p>
<h2 id="the-problem-the-reward-function-cannot-be-derived">The Problem: The Reward Function cannot be Derived</h2>
<p>To minimize <script type="math/tex">\mathcal{L}_{\delta}</script> with stochastic gradient descent, it is necessary to calculate <script type="math/tex">\nabla_w \mathcal{L}_{\delta}</script> , also called the policy gradient in Reinforcement Learning (RL) terms.</p>
<p>But in praxis, the rewards <script type="math/tex">\delta</script> are typically either from an unknown function (e.g. if rewards are collected from human users) or the underlying function cannot be derived (e.g. in the case of BLEU).</p>
<p>As such, it is not immediately clear how to derive <script type="math/tex">\mathcal{L}_{\delta}</script> , i.e. how to calculate
<script type="math/tex">\nabla_w\mathcal{L}_{\delta}.</script></p>
<h2 id="the-solution-score-function-gradient-estimator">The Solution: Score Function Gradient Estimator</h2>
<p>To be able to calculate <script type="math/tex">\nabla_w\mathcal{L}_{\delta}</script> , we use two tricks:</p>
<h3 id="1-the-log-derivative-trick">1. The <script type="math/tex">\log</script> Derivative Trick</h3>
<p>The derivative of the logarithm is:</p>
<script type="math/tex; mode=display">\nabla_w \log f = \frac{\nabla_w f}{f}.</script>
<h3 id="2-the-identity-trick">2. The Identity Trick</h3>
<script type="math/tex; mode=display">f = \frac{g}{g} f</script>
<p>Now we can formulate what is known as the score function gradient estimator (<a href="https://www.sciencedirect.com/science/article/abs/pii/S0927050706130194">Fu ‘06</a>):</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align} \nabla_w \mathcal{L}_\delta &= \nabla_w \mathbb{E}_{p(x)} \mathbb{E}_{\pi_w(y \mid x)} [- \delta] & (1) \\
&= \nabla_w \int_{x} \int_{y} -\delta \, \cdot \, p(x)\textrm{d}x \, \cdot \, \pi_w(y \mid x)\textrm{d}y & (2) \\
&= \int_{x} \int_{y} -\delta \, \cdot \, p(x)\textrm{d}x \, \cdot \, \nabla_w\pi_w(y \mid x)\textrm{d}y & (3) \\
&= \int_{x} \int_{y} -\delta \, \cdot \, p(x)\textrm{d}x \, \cdot \, \nabla_w \frac{\pi_w(y \mid x)}{\pi_w(y \mid x)} \, \cdot \, \pi_w(y \mid x)\textrm{d}y & (4) \\
&= \int_{x} \int_{y} -\delta \, \cdot \, p(x)\textrm{d}x \, \cdot \, \nabla_w \log \pi_w(y \mid x) \, \cdot \, \pi_w(y \mid x)\textrm{d}y & (5) \\
&= \mathbb{E}_{p(x)} \mathbb{E}_{\pi_w(y \mid x)} [-\delta \nabla_w \log \pi_w(y \mid x)]. & (6) \end{align} %]]></script>
<p>Let’s investigate for each line what happened:</p>
<ul>
<li>(2): The expectation is expanded into two integrals. <script type="math/tex">\mathbb{E}_{p(x)}</script> becomes <script type="math/tex">\int_{x} \dots p(x)\textrm{d}x</script> and <script type="math/tex">\mathbb{E}_{\pi_w(y \mid x)}</script> turns into <script type="math/tex">\int_{y} \dots \pi_w(y \mid x)\textrm{d}y</script> .</li>
<li>(3): Integral and differentiation can be switched, so we move <script type="math/tex">\nabla_w</script> in front of <script type="math/tex">\pi_w(y \mid x)</script> because <script type="math/tex">\pi_w(y \mid x)</script> is the only term dependent on <script type="math/tex">w</script> .</li>
<li>(4): We use the identity trick with <script type="math/tex">g = \pi_w(y \mid x)</script> .</li>
<li>(5): We use the <script type="math/tex">\log</script> derivative trick.</li>
<li>(6): We still have <script type="math/tex">\pi_w(y \mid x)\textrm{d}y</script> available. With this, we can transform the expression back into an expectation. But in contrast to before, we now have <script type="math/tex">\nabla_w \log \pi_w(y \mid x)</script> and this derivative is simply scaled by <script type="math/tex">\delta</script> .</li>
</ul>
<p><strong><script type="math/tex">\rightarrow</script> We no longer need to know what the function that produces <script type="math/tex">\delta</script> looks like or derive it.</strong></p>
<p>For an alternative view on the subject, also see <a href="http://blog.shakirm.com/2015/11/machine-learning-trick-of-the-day-5-log-derivative-trick/">this great blog post</a>.</p>
<h2 id="when-can-it-be-applied">When can it be applied?</h2>
<p>The score function gradient estimator can be applied independent of the underlying model, as long as it has a derivative.</p>
<p>E.g. if <script type="math/tex">\pi_w(y \mid x)</script> is a log-linear model with feature vectors <script type="math/tex">\phi(x,y)</script> ,</p>
<script type="math/tex; mode=display">\pi_w(y \mid x) = \frac{e^{ w \phi(x,y)}}{\sum_{y\in \mathbf{Y}(x)} e^{ w \phi(x, y)}},</script>
<p>then the derivative would be</p>
<script type="math/tex; mode=display">\nabla \log \pi_w(y \mid x) = \phi(x,y) - \sum_{y\in \mathbf{Y}(x)} \phi(x, y)\pi_w(y \mid x).</script>
<p>In the case of neural networks, backpropogation is applied to derive <script type="math/tex">\nabla_w \pi_w(y \mid x)</script> (see for example Chapter 3 of <a href="https://arxiv.org/abs/1511.07916">Cho 2015</a>).</p>
<h2 id="lessons-learnt">Lessons Learnt</h2>
<ul>
<li>MLE can sometimes not be applied or cause inferior performance.</li>
<li>Instead, we leverage rewards from an external metric that evaluates the quality of our model ouputs.</li>
<li>The metric might be unknown or cannot be derived: (stochastic) gradient descent cannot be applied directly.</li>
<li>The score function gradient estimator helps us side-step this problem.</li>
</ul>
<p><strong>Acknowledgment: Thanks to Julia Kreutzer for her valuable and much needed feedback for improving this post.</strong></p>
<p><strong>Disclaimer: This blogpost reflects solely the opinion of the author, not any of her affiliated organizations and makes no claim or warranties as to completeness, accuracy and up-to-dateness.</strong></p>
<p><strong>Comments, ideas and critical views are very welcome. We appreciate your feedback! If you want to cite this blogpost, use this <a href="/bibtex/2018-11-12_score_function.bibtex">bibfile</a>.</strong></p>Carolin LawrenceThis post explains the need for the score function gradient estimator trick and how it works.