What to talk about and how? Selective Generation using LSTMs with Coarse-to-Fine Alignment

Hongyuan Mei, Mohit Bansal, Matthew R. Walter Toyota Technological Institute at Chicago Chicago, IL 60637 {hongyuan,mbansal,mwalter}@ttic.edu

Introduction

the task: given a set of structural dataset, generate a description sentence (two main phases)

QQ20170320-015101@2x.png

Previous work is either:

  • template-based
  • domain-dependent(parsers, NER tools, features)
  • learning contenct selection and surface realization separately

A neural encoder-aligner-decoder

  • encoder is essential
  • coarse-to-fine aligner because selective generation requires identifying the small number of salient records among an over-determined database
  • beam search v.s. greedy decoder

model

LSTM-based encoder-decoder with attention:

QQ20170320-023614@2x.png

Common prob. model

LSTM Encoder

each output $h_j$ is a concatenation of output in both direction

Coarse-to-fine Aligner

Only a small subset of salient records are relevant to the output sentence.

1. concat both hidden repr and original record

2. use a pre-selector assign a prob. to each record

the summation of all p_j can be regarded as an approx. to the total number of selected records(gamma)

3. use a standard aligner to compute alignment prob. at timestep t

4. the refiner produce final selection decision

decoder

training

regularization term:

  • prob. in pre-selector should be close to some value
  • at least one record should be selected

Experiments

Setup:

  • hidden units = 500 from {250, 500, 750} (WeatherGov)
  • gamma = 8.5 from {6.5, 7.5, 8.5, 10.5, 12.5} (WeatherGov)
  • gamma = 5.0 from {1.0, 2.0, …, 6.0} (RoboCup)
  • mini-batch = 100
  • Adam, coverage up to 30

Metric:

  • F1 for content selection
  • s(tandard)BLEU (4-gram) for surface realization
  • c(ustomized)BLEU not to penalize numerical derivations within 5 (low 58 = low 60)
  • sBLEUg for the case that ground-truth content selection is given

Primary

QQ20170320-032459@2x.png

Beam Filter with k-NN

beam search performs worse than greedy search:

QQ20170320-032639@2x.png

k-NN steps:

  1. do a standard beam search with size M
  2. find the K neighbors (examples: pair of database,description) from training data for the given record
  3. compute BLEU score with these K neighbors for each of the M candidates
  4. choose the best candidate (not clear about sum over K or just choose the best)

QQ20170320-033023@2x.png

ablation on WeatherGov

aligner(compared with attention in NMT(Bahdanau et al. 2014))

QQ20170320-033114@2x.png

encoder:

QQ20170320-033213@2x.png

qualitative analysis on WeatherGov

  • good match: windDir and southeast, temperature and 71
  • bad match: cloudy and temperature/precipitation

QQ20170320-033603@2x.png

Embedding matrix

  • init. with the pretrained and then refine (better)
  • concatenate the matrix and the pretrained

QQ20170320-033947@2x.png