NEURAL PROGRAMMER: INDUCING LATENT PROGRAMS WITH GRADIENT DESCENT

  • Arvind Neelakantan University of Massachusetts Amherst, arvind@cs.umass.edu

  • Quoc V. Le Google Brain, qvl@google.com

  • Ilya Sutskever Google Brain, ilyasu@google.com

Introduction

Because:

  • NN model fail to do arithmetic and logic tasks, even fail to add two numbers less than 10 bits (Joulin and Mikolov, 2015)
  • Tasks like QA need complex reasoning

Contributions:

  • Neural Programmer, NN augmented with arihmetic and logic ops
  • Learns from weak supervision, no annotated program needed
  • differentiable

Synthetic Dataset (with simple question templates)

  • tables $\in {R, text}^{M\times C}$

Key: Select operators and save the result, for a total of T times of iteration(chosen in advance)

QQ20161227-0.png

which resembles an ALU or the intraparietal sulcus (IPS)

ALU_block.gif800px-Brain_diagram_ja.svg.png

Model

Four components:

  • Question RNN
  • Selector
  • Operators
  • History RNN

Question RNN

QQ20161227-1.png

init:

  • $z_0 = [0]^d$
  • remove numbers from questions and save them in a separate list
  • bi-RNN for long questions(detail in experiments)

Selector

Normal softmax: question q and hidden memory h

Operations

QQ20161227-2.png

Data Selection(against columns)

QQ20161227-3.png

Column Representation:

  • use column names using parameters of question RNN, for each column

Operations

There’re two types of outputs: scalar and item list

Maintain output variables, to which the operators all have access:

  • ${scalar_answer}_t\in R$, scalar answer of t step, init to 0
  • ${lookup_answer}_t\in [0, 1]^{M\times C}$, selection prob. of entry (i, j), init to 0
  • ${row_select}_t\in [0, 1]^M$, selection prob. of row i, init to 1

Operations:

QQ20161227-4.png

For pivot values in greater and lesser operator:

Z matrix contains the representation of n words on the left of the numbers, because the question generated are in the pattern.

Update Variables

QQ20161227-5.png

Text Entries

Add a text match operation, and attend on each column over question hidden states. (avg failed to make column)

Step 1: Column Embedding: (M text entries and K text columns)

Step 2: Attention

Step 3: Text Match selection:

Update variables:

QQ20161227-6.png

History RNN

QQ20161227-7.png

Training

hyper-parameter: $\lambda, \delta$

Inference:

  • output whatever is updated at timestep T
  • use hardmax

Experiments

synthetic dataset

sampling:

  • elements from [-100, 100] for training and [-200, 200] for testing
  • rows from [30, 100] for training, and 120 for testing

templates:

  • Single column templates
  • multi-colum templates
  • variability: several utterances for the same operator
  • text match: small vocabulary of 10 words

QQ20161227-8.png

QQ20161227-9.png

QQ20161227-10.png

QQ20161227-11.png

QQ20161227-12.png

results

Optimize:

  • Adam (Kingma and Ba, 2014) by default value
  • parameters init from [-0.1, 0.1]
  • mini-batch size=50, embedding dim=256

Add gaussian noise:

  • [Neelakantan et al. 2016] proof
  • as a regularizer, gaussian of mu=0, var=curr_step^(-0.55), (Welling and Teh, 2011)

Gradient Clipping:

  • Scaling the gradient when it exceeds a threshold([1,5,50]) (Graves, 2013)
  • Adam epsilon [1e-6, 1e-8]
  • delta [10, 25, 50]
  • lambda [25, 50, 75, 100]

Others:

  • = 5 columns, use bi-RNN

  • = 10 columns, use attention

almost all errors were made when difference ops are needed

QQ20161227-13.png

compare with LSTM

LSTM settings:

  • w/ and w/o attention
  • attention heads (1, 5, 10)
  • input table before & after the question

result:

  • dataset: single column data with only scalar answers
  • entries: [-10, 10] (80%), [-50, 50] (30%)
  • rows: [4, 7]

rough and no detail

Neural Programmers:

  • 100% acc
  • invariant to number scale and question length

Induced Program Example

QQ20161227-14.png

background:

  • Semantic Parsing …
  • neural networks for CFG, knowledge repr, memory networks (not capable of doing complex reasoning)

Close related:

  • Neural Programmer-Interpreter (Reed and Freitas, 2016) with program supervision
  • Neural enquirer (Yin et al. 2015) with program supervision, also on tables
  • dynamic neural module network (Andreas et al. 2016) with dep-tree