NEURAL PROGRAMMER: INDUCING LATENT PROGRAMS WITH GRADIENT DESCENT

• Arvind Neelakantan University of Massachusetts Amherst, arvind@cs.umass.edu

Introduction

Because:

• NN model fail to do arithmetic and logic tasks, even fail to add two numbers less than 10 bits (Joulin and Mikolov, 2015)
• Tasks like QA need complex reasoning

Contributions:

• Neural Programmer, NN augmented with arihmetic and logic ops
• Learns from weak supervision, no annotated program needed
• differentiable

Synthetic Dataset (with simple question templates)

• tables $\in {R, text}^{M\times C}$

Key: Select operators and save the result, for a total of T times of iteration(chosen in advance)

which resembles an ALU or the intraparietal sulcus (IPS)

Model

Four components:

• Question RNN
• Selector
• Operators
• History RNN

Question RNN

init:

• $z_0 = [0]^d$
• remove numbers from questions and save them in a separate list
• bi-RNN for long questions(detail in experiments)

Selector

Normal softmax: question q and hidden memory h

Operations

Data Selection(against columns)

Column Representation:

• use column names using parameters of question RNN, for each column

Operations

There’re two types of outputs: scalar and item list

Maintain output variables, to which the operators all have access:

• ${scalar_answer}_t\in R$, scalar answer of t step, init to 0
• ${lookup_answer}_t\in [0, 1]^{M\times C}$, selection prob. of entry (i, j), init to 0
• ${row_select}_t\in [0, 1]^M$, selection prob. of row i, init to 1

Operations:

For pivot values in greater and lesser operator:

Z matrix contains the representation of n words on the left of the numbers, because the question generated are in the pattern.

Update Variables

Text Entries

Add a text match operation, and attend on each column over question hidden states. (avg failed to make column)

Step 1: Column Embedding: (M text entries and K text columns)

Step 2: Attention

Step 3: Text Match selection:

Update variables:

Training

hyper-parameter: $\lambda, \delta$

Inference:

• output whatever is updated at timestep T
• use hardmax

Experiments

synthetic dataset

sampling:

• elements from [-100, 100] for training and [-200, 200] for testing
• rows from [30, 100] for training, and 120 for testing

templates:

• Single column templates
• multi-colum templates
• variability: several utterances for the same operator
• text match: small vocabulary of 10 words

results

Optimize:

• Adam (Kingma and Ba, 2014) by default value
• parameters init from [-0.1, 0.1]
• mini-batch size=50, embedding dim=256

• [Neelakantan et al. 2016] proof
• as a regularizer, gaussian of mu=0, var=curr_step^(-0.55), (Welling and Teh, 2011)

• Scaling the gradient when it exceeds a threshold([1,5,50]) (Graves, 2013)
• delta [10, 25, 50]
• lambda [25, 50, 75, 100]

Others:

• = 5 columns, use bi-RNN

• = 10 columns, use attention

almost all errors were made when difference ops are needed

compare with LSTM

LSTM settings:

• w/ and w/o attention
• attention heads (1, 5, 10)
• input table before & after the question

result:

• dataset: single column data with only scalar answers
• entries: [-10, 10] (80%), [-50, 50] (30%)
• rows: [4, 7]

rough and no detail

Neural Programmers:

• 100% acc
• invariant to number scale and question length

Induced Program Example

background:

• Semantic Parsing …
• neural networks for CFG, knowledge repr, memory networks (not capable of doing complex reasoning)

Close related:

• Neural Programmer-Interpreter (Reed and Freitas, 2016) with program supervision
• Neural enquirer (Yin et al. 2015) with program supervision, also on tables
• dynamic neural module network (Andreas et al. 2016) with dep-tree