Recurrent Units Compared // Lexicon

LSTM

Peephole LSTM

Full ‘Vanilla’ LSTM, used by Alex Graves for sequence learning.

long-term cell state $c$ is mediated by input gate $i$ , output gate $o$ , and forget gate $f$ to produce the short-term hidden state $h$

\Large \begin{aligned} f_t & = \sigma(W_f \cdot [h_{t-1}, x_t] + P_f \odot c_{t-1} + b_f) \\ i_t & = \sigma(W_i \cdot [h_{t-1}, x_t] + P_i \odot c_{t-1}+ b_i) \\ o_t & = \sigma(W_o \cdot [h_{t-1}, x_t] + P_o \odot c_t + b_o) \\ c_t & = f_t \odot c_{t-1} + i_c \odot \tanh(W_c \cdot [h_{t-1}, x_t] + b_c) \\ h_t & = o_t \odot \tanh(c_t) \end{aligned}

Basic LSTM

the gates don’t take the cell state into account when updating.

\Large \begin{aligned} f_t & = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) \\ i_t & = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) \\ o_t & = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) \\ c_t & = f_t \odot c_{t-1} + i_c \odot \tanh(W_c \cdot [h_{t-1}, x_t] + b_c) \\ h_t & = o_t \odot \tanh(c_t) \end{aligned}

*Jozefowicz et al version uses a tanh nonlinearity on the output gate, making it capable of inverting the cell state. But they find little benefit to including an output gate at all..

LSTM-o

Jozefowicz and friends found the output gate to provide little benefit

\Large \begin{aligned} f_t & = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) \\ i_t & = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) \\ c_t & = f_t \odot c_{t-1} + i_c \odot \tanh(W_c \cdot [h_{t-1}, x_t] + b_c) \\ h_t & = \tanh(c_t) \end{aligned}

Tied Gates LSTM

“Out with the old” implies “in with the new”.

The input gate should be active whenever the forget gate is inactive.

\Large \begin{aligned} f_t & = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) \\ o_t & = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) \\ c_t & = f_t \odot c_{t-1} + (1-f_t) \odot \tanh(W_c \cdot [h_{t-1}, x_t] + b_c) \\ h_t & = o_t \odot \tanh(c_t) \end{aligned}

Minimal LSTM

Tied input and forget gates, and no output gate.

\Large \begin{aligned} f_t & = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) \\ c_t & = f_t \odot c_{t-1} + (1-f_t) \odot \tanh(W_c \cdot [h_{t-1}, x_t] + b_c) \\ h_t & = \tanh(c_t) \end{aligned}

GRU

Fewer parameters and fewer tanh ops than LSTMs, with competitive performance.
Uses update gate $z$ and reset gate $r$

\Large \begin{aligned} z_t & = \sigma(W_z\cdot [h_{t-1}, x_t] + b_z) \\ r_t & = \sigma(W_r\cdot [h_{t-1}, x_t] + b_r) \\ h_t & = (1 - z_t) \odot h_{t-1} + z_t \odot \tanh(W \cdot [r_t \odot h_{t-1}, x_t] + b_h) \end{aligned}

MUT1

Architecture search yielded this mutant recurrent cell.

\Large \begin{aligned} z_t & = \sigma(W_zx_t + b_z) \\ r_t & = \sigma(W_r \cdot [h_t, x_t] + b_r) \\ h_{t + 1} & = (1 - z_t) \odot h_t + z_t \odot \tanh(W_{h}(r_t \odot h_t) + \tanh(x_t) + b_h) \end{aligned}

MGU

Minimal Gated Unit.
Like GRU with the z and r functions combined.

\Large \begin{aligned} f_t & = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) \\ h_t & = (1 - f_t) \odot h_{t-1} + f_t \odot \tanh(W_h \cdot [f_t \odot h_{t-1}, x_t] + b_h) \end{aligned}

SCRN

Structurally-Constrained Recurrent Network
Like a simple recurrent network, but with an added slow hidden neuron with an exp. moving average over its inputs, using a fixed memory parameter $α$ which might be set to .95.

\Large \begin{aligned} s_t & = (1 - \alpha)W_s x_t + \alpha s_{t-1} \\ h_t & = \sigma(W_h \cdot [h_{t-1}, x_t, s_t]) \end{aligned}

where the output $y_t$ is given by $\text{softmax}(W_y \cdot [h_t, s_t])$

SRN

Simple Recurrent Network
An RNN with poor ability to retain information over many timesteps

\Large h_t = \tanh(W \cdot[h_{t-1},x_t] + b)

Multiplicative Recurrent Units

MI-RNN-s

Simple version

\Large h_t = \tanh(W_{x} x_t \odot W_{h} h_{t-1}+ b)

MI-RNN-g

General version

\Large \begin{aligned} h_t = \tanh(v_{xh} \odot W_{x} x_t \odot W_{h} h_{t-1} \\ + v_x \odot W_{x} x_t \\ + v_h \odot W_{h} h_{t-1} + b) \end{aligned}

MI-GRU

$\Large \begin{aligned} z_t = \sigma(&v_{zxh} \odot W_{zx} x_t \odot W_{zh} h_{t-1} \\ & + v_{zx} \odot W_{zx} x_t \\ & + v_{zh} \odot W_{zh} h_{t-1} + b_z) \end{aligned}$
$\Large \begin{aligned} r_t = \sigma(& v_{rxh} \odot W_{rx} x_t \odot W_{rh} h_{t-1} \\ & + v_{rx} \odot W_{rx} x_t \\ & + v_{rh} \odot W_{rh} h_{t-1} + b_r) \end{aligned}$
$\Large \begin{aligned} c_t = \tanh(& v_{cxh} \odot W_{cx} x_t \odot W_{ch} (r_t \odot h_{t-1}) \\ & + v_{cx} \odot W_{cx} x_t \\ & + v_{ch} \odot W_{ch} (r_t \odot h_{t-1}) + b_c) \end{aligned}$
$\Large h_t = (1 - z_t) \odot h_{t-1} + z_t \odot c_{t-1}$

MI-LSTM

(too much Tex for me, use your imagination…)

\Huge :'(

mLSTM

\Large \begin{aligned} m_t & = W_{mx} x_t \odot W_{mh} h_{t-1} + b_m \\ f_t & = \sigma(W_f \cdot [m_t, x_t] + b_f) \\ i_t & = \sigma(W_i \cdot [m_t, x_t] + b_i) \\ o_t & = \sigma(W_o \cdot [m_t, x_t] + b_o) \\ c_t & = f_t \odot c_{t-1} + i_c \odot (W_c \cdot [h_{t-1}, x_t] + b_c) \\ h_t & = \tanh(o_t \odot c_t) \end{aligned}

SRU

Simple Recurrent Unit
A parallelizable architecture designed for speed.

\Large \begin{aligned} f_t & = \sigma(W_f x_t + v_f \odot c_{t-1} + b_f) \\ c_t & = f_t \odot c_{t-1} + (1-f_t) \odot (W_c x_t) \\ r_t & = \sigma(W_r x_t + v_r \odot c_{t-1} + b_r) \\ h_t & = r_t \odot c_t + (1-r_t) \odot x_t \end{aligned}