LSTM

Peephole LSTM

Full ‘Vanilla’ LSTM, used by Alex Graves for sequence learning.

long-term cell state cc is mediated by input gate ii, output gate oo, and forget gate ff to produce the short-term hidden state hh

ft=σ(Wf[ht1,xt]+Pfct1+bf)it=σ(Wi[ht1,xt]+Pict1+bi)ot=σ(Wo[ht1,xt]+Poct+bo)ct=ftct1+ictanh(Wc[ht1,xt]+bc)ht=ottanh(ct) \Large \begin{aligned} f_t & = \sigma(W_f \cdot [h_{t-1}, x_t] + P_f \odot c_{t-1} + b_f) \\ i_t & = \sigma(W_i \cdot [h_{t-1}, x_t] + P_i \odot c_{t-1}+ b_i) \\ o_t & = \sigma(W_o \cdot [h_{t-1}, x_t] + P_o \odot c_t + b_o) \\ c_t & = f_t \odot c_{t-1} + i_c \odot \tanh(W_c \cdot [h_{t-1}, x_t] + b_c) \\ h_t & = o_t \odot \tanh(c_t) \end{aligned}

Basic LSTM

the gates don’t take the cell state into account when updating.

ft=σ(Wf[ht1,xt]+bf)it=σ(Wi[ht1,xt]+bi)ot=σ(Wo[ht1,xt]+bo)ct=ftct1+ictanh(Wc[ht1,xt]+bc)ht=ottanh(ct) \Large \begin{aligned} f_t & = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) \\ i_t & = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) \\ o_t & = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) \\ c_t & = f_t \odot c_{t-1} + i_c \odot \tanh(W_c \cdot [h_{t-1}, x_t] + b_c) \\ h_t & = o_t \odot \tanh(c_t) \end{aligned}

*Jozefowicz et al version uses a tanh nonlinearity on the output gate, making it capable of inverting the cell state. But they find little benefit to including an output gate at all..

LSTM-o

Jozefowicz and friends found the output gate to provide little benefit

ft=σ(Wf[ht1,xt]+bf)it=σ(Wi[ht1,xt]+bi)ct=ftct1+ictanh(Wc[ht1,xt]+bc)ht=tanh(ct) \Large \begin{aligned} f_t & = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) \\ i_t & = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) \\ c_t & = f_t \odot c_{t-1} + i_c \odot \tanh(W_c \cdot [h_{t-1}, x_t] + b_c) \\ h_t & = \tanh(c_t) \end{aligned}

Tied Gates LSTM

“Out with the old” implies “in with the new”.

The input gate should be active whenever the forget gate is inactive.

ft=σ(Wf[ht1,xt]+bf)ot=σ(Wo[ht1,xt]+bo)ct=ftct1+(1ft)tanh(Wc[ht1,xt]+bc)ht=ottanh(ct) \Large \begin{aligned} f_t & = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) \\ o_t & = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) \\ c_t & = f_t \odot c_{t-1} + (1-f_t) \odot \tanh(W_c \cdot [h_{t-1}, x_t] + b_c) \\ h_t & = o_t \odot \tanh(c_t) \end{aligned}

Minimal LSTM

Tied input and forget gates, and no output gate.

ft=σ(Wf[ht1,xt]+bf)ct=ftct1+(1ft)tanh(Wc[ht1,xt]+bc)ht=tanh(ct) \Large \begin{aligned} f_t & = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) \\ c_t & = f_t \odot c_{t-1} + (1-f_t) \odot \tanh(W_c \cdot [h_{t-1}, x_t] + b_c) \\ h_t & = \tanh(c_t) \end{aligned}

GRU

Fewer parameters and fewer tanh ops than LSTMs, with competitive performance.
Uses update gate zz and reset gate rr

zt=σ(Wz[ht1,xt]+bz)rt=σ(Wr[ht1,xt]+br)ht=(1zt)ht1+zttanh(W[rtht1,xt]+bh) \Large \begin{aligned} z_t & = \sigma(W_z\cdot [h_{t-1}, x_t] + b_z) \\ r_t & = \sigma(W_r\cdot [h_{t-1}, x_t] + b_r) \\ h_t & = (1 - z_t) \odot h_{t-1} + z_t \odot \tanh(W \cdot [r_t \odot h_{t-1}, x_t] + b_h) \end{aligned}

MUT1

Architecture search yielded this mutant recurrent cell.

zt=σ(Wzxt+bz)rt=σ(Wr[ht,xt]+br)ht+1=(1zt)ht+zttanh(Wh(rtht)+tanh(xt)+bh) \Large \begin{aligned} z_t & = \sigma(W_zx_t + b_z) \\ r_t & = \sigma(W_r \cdot [h_t, x_t] + b_r) \\ h_{t + 1} & = (1 - z_t) \odot h_t + z_t \odot \tanh(W_{h}(r_t \odot h_t) + \tanh(x_t) + b_h) \end{aligned}

MGU

Minimal Gated Unit.
Like GRU with the z and r functions combined.

ft=σ(Wf[ht1,xt]+bf)ht=(1ft)ht1+fttanh(Wh[ftht1,xt]+bh) \Large \begin{aligned} f_t & = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) \\ h_t & = (1 - f_t) \odot h_{t-1} + f_t \odot \tanh(W_h \cdot [f_t \odot h_{t-1}, x_t] + b_h) \end{aligned}

SCRN

Structurally-Constrained Recurrent Network
Like a simple recurrent network, but with an added slow hidden neuron with an exp. moving average over its inputs, using a fixed memory parameter αα which might be set to .95.

st=(1α)Wsxt+αst1ht=σ(Wh[ht1,xt,st]) \Large \begin{aligned} s_t & = (1 - \alpha)W_s x_t + \alpha s_{t-1} \\ h_t & = \sigma(W_h \cdot [h_{t-1}, x_t, s_t]) \end{aligned}

where the output yty_t is given by softmax(Wy[ht,st])\text{softmax}(W_y \cdot [h_t, s_t])

SRN

Simple Recurrent Network
An RNN with poor ability to retain information over many timesteps

ht=tanh(W[ht1,xt]+b) \Large h_t = \tanh(W \cdot[h_{t-1},x_t] + b)

Multiplicative Recurrent Units

MI-RNN-s

Simple version

ht=tanh(WxxtWhht1+b) \Large h_t = \tanh(W_{x} x_t \odot W_{h} h_{t-1}+ b)

MI-RNN-g

General version

ht=tanh(vxhWxxtWhht1+vxWxxt+vhWhht1+b) \Large \begin{aligned} h_t = \tanh(v_{xh} \odot W_{x} x_t \odot W_{h} h_{t-1} \\ + v_x \odot W_{x} x_t \\ + v_h \odot W_{h} h_{t-1} + b) \end{aligned}

MI-GRU

zt=σ(vzxhWzxxtWzhht1+vzxWzxxt+vzhWzhht1+bz) \Large \begin{aligned} z_t = \sigma(&v_{zxh} \odot W_{zx} x_t \odot W_{zh} h_{t-1} \\ & + v_{zx} \odot W_{zx} x_t \\ & + v_{zh} \odot W_{zh} h_{t-1} + b_z) \end{aligned}
rt=σ(vrxhWrxxtWrhht1+vrxWrxxt+vrhWrhht1+br) \Large \begin{aligned} r_t = \sigma(& v_{rxh} \odot W_{rx} x_t \odot W_{rh} h_{t-1} \\ & + v_{rx} \odot W_{rx} x_t \\ & + v_{rh} \odot W_{rh} h_{t-1} + b_r) \end{aligned}
ct=tanh(vcxhWcxxtWch(rtht1)+vcxWcxxt+vchWch(rtht1)+bc) \Large \begin{aligned} c_t = \tanh(& v_{cxh} \odot W_{cx} x_t \odot W_{ch} (r_t \odot h_{t-1}) \\ & + v_{cx} \odot W_{cx} x_t \\ & + v_{ch} \odot W_{ch} (r_t \odot h_{t-1}) + b_c) \end{aligned}
ht=(1zt)ht1+ztct1 \Large h_t = (1 - z_t) \odot h_{t-1} + z_t \odot c_{t-1}

MI-LSTM

(too much Tex for me, use your imagination…)

:( \Huge :'(

mLSTM

mt=WmxxtWmhht1+bmft=σ(Wf[mt,xt]+bf)it=σ(Wi[mt,xt]+bi)ot=σ(Wo[mt,xt]+bo)ct=ftct1+ic(Wc[ht1,xt]+bc)ht=tanh(otct) \Large \begin{aligned} m_t & = W_{mx} x_t \odot W_{mh} h_{t-1} + b_m \\ f_t & = \sigma(W_f \cdot [m_t, x_t] + b_f) \\ i_t & = \sigma(W_i \cdot [m_t, x_t] + b_i) \\ o_t & = \sigma(W_o \cdot [m_t, x_t] + b_o) \\ c_t & = f_t \odot c_{t-1} + i_c \odot (W_c \cdot [h_{t-1}, x_t] + b_c) \\ h_t & = \tanh(o_t \odot c_t) \end{aligned}

SRU

Simple Recurrent Unit
A parallelizable architecture designed for speed.

ft=σ(Wfxt+vfct1+bf)ct=ftct1+(1ft)(Wcxt)rt=σ(Wrxt+vrct1+br)ht=rtct+(1rt)xt \Large \begin{aligned} f_t & = \sigma(W_f x_t + v_f \odot c_{t-1} + b_f) \\ c_t & = f_t \odot c_{t-1} + (1-f_t) \odot (W_c x_t) \\ r_t & = \sigma(W_r x_t + v_r \odot c_{t-1} + b_r) \\ h_t & = r_t \odot c_t + (1-r_t) \odot x_t \end{aligned}