Peephole LSTM

Full ‘Vanilla’ LSTM, used by Alex Graves for sequence learning.

long-term cell state cc is mediated by input gate ii, output gate oo, and forget gate ff to produce the short-term hidden state hh

ft=σ(Wf[ht1,xt]+Pfct1+bf)it=σ(Wi[ht1,xt]+Pict1+bi)ot=σ(Wo[ht1,xt]+Poct+bo)ct=ftct1+ictanh(Wc[ht1,xt]+bc)ht=ottanh(ct) \Large \begin{aligned} f_t & = \sigma(W_f \cdot [h_{t-1}, x_t] + P_f \odot c_{t-1} + b_f) \\ i_t & = \sigma(W_i \cdot [h_{t-1}, x_t] + P_i \odot c_{t-1}+ b_i) \\ o_t & = \sigma(W_o \cdot [h_{t-1}, x_t] + P_o \odot c_t + b_o) \\ c_t & = f_t \odot c_{t-1} + i_c \odot \tanh(W_c \cdot [h_{t-1}, x_t] + b_c) \\ h_t & = o_t \odot \tanh(c_t) \end{aligned}

Basic LSTM

the gates don’t take the cell state into account when updating.

ft=σ(Wf[ht1,xt]+bf)it=σ(Wi[ht1,xt]+bi)ot=σ(Wo[ht1,xt]+bo)ct=ftct1+ictanh(Wc[ht1,xt]+bc)ht=ottanh(ct) \Large \begin{aligned} f_t & = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) \\ i_t & = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) \\ o_t & = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) \\ c_t & = f_t \odot c_{t-1} + i_c \odot \tanh(W_c \cdot [h_{t-1}, x_t] + b_c) \\ h_t & = o_t \odot \tanh(c_t) \end{aligned}

*Jozefowicz et al version uses a tanh nonlinearity on the output gate, making it capable of inverting the cell state. But they find little benefit to including an output gate at all..


Jozefowicz and friends found the output gate to provide little benefit

ft=σ(Wf[ht1,xt]+bf)it=σ(Wi[ht1,xt]+bi)ct=ftct1+ictanh(Wc[ht1,xt]+bc)ht=tanh(ct) \Large \begin{aligned} f_t & = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) \\ i_t & = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) \\ c_t & = f_t \odot c_{t-1} + i_c \odot \tanh(W_c \cdot [h_{t-1}, x_t] + b_c) \\ h_t & = \tanh(c_t) \end{aligned}

Tied Gates LSTM

“Out with the old” implies “in with the new”.

The input gate should be active whenever the forget gate is inactive.

ft=σ(Wf[ht1,xt]+bf)ot=σ(Wo[ht1,xt]+bo)ct=ftct1+(1ft)tanh(Wc[ht1,xt]+bc)ht=ottanh(ct) \Large \begin{aligned} f_t & = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) \\ o_t & = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) \\ c_t & = f_t \odot c_{t-1} + (1-f_t) \odot \tanh(W_c \cdot [h_{t-1}, x_t] + b_c) \\ h_t & = o_t \odot \tanh(c_t) \end{aligned}

Minimal LSTM

Tied input and forget gates, and no output gate.

ft=σ(Wf[ht1,xt]+bf)ct=ftct1+(1ft)tanh(Wc[ht1,xt]+bc)ht=tanh(ct) \Large \begin{aligned} f_t & = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) \\ c_t & = f_t \odot c_{t-1} + (1-f_t) \odot \tanh(W_c \cdot [h_{t-1}, x_t] + b_c) \\ h_t & = \tanh(c_t) \end{aligned}


Fewer parameters and fewer tanh ops than LSTMs, with competitive performance.
Uses update gate zz and reset gate rr

zt=σ(Wz[ht1,xt]+bz)rt=σ(Wr[ht1,xt]+br)ht=(1zt)ht1+zttanh(W[rtht1,xt]+bh) \Large \begin{aligned} z_t & = \sigma(W_z\cdot [h_{t-1}, x_t] + b_z) \\ r_t & = \sigma(W_r\cdot [h_{t-1}, x_t] + b_r) \\ h_t & = (1 - z_t) \odot h_{t-1} + z_t \odot \tanh(W \cdot [r_t \odot h_{t-1}, x_t] + b_h) \end{aligned}


Architecture search yielded this mutant recurrent cell.

zt=σ(Wzxt+bz)rt=σ(Wr[ht,xt]+br)ht+1=(1zt)ht+zttanh(Wh(rtht)+tanh(xt)+bh) \Large \begin{aligned} z_t & = \sigma(W_zx_t + b_z) \\ r_t & = \sigma(W_r \cdot [h_t, x_t] + b_r) \\ h_{t + 1} & = (1 - z_t) \odot h_t + z_t \odot \tanh(W_{h}(r_t \odot h_t) + \tanh(x_t) + b_h) \end{aligned}


Minimal Gated Unit.
Like GRU with the z and r functions combined.

ft=σ(Wf[ht1,xt]+bf)ht=(1ft)ht1+fttanh(Wh[ftht1,xt]+bh) \Large \begin{aligned} f_t & = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) \\ h_t & = (1 - f_t) \odot h_{t-1} + f_t \odot \tanh(W_h \cdot [f_t \odot h_{t-1}, x_t] + b_h) \end{aligned}


Structurally-Constrained Recurrent Network
Like a simple recurrent network, but with an added slow hidden neuron with an exp. moving average over its inputs, using a fixed memory parameter αα which might be set to .95.

st=(1α)Wsxt+αst1ht=σ(Wh[ht1,xt,st]) \Large \begin{aligned} s_t & = (1 - \alpha)W_s x_t + \alpha s_{t-1} \\ h_t & = \sigma(W_h \cdot [h_{t-1}, x_t, s_t]) \end{aligned}

where the output yty_t is given by softmax(Wy[ht,st])\text{softmax}(W_y \cdot [h_t, s_t])


Simple Recurrent Network
An RNN with poor ability to retain information over many timesteps

ht=tanh(W[ht1,xt]+b) \Large h_t = \tanh(W \cdot[h_{t-1},x_t] + b)

Multiplicative Recurrent Units


Simple version

ht=tanh(WxxtWhht1+b) \Large h_t = \tanh(W_{x} x_t \odot W_{h} h_{t-1}+ b)


General version

ht=tanh(vxhWxxtWhht1+vxWxxt+vhWhht1+b) \Large \begin{aligned} h_t = \tanh(v_{xh} \odot W_{x} x_t \odot W_{h} h_{t-1} \\ + v_x \odot W_{x} x_t \\ + v_h \odot W_{h} h_{t-1} + b) \end{aligned}


zt=σ(vzxhWzxxtWzhht1+vzxWzxxt+vzhWzhht1+bz) \Large \begin{aligned} z_t = \sigma(&v_{zxh} \odot W_{zx} x_t \odot W_{zh} h_{t-1} \\ & + v_{zx} \odot W_{zx} x_t \\ & + v_{zh} \odot W_{zh} h_{t-1} + b_z) \end{aligned}
rt=σ(vrxhWrxxtWrhht1+vrxWrxxt+vrhWrhht1+br) \Large \begin{aligned} r_t = \sigma(& v_{rxh} \odot W_{rx} x_t \odot W_{rh} h_{t-1} \\ & + v_{rx} \odot W_{rx} x_t \\ & + v_{rh} \odot W_{rh} h_{t-1} + b_r) \end{aligned}
ct=tanh(vcxhWcxxtWch(rtht1)+vcxWcxxt+vchWch(rtht1)+bc) \Large \begin{aligned} c_t = \tanh(& v_{cxh} \odot W_{cx} x_t \odot W_{ch} (r_t \odot h_{t-1}) \\ & + v_{cx} \odot W_{cx} x_t \\ & + v_{ch} \odot W_{ch} (r_t \odot h_{t-1}) + b_c) \end{aligned}
ht=(1zt)ht1+ztct1 \Large h_t = (1 - z_t) \odot h_{t-1} + z_t \odot c_{t-1}


(too much Tex for me, use your imagination…)

:( \Huge :'(


mt=WmxxtWmhht1+bmft=σ(Wf[mt,xt]+bf)it=σ(Wi[mt,xt]+bi)ot=σ(Wo[mt,xt]+bo)ct=ftct1+ic(Wc[ht1,xt]+bc)ht=tanh(otct) \Large \begin{aligned} m_t & = W_{mx} x_t \odot W_{mh} h_{t-1} + b_m \\ f_t & = \sigma(W_f \cdot [m_t, x_t] + b_f) \\ i_t & = \sigma(W_i \cdot [m_t, x_t] + b_i) \\ o_t & = \sigma(W_o \cdot [m_t, x_t] + b_o) \\ c_t & = f_t \odot c_{t-1} + i_c \odot (W_c \cdot [h_{t-1}, x_t] + b_c) \\ h_t & = \tanh(o_t \odot c_t) \end{aligned}


Simple Recurrent Unit
A parallelizable architecture designed for speed.

ft=σ(Wfxt+vfct1+bf)ct=ftct1+(1ft)(Wcxt)rt=σ(Wrxt+vrct1+br)ht=rtct+(1rt)xt \Large \begin{aligned} f_t & = \sigma(W_f x_t + v_f \odot c_{t-1} + b_f) \\ c_t & = f_t \odot c_{t-1} + (1-f_t) \odot (W_c x_t) \\ r_t & = \sigma(W_r x_t + v_r \odot c_{t-1} + b_r) \\ h_t & = r_t \odot c_t + (1-r_t) \odot x_t \end{aligned}