A giant 540B-param transformer decoder is trained on loads of data. It is SOTA on a variety of language tasks. The feedforward networks have weights only, no biases.

118 layers, 48 attention heads with shared key-value projections, sequence length of 256, depth of 18432, ffn size of 73728