The transformer model is slightly complicated with a convolution over the input word vectors trained to separate part of speech tags, and normal multi-head attention is performed over this intermediate output.

Provides a moderate improvement in machine translation quality.
I expect this is more a result of the generalization enhancement of the slighly multimodal training in this network, and the particular architecture is not significant.