AlphaGo Zero
or: Mastering the Game of Go without Human Knowledge
A simplified version of AlphaGo embeds the tree search and prediction model into the policy-learning RL loop. It bootstraps itself from random play, instead of feeding supervised human play examples into the prediction network.
Training consisted of 29M games of self-play, grad descended in 3.1M minibatches of 2048 moves.