Really Good Papers (part 1)

I’ve started a challenge to read a machine learning paper a day. Now after a few of months and a lot of catching up to do I’ve gone through a couple hundred papers. Stack of read papers.

These are my picks for the best ones I found:

The Shattered Gradients Problem: If resnets are the answer, then what is the question? David Balduzzi, Marcus Frean, Lennox Leary, JP Lewis, Kurt Wan-Duo Ma, Brian McWilliams (Balduzzi et al., 2017)
Bayesian Action Decoder for Deep Multi-Agent Reinforcement Learning. Jakob N. Foerster, Francis Song, Edward Hughes, Neil Burch, Iain Dunning, Shimon Whiteson, Matthew Botvinick, Michael Bowling (Foerester et al., 2018)
Modular Networks: Learning to Decompose Neural Computation. Louis Kirsch, Julius Kunze, David Barber (Kirsch et al., 2018)
Attention Is All You Need. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin (Vaswani et al., 2017)
Intrinsic Social Motivation via Causal Influence in Multi-Agent RL. Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Caglar Gulcehre, Pedro A. Ortega, DJ Strouse, Joel Z. Leibo, Nando de Freitas (Jaques et al., 2018)
Episodic Curiosity through Reachability. Nikolay Savinov, Anton Raichuk, Raphaël Marinier, Damien Vincent, Marc Pollefeys, Timothy Lillicrap, Sylvain Gelly (Savinov et al., 2018)
Optimizing Agent Behavior over Long Time Scales by Transporting Value. Chia-Chun Hung, Timothy Lillicrap, Josh Abramson, Yan Wu, Mehdi Mirza, Federico Carnevale, Arun Ahuja, Greg Wayne (Hung et al., 2018)
Uncertainty in Neural Networks: Bayesian Ensembling. Tim Pearce, Mohamed Zaki, Alexandra Brintrup, Nicolas Anastassacos, Andy Neely (Pearce et al., 2018)
The Laplacian in RL: Learning Representations with Efficient Approximations. Yifan Wu, George Tucker, Ofir Nachum (Wu et al., 2018)
How Does Batch Normalization Help Optimization? Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, Aleksander Madry (Santurkar et al., 2018)
Understanding disentangling in β-VAE. Christopher P. Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, Alexander Lerchner (Burgess et al., 2018)
Deep Residual Learning for Image Recognition. Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun (He et al., 2015)
Understanding the difficulty of training deep feedforward neural networks. Xavier Glorot, Yoshua Bengio (Glorot & Bengio, 2010)
Combined Reinforcement Learning via Abstract Representations. Vincent François-Lavet, Yoshua Bengio, Doina Precup, Joelle Pineau (François-Lavet et al., 2018)
MIT AGI: Building machines that see, learn, and think like people. Josh Tenenbaum (Tenenbaum @ MIT AGI, 2018)
Human-level performance in first-person multiplayer games with population-based deep reinforcement learning. Max Jaderberg, Wojciech M. Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castaneda, Charles Beattie, Neil C. Rabinowitz, Ari S. Morcos, Avraham Ruderman, Nicolas Sonnerat, Tim Green, Louise Deason, Joel Z. Leibo, David Silver, Demis Hassabis, Koray Kavukcuoglu, Thore Graepel (Jaderberg et al., 2018)
Scalable agent alignment via reward modeling: a research direction. Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, Shane Legg (Leike et al., 2018)
Curiosity-driven Exploration by Self-supervised Prediction. Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, Trevor Darrell (Pathak et al., 2017)
GAN Q-learning. Thang Doan, Bogdan Mazoure, Clare Lyle (Doan et al., 2018)
Dropout is a special case of the stochastic delta rule: faster and more accurate deep learning. Noah Frazier-Logue, Stephen José Hanson (Frazier-Logue & Hanson, 2018)

My interpretation

That’s a chunky list in the classic style of M. Brundage or his mechanical dopleganger. Here it is again with my interpretation of each paper and why I think they are really good:

1. Shattered Gradients

(Balduzzi et al., 2017) analyse the surprising performance of residual networks compared to plain feed forward nets, especially at high layer count. They show that feed forward gradients approach white noise exponentially with layers. On the other hand resnets gradients transition to white noise at sublinear rate with layers. They theorize that white noise gradients provide little information when training and design the Looks Linear initialization to alleviate the problem. They successfully train a 200 layer feed forward net in an empirical experiment, obtaining the same performance as a resnet on CIFAR-10.

Deeper networks have exponentially increasing representation power. While residual networks can be deep they are also much more constrained than plain old networks. This paper thus greatly increases the design space available.

2. Bayesian Action Decoder

(Foerester et al., 2018) describe a neat inductive bias for agents to learn what private information other collaborating agents have about the state. They condition a set of collaborating agents to deterministically choose policy based on the public information available. The agents then make their private observations and act according to the policy. Because the policy is based on the public state and known to all, other agents can then infer the private state based on the action taken. They successfully test it on Hanabi and get near optimal scores. Also Hanabi sounds like a fun game!

This method of inferring the private information of other agents is conceptually simple. It also avoids the intractable recurrence of modelling other agents’ beliefs, and beliefs about beliefs…

3. Modular Networks

(Kirsch et al., 2018) implement a principled modular neural net that succeeds in specializing sub-networks to different tasks, or different parts of a sequential task as in text processing. Without using an entropy regularization term they avoid the problem of “module collapse”, when a single module is selected for all tasks. Unlike mixture of experts models, where computation is still carried through for all experts, the modules sharply specialize for each task. The work shows a neat examples of a recurrent network using a specialised module for words following “the”. Unfortunately they report poor generalization when testing on CIFAR-10 images.

To move from small scale, focused, experiments to extremely large, generalist, networks I believe we need modules if we ever want them to run on commodity hardware. Similarly to how the brain saves on energy and chemical budgets by using 10% at a time.

4. Attention Is All You Need

A transformative 🐆 work by the Google team (Vaswani et al., 2017), achieving state of the art results in language tasks by doubling down on attention and foregoing recurrence and convolutions. The main intuition is that attention can connect distant parts of the source sequence through a constant number of neural net steps, $O(1)$. In contrast recurrent nets need to keep distant inputs in memory for each step of the sequence, $O(n)$. At an intermediate level sit convolutional nets which need $O(log_k(n))$ layers for $k$ wide filters. The downside is that attention requires $O(n^2)$ compute, but this is trivially parallelizable.

I believe most problems can be solved by analysing a few influential factors in a sea of clutter, in this sense, attention is the appropriate inductive bias if only we could bypass the poor computational scaling.

5. Causal Influence in Multi-Agent RL

(Jaques et al., 2018) develop an intrinsic reward to encourage cooperation between independently trained agents. Unlike related works the agents don’t rely on common training, they use no cheap communications channel, and do not share rewards. Instead each agent receives an intrinsic reward proportional to their empowerment on other agents, in less fancy words, how much they can influence the behaviour of others. Influence is measured by counterfactual reasoning in the learned model of other agents’ behaviour, where each agents checks which of its actions can greatly change the actions of others. In their toy collaborative example an agent learns to act as a scout by spinning around when there’s no high reward in sight, allowing others to keep collecting low rewards instead of exploring.

The paper presents empowerment as a proxy for coordination. Unlike other multi-agent works, both acting, rewards and training is independent for each agent.

6. Episodic Curiosity through Reachability

(Savinov et al., 2018) rephrase intrinsic curiosity by learning a model that orders states by how reachable they are from each other. A curiosity incentive is then given whenever a state is significantly different than any state in memory. With this method they overcome the noisy TV problem where an agent is drawn to naturally unpredictable states. With the reachability metric, a TV can be deemed close in action space and eventually considered explored. They learn the reachability metric by sampling 2 states from a trajectory and training a network to predict whether they are within $k$ time steps.

Consistent exploratory behaviour is essential in many situations, as random exploration is exponentially unlikely to find distant rewards. Curiosity has been proposed as a solution, but current approaches get stuck in naturally unpredictable states. The authors propose reachability as an effective solution.

7. Transporting Value over Long Time Scales

(Hung et al., 2018) address the problem of long term credit assignment by transporting value to past states based on attention over memory. To prove their method, the authors design a difficult task where meaningful actions are separated by a distractor phase. The agent must first collect a key, then collect high rewards over a long time frame of 500 frames, and finally open the corresponding door for a small amount of extra reward. They solve the task by encoding all observations in memory and using an attention mechanism. When observations pass an attention threshold they add the current value to the attended time step. They solve the task using $\gamma = .96$ even though classic RL cannot solve it for $\gamma = 1$ due to the variable distraction reward.

Solving problems over extremely long time scales is an important feat for intelligent behaviour. We take meaningful actions across days, or years, despite the various distractions we encounter through the day. The authors propose a method to connect rewards and distant actions, bypassing the discount formalism.

8. Bayesian Ensembling

(Pearce et al., 2018) make the key observation that the neural network prior is over a random parametrization, as opposed to 0 centered parameters. They note that an ensemble of networks regularized towards 0 tend to collapse to similar solutions and produce over-confident error bounds. They use an ensemble of networks with $L^2$ loss anchored at different random initializations to produce a better approximation of the true posterior. In a small empirical study they find their ensembling method to produce very similar results to Gaussian processes.

The prior for neural networks is random initialization, instead of 0 centered parameters. Using this change, the authors achieve very good error bounds by just using an ensemble of networks.

9. The Laplacian in RL

(Wu et al., 2018) learn an approximation of the state-graph’s Laplacian eigenvectors, and use them to define a better distance heuristic for reward shaping. The authors find an embedding for the smallest eigenvalues by minimizing the distance between neighboring points in state space. Using an example of a maze, the distance in the embedding space would follow the contours of the maze similarly to how water would fill it up. An agent can solve the maze faster if the reward is shaped according to the Laplacian distance to the goal, as opposed to naive Euclidean distance.

The Laplacian distance is a more informative measure of the differences between states. It can be used for better reward shaping, or an aide to exploration.

10. How Does Batch Normalization Help Optimization?

Contrary to the assumption that Batch Normalization reduces internal covariate shift, (Santurkar et al., 2018) show that the beneficial effect of BN is to smooth out the loss landscape. They define a couple of measures for internal covariate shift and empirically show that BN might even increase it. In fact even intentionally adding parameter noise to increase ICS, the accuracy is not affected. The authors investigate further to find that the beneficial effect is smoothing of the loss landscape and making gradients more predictable with respect to changes in parameters along the gradient direction. They theoretically prove that BN reduces the Lipschitz factor of the network’s loss and gradients.

Batch Normalization is a widely used technique in machine learning, but the reason it works is assumed to be reducing internal covariate shift. The authors empirically find the assumption wrong and alternatively propose, with proof, that BN helps by smoothing the loss landscape.

11. Understanding disentangling in β-VAE

(Burgess et al., 2018) explore information encoding in VAE and their successor β-VAE. They relate the objective to the information bottleneck principle, where the β-VAE objective is equivalent to maximizing mutual information between the latent encoding and the task, whilst minimizing information between the input and the encoding. To control the amount of information in the bottleneck they propose to set a target capacity C for the KL divergence and minimize the absolute difference from measured KL and C. They then increase the target capacity and show that the network starts encoding more information. For the dSprites dataset it first encodes only position as it is the most salient feature, then scale, orientation and shape.

β-VAE have produced interesting results, but the amount of information encoded in the latent bottleneck is hard to judge. I believe the biggest contribution of their paper is a method of quantifying and controlling the amount of information required to encode the various features of a dataset.

12. Deep Residual Learning for Image Recognition

(He et al., 2015) introduce residual networks for image recognition. Noticing that very deep networks generally fail to obtain good performance, they propose an alternative architecture comprised of residual blocks. By providing a pass through connection for the inputs, gradients and training are well-behaved at all layers. Residual blocks each add a small amount of processing towards the result. Notably, each residual block is composed of at least 2 layers with a non-linearity in-between to prevent the whole network from being a simple affine transformation.

Residual networks won the COCO 2015 competitions with state of the art results in object detection and image processing. The architecture has since been widely expanded upon as the default method of training very deep networks.

13. Understanding the difficulty of training deep feedforward neural networks

(Glorot & Bengio, 2010) analyse statistical properties of gradients across the layers of a neural network. They propose the normalized initialization to maintain the singular values for the layer Jacobians close to 1. With this modification and using an alternative to sigmoid activations they improve training and observe that gradients are distributed more uniformly across layers.

The paper highlights the importance of good initialization for neural networks. Their initialization scheme and advice of monitoring gradient distributions has been widely adopted across the field.

14. Combined Reinforcement Learning via Abstract Representations

(François-Lavet et al., 2018) integrate model-based and model-free reinforcement learning by using a shared abstract state. The authors split the neural network into an encoding network, an abstract state, and a model-free RL network built on the abstract state. At the same time they train a model-based network to predict rewards and transitions between abstract states. By using the abstract state they greatly reduce the state dimensionality for the model-based algorithm making it easier to train. Additionally it allows for fast planning and interpretation. They experiment with planning by simulating only the abstract state, with interpretability by adding a cosine similarity loss to state vector of interest, and with transfer learning by re-training the encoder on modified inputs.

The authors propose a very useful idea of training a shared embedding for both model-based and model-free reinforcement learning. By using the shared abstract state they improve training time, generalizability and allow for easy manipulation of the state representation.

15. MIT AGI: Building machines that see, learn, and think like people

(Tenenbaum @ MIT AGI, 2018) presents a very good overview of general intelligence. He talks about the key features that make humans intelligent and a rough plan for how we might engineer artificial intelligence. Our eye resolution is low, but we believe we can see the entire room well. Current AI has a lot of difficulty of understanding the world. Image description bots can give a reasonable description, but sometimes miss the point of the image. Humour is super hard to get. The most capable robots, of Boston Dynamics, have nothing to do with intelligence. Even young babies are really good at manipulating objects, and much better than robots. Humans as probabilistic physics simulators. Really good intuition oh how physical states advance and general description of the future.

A good overview of the feats that humans can achieve, current problems with AI, and a path and ideas to achieving general intelligence.

16. Human-level performance in multiplayer games with population-based deep reinforcement learning

The Deepmind team, lead by (Jaderberg et al., 2018), combine a variety of techniques to train a very strong game playing agent. They design a multiplayer game environment and use population based training 1 to obtain a diverse set of agents. Each agent also learn an internal reward function that is a dense proxy for the sparse win/lose reward at the end of a match. Further the agent architecture is comprised of a multi-timescale RNN. With a set of RNN cells updated slowly to encourage long-term learning, and another set updated frequently to fine tune the prior of the slow cells. Finally they visualize the behaviours of the agents by embedding their internal states in 2D using t-SNE 2, prescient of their work on AlphaStar 3.

Although many papers address a specific aspect of reinforcement learning, I believe it’s important to bring together the best techniques to ensure they work well when combined. With this experiment the Deepmind team showcase a proof of concept for their subsequent work on the Starcraft 2 agent 3.

17. Scalable agent alignment via reward modeling: a research direction

(Leike et al., 2018) discuss the challenges in designing intelligent agents that do what we really want them to do. Games are widely used in reinforcement learning research as they are convenient to simulate, while providing a very dynamic environment. However games have a very clear objective of winning or maximising a score, whereas the real world is much more ambiguous. The authors summarize their desiderata for agent alignment as scalable, economic/efficient, as well as pragmatic. The same solution should be applicable on small scale experiments as well as large operations, it should satisfy safety and trust concerns, but not reduce performance, and finally it should be realizable. The authors also analyse a common failure mode where the effective reward function is far from the intended reward, leading to degenerate behaviours in RL agents.

The authors present a philosophical outline of the requirements and challenges of bringing AI agents from the laboratory into the real world. In particular arguing that significant effort should be made to ensure that agents learn what we want them to, and develop mechanisms to correct their behaviour when agents diverge.

18. Curiosity-driven Exploration by Self-supervised Prediction

(Pathak et al., 2017) develop the baseline curiosity implementation (as of early 2019). Although encouraging exploration based on the prediction error of a learned model has been attempted before, results were poor as predicting a high-dimensional state is a very challenging task. The authors design an architecture for an agent to predict only the features that is has control of. Their Internal Curiosity Module (ICM) consists of an embedding network to obtain a low-dimensional state representation. The embedding network is an inverse model trained to predict the action taken between 2 states, thus learning only features the agent can control. They then train a forward model on the embedded vectors and use the prediction error as an additional reward. Because the embedded state only features salient to the agent’s policy, the learnt curiosity reward is robust to noise.

Structured exploration of the environment is a great challenge for reinforcement learning. In this paper the Deepmind team present one of the first solutions that is robust to features outside of the agent’s control.

19. GAN Q-learning

In an innovative paper, (Doan et al., 2018) combine generative adversarial networks and reinforcement learning. They pose the problem as generating optimal action-state pairs and discriminating between optimal and generated states. To approximate optimal, they use a single step of the Bellman operator on the states in the replay buffer. The generator eventually learns to sample actions near the equilibrium Q function, thus becoming an optimal policy. Although both GAN and RL are notoriously unstable, as the authors acknowledge, the combination succeeds in learning policies for CartPole and Acrobot that slightly improve on the DQN baseline.

The authors succeed in posing the reinforcement learning problem in a generator-discriminator framework. A really interesting combination of GAN and RL.

20. Dropout is a special case of the stochastic delta rule: faster and more accurate deep learning

(Frazier-Logue & Hanson, 2018) make the connection between dropout and the older stochastic delta rule. In SDR the weights are sampled from the normal distribution with learnable mean and standard deviation. The mean is updated by gradient descent, where the gradients are computed on the sampled weight values. σ is also updated the same way, but with a separate learning rate and exponential decay to eventually reduce to 0. By considering repeated dropout samples they show that it is equivalent to SDR. They compare the regularization methods on CIFAR-100 and obtain significant improvements in test error.

Dropout has shown several weaknesses in the past years as researchers have tried to analyse it. By re-framing it as the stochastic delta rule the author explains some of the factors that make it work and greatly improve its results.