Neural Architecture Search (NAS), which is closely related to my research work in self-evolving machine, has become increasingly important to online learning and lifelong learning. As data is continuously collected, retraining neural networks with predefined architecture gets more difficult, and it eventually becomes impossible to keep up with the underlying data distribution at some point. Here, a new neural network architecture can come to the rescue. So in the post, I have selected three papers to present the research on this topic. These papers are not the most recent, nonetheless they are important representative in the field. Here we go.

Zoph's paper: Learning Transferable Architectures for Scalable Image Recognition

Liu's paper: DARTS: Differentiable Architecture Search

Gaier's paper: Weight Agnostic Neural Networks

Zoph uses reinforcement learning (RL) to search the neural network architecture. In the paper, a neural network is a stack of "cells" in a predefined way. The RL agent (called RNN controller in the paper) is only responsible for generating the cells. Along reading the paper, I asked the following three questions:

As the above figure shows, RL agent takes actions to sample a cell (see Q3 for details), based on which an entire neural network is constructed (see Q2 for details). The neural network is trained on training data and evaluated on hold-out data. The evaluated performance is used as a reward for the RL agent to adjust its internal parameters (via policy gradient) to generate better cells.

The construction from cells to networks is pre-defined by human. The authors designed various ways of stacking the basic cells for CNNs or RNNs. The figure 2 of the paper (copied below) shows human choice of constructing CIFAR10 and ImageNet architectures using RL generated cells.

One thing worth noting is that each cell has **two inputs and one output**. The two inputs are supposedly coming from the outputs of the immediately previous two cells in the architecture. Each cell is constructed from B blocks. Each block consists of two inputs sampled from *input pool*, three operations, and one output. The *input pool* is initialized as a set of two inputs (the two inputs from the immediately previous two cells), and gets extended by adding the output of the newly generated block.

Essentially, the job of the RL agent, which is a RNN network, is generate **block by block sequentially** (see the figure 7 of the paper copied here). Each block is generated by **5 actions** (see the figure 3 of the paper copied here):

- sample 1st input from input pool,
- sample 2nd input from input pool,
- sample operation for the 1st input,
- sample operation for the 2nd input, and
- sample operation for combining the two inputs

Now Zoph's paper offers a good solution to search neural network architectures. One of the biggest problems is its demanding computation resource (2000 GPU days). Liu, the author of the 2nd paper DARTS, offers a much more efficient modification that significantly reduces the computation cost to a few GPU days. This modification is to relax the optimization space into a continuous one, so that the much faster method, gradient descent, can be applied. Similarly, at least two questions should be asked.

In Zoph's approach, each cell has B blocks and each block has two inputs and one output. In Liu's language, these inputs and outputs of the blocks within the cell are called nodes. Liu completely eliminates the concept of block, which I think makes the logic simpler, and simply say there's a fixed number (N) of nodes in the cell. The nodes are given ordered indexes (see the figure 1 of the paper copied below. In this demo case, there's 4 nodes and they are indexed as 0, 1, 2, 3.

Each node (except the first two nodes) is computed by the nodes whose indexes are smaller than itself.

$$x^{(j)} = \sum_{i<j} o^{(i,j)}(x^{(i)})$$

Take the example of node 2 in the above figure, it's computed as a sum of node 0 and node 1 after their taking corresponding operations (an operation is represented by an edge). To search the optimal cell architecture in this setup is essentially to find the best set of edge types for the computation graph - part (a) of the figure.

It's a discrete optimization by nature since edge type choice discrete. To make it continuous, each edge is represented as a weighted mixture of edge types. Therefore, the optimization problem is relaxed to finding the best set of mixture weights (denoted as \(\alpha\)) for each edge, which could be done via gradient based method (part c of the figure). Finally, the discrete solution can be induced (part d of the figure).

Solving for the mixture weights are like hyper-parameter tuning. For each \(\alpha\) we have to train the network weights to fit the training data. Eventually we'd like to pick the best \(\alpha)\) that maximizes the validation accuracy.

$$min_\alpha L^{val} (\alpha, w^{*}(\alpha))$$

$$s.t. w^{*}(\alpha) = argmin_w L^{train} (w, \alpha)$$

The above two papers limit themselves to searching for the best set of edges/connections with the fixed number of nodes. Gaier takes a step further towards flexible number of nodes. Of course, Gaier only chose simple and small networks to work with. The nodes here are also much smaller - they are just neurons while nodes are feature maps (a list/matrix of neurons) in above two papers.

The more detailed question is how to genetically search networks of varying numbers of neurons and connections. The steps are summarized below.

initialize a population of networks

rank population by fitness (a function of reward/performance)

create a population of next generation

- sort current population by fitness
- remove bottom % of old population and directly keep top % of old population
- select pairs of parents from old population to produce children (either copy higher-fitness parent or crossover)
- child goes mutation with some probability (mutation includes add node, add/remove connection)

Although the authors try to draw readers' attention to the debate over weights vs structure, I think high of this paper from the point of view of learning the NAS methodologies - it well covers the working of genetic algorithm and popularizes the NEAT by creating its python wrapper version prettyNEAT.