Decoupled Greedy Learning Enables Linear Parallelization of CNN Training

Decoupled Greedy Learning of CNNs
Eugene Belilovsky 1 Michael Eickenberg 2 Edouard Oyallon 3
Abstract do not permit the computations of the different constituent

modules to be parallelized. (Jaderberg et al., 2017) charac-
A commonly cited inefficiency of neural network terizes these in order of severity as the forward, update, and
training by back-propagation is the update lock- backward locking problems. Backward unlocking would
arXiv:1901.08164v1 [cs.LG] 23 Jan 2019
ing problem: each layer must wait for the signal permit updates of all modules once signals have propagated
to propagate through the network before updating. to all subsequent modules, update unlocking would permit
We consider and analyze a training procedure, De- updates of a module before a signal has reached all sub-
coupled Greedy Learning (DGL), that addresses sequent modules, and forward unlocking would permit a
this problem more effectively and at scales be- module to operate asynchronously from its predecessor and
yond those of previous solutions. It is based on dependent modules.
a greedy relaxation of the joint training objec-
tive, recently shown to be effective in the context Multiple methods have been proposed, which can deal up
of Convolutional Neural Networks (CNNs) on to a certain degree with the backward unlocking challenge
large-scale image classification. We consider an (Huo et al., 2018b; Choromanska et al., 2018; Nø kland,
optimization of this objective that permits us to 2016). Jaderberg et al. (2017); Czarnecki et al. (2017) pro-
decouple the layer training, allowing for layers pose and analyze DNI, a method that addresses the more
or modules in networks to be trained with a po- challenging update locking. The DNI approach uses an aux-
tentially linear parallelization in layers. We show iliary network to predict the gradient of the backward pass
theoretically and empirically that this approach directly from the input. This method is not shown to scale
converges. In addition, we empirically find that it well computationally or in terms of accuracy, especially in
can lead to better generalization than sequential the case of CNNs (Huo et al., 2018b;a). Indeed, auxiliary
greedy optimization and even standard end-to-end networks must predict a weight gradient that can be very
back-propagation. We show that an extension of large in dimensionality, which can be inaccurate and chal-
this approach to asynchronous settings, where lenging to scale when intermediate representations are large,
modules can operate with large communication as is the case for larger models and input image sizes.
delays, is possible with the use of a replay buffer. Recently, several authors have revisited the classic
We demonstrate the effectiveness of DGL on the (Ivakhnenko & Lapa, 1965; Bengio et al., 2007) approach
CIFAR-10 datasets against alternatives and on the of supervised greedy layer-wise training of neural networks
large-scale ImageNet dataset, where we are able (Huang et al., 2018; Marquez et al., 2018). In Belilovsky
to effectively train VGG and ResNet-152 models. et al. (2018) it is shown that such an approach, which relaxes
the joint learning objective, can lead to high performance
deep CNNs on large-scale datasets. Some of these works
1. Introduction also consider the use of auxiliary networks with hidden lay-
Training jointly all layers using back-propagation is the ers as part of the local auxiliary problems which has some
standard method for learning neural networks, including the analogs to the auxiliary networks of DNI and target propaga-
computationally intensive vision models based on Convolu- tion (Lee et al., 2014). We will show that the greedy learning
tional Neural Networks (CNNs) (Goyal et al., 2017). Due objective can be solved with an alternative optimization al-
to the sequential nature of the gradient processing, standard gorithm, which permits decoupling the computations and
back-propagation has several well-known inefficiencies that achieving update unlocking. This can also be augmented
1
Mila, University of Montreal 2 University of California Berke- with replay buffers (Lin, 1992) to permit forward unlocking.
3
ley CentraleSupelec and INRIA. Correspondence to: Eugene This strategy can be shown to be a state-of-the-art baseline
Belilovsky <eugene.belilovsky@umontreal.ca>, Michael Eick- for parallelizing the training across modules of a neural
enberg <michael.eickenberg@berkeley.edu>, Edouard Oyallon network.
<edouard.oyallon@centralesupelec.fr>.
Our contributions in this work are as follows. We (a) pro-
pose an optimization procedure for a decoupled greedy weights of the backward pass with random weights. Direct
learning objective that solves the update locking problem. feedback alignment extends the idea of feedback alignment
(b) Empirically, we show that it exhibits similar conver- passing errors from the top to all layers, potentially permit-
gence rates and generalization as its non-decoupled coun- ting a simultaneous update. These approaches have also not
terpart. (c) We show that it can be extended to an asyn- been shown to be scalable to large datasets (Bartunov et al.,
chronous setting by use of a replay buffer, providing a step 2018), obtaining only 17.5% top-5 accuracy on ImageNet
towards addressing the forward locking problem. (d) We (for a reference model that achieves 59.8%). On the other
motivate these observations theoretically, showing that the hand a greedy learning strategy has been shown to work
proposed optimization procedure converges and recovers well on the same task (Belilovsky et al., 2018).
standard rates of non-convex optimization. Experimentally
Another line of related work inspired by optimization meth-
we (e) design an improved auxiliary network structure for
ods such as Alternating Direction Method of Multipliers
greedy layer-wise training of CNNs that permits to main-
(ADMM) (Taylor et al., 2016; Carreira-Perpinan & Wang,
tain accuracy while having negligible cost for the auxiliary
2014; Choromanska et al., 2018) considers approaches
task. We (f) show that the decoupled greedy learning can
that use auxiliary variables to break optimization into sub-
well outperform competing methods in terms of scalability
problems. These approaches are fundamentally different
to larger and deeper models and stability to optimization
from ours as they optimization for the joint training objec-
hyper-parameters, allowing it to be applied to large datasets.
tive, the auxiliary variables providing a link between a layer
We then demonstrate on the ImageNet dataset that we can
and it’s successive layers, whereas we consider a different
train the deep models VGG-19 and ResNet-152 with larger
objective where a layer has no dependence on its successors.
degrees of parallelism than other proposals and reduced
None of these methods can achieve update or forward un-
memory consumption. Code for experiments will be made
locking, however some (Choromanska et al., 2018) are able
available1 .
to have a simultaneous weight updates (backward unlocked).
2. Related work Another issue with these methods is that most of the existing
To the best of our knowledge (Jaderberg et al., 2017) is the approaches except for Choromanska et al. (2018) require
only work which directly addresses the update or forward standard (“batch”) gradient descent and are thus difficult
locking problems in deep feed-forward networks. Other to scale. They also often involve an inner minimization
works (Huo et al., 2018a;b) consider the backward locking problem and have thus not been demonstrated to work on
problem, furthermore a number of back-propagation alter- realistic large scale datasets. Furthermore, none of these
natives such as (Choromanska et al., 2018; Lee et al., 2014; have been combined with CNNs.
Nø kland, 2016) are also able to address this problem. How- Distributed optimization based on data parallelism is a pop-
ever, update locking is a more severe inefficiency. Consider ular area of research in machine learning beyond deep learn-
the case where each layer’s forward processing time is TF ing models and often studied in the convex setting (Leblond
and is equal across a network of L layers. Given that the et al., 2018). In deep network optimization the predomi-
backward pass of back-propagation is a constant multiple in nant method is distributed synchronous SGD (Goyal et al.,
time of the forward pass, in the most ideal case the backward 2017) and variants, as well as asynchronous (Zhang et al.,
unlocking will still only scale as O(LTF ) with L parallel 2015) variants. Our approach on the other hand can be seen
nodes, while update unlocking could scale as O(TF ). as closer to exploiting a type of model parallelism vs data
One class of alternatives to standard back-propagation parallelism and can be easily combined with many of these
aims to avoid its biologically implausible aspects, most methods, particularly distributed synchronous SGD.
notably the weight transport problem (Bartunov et al., 2018; 3. Parallel Decoupled Greedy Learning
Nø kland, 2016; Lillicrap et al., 2014; Lee et al., 2014).
In this section we formally define the greedy objective
Some of these methods (Lee et al., 2014; Nø kland, 2016)
and parallel optimization which we study in both the syn-
can also achieve backward unlocking as they permit all pa-
chronous and asynchronous setting. We mainly consider
rameters to be updated at the same time, but only once the
the online setting and assume a stream of samples or mini-
signal has propagated to the top layer. None of them how-
batches denoted S , {(xt0 , y t )}t≤T , that can be run during
ever solve the update locking problem or forward locking
T iterations.
problem which we consider. Target propagation uses a local
auxiliary network as in our approach, which is used to prop- 3.1. Optimization for Greedy Objective
agate backward the optimal activations computed from the Let X0 and Y be the data matrix and labels for the train-
layer above. Feedback alignment replaces the symmetric ing data. Let Xj be the output representation for mod-
1
ule j. We will denote the per-module objective function
Experiment code will be made available at L̂(Xj , Y ; θj , γj ), where the parameters θj correspond to
https://github.com/eugenium/DGL
the module parameter (i.e. Xj+1 = fθj (Xj )), γj corre-
Algorithm 1: Decoupled Greedy Learning Decoupled Greedy Learning

Prediction
Standard Backpropagation
Aux
Input: Stream S , {(xt0 , y t )}t≤T of samples
CNN CNN
or mini-batches Module Module
1 Initialize Parameters {θj , γj }j≤J Aux
2 for (xt0 , y t ) ∈ S do
3 for j ∈ 1, ..., J do CNN CNN
Module Module
4 xtj ← fθj−1 (xtj−1 ) Aux
5 Compute ∇(γj ,θj ) L̂(y t , xtj ; γj , θj )

6 (θj , γj ) ←Update parameters (θj , γj ) CNN CNN
Module Module
7 end
8 end Batch 1 Batch 2 Batch 3 Batch 1 Batch 2 Batch 3
Time
Time
Figure 1. We illustrate the signal propagation for three mini-batches processed by standard back-propagation and with decoupled greedy
learning. In each case a module can begin processing forward and backward passes as soon as possible. For illustration we assume same
speed for forward and backward passes, and discount the auxiliary network computation (negligible in our experiments).
spond to auxiliary parameters used to compute the objective. We will evaluate one instance of such an algorithm based
L̂ in our case will be the empirical risk with a cross-entropy on the use of a replay buffer of size M , shown in Alg. 2.
loss. The greedy training objective is thus given recursively Here each module maintains a buffer to which it writes its
by defining Pj : output representations, which is read by the module above.
min L̂(Xj , Y ; θj , γj ) (Pj ) Algorithm 2: Asynchronous DGL with Replay Buffer

θj ,γj
Input: Stream S , {(xt0 , y t )}t≤T ; Distribution of the
where Xj = fθj−1 ∗ (Xj−1 ) and θj−1 ∗
is the minimizer of delay p = {p(j)}j ; Buffer size M
1 Initialize: Buffers {Bj }j with size M ; params {θj , γj }j
Problem (Pj ). A natural way to solve the optimization prob- 2 while training do
lem for J modules, PJ , is thus by sequentially solving the 3 Sample j in {1, ..., J} following p.
problems {Pj }j≤J starting with j = 1. Here we consider 4 if j = 1 then (x0 , y) ← S else (xj−1 , y) ← Bj−1
an alternative procedure for optimizing this objective given 5 xj ← fθj−1 (xj−1 )
in Alg. 1: individual updates of each set of parameters are 6 Compute ∇(γj ,θj ) L̂(y, xj ; γj , θj )
performed in sequence across the different layers. Each 7 (θj , γj ) ← Update parameters (θj , γj )
layer processes a sample or mini-batch, then passes it to 8 if j < J then Bj ← (xj , y)
the next layer. Note that at line 5 the subsequent layer can 9 end
already begin computing line 4. Therefore, this algorithm
achieves update unlocking. An explicit version of an equiv- Our minimal distributed setting is as follows. Each worker
alent multi-worker pseudo-code is included in Appendix B. j has a buffer that it writes to and that worker j + 1 can
read from. The buffer uses a simple read and write protocol.
Fig. 1 illustrates the decoupling compared to how samples A buffer Bj permits layer j to write new samples. When
are processed in standard back-propagation. We observe it reaches capacity it overwrites the oldest sample. Layer
that once a xtj has been computed, processing by subsequent j + 1 requests samples from the buffer Bj . The sample
layers can begin. Sec. 3.2 will also consider a version of the is selected by a simple last-in-first-out (LIFO) rule, with a
algorithm that can be made asynchronous by introducing a precedence for the least-reused samples. The speed of the
replay buffer. worker is constant, yet can be potentially different across
3.2. Asynchronous Decoupled Greedy Learning with workers and is also subject to small random fluctuations.
Replay Our algorithm does not require a shared buffer across all
We consider an extension of this framework that addresses workers, but only across pairs of workers. Alg. 2 simulates
forward unlocking (Jaderberg et al., 2017). Since the compu- potential delays in such a setup by the use of a probability
tations of the modules in DGL are only loosely dependent, mass function (pmf) p(j) over workers analogous to typical
we can attempt an extension that allows to also remove some asynchronous settings such as (Leblond et al., 2017). At
dependency of the computations on the previous modules each iteration a layer is chosen at random according to p(j)
such that this can operate asynchronously. This is achieved to perform computation. In our experiments we will limit
by the use of a replay buffer that is shared between adja- ourselves to pmfs that are uniform over workers except for
cent modules and allows modules to reuse older samples. It a single layer which is chosen to be selected less frequently
can be beneficial in scenarios with communication delays on average. We note even in the case of a uniform pmf,
or substantial variations in speeds between layers/modules. asynchronous behavior will naturally arise requiring the
reuse of samples. Alg. 2 permits a controlled simulation tial applications. Greedy objectives have recently been used
of processing speed discrepancies and will be used over in several applications in reinforcement learning (Haarnoja
settings of p and M to demonstrate that training accuracy et al., 2018) and in ensemble methods like Boosting (Huang
and testing accuracy remain robust in practical regimes. et al., 2018). Even with a single worker the synchronous
Appendix B also provides a more intuitive pseudo-code for DGL has a gain in terms of memory. Moreover it is eas-
how this buffer-based algorithm would be implemented in a ier to implement efficiently than sequential greedy training
parallel environment. since in the naive sequential training scheme, a forward pass
through old modules or caching of previous activations is
As will be demonstrated in our experiments, the DGL can
needed for optimal performance.
potentially be robust to substantial asynchronous behav-
ior. Unlike common data-parallel asynchronous algorithms 4. Theoretical Analysis
(Zhang et al., 2015), the asynchronous DGL does not rely In this section we analyze the convergence of Alg. 1 when
on a master node and requires only local communication the update steps are obtained from stochastic gradient meth-
similar to recent decentralized schemes (Lian et al., 2017). ods. We show that under the DGL optimization scheme a
Unlike decentralized SGD algorithms, nodes only need to critical point can be reached. In standard stochastic opti-
maintain and update the parameters of their local model, po- mization schemes, the input distribution fed to a model is
tentially supporting much larger models. Combining DGL fixed (Bottou et al., 2018). With the decoupled training pro-
with data-parallel methods is also natural. For example a cedure the input distribution to each module is time-varying
common issue of the popular distributed synchronous SGD and dependent on the convergence of the previous module.
in deep CNNs is the often limited maximum batch size At time step t, for simplicity we will denote all parameters
(Goyal et al., 2017; Keskar et al., 2016). This suggests that of a module (including auxiliary) as Θtj , (θjt , γjt ), and
DGL can be used in combination with data parallelism to
samples as Zjt , (Xjt , Y ), which follow the density ptj (z).
add an additional dimension of parallelization. Potentially
We aim to prove that each auxiliary problem of the DGL
combining asynchronous DGL with distributed synchronous
approach will converge to a critical point despite the time
SGD for sub-problem optimization is a promising direction.
varying inputs corresponding to sub-optimal outputs from
3.3. Auxiliary and Primary Network Design prior modules. Proofs are given in Appendix B.
The DNI method requires an auxiliary network to predict
Let us fix a depth j, such that j > 1 and consider the tar-
the gradient. The greedy layerwise CNN training procedure
get density of the previous layer, p∗j−1 (z). We consider
of (Belilovsky et al., 2018), which we parallelize, similarly
the following distance: ctj−1 , |ptj−1 (z) − p∗j−1 (z)| dz.
R
relies on an auxiliary network. This requires the design of
an auxiliary network in addition to the CNN architecture Denoting ` the composition of the non-negative loss func-
design. Belilovsky et al. (2018) have shown that simple tion and the network, we will study the expected risk
averaging operations can be used to construct a scalable L(Θj ) , Ep∗j−1 [`(Zj−1 ; Θj )]. We will now state several
auxiliary network. However, they did not directly consider standard assumptions we use.
the parallel training use case. Here care must be taken in the Assumption 1 (L-smoothness). L is differentiable and its
design, as will be discussed in the experimental section. The gradient is L-Lipschitz.
primary considerations in our case is the relative speed of the
auxiliary network with respect to the associated module it is We consider the SGD scheme with learning rate {ηt }t :
attached to in the primary network. We will use primarily Θt+1
j = Θtj − ηt ∇Θj `(Zj−1
t
; Θtj ) where Zj−1
t
∼ ptj−1 (1)
FLOP count in our analysis and aim to restrict our auxiliary
Assumption
P 2 (Robbins-Monro conditions). The step sizes
networks to be 5% of the primary network.
satisfy t ηt = ∞ yet t ηt2 < ∞.
P
Although auxiliary network design might seem like an ad-
ditional layer of complexity in CNN design and might po- We also assume bounded gradient moments:
tentially require slightly different architecture principles for Assumption 3 (Finite variance). There exists G > 0, for
the primary network than standard end-to-end trained deep any t and Θj , Eptj−1 k∇`(Zj−1 ; Θj )k2 ≤ G.
CNNs, this is not inherently prohibitive since architecture
design is well known to be related to the training. As an The Assumptions 1, 2 and 3 are standard (Bottou et al.,
example, consider the typical motivation for residual connec- 2018; Huo et al., 2018a), and we show in the following
tions which are originally motivated by optimization issues that our proof of convergence leads to similar rates, up
inherent to end-to-end backpropagation of deep networks. to a multiplicative constant. The following assumption is
specific to our setting where we consider a time-varying
We note although we focus on the distributed learning con- distribution:
text, the proposed optimization algorithm and associated
AssumptionP 4 (Convergence of the previous layer). We
theory for greedy objectives is generic and has other poten-
assume that t ctj−1 < ∞.
FLOPS Aux./ Module Acc.

Assumption 3 can be extend to p∗j−1 ,
CNN-aux 200% 92.2
Lemma 4.1. Under Assumption 3 and 4, one has: MLP-aux 0.7% 90.6
∀Θj , Ep∗j−1 k∇`(Zj−1 ; Θj )k2 ≤ G. MLP-SR-aux 4.0% 91.2
We are now ready to prove the core statement for the con- Table 1. We compare auxiliary networks. CNN-aux applied in
vergence results in this setting: previous work is inefficient with respect to the primary module. We
Lemma 4.2. Under Assumptions 1, 3 and 4, we have: report the flop count of the largest module and the corresponding
auxiliary network and accuracy for a CIFAR-10. MLP-aux and
E[L(Θt+1 t
j )] ≤ E[L(Θj )] − ηt E[k∇L(Θj )k ]
t 2 MLP-SR-aux applied after spatial averaging operations are far
√ LG 2 (2) more effective with minimal accuracy loss.
− 2Gctj−1 + ηt .
2
generality and scalability of DGL. We then consider the
The expectation is taken over each random variable. Also,
design of a more efficient auxiliary network which we will
note that without the temporal dependency (i.e. ctj = 0), this
subsequently use to permit scaling to the ImageNet dataset.
becomes analogous to Lemma 4.4 in (Bottou et al., 2018).
We will also show that DGL is effective at optimizing the
Naturally it follows, that
greedy objective compared to a naive sequential algorithm.
Proposition 4.1. Under Assumptions 1, 2, 3 and 4, each
term of the following equation converges: Comparison to DNI We reproduce the CIFAR-10 CNN
T experiment described in (Jaderberg et al., 2017), Appendix
C.1. This experiment utilizes a 3 layer network with aux-
X
ηt E[k∇L(Θtj )k2 ] ≤ E[L(Θ0j )]
t=0 iliary networks of 2 hidden CNN layers. We compare our
(3) reproduction to the DGL approach. Instead of the final
T q
X Lηt synthetic gradient prediction for the DGL we apply a final
+G ηt 2ctj−1 + .
t=0
2 projection to the target prediction space. We follow the
prescribed optimization procedure from (Jaderberg et al.,
Thus the DGL scheme converges in the sense of (Bottou 2017) in this comparison, using Adam with a learning rate
et al., 2018; Huo et al., 2018a). It is also possible to obtain of 3e−5. We run training for 1500 epochs and compare stan-
the following rate: dard backpropagation, DNI, cDNI (Jaderberg et al., 2017)
Corollary 4.1. The sequence of expected gradient norm and DGL. Results are shown in Fig. 2. Further details are
accumulates around 0 at the following rate: included in the Appendix. We find that the DGL method out-
performs DNI and the context DNI by a substantial amount
 PT q
both in test accuracy and training loss. We also find in this

t=0 ctj−1 ηt
inf E[k∇L(Θtj )k2 ] ≤ O  PT  (4) setting that the DGL can generalize better than standard
t≤T η backpropagation and obtains a very close final training loss.
t=0 t
We also attempted DNI with the more commonly used op-

Thus compared to the sequentialq case, the parallel setting timization settings for CNNs (SGD with momentum and
adds a delay that is controlled by ctj−1 . We now evaluate step decay), but found that the DNI would diverge when
DGL empirically. larger learning rates were used, although DGL sub-problem
optimization worked effectively with common CNN opti-
5. Experiments mization strategies. We also note that the prescribed experi-
We conducted several experiments that empirically show ment uses a setting where the scalability of our method is
that DGL optimizes the greedy objective well. We compare not fully exploited. Each layer of the primary network of
the DGL method to others, showing it is a state-of-the-art (Jaderberg et al., 2017) has a pooling operation, which per-
solution for decoupling training of deep network modules. mits the auxiliary network to be small for synthetic gradient
We show that it can still work on a large-scale dataset (Ima- prediction. This however severely restricts the architecture
geNet) and that it can, in some cases, generalize better than choices in the primary network to using a pooling operation
standard back-propagation. We also demonstrate positive at each layer. In DGL we can apply the pooling operations
initial results for the asynchronous variant of the algorithm. in the auxiliary network thus permitting the auxiliary net-
5.1. Other Approaches and Auxiliary Network Designs work to be negligible in cost even for layers without pooling
This section presents experiments evaluating DGL with the (whereas for synthetic gradient prediction they often have to
CIFAR-10 dataset (Krizhevsky, 2009) and standard data be as costly as the base network). Overall we find that the
augmentation. We first use a setup that permits us to com- DGL approach is not only far more scalable and accurate
pare against the DNI method and which also highlights the but also more stable and robust to changes in optimization
hyper-parameters than DNI.
Accuracy (CIFAR-10) Training Loss (CIFAR-10)

Backprop
80 2.0 DGL
DNI
70 cDNI
1.5
Accuracy
60
Loss
50 1.0
40 Backprop
DGL 0.5
30 DNI
cDNI
20 0.0
0 200 400 600 800 1000 1200 1400 0 200 400 600 800 1000 1200 1400
epoch epoch
Figure 2. Comparison of DNI, context DNI (cDNI), and DGL in terms of training loss and test accuracy for experiment from (Jaderberg
et al., 2017). DGL converges better than cDNI and DNI with the same auxiliary net. and generalizes better than backprop for this case.
Auxiliary Network Design We consider different auxil- Accuracy layer 2 and 4

90
iary networks for CNNs. As a baseline we use convolutional
auxiliary layers as in (Jaderberg et al., 2017) and (Belilovsky 80
et al., 2018). For distributed training application this ap-
proach is sub-optimal as the auxiliary network can be sub-
Accuracy
70
stantial compared to the base network, leading to poorer
parallelization gains. We note however that even in those 60
cases (that we don’t study here) where the auxiliary network Sequential, layer 2
50 Parallel, layer 2
computation is potentially on the order of the the primary Sequential, layer 4
network, it can still give advantages for parallelization for Parallel, layer 4
very deep networks and many available workers. 0 10 20 30 40 50
Epoch
The primary network architecture we use for these experi-
ments is a simple CNN similar to VGG family models (Si-
Figure 3. Comparison of sequential and parallel training. We ob-
monyan & Zisserman, 2014). It consists of 6 convolutional serve parallel training catches up rapidly to sequential.
layers with 3 × 3 kernels, batchnorm and shape preserving
padding, with 2 × 2 maxpooling operations at layers 1 and
a 3 layer MLP (of constant width). This is denoted MLP-
3. The channel width of the first layer is 128 and is doubled
aux and drastically reduces the FLOP count with minimal
at each downsampling operation. The final layer does not
accuracy loss compared to CNN-aux. Finally we consider
have an auxiliary model, it is learned with a linear spatial
applying a staged spatial resolution, first reducing the spatial
averaging followed by a 2-hidden layer constant depth fully
resolution by 4× (and total size 16×), then applying 3 1 × 1
connected network, for all experiments. Two alternatives
convolutions followed by a reduction to 2 × 2 and a 3 layer
to the CNN auxiliary of (Belilovsky et al., 2018) are ex-
MLP. We denote this approach MLP-SR-aux. These latter
plored, which exploit a spatial averaging operation. We
two strategies that leverage the spatial averaging produce
re-iterate that this kind of approach and even the simple
auxiliary networks that are less than 5% of the FLOP count
network structure we consider is not easily applicable in the
of the primary network even for large spatial resolutions
case of DNI and synthetic gradient prediction. Optimization
as in real world image datasets. We will show that MLP-
is done using a standard strategy for CIFAR CNN training.
SR-aux is still effective even for the large-scale ImageNet
We apply SGD with momentum of 0.9 and weight decay
dataset.
5e−4 (Zagoruyko & Komodakis, 2016) and decaying step
sizes. For these experiments we use a short schedule of 50 Sequential vs. Parallel Optimization of Greedy Objec-
epochs and decay factor of 0.2 every 15 epochs (Belilovsky tive We briefly compare the sequential optimization of
et al., 2018). Results of comparisons are given in Table 1. the greedy objective (Belilovsky et al., 2018; Bengio et al.,
2007) to the DGL (Alg. 1). We use a 4 layer CIFAR-10
The baseline auxiliary strategy based on (Belilovsky et al., network with an MLP-SR-aux auxiliary model and a final
2018) and (Jaderberg et al., 2017) applies 2 CNN layers fol- layer attached to a 2 layer MLP. We use the same optimiza-
lowed by a spatial averaging to 2×2 before a final projection. tion settings with 50 epochs as in the last experiment. In the
We denote this CNN-aux. The first alternative we explore sequential training we train each layer for 50 epochs before
is a direct application of the spatial averaging to 2 × 2 out- moving to the subsequent one. Thus the difference to DGL
put shape (regardless of the input resolution) followed by lies only in the input received at each layer (fully converged
Parallel Loss (CIFAR-10) Backprop DDG DGL

1.6 Layer 1 93.53 93.41 93.5 ± 0.1
Layer 2
1.4 Layer 3
Layer 4 Table 2. ResNet-110 (K = 2) for Backprop and DDG method
1.2 reported from (Huo et al., 2018b). DGL is run for 3 trials to
1.0 compute variance. The approaches give the same accuracy with
Loss
DGL being update unlocked and DDG only backward unlocked.

0.8
DNI is reported to not work in this setting (Huo et al., 2018b).
0.6
0.4
MLP-SR-aux which has less than 0.1% the FLOP count of
the primary network. We use the exact optimization and
0.2
network split points as in (Huo et al., 2018b). To assess
0 10 20 30 40 50 variance in the accuracy for CIFAR-10 we perform 3 trials.
Epoch We observe in Tab. 2 that the accuracy is the same across the
DDG method, backprop, and our approach. DDG achieves
Figure 4. Loss of layer in CIFAR-10 network, after a few epochs better parallelization because it also splits the forward pass.
the layers build a dynamic of progressive improvement in depth. 5.2. Large-scale Experiments
Existing methods considering update or backward locking
previous layer versus not fully converged previous layer).
have not been evaluated on large image datasets as they
The rest of the optimization settings are identical. Figure 3
are often unstable or already show large losses in accuracy
shows comparisons of the learning curves for sequential
on smaller datasets. Here we study the optimization of
training and DGL at layer 4 (layer 1 is the same for both
several well-known architectures, mainly the VGG family
as the input representation is not varying over the training
(Simonyan & Zisserman, 2014) and the ResNet (He et al.,
period). We observe that the DGL quickly catches up with
2016). In all our experiments we use the MLP-SR-aux auxil-
the sequential training scheme and appears to sometimes
iary network which scales well from the smaller CIFAR-10
generalize better. We additionally visualize the dynamics of
images to the larger ImageNet ones. The final module does
training per layer in Fig. 4, which demonstrates that after
not have an auxiliary model.
just a few epochs the individual layers build a dynamic of
progressive improvement with depth. Additional visualiza- For all optimization of auxiliary problems and for end-to-
tions are included in the supplementary materials. end optimization of reference models we use the optimiza-
Multi-Layer modules Although we have so far consid- tion schedule prescribed in (Xiao et al., 2019). It consists
ered the setting of layer-wise decoupling, this approach can of training for 50 epochs with mini-batches size 256, uses
easily be applied to generic modules. Indeed approaches SGD with momentum of 0.9, weight decay of 1e−4, and a
such as DNI (Jaderberg et al., 2017) often consider decou- learning rate of 0.1 reduced by a factor 10 every 10 epochs.
pling entire multi-layer modules. Furthermore the propo- Results are shown in Tab. 3. For several of the models DGL
sitions for backward unlocking (Huo et al., 2018b;a) also can perform as well and sometimes better than the end-to-
rely on and report they can often only decouple 100 layer end trained model, while permitting parallel training. For
networks into 2 or 4 blocks before observing optimization the VGG-13 architecture we also evaluate the case where
issues or performance losses and require that the number of the model is trained layer by layer (K = 10). Although per-
parallel modules is much lower than the network depth for formance is degraded by this split we find its performance
the theoretical guarantees to hold. As in those cases, using surprising considering that no backward communication is
multi-layer decoupled modules can improve performance performed. We conjecture that improved auxiliary models
and is natural in the case of deeper networks. We now use and combinations with methods such as (Huo et al., 2018a;
such a multi-layer approach to directly compare to the back- Jaderberg et al., 2017) to allow feedback on top of the local
ward unlocking of (Huo et al., 2018b) and then subsequently model, may further improve performance. Also as men-
we will apply this on deep networks for ImageNet. We will tioned for the settings with larger potential parallelization,
denote from here-on the number of total modules a network slower but more performant auxiliary models could poten-
is split into as K. tially be considered as well.
Comparison to DDG (Huo et al., 2018b) proposes a so- We also remark that the synchronous DGL has favorable
lution to the backward locking (less efficient than solving memory usage compared to DDG and to the DNI method.
update-locking, see discussion above) We show that even in DNI requiring to store larger activations and DDG having
this situation the DGL method can provide a strong baseline memory. Although not the focus of this work in the single
for work on backward unlocking. We take the example from worker version of DGL has favorable memory usage com-
(Huo et al., 2018b), which considers a ResNet-110 paral- pared to standard end-to-end backpropagation training. For
lelized into K = 2 blocks. We use the auxiliary network example the ResNet-152 DGL K = 2 setting we have con-
Model (training method) Top-1 Top-5

91 Accuracy for Varying Buffer Size
VGG-13 (DGL per Layer, K = 10) 64.4 85.8
VGG-13 (DGL K = 4) 67.8 88.0
90
Accuracy
VGG-13 (backprop) 66.6 87.5
VGG-19 (DGL K = 4) 69.2 89.0
VGG-19 (DGL K = 2) 70.8 90.2
89
VGG-19 (backprop) 69.7 89.7
ResNet-152 (DGL K = 2) 74.5 92.0 10 20 30 40 50
ResNet-152 (backprop) 74.4 92.1 Buffer size
Figure 6. Buffer size vs. Accuracy for Async DGL. Smaller buffer
Table 3. ImageNet results. We observe that DGL can be effective sizes produce only small loss in accuracy.
for VGG and ResNet models obtaining similar or better accuracies,
while permitting parallelization and reduced memory.
Flops Net Flops Aux

VGG-13 (DGL K = 4) 13 GFLOPs 0.2 GFLOP
VGG-19 (DGL K = 4) 20 GFLOPs 0.2 GFLOP
ResNet-152 (DGL K = 2) 11 GFLOP 0.02 GFLOP each of these settings (for a total of 18 experiments per
data point). We show the evaluations for 10 values of S.
Table 4. ImageNet comparisons of FLOPs for auxiliary model in To ensure a fair comparison we also stop updating layers
major models trained. Auxiliary networks are negligible. once they have completed the iterations for 50 epochs, thus
sidered can fit 38% more samples on a single 16GB GPU assuring identical number of gradient updates for all layers
than the the standard end-to-end training. in all experiments compared. In practice one could continue
updating until all layers have completed training. In Fig. 5
5.3. Asynchronous DGL with Replay and compare it to the synchronous case (standard DGL). We
We now study the stability of Alg. 2 w.r.t. the buffers. We first observe that the accuracy of the synchronous algorithm
use a 5 layer CIFAR-10 network with the MLP-aux and is maintained in the setting where S = 1 and the pmf is
with all other architecture and optimization settings as in uniform. Note that even this is a non-trivial case, as it
the auxiliary network experiments of Sec. 5.1. Each layer will mean that layers inherently have random delays (as
is equipped with a buffer of size M . At each iteration, compared to the synchronous Alg. 1). We see that accuracy
a layer is chosen according to the pmf p(j), and a batch is maintained until approximately 1.2× and accuracy losses
selected from buffer Bj−1 . One layer is slowed down by after that still remain small. Our maximum slowdown factor
decreasing its selection probability in the pmf p(j) by a of 2× is somewhat drastic – it means that for the 50 epochs
factor S. We evaluate different slowdown factors (up to of training, the slowed-down layer is only on epoch 25 while
S = 2.0). Accuracy versus slowdown factor is shown in those following it are at epoch 50.
Fig. 5. For this experiment we use a buffer of size M = 50.
We run separate experiments with the slowdown applied In a second experiment we evaluate performance with re-
at each layer of the network as well as 3 random seeds for spect to the buffer size. Results are shown in Fig. 6. For this
experiment we fix the slowdown factor to 1.2×. We observe
that even when a very small buffer size can yield only a
92.0 Test Accuracy (CIFAR-10) slight loss in performance accuracy. Indeed building on this
demonstration there are multiple direction to improve Async
91.5 DGL with replay. For example improving the efficiency of
91.0 the buffer, by including data augmentation in feature space
(Verma et al., 2018), mixing samples in batches, and im-
Test Accuracy
90.5
proved batch sampling among other directions.
90.0
6. Conclusion
89.5
We have analyzed and introduced a simple and strong base-
89.0 line for parallelizing per layer and per module computations
88.5 DGL+Replay (Async) in CNN training. Our approach is shown to match or ex-
Reference (Sync)
88.01.0 ceed state of the art approaches addressing these problems
1.2 1.4 1.6 1.8 2.0
Module Slowdown Factor and shown able to scale to much large datasets than oth-
ers. Future work can develop improved auxiliary problem
Figure 5. Evaluation of Async DGL. A single layer of a network is objectives and combinations with delayed feedback.
slowed down on average over others, we observe negligible losses
of accuracy at even substantial delays.
Acknowledgements He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-
ing for image recognition. In Proceedings of the IEEE
EO and EB acknowledges NVIDIA for its GPU donation.
conference on computer vision and pattern recognition,
We would like to thank John Zarka, Louis Thiry, Geor-
pp. 770–778, 2016.
gios Exarchakis, Fabian Pedregosa, Maxim Berman, Amal
Rannen, Aaron Courville, and Nicolas Pinto for helpful Huang, F., Ash, J., Langford, J., and Schapire, R. Learning
discussions. deep resnet blocks sequentially using boosting theory.
International Conference on Machine Learning(ICML),
References 2018.
Bartunov, S., Santoro, A., Richards, B., Marris, L., Hin- Huo, Z., Gu, B., and Huang, H. Training neural networks
ton, G. E., and Lillicrap, T. Assessing the scalability using features replay. Advances in Neural Information
of biologically-motivated deep learning algorithms and Processing Systems, 2018a.
architectures. In Advances in Neural Information Pro-
cessing Systems, pp. 9389–9399, 2018. Huo, Z., Gu, B., qian Yang, and Huang, H. Decoupled
parallel backpropagation with convergence guarantee. In
Belilovsky, E., Eickenberg, M., and Oyallon, E. Greedy Dy, J. and Krause, A. (eds.), Proceedings of the 35th In-
layerwise learning can scale to imagenet, 2018. URL ternational Conference on Machine Learning, volume 80
http://arxiv.org/abs/1812.11446. of Proceedings of Machine Learning Research, pp. 2098–
Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. 2106, Stockholmsmssan, Stockholm Sweden, 10–15 Jul
Greedy layer-wise training of deep networks. In Advances 2018b. PMLR. URL http://proceedings.mlr.
in neural information processing systems, pp. 153–160, press/v80/huo18a.html.
2007. Ivakhnenko, A. G. and Lapa, V. G. Cybernetic Predicting
Bottou, L., Curtis, F. E., and Nocedal, J. Optimization Devices. CCM Information Corporation., 1965.
methods for large-scale machine learning. SIAM Review,
Jaderberg, M., Czarnecki, W. M., Osindero, S., Vinyals, O.,
60(2):223–311, 2018.
Graves, A., Silver, D., and Kavukcuoglu, K. Decoupled
Carreira-Perpinan, M. and Wang, W. Distributed optimiza- neural interfaces using synthetic gradients. International
tion of deeply nested systems. In Artificial Intelligence Conference of Machine Learning, 2017.
and Statistics, pp. 10–19, 2014.
Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy,
Choromanska, A., Kumaravel, S., Luss, R., Rish, I., Kings- M., and Tang, P. T. P. On large-batch training for deep
bury, B., Tejwani, R., and Bouneffouf, D. Beyond back- learning: Generalization gap and sharp minima. arXiv
prop: Alternating minimization with co-activation mem- preprint arXiv:1609.04836, 2016.
ory. arXiv preprint arXiv:1806.09077, 2018.
Krizhevsky, A. Learning multiple layers of features from
Czarnecki, W. M., Swirszcz, G., Jaderberg, M., Osin- tiny images. Technical report, Citeseer, 2009.
dero, S., Vinyals, O., and Kavukcuoglu, K. Under-
standing synthetic gradients and decoupled neural in- Leblond, R., Pedregosa, F., and Lacoste-Julien, S. Asaga:
terfaces. CoRR, abs/1703.00522, 2017. URL http: Asynchronous parallel saga. In 20th International Con-
//arxiv.org/abs/1703.00522. ference on Artificial Intelligence and Statistics (AISTATS)
2017, 2017.
Goyal, P., Dollár, P., Girshick, R. B., Noordhuis, P.,
Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and Leblond, R., Pedregosa, F., and Lacoste-Julien, S. Im-
He, K. Accurate, large minibatch SGD: training ima- proved asynchronous parallel optimization analysis for
genet in 1 hour. CoRR, abs/1706.02677, 2017. URL stochastic incremental methods. Journal of Machine
http://arxiv.org/abs/1706.02677. Learning Research, 19(81):1–68, 2018. URL http:
//jmlr.org/papers/v19/17-650.html.
Haarnoja, T., Hartikainen, K., Abbeel, P., and Levine, S. La-
tent space policies for hierarchical reinforcement learning. Lee, D., Zhang, S., Biard, A., and Bengio, Y. Target
In Dy, J. and Krause, A. (eds.), Proceedings of the 35th In- propagation. CoRR, abs/1412.7525, 2014. URL http:
ternational Conference on Machine Learning, volume 80 //arxiv.org/abs/1412.7525.
of Proceedings of Machine Learning Research, pp. 1851–
1860, Stockholmsmssan, Stockholm Sweden, 10–15 Jul Lian, X., Zhang, W., Zhang, C., and Liu, J. Asynchronous
2018. PMLR. URL http://proceedings.mlr. decentralized parallel stochastic gradient descent. arXiv
press/v80/haarnoja18a.html. preprint arXiv:1710.06952, 2017.
Lillicrap, T. P., Cownden, D., Tweed, D. B., and Akerman,

C. J. Random feedback weights support learning in deep
neural networks. CoRR, abs/1411.0247, 2014. URL
http://arxiv.org/abs/1411.0247.
Lin, L.-J. Self-improving reactive agents based on reinforce-
ment learning, planning and teaching. Machine learning,
8(3-4):293–321, 1992.
Marquez, E. S., Hare, J. S., and Niranjan, M. Deep cascade

learning. IEEE Transactions on Neural Networks and
Learning Systems, 29(11):5475–5485, Nov 2018. ISSN
2162-237X. doi: 10.1109/TNNLS.2018.2805098.
Nø kland, A. Direct feedback alignment provides learning

in deep neural networks. In Lee, D. D., Sugiyama, M.,
Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Ad-
vances in Neural Information Processing Systems 29, pp.
1037–1045. 2016.
Simonyan, K. and Zisserman, A. Very deep convolu-
tional networks for large-scale image recognition. arXiv
preprint arXiv:1409.1556, 2014.
Taylor, G., Burmeister, R., Xu, Z., Singh, B., Patel, A.,
and Goldstein, T. Training neural networks without gra-
dients: A scalable admm approach. In Balcan, M. F.
and Weinberger, K. Q. (eds.), Proceedings of The 33rd
International Conference on Machine Learning, vol-
ume 48 of Proceedings of Machine Learning Research,
pp. 2722–2731, New York, New York, USA, 20–22 Jun
2016. PMLR. URL http://proceedings.mlr.
press/v48/taylor16.html.
Verma, V., Lamb, A., Beckham, C., Najafi, A., Courville,

A., Mitliagkas, I., and Bengio, Y. Manifold mixup: Learn-
ing better representations by interpolating hidden states.
2018.
Xiao, W., Chen, H., Liao, Q., and Poggio, T. A. Biologically-

plausible learning algorithms can scale to large datasets.
International Conference on Learning Representations,
2019.
Zagoruyko, S. and Komodakis, N. Wide residual networks.
arXiv preprint arXiv:1605.07146, 2016.
Zhang, S., Choromanska, A. E., and LeCun, Y. Deep learn-

ing with elastic averaging sgd. In Cortes, C., Lawrence,
N. D., Lee, D. D., Sugiyama, M., and Garnett, R. (eds.),
Advances in Neural Information Processing Systems 28,
pp. 685–693. Curran Associates, Inc., 2015.
A. Proofs

Lemma 4.1. Under Assumption 3 and 4, one has: ∀Θj , Ep∗j−1 k∇`(Zj−1 ; Θj )k2 ≤ G.
Proof. First of all, observe that under Assumption 4 and via Fubini’s theorem:
X XZ Z X
ctj−1 = |ptj−1 (z) − p∗j−1 (z)| dz = |ptj−1 (z) − p∗j−1 (z)| dz < ∞ (5)
t t t
|ptj − p∗j | is convergent a.s. and |ptj − p∗j | → 0 a.s as well. From Fatou’s lemma, one has:
P
thus, t
Z Z
p∗j−1 (z)k∇Θj `(z; Θj )k2 dz = lim inf ptj−1 (z)k∇Θj `(z; Θj )k2 dz (6)
t
Z
≤ lim inf ptj−1 (z)k∇Θj `(z; Θj )k2 dz ≤ G (7)
t
Lemma 4.2. Under Assumptions 1, 3 and 4, we have:

√ LG 2
E[L(Θt+1 t t 2
j )] ≤ E[L(Θj )] − ηt E[k∇L(Θj )k ] − 2Gctj−1 + η ,
2 t
Observe the expectation is taken over each random variable.
Proof. By L-smoothness:
L t+1
L(Θt+1 t t T t+1
j ) ≤ L(Θj ) + ∇L(Θj ) (Θj − Θtj ) + kΘ − Θtj k2 (8)
2 j
Substituting Θt+1
j − Θtj on the right:
Lηt2
L(Θt+1 t t T t t
j ) ≤ L(Θj ) − ηt ∇L(Θj ) ∇Θj `(Zj−1 ; Θj ) +
t
k∇Θj `(Zj−1 ; Θtj )k2 (9)
2
t
Taking the expectation w.r.t. Zj−1 which has a density ptj−1 , we get:
Lηt2
Eptj−1 [L(Θt+1 t t T t t t
; Θtj )k2

j )] ≤ L(Θj ) − ηt ∇L(Θj ) Eptj−1 [∇Θj `(Zj−1 ; Θj )] + Eptj−1 k∇Θj `(Zj−1
2
From Assumption 3, we obtain that:
Lηt2 t
Lηt2 G
; Θtj )k2 ≤

Eptj−1 ∇Θj `(Zj−1 (10)
2 2
Then, as a side computation, observe that:
Z Z
t
; Θtj ) − ∇L(Θtj )k = k ∇`(z, Θtj )ptj−1 (z) dz − ∇`(z, Θtj )p∗j−1 (z) dzk

kEptj−1 ∇Θj `(Zj−1 (11)
Z
≤ k∇`(z, Θtj )k |ptj−1 (z) − p∗j−1 (z)| dz (12)
Z q q t
= k∇`(z, Θtj )k |ptj−1 (z) − p∗j−1 (z)| |pj−1 (z) − p∗j−1 (z)| dz (13)
(14)
Let us apply the Cauchy-Swchartz inequality, we obtain:
sZ sZ
t
; Θtj ) − ∇L(Θtj )k ≤ k∇`(z, Θtj )k2 |ptj−1 (z) − p∗j−1 (z)| dz |ptj−1 (z) − p∗j−1 (z)| dz (15)

kEptj−1 ∇Θj `(Zj−1
sZ
q
= k∇`(z, Θtj )k2 |ptj−1 (z) − p∗j−1 (z)| dz ctj (16)
Then, observe that:
Z Z
k∇`(z, Θtj )k2 |ptj−1 (z) p∗j−1 (z)| dz k∇`(z, Θtj )k2 ptj−1 (z) + p∗j−1 (z) dz

− ≤ (17)
= Eptj−1 [k∇`(Zj−1 , Θtj )k2 ] + Ep∗j−1 [k∇`(Zj−1 , Θtj )k2 ] (18)

≤ 2G (19)
The last inequality follows from Lemma 4.1 and Assumption 3.

Then, using again Cauchy-Schwartz inequality:

k∇L(Θtj )k2 − ∇L(Θtj )T Ept [∇Θ `(Zj−1 t t t T t t t

j−1 j
; Θ )]
j =
∇L(Θj ) ∇L(Θ j ) − E t
pj−1 [∇ Θj `(Z j−1 ; Θj )]
(20)
≤ k∇L(Θtj )k kEptj−1 ∇Θj `(Zj−1

t
; Θtj ) − ∇L(Θtj )k

(21)
q
≤ k∇L(Θtj )k 2Gctj (22)
Then, taking the expectation leads to

E k∇L(Θtj )k2 − ∇L(Θtj )T Ept [∇`j,t ] ≤ E[k∇L(Θtj )k2 − ∇L(Θtj )T Ept [∇`j,t ]]

j−1 j−1 (23)
q
≤ E[k∇L(Θtj )k] 2Gctj (24)
q q
≤ E[k∇L(Θtj )k2 ] 2Gctj (25)
(26)
However, observe that by Lemma 4.1 and Jensen inequality:
k∇L(Θtj )k2 = kEp∗j [∇`(Z, Θtj )]k2 ≤ Ep∗j [k∇`(Z, Θtj )k2 ] ≤ G (27)
Combining this inequality and Assumption 3, we get:

√ LG 2
E[L(Θt+1 t t 2
j )] ≤ E[L(Θj )] − ηt E[k∇L(Θj )k ] − 2Gctj−1 + η ,
2 t
Proposition 4.2. Under Assumptions 1, 2, 3 and 4, each term of the following equation converges:
T T
X X q Lηt
ηt E[k∇L(Θtj )k2 ] ≤ E[L(Θ0j )] + G ηt ( 2ctj−1 + ) (28)
t=0 t=0
2
Algorithm 3: DGL Parallel Implementation

Input: Stream S , {(xt0 , y t )}t≤T of samples or
mini-batches;
1 Initialize Parameters {θj , γj }j≤J
2 Worker 0: Algorithm 4: DGL Async Buffer Parallel Impl.
3 for xt0 ∈ S do Input: Stream S , {(xt0 , y t )}t≤T ; Distribution of the
4 xt1 ← fθ0t (xt0 ) delay p = {pj }j ; Buffer size M
5 Send xt0 to worker 1 1 Initialize: Buffers {Bj }j with size M ; params {θj , γj }j
2 Worker j:
6 Compute ∇(γ1 ,θ1 ) L̂(y t , xt0 ; γ0t , θ0t )
3 while training do
7 (θ0t+1 , γ0t+1 ) ← Step of parameters (θ0t , γ0t ) 4 if j = 1 then (x0 , y) ← S else (xj−1 , y) ← Bj−1
8 end 5 xj ← fθj−1 (xj−1 )
9 Worker j:
6 Compute ∇(γj ,θj ) L̂(y, xj ; γj , θj )
10 for t ∈ 0...T do
11 Wait until xtj−1 is available 7 (θj , γj ) ← Step of parameters (θj , γj )
12 xtj ← fθj−1 t (xtj−1 ) 8 if j < J then Bj ← (xj , y)
t
9 end
13 Compute ∇(γj ,θj ) L̂(y , xtj ; γjt , θjt )
14 Send xtj to worker xtj+1
15 (θjt+1 , γjt+1 ) ← Step of parameters (θjt , γjt )
16 end
Proof. Applying Lemma 4.2 for t = 0, ..., T − 1, we obtain (observe the telescoping sum), for our non-negative loss:
T T q T
X √ X LG X 2
ηt E[k∇L(Θtj )k2 ] ≤ E[L(Θ0j )] − E[L(ΘTj +1 )] + 2G ctj ηt + η (29)
t=0 t=0
2 t=0 t
T q T
√ X LG X 2
≤ E[L(Θ0j )] + 2G ctj ηt + η (30)
t=0
2 t=0 t
(31)
Pq
ctj ηt is convergent, as ctj and ηt2 are convergent, thus the right term is bounded.
P P
Yet, t
B. Additional pseudo-code
To illustrate the parallel implementations of the Algorithms we show a different pseudocode implementation with an
explicit behavior for each worker specified. The following Algorithm 3 is equivalent to Algorithm 1 in terms of output but
directly illustrates a parallel implementation. Similarly 4 illustrates a parallel implementation of the algorithm described
in Algorithm 2. The probabilities used in Algorithm 4 are not included here as they are derived from communication and
computation speed differences.
C. Additional Descriptions of Experiments

Here we provide some additional details of the experiments. Code for experiments is provided along with the supplementary
materials.
Comparisons to DNI The comparison to DNI attempts to directly replicate the the Appendix C.1 (Jaderberg et al., 2017).
Although the baseline accuracies for backprop and cDNI are close to those reported in the original work, those of DNI are
worse than those reported in (Jaderberg et al., 2017), which could be due to minor differences in the implementation. We
utilize a popular pytorch DNI implementation available and source code will be provided.
C.1. Auxiliary Network Sizes

We briefly illustrate the sizes of auxiliary networks. Lets take as an example the ImageNet experiments for VGG-13. At
the first layer the output is 224 × 224 × 64. The MLP-aux here would be applied after averaging to 2 × 2 × 64, and
would consists of 3 fully connected layers of size 256 (2 ∗ 2 ∗ 64) followed by a projection to 1000 image categories. The
MLP-SR-aux network used would first reduce to 56 × 56 × 64 and then apply 3 layers of 1 × 1 convolutions of size 64.
This is followed by reduction to 2 × 2 and 3 FC layers as in the MLP-aux network.

Decoupled Greedy Learning Enables Linear Parallelization of CNN Training

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Decoupled Greedy Learning Enables Linear Parallelization of CNN Training

Uploaded by

Copyright:

Available Formats

Decoupled Greedy Learning of CNNs

Eugene Belilovsky 1 Michael Eickenberg 2 Edouard Oyallon 3

Abstract do not permit the computations of the different constituent

Algorithm 1: Decoupled Greedy Learning Decoupled Greedy Learning

5 Compute ∇(γj ,θj ) L̂(y t , xtj ; γj , θj )

min L̂(Xj , Y ; θj , γj ) (Pj ) Algorithm 2: Asynchronous DGL with Replay Buffer

FLOPS Aux./ Module Acc.

We also attempted DNI with the more commonly used op-

Accuracy (CIFAR-10) Training Loss (CIFAR-10)

Auxiliary Network Design We consider different auxil- Accuracy layer 2 and 4

Parallel Loss (CIFAR-10) Backprop DDG DGL

DGL being update unlocked and DDG only backward unlocked.

Model (training method) Top-1 Top-5

Flops Net Flops Aux

Lillicrap, T. P., Cownden, D., Tweed, D. B., and Akerman,

Marquez, E. S., Hare, J. S., and Niranjan, M. Deep cascade

Nø kland, A. Direct feedback alignment provides learning

Verma, V., Lamb, A., Beckham, C., Najafi, A., Courville,

Xiao, W., Chen, H., Liao, Q., and Poggio, T. A. Biologically-

Zhang, S., Choromanska, A. E., and LeCun, Y. Deep learn-

Lemma 4.2. Under Assumptions 1, 3 and 4, we have:

From Assumption 3, we obtain that:

Then, as a side computation, observe that:

Let us apply the Cauchy-Swchartz inequality, we obtain:

Then, observe that:

= Eptj−1 [k∇`(Zj−1 , Θtj )k2 ] + Ep∗j−1 [k∇`(Zj−1 , Θtj )k2 ] (18)

The last inequality follows from Lemma 4.1 and Assumption 3.

≤ k∇L(Θtj )k kEptj−1 ∇Θj `(Zj−1

Then, taking the expectation leads to

However, observe that by Lemma 4.1 and Jensen inequality:

Combining this inequality and Assumption 3, we get:

Algorithm 3: DGL Parallel Implementation

C. Additional Descriptions of Experiments

C.1. Auxiliary Network Sizes

You might also like