Professional Documents
Culture Documents
ing problem: each layer must wait for the signal permit updates of all modules once signals have propagated
to propagate through the network before updating. to all subsequent modules, update unlocking would permit
We consider and analyze a training procedure, De- updates of a module before a signal has reached all sub-
coupled Greedy Learning (DGL), that addresses sequent modules, and forward unlocking would permit a
this problem more effectively and at scales be- module to operate asynchronously from its predecessor and
yond those of previous solutions. It is based on dependent modules.
a greedy relaxation of the joint training objec-
tive, recently shown to be effective in the context Multiple methods have been proposed, which can deal up
of Convolutional Neural Networks (CNNs) on to a certain degree with the backward unlocking challenge
large-scale image classification. We consider an (Huo et al., 2018b; Choromanska et al., 2018; Nø kland,
optimization of this objective that permits us to 2016). Jaderberg et al. (2017); Czarnecki et al. (2017) pro-
decouple the layer training, allowing for layers pose and analyze DNI, a method that addresses the more
or modules in networks to be trained with a po- challenging update locking. The DNI approach uses an aux-
tentially linear parallelization in layers. We show iliary network to predict the gradient of the backward pass
theoretically and empirically that this approach directly from the input. This method is not shown to scale
converges. In addition, we empirically find that it well computationally or in terms of accuracy, especially in
can lead to better generalization than sequential the case of CNNs (Huo et al., 2018b;a). Indeed, auxiliary
greedy optimization and even standard end-to-end networks must predict a weight gradient that can be very
back-propagation. We show that an extension of large in dimensionality, which can be inaccurate and chal-
this approach to asynchronous settings, where lenging to scale when intermediate representations are large,
modules can operate with large communication as is the case for larger models and input image sizes.
delays, is possible with the use of a replay buffer. Recently, several authors have revisited the classic
We demonstrate the effectiveness of DGL on the (Ivakhnenko & Lapa, 1965; Bengio et al., 2007) approach
CIFAR-10 datasets against alternatives and on the of supervised greedy layer-wise training of neural networks
large-scale ImageNet dataset, where we are able (Huang et al., 2018; Marquez et al., 2018). In Belilovsky
to effectively train VGG and ResNet-152 models. et al. (2018) it is shown that such an approach, which relaxes
the joint learning objective, can lead to high performance
deep CNNs on large-scale datasets. Some of these works
1. Introduction also consider the use of auxiliary networks with hidden lay-
Training jointly all layers using back-propagation is the ers as part of the local auxiliary problems which has some
standard method for learning neural networks, including the analogs to the auxiliary networks of DNI and target propaga-
computationally intensive vision models based on Convolu- tion (Lee et al., 2014). We will show that the greedy learning
tional Neural Networks (CNNs) (Goyal et al., 2017). Due objective can be solved with an alternative optimization al-
to the sequential nature of the gradient processing, standard gorithm, which permits decoupling the computations and
back-propagation has several well-known inefficiencies that achieving update unlocking. This can also be augmented
1
Mila, University of Montreal 2 University of California Berke- with replay buffers (Lin, 1992) to permit forward unlocking.
3
ley CentraleSupelec and INRIA. Correspondence to: Eugene This strategy can be shown to be a state-of-the-art baseline
Belilovsky <eugene.belilovsky@umontreal.ca>, Michael Eick- for parallelizing the training across modules of a neural
enberg <michael.eickenberg@berkeley.edu>, Edouard Oyallon network.
<edouard.oyallon@centralesupelec.fr>.
Our contributions in this work are as follows. We (a) pro-
Decoupled Greedy Learning of CNNs
pose an optimization procedure for a decoupled greedy weights of the backward pass with random weights. Direct
learning objective that solves the update locking problem. feedback alignment extends the idea of feedback alignment
(b) Empirically, we show that it exhibits similar conver- passing errors from the top to all layers, potentially permit-
gence rates and generalization as its non-decoupled coun- ting a simultaneous update. These approaches have also not
terpart. (c) We show that it can be extended to an asyn- been shown to be scalable to large datasets (Bartunov et al.,
chronous setting by use of a replay buffer, providing a step 2018), obtaining only 17.5% top-5 accuracy on ImageNet
towards addressing the forward locking problem. (d) We (for a reference model that achieves 59.8%). On the other
motivate these observations theoretically, showing that the hand a greedy learning strategy has been shown to work
proposed optimization procedure converges and recovers well on the same task (Belilovsky et al., 2018).
standard rates of non-convex optimization. Experimentally
Another line of related work inspired by optimization meth-
we (e) design an improved auxiliary network structure for
ods such as Alternating Direction Method of Multipliers
greedy layer-wise training of CNNs that permits to main-
(ADMM) (Taylor et al., 2016; Carreira-Perpinan & Wang,
tain accuracy while having negligible cost for the auxiliary
2014; Choromanska et al., 2018) considers approaches
task. We (f) show that the decoupled greedy learning can
that use auxiliary variables to break optimization into sub-
well outperform competing methods in terms of scalability
problems. These approaches are fundamentally different
to larger and deeper models and stability to optimization
from ours as they optimization for the joint training objec-
hyper-parameters, allowing it to be applied to large datasets.
tive, the auxiliary variables providing a link between a layer
We then demonstrate on the ImageNet dataset that we can
and it’s successive layers, whereas we consider a different
train the deep models VGG-19 and ResNet-152 with larger
objective where a layer has no dependence on its successors.
degrees of parallelism than other proposals and reduced
None of these methods can achieve update or forward un-
memory consumption. Code for experiments will be made
locking, however some (Choromanska et al., 2018) are able
available1 .
to have a simultaneous weight updates (backward unlocked).
2. Related work Another issue with these methods is that most of the existing
To the best of our knowledge (Jaderberg et al., 2017) is the approaches except for Choromanska et al. (2018) require
only work which directly addresses the update or forward standard (“batch”) gradient descent and are thus difficult
locking problems in deep feed-forward networks. Other to scale. They also often involve an inner minimization
works (Huo et al., 2018a;b) consider the backward locking problem and have thus not been demonstrated to work on
problem, furthermore a number of back-propagation alter- realistic large scale datasets. Furthermore, none of these
natives such as (Choromanska et al., 2018; Lee et al., 2014; have been combined with CNNs.
Nø kland, 2016) are also able to address this problem. How- Distributed optimization based on data parallelism is a pop-
ever, update locking is a more severe inefficiency. Consider ular area of research in machine learning beyond deep learn-
the case where each layer’s forward processing time is TF ing models and often studied in the convex setting (Leblond
and is equal across a network of L layers. Given that the et al., 2018). In deep network optimization the predomi-
backward pass of back-propagation is a constant multiple in nant method is distributed synchronous SGD (Goyal et al.,
time of the forward pass, in the most ideal case the backward 2017) and variants, as well as asynchronous (Zhang et al.,
unlocking will still only scale as O(LTF ) with L parallel 2015) variants. Our approach on the other hand can be seen
nodes, while update unlocking could scale as O(TF ). as closer to exploiting a type of model parallelism vs data
One class of alternatives to standard back-propagation parallelism and can be easily combined with many of these
aims to avoid its biologically implausible aspects, most methods, particularly distributed synchronous SGD.
notably the weight transport problem (Bartunov et al., 2018; 3. Parallel Decoupled Greedy Learning
Nø kland, 2016; Lillicrap et al., 2014; Lee et al., 2014).
In this section we formally define the greedy objective
Some of these methods (Lee et al., 2014; Nø kland, 2016)
and parallel optimization which we study in both the syn-
can also achieve backward unlocking as they permit all pa-
chronous and asynchronous setting. We mainly consider
rameters to be updated at the same time, but only once the
the online setting and assume a stream of samples or mini-
signal has propagated to the top layer. None of them how-
batches denoted S , {(xt0 , y t )}t≤T , that can be run during
ever solve the update locking problem or forward locking
T iterations.
problem which we consider. Target propagation uses a local
auxiliary network as in our approach, which is used to prop- 3.1. Optimization for Greedy Objective
agate backward the optimal activations computed from the Let X0 and Y be the data matrix and labels for the train-
layer above. Feedback alignment replaces the symmetric ing data. Let Xj be the output representation for mod-
1
ule j. We will denote the per-module objective function
Experiment code will be made available at L̂(Xj , Y ; θj , γj ), where the parameters θj correspond to
https://github.com/eugenium/DGL
the module parameter (i.e. Xj+1 = fθj (Xj )), γj corre-
Decoupled Greedy Learning of CNNs
2 for (xt0 , y t ) ∈ S do
3 for j ∈ 1, ..., J do CNN CNN
Module Module
4 xtj ← fθj−1 (xtj−1 ) Aux
Figure 1. We illustrate the signal propagation for three mini-batches processed by standard back-propagation and with decoupled greedy
learning. In each case a module can begin processing forward and backward passes as soon as possible. For illustration we assume same
speed for forward and backward passes, and discount the auxiliary network computation (negligible in our experiments).
spond to auxiliary parameters used to compute the objective. We will evaluate one instance of such an algorithm based
L̂ in our case will be the empirical risk with a cross-entropy on the use of a replay buffer of size M , shown in Alg. 2.
loss. The greedy training objective is thus given recursively Here each module maintains a buffer to which it writes its
by defining Pj : output representations, which is read by the module above.
reuse of samples. Alg. 2 permits a controlled simulation tial applications. Greedy objectives have recently been used
of processing speed discrepancies and will be used over in several applications in reinforcement learning (Haarnoja
settings of p and M to demonstrate that training accuracy et al., 2018) and in ensemble methods like Boosting (Huang
and testing accuracy remain robust in practical regimes. et al., 2018). Even with a single worker the synchronous
Appendix B also provides a more intuitive pseudo-code for DGL has a gain in terms of memory. Moreover it is eas-
how this buffer-based algorithm would be implemented in a ier to implement efficiently than sequential greedy training
parallel environment. since in the naive sequential training scheme, a forward pass
through old modules or caching of previous activations is
As will be demonstrated in our experiments, the DGL can
needed for optimal performance.
potentially be robust to substantial asynchronous behav-
ior. Unlike common data-parallel asynchronous algorithms 4. Theoretical Analysis
(Zhang et al., 2015), the asynchronous DGL does not rely In this section we analyze the convergence of Alg. 1 when
on a master node and requires only local communication the update steps are obtained from stochastic gradient meth-
similar to recent decentralized schemes (Lian et al., 2017). ods. We show that under the DGL optimization scheme a
Unlike decentralized SGD algorithms, nodes only need to critical point can be reached. In standard stochastic opti-
maintain and update the parameters of their local model, po- mization schemes, the input distribution fed to a model is
tentially supporting much larger models. Combining DGL fixed (Bottou et al., 2018). With the decoupled training pro-
with data-parallel methods is also natural. For example a cedure the input distribution to each module is time-varying
common issue of the popular distributed synchronous SGD and dependent on the convergence of the previous module.
in deep CNNs is the often limited maximum batch size At time step t, for simplicity we will denote all parameters
(Goyal et al., 2017; Keskar et al., 2016). This suggests that of a module (including auxiliary) as Θtj , (θjt , γjt ), and
DGL can be used in combination with data parallelism to
samples as Zjt , (Xjt , Y ), which follow the density ptj (z).
add an additional dimension of parallelization. Potentially
We aim to prove that each auxiliary problem of the DGL
combining asynchronous DGL with distributed synchronous
approach will converge to a critical point despite the time
SGD for sub-problem optimization is a promising direction.
varying inputs corresponding to sub-optimal outputs from
3.3. Auxiliary and Primary Network Design prior modules. Proofs are given in Appendix B.
The DNI method requires an auxiliary network to predict
Let us fix a depth j, such that j > 1 and consider the tar-
the gradient. The greedy layerwise CNN training procedure
get density of the previous layer, p∗j−1 (z). We consider
of (Belilovsky et al., 2018), which we parallelize, similarly
the following distance: ctj−1 , |ptj−1 (z) − p∗j−1 (z)| dz.
R
relies on an auxiliary network. This requires the design of
an auxiliary network in addition to the CNN architecture Denoting ` the composition of the non-negative loss func-
design. Belilovsky et al. (2018) have shown that simple tion and the network, we will study the expected risk
averaging operations can be used to construct a scalable L(Θj ) , Ep∗j−1 [`(Zj−1 ; Θj )]. We will now state several
auxiliary network. However, they did not directly consider standard assumptions we use.
the parallel training use case. Here care must be taken in the Assumption 1 (L-smoothness). L is differentiable and its
design, as will be discussed in the experimental section. The gradient is L-Lipschitz.
primary considerations in our case is the relative speed of the
auxiliary network with respect to the associated module it is We consider the SGD scheme with learning rate {ηt }t :
attached to in the primary network. We will use primarily Θt+1
j = Θtj − ηt ∇Θj `(Zj−1
t
; Θtj ) where Zj−1
t
∼ ptj−1 (1)
FLOP count in our analysis and aim to restrict our auxiliary
Assumption
P 2 (Robbins-Monro conditions). The step sizes
networks to be 5% of the primary network.
satisfy t ηt = ∞ yet t ηt2 < ∞.
P
Although auxiliary network design might seem like an ad-
ditional layer of complexity in CNN design and might po- We also assume bounded gradient moments:
tentially require slightly different architecture principles for Assumption 3 (Finite variance). There exists G > 0, for
the primary network than standard end-to-end trained deep any t and Θj , Eptj−1 k∇`(Zj−1 ; Θj )k2 ≤ G.
CNNs, this is not inherently prohibitive since architecture
design is well known to be related to the training. As an The Assumptions 1, 2 and 3 are standard (Bottou et al.,
example, consider the typical motivation for residual connec- 2018; Huo et al., 2018a), and we show in the following
tions which are originally motivated by optimization issues that our proof of convergence leads to similar rates, up
inherent to end-to-end backpropagation of deep networks. to a multiplicative constant. The following assumption is
specific to our setting where we consider a time-varying
We note although we focus on the distributed learning con- distribution:
text, the proposed optimization algorithm and associated
AssumptionP 4 (Convergence of the previous layer). We
theory for greedy objectives is generic and has other poten-
assume that t ctj−1 < ∞.
Decoupled Greedy Learning of CNNs
60
Loss
50 1.0
40 Backprop
DGL 0.5
30 DNI
cDNI
20 0.0
0 200 400 600 800 1000 1200 1400 0 200 400 600 800 1000 1200 1400
epoch epoch
Figure 2. Comparison of DNI, context DNI (cDNI), and DGL in terms of training loss and test accuracy for experiment from (Jaderberg
et al., 2017). DGL converges better than cDNI and DNI with the same auxiliary net. and generalizes better than backprop for this case.
Accuracy
70
stantial compared to the base network, leading to poorer
parallelization gains. We note however that even in those 60
cases (that we don’t study here) where the auxiliary network Sequential, layer 2
50 Parallel, layer 2
computation is potentially on the order of the the primary Sequential, layer 4
network, it can still give advantages for parallelization for Parallel, layer 4
very deep networks and many available workers. 0 10 20 30 40 50
Epoch
The primary network architecture we use for these experi-
ments is a simple CNN similar to VGG family models (Si-
Figure 3. Comparison of sequential and parallel training. We ob-
monyan & Zisserman, 2014). It consists of 6 convolutional serve parallel training catches up rapidly to sequential.
layers with 3 × 3 kernels, batchnorm and shape preserving
padding, with 2 × 2 maxpooling operations at layers 1 and
a 3 layer MLP (of constant width). This is denoted MLP-
3. The channel width of the first layer is 128 and is doubled
aux and drastically reduces the FLOP count with minimal
at each downsampling operation. The final layer does not
accuracy loss compared to CNN-aux. Finally we consider
have an auxiliary model, it is learned with a linear spatial
applying a staged spatial resolution, first reducing the spatial
averaging followed by a 2-hidden layer constant depth fully
resolution by 4× (and total size 16×), then applying 3 1 × 1
connected network, for all experiments. Two alternatives
convolutions followed by a reduction to 2 × 2 and a 3 layer
to the CNN auxiliary of (Belilovsky et al., 2018) are ex-
MLP. We denote this approach MLP-SR-aux. These latter
plored, which exploit a spatial averaging operation. We
two strategies that leverage the spatial averaging produce
re-iterate that this kind of approach and even the simple
auxiliary networks that are less than 5% of the FLOP count
network structure we consider is not easily applicable in the
of the primary network even for large spatial resolutions
case of DNI and synthetic gradient prediction. Optimization
as in real world image datasets. We will show that MLP-
is done using a standard strategy for CIFAR CNN training.
SR-aux is still effective even for the large-scale ImageNet
We apply SGD with momentum of 0.9 and weight decay
dataset.
5e−4 (Zagoruyko & Komodakis, 2016) and decaying step
sizes. For these experiments we use a short schedule of 50 Sequential vs. Parallel Optimization of Greedy Objec-
epochs and decay factor of 0.2 every 15 epochs (Belilovsky tive We briefly compare the sequential optimization of
et al., 2018). Results of comparisons are given in Table 1. the greedy objective (Belilovsky et al., 2018; Bengio et al.,
2007) to the DGL (Alg. 1). We use a 4 layer CIFAR-10
The baseline auxiliary strategy based on (Belilovsky et al., network with an MLP-SR-aux auxiliary model and a final
2018) and (Jaderberg et al., 2017) applies 2 CNN layers fol- layer attached to a 2 layer MLP. We use the same optimiza-
lowed by a spatial averaging to 2×2 before a final projection. tion settings with 50 epochs as in the last experiment. In the
We denote this CNN-aux. The first alternative we explore sequential training we train each layer for 50 epochs before
is a direct application of the spatial averaging to 2 × 2 out- moving to the subsequent one. Thus the difference to DGL
put shape (regardless of the input resolution) followed by lies only in the input received at each layer (fully converged
Decoupled Greedy Learning of CNNs
Accuracy
VGG-13 (backprop) 66.6 87.5
VGG-19 (DGL K = 4) 69.2 89.0
VGG-19 (DGL K = 2) 70.8 90.2
89
VGG-19 (backprop) 69.7 89.7
ResNet-152 (DGL K = 2) 74.5 92.0 10 20 30 40 50
ResNet-152 (backprop) 74.4 92.1 Buffer size
Figure 6. Buffer size vs. Accuracy for Async DGL. Smaller buffer
Table 3. ImageNet results. We observe that DGL can be effective sizes produce only small loss in accuracy.
for VGG and ResNet models obtaining similar or better accuracies,
while permitting parallelization and reduced memory.
90.5
proved batch sampling among other directions.
90.0
6. Conclusion
89.5
We have analyzed and introduced a simple and strong base-
89.0 line for parallelizing per layer and per module computations
88.5 DGL+Replay (Async) in CNN training. Our approach is shown to match or ex-
Reference (Sync)
88.01.0 ceed state of the art approaches addressing these problems
1.2 1.4 1.6 1.8 2.0
Module Slowdown Factor and shown able to scale to much large datasets than oth-
ers. Future work can develop improved auxiliary problem
Figure 5. Evaluation of Async DGL. A single layer of a network is objectives and combinations with delayed feedback.
slowed down on average over others, we observe negligible losses
of accuracy at even substantial delays.
Decoupled Greedy Learning of CNNs
Acknowledgements He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-
ing for image recognition. In Proceedings of the IEEE
EO and EB acknowledges NVIDIA for its GPU donation.
conference on computer vision and pattern recognition,
We would like to thank John Zarka, Louis Thiry, Geor-
pp. 770–778, 2016.
gios Exarchakis, Fabian Pedregosa, Maxim Berman, Amal
Rannen, Aaron Courville, and Nicolas Pinto for helpful Huang, F., Ash, J., Langford, J., and Schapire, R. Learning
discussions. deep resnet blocks sequentially using boosting theory.
International Conference on Machine Learning(ICML),
References 2018.
Bartunov, S., Santoro, A., Richards, B., Marris, L., Hin- Huo, Z., Gu, B., and Huang, H. Training neural networks
ton, G. E., and Lillicrap, T. Assessing the scalability using features replay. Advances in Neural Information
of biologically-motivated deep learning algorithms and Processing Systems, 2018a.
architectures. In Advances in Neural Information Pro-
cessing Systems, pp. 9389–9399, 2018. Huo, Z., Gu, B., qian Yang, and Huang, H. Decoupled
parallel backpropagation with convergence guarantee. In
Belilovsky, E., Eickenberg, M., and Oyallon, E. Greedy Dy, J. and Krause, A. (eds.), Proceedings of the 35th In-
layerwise learning can scale to imagenet, 2018. URL ternational Conference on Machine Learning, volume 80
http://arxiv.org/abs/1812.11446. of Proceedings of Machine Learning Research, pp. 2098–
Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. 2106, Stockholmsmssan, Stockholm Sweden, 10–15 Jul
Greedy layer-wise training of deep networks. In Advances 2018b. PMLR. URL http://proceedings.mlr.
in neural information processing systems, pp. 153–160, press/v80/huo18a.html.
2007. Ivakhnenko, A. G. and Lapa, V. G. Cybernetic Predicting
Bottou, L., Curtis, F. E., and Nocedal, J. Optimization Devices. CCM Information Corporation., 1965.
methods for large-scale machine learning. SIAM Review,
Jaderberg, M., Czarnecki, W. M., Osindero, S., Vinyals, O.,
60(2):223–311, 2018.
Graves, A., Silver, D., and Kavukcuoglu, K. Decoupled
Carreira-Perpinan, M. and Wang, W. Distributed optimiza- neural interfaces using synthetic gradients. International
tion of deeply nested systems. In Artificial Intelligence Conference of Machine Learning, 2017.
and Statistics, pp. 10–19, 2014.
Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy,
Choromanska, A., Kumaravel, S., Luss, R., Rish, I., Kings- M., and Tang, P. T. P. On large-batch training for deep
bury, B., Tejwani, R., and Bouneffouf, D. Beyond back- learning: Generalization gap and sharp minima. arXiv
prop: Alternating minimization with co-activation mem- preprint arXiv:1609.04836, 2016.
ory. arXiv preprint arXiv:1806.09077, 2018.
Krizhevsky, A. Learning multiple layers of features from
Czarnecki, W. M., Swirszcz, G., Jaderberg, M., Osin- tiny images. Technical report, Citeseer, 2009.
dero, S., Vinyals, O., and Kavukcuoglu, K. Under-
standing synthetic gradients and decoupled neural in- Leblond, R., Pedregosa, F., and Lacoste-Julien, S. Asaga:
terfaces. CoRR, abs/1703.00522, 2017. URL http: Asynchronous parallel saga. In 20th International Con-
//arxiv.org/abs/1703.00522. ference on Artificial Intelligence and Statistics (AISTATS)
2017, 2017.
Goyal, P., Dollár, P., Girshick, R. B., Noordhuis, P.,
Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and Leblond, R., Pedregosa, F., and Lacoste-Julien, S. Im-
He, K. Accurate, large minibatch SGD: training ima- proved asynchronous parallel optimization analysis for
genet in 1 hour. CoRR, abs/1706.02677, 2017. URL stochastic incremental methods. Journal of Machine
http://arxiv.org/abs/1706.02677. Learning Research, 19(81):1–68, 2018. URL http:
//jmlr.org/papers/v19/17-650.html.
Haarnoja, T., Hartikainen, K., Abbeel, P., and Levine, S. La-
tent space policies for hierarchical reinforcement learning. Lee, D., Zhang, S., Biard, A., and Bengio, Y. Target
In Dy, J. and Krause, A. (eds.), Proceedings of the 35th In- propagation. CoRR, abs/1412.7525, 2014. URL http:
ternational Conference on Machine Learning, volume 80 //arxiv.org/abs/1412.7525.
of Proceedings of Machine Learning Research, pp. 1851–
1860, Stockholmsmssan, Stockholm Sweden, 10–15 Jul Lian, X., Zhang, W., Zhang, C., and Liu, J. Asynchronous
2018. PMLR. URL http://proceedings.mlr. decentralized parallel stochastic gradient descent. arXiv
press/v80/haarnoja18a.html. preprint arXiv:1710.06952, 2017.
Decoupled Greedy Learning of CNNs
A. Proofs
Lemma 4.1. Under Assumption 3 and 4, one has: ∀Θj , Ep∗j−1 k∇`(Zj−1 ; Θj )k2 ≤ G.
Proof. First of all, observe that under Assumption 4 and via Fubini’s theorem:
X XZ Z X
ctj−1 = |ptj−1 (z) − p∗j−1 (z)| dz = |ptj−1 (z) − p∗j−1 (z)| dz < ∞ (5)
t t t
|ptj − p∗j | is convergent a.s. and |ptj − p∗j | → 0 a.s as well. From Fatou’s lemma, one has:
P
thus, t
Z Z
p∗j−1 (z)k∇Θj `(z; Θj )k2 dz = lim inf ptj−1 (z)k∇Θj `(z; Θj )k2 dz (6)
t
Z
≤ lim inf ptj−1 (z)k∇Θj `(z; Θj )k2 dz ≤ G (7)
t
Proof. By L-smoothness:
L t+1
L(Θt+1 t t T t+1
j ) ≤ L(Θj ) + ∇L(Θj ) (Θj − Θtj ) + kΘ − Θtj k2 (8)
2 j
Substituting Θt+1
j − Θtj on the right:
Lηt2
L(Θt+1 t t T t t
j ) ≤ L(Θj ) − ηt ∇L(Θj ) ∇Θj `(Zj−1 ; Θj ) +
t
k∇Θj `(Zj−1 ; Θtj )k2 (9)
2
t
Taking the expectation w.r.t. Zj−1 which has a density ptj−1 , we get:
Lηt2
Eptj−1 [L(Θt+1 t t T t t t
; Θtj )k2
j )] ≤ L(Θj ) − ηt ∇L(Θj ) Eptj−1 [∇Θj `(Zj−1 ; Θj )] + Eptj−1 k∇Θj `(Zj−1
2
Lηt2 t
Lηt2 G
; Θtj )k2 ≤
Eptj−1 ∇Θj `(Zj−1 (10)
2 2
Z Z
t
; Θtj ) − ∇L(Θtj )k = k ∇`(z, Θtj )ptj−1 (z) dz − ∇`(z, Θtj )p∗j−1 (z) dzk
kEptj−1 ∇Θj `(Zj−1 (11)
Z
≤ k∇`(z, Θtj )k |ptj−1 (z) − p∗j−1 (z)| dz (12)
Z q q t
= k∇`(z, Θtj )k |ptj−1 (z) − p∗j−1 (z)| |pj−1 (z) − p∗j−1 (z)| dz (13)
(14)
Decoupled Greedy Learning of CNNs
sZ sZ
t
; Θtj ) − ∇L(Θtj )k ≤ k∇`(z, Θtj )k2 |ptj−1 (z) − p∗j−1 (z)| dz |ptj−1 (z) − p∗j−1 (z)| dz (15)
kEptj−1 ∇Θj `(Zj−1
sZ
q
= k∇`(z, Θtj )k2 |ptj−1 (z) − p∗j−1 (z)| dz ctj (16)
Z Z
k∇`(z, Θtj )k2 |ptj−1 (z) p∗j−1 (z)| dz k∇`(z, Θtj )k2 ptj−1 (z) + p∗j−1 (z) dz
− ≤ (17)
E k∇L(Θtj )k2 − ∇L(Θtj )T Ept [∇`j,t ] ≤ E[k∇L(Θtj )k2 − ∇L(Θtj )T Ept [∇`j,t ]]
j−1 j−1 (23)
q
≤ E[k∇L(Θtj )k] 2Gctj (24)
q q
≤ E[k∇L(Θtj )k2 ] 2Gctj (25)
(26)
k∇L(Θtj )k2 = kEp∗j [∇`(Z, Θtj )]k2 ≤ Ep∗j [k∇`(Z, Θtj )k2 ] ≤ G (27)
Proposition 4.2. Under Assumptions 1, 2, 3 and 4, each term of the following equation converges:
T T
X X q Lηt
ηt E[k∇L(Θtj )k2 ] ≤ E[L(Θ0j )] + G ηt ( 2ctj−1 + ) (28)
t=0 t=0
2
Decoupled Greedy Learning of CNNs
Proof. Applying Lemma 4.2 for t = 0, ..., T − 1, we obtain (observe the telescoping sum), for our non-negative loss:
T T q T
X √ X LG X 2
ηt E[k∇L(Θtj )k2 ] ≤ E[L(Θ0j )] − E[L(ΘTj +1 )] + 2G ctj ηt + η (29)
t=0 t=0
2 t=0 t
T q T
√ X LG X 2
≤ E[L(Θ0j )] + 2G ctj ηt + η (30)
t=0
2 t=0 t
(31)
Pq
ctj ηt is convergent, as ctj and ηt2 are convergent, thus the right term is bounded.
P P
Yet, t
B. Additional pseudo-code
To illustrate the parallel implementations of the Algorithms we show a different pseudocode implementation with an
explicit behavior for each worker specified. The following Algorithm 3 is equivalent to Algorithm 1 in terms of output but
directly illustrates a parallel implementation. Similarly 4 illustrates a parallel implementation of the algorithm described
in Algorithm 2. The probabilities used in Algorithm 4 are not included here as they are derived from communication and
computation speed differences.
Comparisons to DNI The comparison to DNI attempts to directly replicate the the Appendix C.1 (Jaderberg et al., 2017).
Although the baseline accuracies for backprop and cDNI are close to those reported in the original work, those of DNI are
worse than those reported in (Jaderberg et al., 2017), which could be due to minor differences in the implementation. We
utilize a popular pytorch DNI implementation available and source code will be provided.
would consists of 3 fully connected layers of size 256 (2 ∗ 2 ∗ 64) followed by a projection to 1000 image categories. The
MLP-SR-aux network used would first reduce to 56 × 56 × 64 and then apply 3 layers of 1 × 1 convolutions of size 64.
This is followed by reduction to 2 × 2 and 3 FC layers as in the MLP-aux network.