1 Introduction
Recently, deep learning Goodfellowetal2016Book
has revolutionized artificial intelligence and led to impressive performance in various tasks such as computer vision
krizhevsky2012imagenet , speech recognition hinton2012deep and machine translation bahdanau2014neural . Deep learning, thanks to backpropagation almeida1987learning ; pineda1987generalization, is able to take advantage of the multilayer structure of the neural network to learn high level features relevant for the given task it is trained to perform. Those features are more and more abstract while going deeper in the network, and give rise to a high level representation of the data. In the case of visual tasks, when using convolutional neural networks
lecun1990handwritten , the high level features learned thanks to the backpropagation algorithm are similar to the ones experimentally observed in the visual cortex khaligh2014deep .However, it is still an open question how the brain is able to perform credit assignment in deep neural structures spanning multiple areas. The core algorithm used to train deep neural networks (i.e backpropagation) has been seen by the neuroscience community as being biologically implausible because the implementation used in deep learning relies on assumptions that cannot be met in the brain bengio2015towards ; neftci2017event (i.e need for symmetric weights and a separate circuit for feedforward and gradient computations, precise timing between the forward and the backward paths with fixed activity of the neurons and knowledge of the derivative of the forward activation to correctly update the weights, high precision numbers to characterize forward activities and backpropagate errors compared to binary values occurring in the brain). Thanks to recent work lillicrap2016random ; nokland2016direct the assumption that the feedback weights must be the exact transpose of the forward ones is no longer required to have efficient credit assignment thanks to the feedbackalignment mechanism. courbariaux2016binarized
showed it was possible to train deep neural networks using backpropagation with binarized activation functions and binary weights.
scellier2017equilibrium showed how the same neurons could be used for feedforward and gradient computations thanks to the recurrence induced by feedback connections and how nudging output units towards a lowererror configuration propagates, via feedback connections, error gradients in the inner layers of the circuit.Recent studies have shown that local contrastive hebbian plasticity in an energy based model can implement backpropagation in deep neural structures thanks to the recurrent dynamics
bengio2017stdp ; scellier2017equilibrium while having promising results when using leaky integrateandfire neurons mesnard2016towards . Others guerguiev2017towards were able to train deep learning networks by approximating backpropagation in systems with multicompartment neurons. Finally, a recent study sacramento2018dendritic ; NIPS2018_8089 used recurrent networks with inhibitory neurons to approximate backpropagation. Those inhibitory units aim to predict and cancel the feedback signal coming from the upper layers. When they are not able to correctly predict the incoming feedback, weights are updated proportionally to this prediction error which is closely related to the correct gradient that would have been obtained with classical backpropagation. This idea of predicting the upper incoming feedback activity to predict the correct gradient can be related to jaderberg2016decoupled , where side networks are introduced between each layer of the main network that learn in a supervised manner to predict the correct gradients based only on the feedforward inputs. A more thorough comparison with previous works on how backpropagation could be implemented in the brain is done in Section 4.In this paper, we consider a recurrent network composed of pyramidal units (PU) that can be identified with the feedforward units of a multilayer perceptron, with a dynamic of learning inspired from
scellier2017equilibrium with two different phases. These cells integrate feedforward activity coming from the lower layers but also feedback activity coming from the upper layers. Moreover, in order to enable backpropagation of errors, we introduce a new type of interneurons, referred to as ghost units (GU). Their goal is to predict and cancel feedback from pyramidal units in the upper layer by integrating the same feedforward input without having access to any feedback from the following layers. This property enables the network to converge quickly during the feedforward computation, in spite of the presence of recurrent connections, by canceling feedback coming from the upper layers thanks to the ghost units. This cancellation effect of the ghost units allows topdown corrective feedback to be correctly backpropagated when targets are provided in the weaklyclamped phase. This gives the network the capacity to perform credit assignment in a multilayer structure by simply following its dynamics and updating the weights according to local plasticity learning rules.2 Backpropagation thanks to ghost units in a recurrent and dynamical neural network
2.1 Architecture
We consider a biologically plausible implementation for backpropagation in a directed acyclic graph of feedforward connections with network input . We consider that the network has layers. We will use for the first layer that represents the inputs and for the last layer which is the output of the network, see Figure 1 for a schematic representation of the architecture.
At each layer , a node of this graph is associated with one or multiple pyramidal units (PU)^{*}^{*}*Note that a single feedforward unit (called pyramidal unit) could in reality be implemented by multiple pyramidal units with similar input and output connectivity, allowing the network to reduce the spiking noise, if integrateandfire neurons where used for example. whose activity is denoted by (the state of unit in layer ). Pyramidal units have an output nonlinearity which maps their activity to their firing rate . Technically, this transfer function must be Lipschitz continuous. is the set of pyramidal units in layer .
For a connectivity matrix , we define which corresponds to the synaptic weight from unit in the input layer to unit in the output layer.
Both feedforward and feedback connections are considered in this model. The main feedforward synaptic weights correspond to the influence of presynaptic unit (in layer ) on the postsynaptic unit (in layer ). The feedback weights encapsulate the effect of the pyramidal unit (of layer ) on pyramidal unit (layer ). The network output is defined by the firing rate of the output pyramidal units. This output is compared to target values and the performance is measured through a scalar cost function which somehow compares the network output and the target. In the simulations, the meansquared error function was used as cost function: , but other losses could also be implemented in the same way.
We also define the cost function which corresponds to the cost function of the associated (purely feedforward) multilayer perceptron (MLP), where the associated activity is defined by:
(1) 
Training is decomposed in a free phase and a weaklyclamped phase following scellier2017equilibrium . During the free phase, the network evolves thanks to its recurrent dynamics with only inputs provided. During the weaklyclamped phase, both inputs and targets are presented to the network. A topdown error signal pushes the output units towards a value corresponding to a smaller loss . corresponds to the freephase and to the weaklyclamped one.
In addition, we consider a lateral network of ghost units (GU), which could be implemented by inhibitory interneurons. These units are represented by a scalar variable for each unit in layer . A ghost unit in a layer is only connected to the pyramidal units of the same layer, through two matrices (for lateral connections from the pyramidal unit to ghost unit ) and (for the lateral connections from the ghost unit to the pyramidal unit ). These units aim to reproduce the feedback activity from the pyramidal units of the next layer during the forward phase, and therefore enable the network to directly compute the gradient during the weaklyclamped phase. These units are considered as inhibitory when projecting to the pyramidal neurons (expressed here as minus sign in , although the synaptic weights can themselves be negative). These ghost units are only present at each hidden layer ( and ).
We will show in the following section that the combination of lateral recurrent and feedback connections propagates the error through the network in a way that closely approximates backpropagation, so long as some assumptions are satisfied, regarding the ability of feedback connections to mimic feedforward connections (approximate symmetry) and of lateral connections to learn to cancel the feedback connections when there is no nudging.
2.2 Notations
network input  time constant  

PU  pyramidal unit  GU  ghost unit 
activity of PU in layer  cost function  
activity of PU in layer in the MLP  cost function of the MLP  
neuronal transfer function  activity of GU in layer  
set of pyramidal units in layer  
feedforward connection from PU (layer ) to PU (layer )  
feedback connection from PU (layer ) to PU (layer )  
lateral (recurrent) connection from PU to GU (layer )  
lateral (recurrent) connection from GU back to PU (layer ) 
2.3 Dynamics of the neurons
Three different inputs are integrated by pyramidal units in layer :

[noitemsep,topsep=0pt]

is the bottomup input coming from the pyramidal units of layer .

is the topdown feedback coming from pyramidal units of layer .

is the lateral feedback coming from the ghost units of the same layer .
The pyramidal units are evolving through:
(2) 
where represents an error term whose expression depends on the layer.
For the hidden layers, is the difference between the topdown feedback (the local target ) and the cancelling contribution from the inhibitory ghost units (, counted negatively because of the inhibitory nature of the ghost units). For the output layer, is the nudging term that indicates in which direction should move to reduce the output cost function , with
the target output values (
in the free phase and in the weaklyclamped phase).In particular, at the end of the forward phase when , we have: . Because of perfect cancellation , the network behaves like a feedforward multilayer perceptron.
The ghost units of layer follow:
(3) 
3 Different architectures and learning procedures
3.1 Network with 11 correspondence between the pyramidal units and the ghost units (MA)
Model description
In this section, we consider that each pyramidal unit in layer has a corresponding ghost unit^{†}^{†}†In biology there are more pyramidal neurons than inhibitory neurons. Yet, GU may also include subclasses of pyramidal neurons, such that the number of GU must not be smaller than the number of PU. Moreover, the code formed by pyramidal neurons may show some redundancy so that it could be compressed to a smaller number of effective PU. in the previous layer , and that the ghost units aim to replicate the activity of their associated pyramidal units by integrating the same inputs. In order to make the reading easier in this part, we use the same indices in the brackets for the ghost unit and its associated pyramidal unit. For example, ghost unit of layer (with activity ) will be associated to the pyramidal unit of layer (activity ). This architecture can be seen in Figure 1.
During the freephase, only the lateral connections between the pyramidal and ghost units are updated. The local learning rules for the synaptic weights and are defined as follow:

[noitemsep,topsep=0pt]

acts like a target for the ghost units to learn :
(4) This minimizes , i.e., the inhibitory ghost unit learns to imitate its associated pyramidal unit. is the learning rate.

The topdown feedback onto layer acts as a target for the weights forming the canceling feedback :
(5) This minimizes for each layer , with the same learning rate .
During the weaklyclamped phase, only the and are updated through the following learning rules:

The main weights (feedforward, from pyramidal units of layer to pyramidal units of layer ) are updated, using a local learning rule:
(6) This approximates gradient descent on , see Theorem 1.

The feedback weights are set equal to the transpose of the feedforward ones: .
This was implemented using Euler discretization, see Algorithm 1 for a more precise description of the algorithm.
We also consider a variant (MA’) where all the updates are performed during all the phases. This version, with continuous updates of the weights is more close to what can be expected to happen in the brain.
A good approximation of backpropagation
The combination of these update rules, leads to the following theorem, where eq. 6
can be seen as a close estimate to backpropagation.
Theorem 1.
We set the backward weights equal to the (adapting) forward weights, , and assume that the ghost unit circuit (, ) converged during the free phases, , for all hidden layer (MA). Then for weak output nudging ( small, ) errors converge to for each hidden layer . So, the forward weights () become updated according to the classical backpropagated error gradient.
Proof.
Let’s introduce which is the L2 norm if
is a vector and the maximum singular value norm if
is a matrix.We suppose that all the matrices are bounded during the procedure. In other words, we suppose that there exists such that , , and . This could be ensured by clipping the weights between extremal values.
We remind the definition of , that corresponds to the cost function of the multilayer perceptron built from the feedforward graph (with no recurrent dynamics) and nodes .
Free phase:
Firstly, we study the dynamics of the weights during the free phase, where both loss functions
and are minimized thanks to the updates eq. 4 and eq. 5.In the stationary limit (where ), we have and so, for all hidden layer :
So if both and are minimized for all in the stationary limit, then tends to . Considering that the space spanned by
during learning is large enough (this means having a large set of training data, which was true in the cases we tested), the meansquared error of the associated linear regression converges also to
, and therefore during the free phase.Secondly, still in the stationary limit,
because is Lipschitz.
So if both and are minimized for all , then . As before, if we consider that the space spanned by is big enough, we also minimize the meansquared error of the linear regression, and thus during the free phase.
Weaklyclamped phase:
For the weaklyclamped phase, we prove the theorem by induction over the layers. We suppose that the learning of the free phase is done and so that .
Firstly, it is easy to see that for the output layer (index ), in the stationary limit and with small nudging, we have (at zero order in ) and so:
(7) 
Then we just have to prove that this property is true for the last hidden layer and the rest of the proof will follow by induction.
Considering that the layer is still at equilibrium, and that the nudging was small, we have and so in the stationary limit:
Contrarily to just above, we need the first order approximation for in here (otherwise we would get in the following).
As we have and ,
Starting from the definition of , we substitute the definition of and . Using that (from free phase) and , we have:
Then, by assuming is small (because is small):
(8)  
(9) 
By the chain rule on the feedforward graph, we have:
(10) 
Using induction across layers, we have for every layer :
(12) 
∎
Due to the stacked nonlinearities the approximation for deeper layers may get worse and worse as we go deeper, this can be compensated by choosing smaller .
Corollary 1.
Under the assumptions of Theorem 1, the weight change proposed in eq. 6
corresponds to approximate stochastic gradient descent, i.e.,
(13) 
Proof.
(14) 
Hence, if , we obtain that and the corollary. ∎
3.2 Deep neural network with ghost units replicating online the feedback from the pyramidal units (MB)
Model description
We also developed a different class of models that we introduce in this section. In this model (MB), we do not make any hypothesis on the number of pyramidal units and ghost units. We also consider that the lateral connections from the pyramidal units of layer to the ghost units of the same layer are fixed to a randomly initialized value, therefore . evolves so as to have, example by example, the feedback coming from the ghost units of layer replicating the feedback coming from the pyramidal units of layer , see Figure 2. Thanks to this property, we have for a given example at the end of the free phase after the efficient and rapid learning of . This enables the network to correctly learn in the weaklyclamped phase.
This highly modular and fast changing plasticity could be implemented in real neural circuits by PostTetanic Potentiation. This type of plasticity evolves rapidly and only lasts on a time scale of seconds storozhuk2002post ; xue2010post .
As just described, the topdown feedback onto layer acts as a target for the weights forming the canceling lateral feedback . Therefore, the weights are updated during the free phase as follow:
(15) 
which minimizes .
The main weights are updated at the end of the weaklyclamped phase through the same local rule as in (MA):
(16) 
We used different learning rates for each layer in the case of (MB).
For a more detailed description of the algorithm, see Algorithm 2.
A good approximation of backpropagation
These update rules lead to the following theorem, where eq. 16 can be again seen as close to backpropagation.
Theorem 2.
We set the backward weights equal to the (adapting) forward weights, , and assume that the ghost unit circuit () converged during the free phase (at each different presentation of inputs), for each hidden layer (MB). Then for weak output nudging ( small, ) errors converge to for each hidden layer . So, the forward weights () become updated according to the classical backpropagated error gradient.
Proof.
Free phase:
After settling in the free phase (), we have, because :
In particular, we have for all units and, as :
(17) 
Weaklyclamped phase:
As in the previous proof, we use induction.
We have clearly that in the output layer :
(18) 
We will prove the property for the last hidden layer , and induction will follow. Starting from the definition of , we substitute the definition of and :
(19)  
(20) 
We consider that the layer is still at equilibrium, and that the nudging is small, so . And by the same arguments than for (MA):
(21) 
3.3 Transpose feedback (TF) versus Feedbackalignment (FA)
The feedback weights are assumed to be equal to the transpose of the feedforward ones, and are updated as such during the training: , as in classical backpropagation. We refer to this hypothesis as transposefeedback (TF). In practice this characteristic could be implemented using an additional reconstruction cost (between consecutive pyramidal layers), which has been shown to encourage symmetry of the weights VincentJMLR2010small . This assumption can also be relaxed thanks to lillicrap2016random and the feedbackalignment effect (FA). In this case, feedback weights are fixed and randomly initialized. During learning, the feedforward matrix tends to align with the transpose of the feedback matrix. Both hypotheses (TF) and (FA) were tested here.
4 Related Work
Backpropagation in the brain has been a very active topic of research for the last few years and various models have been proposed.
Constrastive Hebbian learning Ackley85 ; Hinton+McClelland1988 introduced the idea of learning in two different phases, a free phase where the inputs are presented to the network, followed by a weaklyclamped phase, with a target signal that nudges the output layer towards the right solution. scellier2017equilibrium made the parallel between contrastive Hebbian learning and backpropagation with the definition of a framework for energybased models, Equilibrium Backpropagation. The idea of using two different phases during the training procedure was kept in this work, however, contrarily to the previous studies, we were also able to train the network while allowing synaptic updates during both phases.
Segregated dendrites and multicompartment neurons were recently used guerguiev2017towards to implement backpropagation in a biologically plausible manner. This study gave a very interesting explanation of how neurons can store feedforward activity and how feedback connections can carry the backpropagated error without interfering with the feedforward activity. Training can then be performed without dealing with recurrent activity caused by feedback connections, which makes the theory simpler and closer to deep learning methodology. This study achieved great results even if they were using a spiking neural network with update rules being computed using average of the neural potential.
A recent study sacramento2018dendritic ; NIPS2018_8089 introduced the idea of canceling the feedback from the next layer with the inhibitory lateral feedback in order to leave out only the backpropagated error as remaining from the feedback signal. They used recurrent networks of twocompartment neurons. Links can also be drawn with Lee:2015:DTP:3120485.3120521 ; jaderberg2016decoupled ; DBLP:journals/corr/abs180301834 where local credit assignment is also performed.
As in [17, 18, 19], we effectively consider the dynamics of a single quantity per neuron, the somatic activity. However we do not describe the input currents as coming from different compartments and consider instead a more abstract singlecompartment neuron (that, to implement plasticity, is able to represent two quantities, the target rate and its actual rate). This has the advantage of simplifying the terminology of the model as we do not need to introduce dendritic quantities that enter in the representation of the errors. Moreover, this does not induce any scaling of the approximated error by dendritic attenuation factors as in sacramento2018dendritic ; NIPS2018_8089 (the same scaling can be recovered by multiplying the learning rate by the inverse of the dendritic attenutation factor). As such, it is possible to approximate the backpropagated gradient without leading to an exponential decay of its magnitude when it is propagated through several layers. We used a reduced system with only the required ingredients in order to obtain a working biologically abstracted analogue of backpropagation. Model A implements in a simple and condensed way the principles from sacramento2018dendritic ; NIPS2018_8089 where the ghost units network copy the pyramidal one. Diverging from the ideas of Model A, we also postulate a shortterm plasticity according to which the local circuit adapts for a single pattern such as in PostTetanic Potentiation storozhuk2002post ; xue2010post or in the FORCE algorithm sussilloabbott . We develop accordingly Model B where the ghost units dynamically adapt their feedback to replicate in an online manner the feedback coming from the pyramidal units. This single compartment model is sufficient in order to obtain the required credit assignment mechanism and it also simplifies the mathematics to the bare necessities required to obtain the desired results.
5 Results
5.1 Credit assignment with replicating units
We consider a (78450010) network with one hidden layer with MSE (meansquared error) loss. No preconditioning of the inputs is used. Batch size is equal to 100 for (MA) and 1 for (MB). The activation is sigmoid. We train on the 55000 MNIST training set and test on the 10000 examples of the test set. We initialized the weights randomly with a uniform distribution over
(Table 2 and 4).(MA) dynamics
In (MA), learning is composed of two phases. During the free phase, only inputs are provided to the network. The weights push the ghost units to mimic their corresponding pyramidal units (eq. 4) while having learning to minimize the mismatch between the feedback coming from the ghost units and the pyramidal units from the next layer (eq. 5). This pushes the matrix to reproduce and to copy . This can be seen at the bottom of Figure 3, where the Frobenius norms between these matrices during training are reproduced.
This leads to the correct computation of the feedforward path because the feedback terms cancel each other, despite happening in a dynamical way.
During the weaklyclamped phase, the output units are nudged towards the correct values. This shift is backpropagated through the dynamics of the network. This gives rise to an error term at each hidden layer thanks to the mismatch between the feedback coming from the pyramidal units and the corresponding ghost units. evolves in order to minimize this mismatch (eq. 6).
(MB) dynamics
We also studied learning in a neural network following (MB) hypothesis. In the one hidden layer network, 5 ghost units were used which aim to replicate the feedback signal from the 10 output pyramidal units. For the testing of the feedforward network on the train and test MNIST sets, we ran the forward graph without the dynamical part to have quicker simulations.
Different inputs are presented to the network sequentially (no batch in this setting). At each input presentation, the network goes through two different phases (see Figure 4). First, the free phase (in blue) where there is no nudging of the output layer. In particular, the feedback weights adapt their synaptic variables to have the feedback signal from the ghost unit cancel the feedback from the pyramidal cells (), as can be seen in Figure 4
(bottom). This cancellation of the local error leads to a correct computation of the feedforward graph of the neural network. Output probabilities of the classification task can then be read from the output pyramidal cells as seen in Figure
4 (top). Then, during the weaklyclamped phase (in green), the output neurons are nudged toward the right solution. This error is then backpropagated through the network by its own dynamics. When the equilibrium is reached, the feedforward weights are updated. Another input is then presented to the network and as such, this process enables learning of the classification task.Learning can be studied through the responses of the output neurons at different epochs. At epoch 0 (Figure
4 (left)), the output neurons are mainly wrong. The gradients that are backpropagated have a large amplitude (as can be seen with the jumps in activity after the beginning of the weaklyclamped phase). In particular, we clearly see that the neuron representing the right class (in red) is nudged towards 1 whereas the others are nudged down to 0. After one epoch (Figure 4 (middle)), the output states are moving towards the right solution. However it becomes harder to cancel the error in the free phase, because the weights grow bigger, making the ghost units work harder after a switch between two different inputs. Finally, after two epochs (Figure 4 (right)), the output neurons already start to saturate to 0 or 1. In conclusion, using (MB) dynamics, the neural network is able to quickly learn a classification task.On a longer scale of learning (50 epochs), the accuracy and meansquared error are plotted in Figure 5
for both train and test sets. In particular learning goes well as the training accuracy reach 0.9976 (top). This is correlated with the meansquared error that goes down and tends to 0 (middle). The gradients computed through the ghost units are almost the same (up to 7% of relative error to the classical backpropagated gradient, bottom) as the ones that can be computed using the usual backpropagation and the chainrule. Generalization is also quite good with the accuracy on the test set that reaches 0.981 after only 50 epochs of training, which is quite quick and efficient, considering that none of the usual tricks (Adam, RMSProp, …) were used in this setting.
As a conclusion, this biologically inspired neural network, following (MB) hypothesis is able to quickly learn in a robust way the classification MNIST benchmark, with locally computed gradients that closely approximate backpropagation.
5.2 Classification on MNIST
Classification on MNIST using networks with one hidden layer. Test error on MNIST as a function of the number of epochs for one hiddenlayer network with 100 (red), 300 (black) and 500 (blue) neurons. Both TF (plain lines) and FA (dotted lines) are represented. On the right are detailed the mean values of test and train accuracies (between parenthesis). For each network, the standard deviation over 5 experiments is represented by the width of the area below the ticked mean values.
We have tested several types of networks on the MNIST dataset LeCun+98 , see Table 1 for the results. We looked at both (MA) and (MB) models while using either Transposefeedback (TF) or Feedbackalignment (FA) (see Figure 5(a) for (MA) and Figure 5(b) for (MB)). We tested different numbers of units per hidden layer and different numbers of hidden layers. Simulations were run on the GPU cluster Cedar, Compute Canada (www.computecanada.ca).
We ran 5 experiments for each model, and represent these results in Table 1. The results approach stateoftheart accuracies for multilayer perceptron, with (MA) performing slightly better than (MB).
For both models, raising the number of neurons resulted in a rise in performance.
(MA) performance on the MNIST task
1layer and 2layer (MA) networks compete with the stateoftheart when using a multilayer perceptron trained with backpropagation. Training with 1layer (MA) networks is stable and works well when using transposefeedback and feedbackalignment. Increasing the number of units per layer helps improving performances as shown in Figure 5(a). Let’s note that we were able to use a relatively big during the weaklyclamped phase that speeds up and stabilize learning.
Training with 2layer (MA) networks is stable and works also well across a large range of hyperparameters when using feedbackalignment. However, when using transposefeedback training is a bit less stable and requires a more precise hyperparameters search to achieve great performances. Accuracies and hyperparameters are shown in Table
1 and Table 2.We also realized 1layer experiments with all synaptic updates occurring at all times during the free and the weaklyclamped phases (MA’) (Table 3). Theses networks were harder to train but are still able to perform quite well considering all the assumptions that are made (around 3.5% of test error). These results are comparable to the results from guerguiev2017towards ; sacramento2018dendritic . To get stable behaviors, the updates for were clipped. Otherwise it sometimes diverged at the beginning of the learning. This hypothesis is biologically plausible, through some saturation mechanisms.
(MB) performance on the MNIST task
For (MB) networks, 1layer network were associated with one layer of ghosts units (to replicate the feedback from the pyramidal output units). For 2layer network, we used two layers of ghost units of respectively and neurons. The other parameters are gathered in Table 4.
1layer (MB) networks perform really well compared to stateoftheart results, both in the case of transposefeedback and feedbackalignment. Using networks with more neurons (100  300  500) also helps reaching higher performance.
For 2layer (MB) networks with transposefeedback weights, the networks were harder to train and sometimes unstable. This could be easily explain by the fact that the error that is made when backpropagating the gradient through the biological backpropagation is proportional to the amplitude of the feedback weights. In 2layer networks with transpose feedback, they grow at the same speed than . If they grow too large, they can induce some unwanted errors and make the network unstable. These problems were totally solved when using the feedbackalignment version of (MB), where the amplitude of the feedback weights is fixed. In particular, the higher accuracies for (MB) are reached with the 2layer version, with feedbackalignment. Adding layers helps the network to perform better and the biological backpropagation is efficient through several layers.
Architecture  TF  FA  

Model  #units  Train  Test  Train  Test 
(MA)  100  99.97  97.66  99.90  97.47 
(MA)  300  100  98.21  99.99  97.97 
(MA)  500  100  98.27  100  98.12 
(MA)  500/500  99.67  97.86  99.93  98.05 
(MA’)  500  97.77  96.57     
(MB)  100  99.31  97.22  98.93  97.39 
(MB)  300  99.70  98.05  99.48  97.98 
(MB)  500  99.76  98.13  99.56  98.01 
(MB)  300/300  99.84  97.95  99.78  98.05 
(MB)  500/500  99.91  98.13  99.85  98.21 
Lillicrapetalnature2016  1000    97.6    97.9 
nokland2016direct  800/800    98.33    98.18 
guerguiev2017towards  500    96.4    95.9 
guerguiev2017towards  500/100        96.8 
sacramento2018dendritic ; NIPS2018_8089  500/500    98.04     
Comparison to other works
We saw that both models (MA) and (MB) were competing with stateoftheart multilayer perceptrons on MNIST, as in Lillicrapetalnature2016 ; nokland2016direct where classical backpropagation is used. As we can see in Table 1, we get higher accuracies than other previous biologicallyplausible models guerguiev2017towards ; sacramento2018dendritic ; NIPS2018_8089 in our setting with different updates rules during the two phases. Even when updating all weights during both phases (MA’), we get similar results than a model with segregated dendrites compartment guerguiev2017towards , where updates are done in two different phases, but with a spiking neural network. We however cannot reach an accuracy as high as in sacramento2018dendritic ; NIPS2018_8089 when considering updates in both phases as they do, this may be due to the fact they use different compartments. In conclusion, both models presented in this work were able to reach accuracies comparable to backpropagation, with a simple biologicallyplausible setting.
6 Conclusion
Deep learning has been the focus of intense studies in the past decade and has become more and more efficient in solving diverse and complex tasks, that range from pattern recognition, to image generation, NLP (natural language processing) and many others. Many ideas that had some success in deep learning originally come from neuroscience, and making link between both worlds has been the focus of many studies recently
Marblestoneetal2016 . In particular, different mechanisms have been developed to update the parameters of an artificial neural network. Backpropagation almeida1987learning has been the canonical and most used way of training a network. Many frameworks have been developed to enable efficient gradient computations thanks to the use of the chain rule 2016arXiv160502688short ; tensorflow2015whitepaper . However backpropagation as it is commonly used has many properties that are not biologically plausible bengio2015towards ; neftci2017event .We developed in this work a class of neural networks models, that enables learning as would backpropagation but with local learning rules. To the feedforward neural network of pyramidal units (which represents a classical multilayer perceptron) we add a second network of inhibitory interneurons, which we denote as ghost units. Connections between all layers from both networks exist making the whole network a complex recurrent one. The dynamics of the models are separated into two phases as in scellier2017equilibrium . During the free phase, the ghost units network learns to perfectly replicate the feedback from the pyramidal units, and therefore cancel any feedback coming from the upper layers. At equilibrium, the network propagates the correct feedforward graph and can perform a classification task. During the weaklyclamped phase, the output pyramidal units are nudged in order to reduce the output cost function. The error is backpropagated through the layers thanks to the recurrent dynamic itself. We considered two different models, the first one (MA) where each pyramidal unit has an associated ghost unit in the previous layer. The ghost unit network learns throughout the training to replicated the pyramidal one. In the second model (MB), we consider fewer ghost units that dynamically learn to replicate the feedback from the pyramidal units at each different pattern presentation. We prove for both models that, under some hypotheses, the locally defined learning rules approximate classical backpropagation, with condensed notations and proofs. Moreover, we tested both model on the MNIST classification tasks, with different architectures (number of neurons, number of layers) and proved that these networks were able to accomplish such tasks as well as with backpropagation. We also made links with feedbackalignment Lillicrapetalnature2016 . Finally, we were able to loosen some of the assumptions made in our models such as updating all the synaptic weights at all time, while still having a trained network that was able to perform correctly in the pattern recognition task. This single compartment model only expresses the bare properties needed to have a credit assignment mechanism, and as such, with simple notations and proof, straightforwardly highlights a possible implementation of credit assignment in the brain.
The class of models presented in this work was built upon some properties of the mammalian cortex. In particular, the different layers of the pyramidal neural network represent the integration of the sensory stimulus (from other brains areas, thalamus for example) through several layers of cortical neurons (through higher order regions of the brain). As in sacramento2018dendritic ; NIPS2018_8089 , we also consider a population of inhibitory interneurons that would be coherent with the neurophysiological properties and role of SST interneurons. In particular, their role would be to cancel topdown feedback from the pyramidal neurons. Some recent experiments LEINWEBER20171204 showed that pyramidal neurons project back through topdown projections to the interneurons of the previous layer, which would be coherent with these interneurons replicating the activity from upper layers. Other works such as ENIKOLOPOV2018135 showed that synaptic plasticity could generate a negative image of the input in the electric fish and they illustrate the importance of this kind of signal for improvements in neural coding and detection of perturbations. The models presented in this work are able to backpropagate the output layer error, through the different layers without needing gradients computations. The network is able to learn thanks to local learning rules, which have been proven to have some biological relevance and links to STDP (spiketiming dependent plasticity) bengio2017stdp ; Feldman2012 . We use singlecompartment leakyintegrate neurons, that can be seen as a really simple approximation of biological neurons.
However some properties of the presented network can hardly be seen as biologically plausible. Firstly, in (MA), we considered that there was a 11 correspondence between the pyramidal cells and the ghost units. This is hardly true in biological neuronal network, however we could consider that each pyramidal cell models the activity of several pyramidal neurons and as such, forget this hypothesis. We also developed (MB), where the number of ghost units can be set to an arbitrary (smaller) number and forget 11 correspondence. For this we suppose that the network of ghost units is able to dynamically replicate the feedback from the pyramidal cells through some rapid plastic mechanisms (as can be seen with PostTetanic Potentiation storozhuk2002post ; xue2010post ), which may not be biologically implemented directly without any neurotransmitter modulation. However, as we present two models that rely on opposite hypotheses, it would be possible to compose a large class of models sharing properties from both of these extreme cases. We could then have, at the same time, an adaptive process, that learns on long time scale how to cancel feedback but that can also adapt to variations in the input, making it closer to biological observations. One possible way to implement such a neural network would be to make the ghost units of (MA) replicate a linear combination of the pyramidal units activity and as such use fewer ghost units. Training 2layer networks with transposed feedback weights was in some case unstable due to the fact that the transmission of the backpropagated error was proportional to the amplitude of the feedback weights, and therefore of the forward ones. Using feedbackalignment solves this issue, as the scale of the feedback weights was fixed. However we can imagine that this problem could also be overcome in the brain thanks to normalization mechanisms such as weight decay or weights regularization, that could be implemented by biologically plausible mechanisms.
We could also implement integrateandfire neurons and make the link with spiking deep networks NIPS2011_4383 in order to go further towards biological networks. We focused on pattern recognition in this work, but ghost units could also help to make biologically plausible networks for other tasks of deep learning such as Generative Adversarial Networks GoodfellowetalNIPS2014small
or Deep Reinforcement Learning, by adding some other features such as the influence of neurotransmitters on local learning rules.
This work is a step towards apprehending the mechanics of learning and memory in the brain, but at the same time, it raises interesting perspectives for implementation of deep networks in neuromorphic hardware.
References
 [1] Ian J. Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
 [2] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [3] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdelrahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
 [4] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
 [5] Luis B Almeida. A learning rule for asynchronous perceptrons with feedback in a combinatorial environment. In Proceedings, 1st First International Conference on Neural Networks, volume 2, pages 609–618. IEEE, 1987.

[6]
Fernando J Pineda.
Generalization of backprop to recurrent neural networks.
Physical review letters, 59(19):2229, 1987.  [7] Yann LeCun, Bernhard E Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne E Hubbard, and Lawrence D Jackel. Handwritten digit recognition with a backpropagation network. In Advances in neural information processing systems, pages 396–404, 1990.
 [8] SeyedMahdi KhalighRazavi and Nikolaus Kriegeskorte. Deep supervised, but not unsupervised, models may explain it cortical representation. PLoS computational biology, 10(11):e1003915, 2014.
 [9] Yoshua Bengio, DongHyun Lee, Jorg Bornschein, Thomas Mesnard, and Zhouhan Lin. Towards biologically plausible deep learning. arXiv preprint arXiv:1502.04156, 2015.
 [10] Emre O Neftci, Charles Augustine, Somnath Paul, and Georgios Detorakis. Eventdriven random backpropagation: Enabling neuromorphic deep learning machines. Frontiers in neuroscience, 11:324, 2017.
 [11] Timothy P Lillicrap, Daniel Cownden, Douglas B Tweed, and Colin J Akerman. Random synaptic feedback weights support error backpropagation for deep learning. Nature communications, 7:13276, 2016.
 [12] Arild Nøkland. Direct feedback alignment provides learning in deep neural networks. In Advances in Neural Information Processing Systems, pages 1037–1045, 2016.
 [13] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or1. arXiv preprint arXiv:1602.02830, 2016.
 [14] Benjamin Scellier and Yoshua Bengio. Equilibrium propagation: Bridging the gap between energybased models and backpropagation. Frontiers in computational neuroscience, 11:24, 2017.
 [15] Yoshua Bengio, Thomas Mesnard, Asja Fischer, Saizheng Zhang, and Yuhuai Wu. Stdpcompatible approximation of backpropagation in an energybased model. Neural computation, 29(3):555–577, 2017.
 [16] Thomas Mesnard, Wulfram Gerstner, and Johanni Brea. Towards deep learning with spiking neurons in energy based models with contrastive hebbian plasticity. arXiv preprint arXiv:1612.03214, 2016.
 [17] Jordan Guerguiev, Timothy P Lillicrap, and Blake A Richards. Towards deep learning with segregated dendrites. eLife, 6, 2017.
 [18] João Sacramento, Rui Ponte Costa, Yoshua Bengio, and Walter Senn. Dendritic error backpropagation in deep cortical microcircuits. arXiv preprint arXiv:1801.00062, 2018.
 [19] João Sacramento, Rui Ponte Costa, Yoshua Bengio, and Walter Senn. Dendritic cortical microcircuits approximate the backpropagation algorithm. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 8735–8746. Curran Associates, Inc., 2018.
 [20] Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Decoupled neural interfaces using synthetic gradients. arXiv preprint arXiv:1608.05343, 2016.
 [21] Maksim V Storozhuk, Svetlana Y Ivanova, Tatyana A Pivneva, Igor V Melnick, Galina G Skibo, Pavel V Belan, and Platon G Kostyuk. Posttetanic depression of gabaergic synaptic transmission in rat hippocampal cell cultures. Neuroscience letters, 323(1):5–8, 2002.
 [22] Lei Xue and LingGang Wu. Posttetanic potentiation is caused by two signalling mechanisms affecting quantal size and quantal content. The Journal of Physiology, 588(24):4987–4994, 2010.

[23]
Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and
PierreAntoine Manzagol.
Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.
J. Machine Learning Res.
, 11, 2010. 
[24]
D. H. Ackley, G. E. Hinton, and T. J. Sejnowski.
A learning algorithm for Boltzmann machines.
Cognitive Science, 9:147–169, 1985.  [25] Geoffrey E. Hinton and James L. McClelland. Learning representations by recirculation. In D. Z. Anderson, editor, Neural Information Processing Systems, pages 358–366. American Institute of Physics, 1988.
 [26] DongHyun Lee, Saizheng Zhang, Asja Fischer, and Yoshua Bengio. Difference target propagation. In Proceedings of the 2015th European Conference on Machine Learning and Knowledge Discovery in Databases  Volume Part I, ECMLPKDD’15, pages 498–515, Switzerland, 2015. Springer.
 [27] Alexander G. Ororbia II, Ankur Mali, Daniel Kifer, and C. Lee Giles. Conducting credit assignment by aligning local representations. CoRR, abs/1803.01834, 2018.
 [28] David Sussillo and L F Abbott. Generating coherent patterns of activity from chaotic neural networks. Neuron, 63:544–57, 09 2009.
 [29] Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998.
 [30] Timothy P Lillicrap, Daniel Cownden, Douglas B Tweed, and Colin J Akerman. Random synaptic feedback weights support error backpropagation for deep learning. Nature communications, 7, 2016.
 [31] Adam H Marblestone, Greg Wayne, and Konrad P Kording. Toward an integration of deep learning and neuroscience. Frontiers in Computational Neuroscience, 10, 2016.
 [32] Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv eprints, abs/1605.02688, May 2016.
 [33] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
 [34] Marcus Leinweber, Daniel R. Ward, Jan M. Sobczak, Alexander Attinger, and Georg B. Keller. A sensorimotor circuit in mouse cortex for visual flow predictions. Neuron, 96(5):1204, 2017.
 [35] Armen G. Enikolopov, L.F. Abbott, and Nathaniel B. Sawtell. Internally generated predictions enhance neural and behavioral detection of sensory stimuli in an electric fish. Neuron, 99(1):135 – 146.e3, 2018.
 [36] Daniel E. Feldman. The spike timing dependence of plasticity. Neuron, 75(4):556–571, 2012.
 [37] Johanni Brea, Walter Senn, and JeanPascal Pfister. Sequence learning with hidden units in spiking neural networks. In J. ShaweTaylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 1422–1430. Curran Associates, Inc., 2011.
 [38] Ian J. Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. In NIPS’2014, 2014.
Comments
There are no comments yet.