Residual neural network: Difference between revisions

Content deleted Content added

Inline

Revision as of 08:06, 19 February 2019

Canonical form of a residual neural network. A layer ℓ − 1 is skipped over activation from ℓ − 2.

A residual neural network (ResNet) is an artificial neural network (ANN) of a kind that builds on constructs known from pyramidal cells in the cerebral cortex^{[citation needed]}. Residual neural networks do this by utilizing skip connections or short-cuts to jump over some layers. In its limit as ResNets it will only skip over a single layer.^[1] With an additional weight matrix to learn the skip weights it is referred to as HighwayNets.^[2] With several parallel skips it is referred to as DenseNets.^[3] In comparison, a non-residual neural network is described as a plain network in the context of residual neural networks.

A reconstruction of a pyramidal cell. Soma and dendrites are labeled in red, axon arbor in blue. (1) Soma, (2) Basal dendrite, (3) Apical dendrite, (4) Axon, (5) Collateral axon.

One motivation for skipping over layers is to avoid the problem of vanishing gradients by reusing activations from a previous layer until the layer next to the current one learns its weights. During training the weights adapt to mute the previous layer and amplify the layer next to the current. In the simplest case only the weights for the connection to the next to the current layer is adapted, with no explicit weights for the upstream previous layer. This usually works properly when a single non-linear layer is stepped over, or when the intermediate layers are all linear. If not, then an explicit weight matrix should be learned for the skipped connection.

Skipping initially compresses the network into fewer layers, which speeds learning. The network gradually restores the skipped layers as it learns the feature space. During later learning, when all layers are expanded, it stays closer to the manifold and thus learns faster. A neural network without residual parts explores more of the feature space. This makes it more vulnerable to perturbations that cause it to leave the manifold, and necessitates extra training data to recover.

Biological analog

The brain has structures similar to residual nets, as cortical layer VI neurons get input from layer I, skipping intermediary layers.^{[citation needed]} In the figure this compares to signals from the apical dendrite (3) skipping over layers, while the basal dendrite (2) collects signals from the previous and/or same layer.^{[note 1]}^[4] Similar structures exists for other layers.^[5] How many layers in the cerebral cortex compare to layers in an artificial neural network is not clear, nor whether every area in the cerebral cortex exhibits the same structure, but over large areas they appear similar.

Forward propagation

For single skips, the layers may be indexed either as ${\textstyle \ell -2}$ to ${\textstyle \ell }$ or as ${\textstyle \ell }$ to ${\textstyle \ell +2}$ . (Script ${\textstyle \ell }$ used for clarity, usually it is written as a simple l.) The two indexing systems are convenient when describing skips as going backward or forward. As signal flows forward through the network it is easier to describe the skip as ${\textstyle \ell +k}$ from a given layer, but as a learning rule (back propagation) it is easier to describe which activation layer you reuse as ${\textstyle \ell -k}$ , where ${\textstyle k-1}$ is the skip number.

Given a weight matrix ${\textstyle W^{\ell -1,\ell }}$ for connection weights from layer ${\textstyle \ell -1}$ to ${\textstyle \ell }$ , and a weight matrix ${\textstyle W^{\ell -2,\ell }}$ for connection weights from layer ${\textstyle \ell -2}$ to ${\textstyle \ell }$ , then the forward propagation through the activation function would be (aka HighwayNets)

{\begin{aligned}a^{\ell }&:=\mathbf {g} (W^{\ell -1,\ell }\cdot a^{\ell -1}+b^{\ell }+W^{\ell -2,\ell }\cdot a^{\ell -2})\\&:=\mathbf {g} (Z^{\ell }+W^{\ell -2,\ell }\cdot a^{\ell -2})\end{aligned}}

where

{\textstyle a^{\ell }}

the activations (outputs) of neurons in layer

{\textstyle \ell }

,

{\textstyle \mathbf {g} }

the activation function for layer

{\textstyle \ell }

,

{\textstyle W^{\ell -1,\ell }}

the weight matrix for neurons between layer

{\textstyle \ell -1}

and

{\textstyle \ell }

, and

{\textstyle Z^{\ell }=W^{\ell -1,\ell }\cdot a^{\ell -1}+b^{\ell }}

Absent an explicit matrix ${\textstyle W^{\ell -2,\ell }}$ (aka ResNets), forward propagation through the activation function simplifies to

a^{\ell }:=\mathbf {g} (Z^{\ell }+a^{\ell -2})

Another way to formulate this is to substitute an identity matrix for ${\textstyle W^{\ell -2,\ell }}$ , but that is only valid when the dimensions match. This is somewhat confusingly called an identity block, which means that the activations from layer ${\textstyle \ell -2}$ are passed to layer ${\textstyle \ell }$ without weighting.

In the cerebral cortex such forward skips are done for several layers. Usually all forward skips start from the same layer, and successively connect to later layers. In the general case this will be expressed as (aka DenseNets)

a^{\ell }:=\mathbf {g} \left(Z^{\ell }+\sum _{k=2}^{K}W^{\ell -k,\ell }\cdot a^{\ell -k}\right)

.

Backward propagation

During backpropagation learning for the normal path

\Delta w^{\ell -1,\ell }:=-\eta {\frac {\partial E^{\ell }}{\partial w^{\ell -1,\ell }}}=-\eta a^{\ell }\cdot \delta ^{\ell }

and for the skip paths (nearly identical)

\Delta w^{\ell -2,\ell }:=-\eta {\frac {\partial E^{\ell }}{\partial w^{\ell -2,\ell }}}=-\eta a^{\ell }\cdot \delta ^{\ell }

.

In both cases

{\textstyle \eta }

a learning rate (

{\textstyle \eta <0)}

,

{\textstyle \delta ^{\ell }}

the error signal of neurons at layer

{\textstyle \ell }

, and

{\textstyle a_{i}^{\ell }}

the activation of neurons at layer

{\textstyle \ell }

.

If the skip path has fixed weights (e.g. the identity matrix, as above), then they are not updated. If they can be updated, the rule is an ordinary backpropagation update rule.

In the general case there can be ${\textstyle K}$ skip path weight matrices, thus

\Delta w^{\ell -k,\ell }:=-\eta {\frac {\partial E^{\ell }}{\partial w^{\ell -k,\ell }}}=-\eta a^{\ell }\cdot \delta ^{\ell }

As the learning rules are similar, the weight matrices can be merged and learned in the same step.

Notes

^ Some research indicates that there are additional structures here, so this explanation is somewhat simplified.

References

^ He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2015-12-10). "Deep Residual Learning for Image Recognition". arXiv:1512.03385 [cs.CV].
^ Srivastava, Rupesh Kumar; Greff, Klaus; Schmidhuber, Jürgen (2015-05-02). "Highway Networks". arXiv:1505.00387 [cs.LG].
^ Huang, Gao; Liu, Zhuang; Weinberger, Kilian Q.; van der Maaten, Laurens (2016-08-24). "Densely Connected Convolutional Networks". arXiv:1608.06993 [cs.CV].
^ Winterer, Jochen; Maier, Nikolaus; Wozny, Christian; Beed, Prateep; Breustedt, Jörg; Evangelista, Roberta; Peng, Yangfan; D’Albis, Tiziano; Kempter, Richard (2017). "Excitatory Microcircuits within Superficial Layers of the Medial Entorhinal Cortex". Cell Reports. 19 (6): 1110–1116. doi:10.1016/j.celrep.2017.04.041. PMID 28494861.
^ Fitzpatrick, David (1996-05-01). "The Functional Organization of Local Circuits in Visual Cortex: Insights from the Study of Tree Shrew Striate Cortex". Cerebral Cortex. 6 (3): 329–341. doi:10.1093/cercor/6.3.329. ISSN 1047-3211.

[4] Some research indicates that there are additional structures here, so this explanation is somewhat simplified.

[1] He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2015-12-10). "Deep Residual Learning for Image Recognition". arXiv:1512.03385 [cs.CV].

[2] Srivastava, Rupesh Kumar; Greff, Klaus; Schmidhuber, Jürgen (2015-05-02). "Highway Networks". arXiv:1505.00387 [cs.LG].

[3] Huang, Gao; Liu, Zhuang; Weinberger, Kilian Q.; van der Maaten, Laurens (2016-08-24). "Densely Connected Convolutional Networks". arXiv:1608.06993 [cs.CV].

[5] Winterer, Jochen; Maier, Nikolaus; Wozny, Christian; Beed, Prateep; Breustedt, Jörg; Evangelista, Roberta; Peng, Yangfan; D’Albis, Tiziano; Kempter, Richard (2017). "Excitatory Microcircuits within Superficial Layers of the Medial Entorhinal Cortex". Cell Reports. 19 (6): 1110–1116. doi:10.1016/j.celrep.2017.04.041. PMID 28494861.

[6] Fitzpatrick, David (1996-05-01). "The Functional Organization of Local Circuits in Visual Cortex: Insights from the Study of Tree Shrew Striate Cortex". Cerebral Cortex. 6 (3): 329–341. doi:10.1093/cercor/6.3.329. ISSN 1047-3211.

[1]

[2]

[3]

[note 1]

[4]

[5]

@@ Line 2: / Line 2: @@
 [[File:ResNets.svg|thumb|right|Canonical form of a residual neural network. A layer ''ℓ''&nbsp;&nbsp;−&nbsp;1 is skipped over activation from ''ℓ''&nbsp;−&nbsp;2.]]
-A '''residual neural network''' is an [[artificial neural network]] (ANN) of a kind that builds on constructs known from [[pyramidal cell]]s in the [[cerebral cortex]]{{citation needed|date=August 2018}}. Residual neural networks do this by utilizing ''skip connections'' or ''short-cuts'' to jump over some layers. In its limit as ''ResNets'' it will only skip over a single layer.<ref>{{cite arxiv|last=He|first=Kaiming|last2=Zhang|first2=Xiangyu|last3=Ren|first3=Shaoqing|last4=Sun|first4=Jian|date=2015-12-10|title=Deep Residual Learning for Image Recognition|eprint=1512.03385|class=cs.CV}}</ref> With an additional weight matrix to learn the skip weights it is referred to as ''HighwayNets''.<ref>{{cite arxiv|last=Srivastava|first=Rupesh Kumar|last2=Greff|first2=Klaus|last3=Schmidhuber|first3=Jürgen|date=2015-05-02|title=Highway Networks|eprint=1505.00387|class=cs.LG}}</ref> With several parallel skips it is referred to as ''DenseNets''.<ref>{{cite arxiv|last=Huang|first=Gao|last2=Liu|first2=Zhuang|last3=Weinberger|first3=Kilian Q.|last4=van der Maaten|first4=Laurens|date=2016-08-24|title=Densely Connected Convolutional Networks|eprint=1608.06993|class=cs.CV}}</ref> In comparison, a non-residual neural network is described as a ''plain network'' in the context of residual neural networks.
+A '''residual neural network''' ('''ResNet''') is an [[artificial neural network]] (ANN) of a kind that builds on constructs known from [[pyramidal cell]]s in the [[cerebral cortex]]{{citation needed|date=August 2018}}. Residual neural networks do this by utilizing ''skip connections'' or ''short-cuts'' to jump over some layers. In its limit as ''ResNets'' it will only skip over a single layer.<ref>{{cite arxiv|last=He|first=Kaiming|last2=Zhang|first2=Xiangyu|last3=Ren|first3=Shaoqing|last4=Sun|first4=Jian|date=2015-12-10|title=Deep Residual Learning for Image Recognition|eprint=1512.03385|class=cs.CV}}</ref> With an additional weight matrix to learn the skip weights it is referred to as ''HighwayNets''.<ref>{{cite arxiv|last=Srivastava|first=Rupesh Kumar|last2=Greff|first2=Klaus|last3=Schmidhuber|first3=Jürgen|date=2015-05-02|title=Highway Networks|eprint=1505.00387|class=cs.LG}}</ref> With several parallel skips it is referred to as ''DenseNets''.<ref>{{cite arxiv|last=Huang|first=Gao|last2=Liu|first2=Zhuang|last3=Weinberger|first3=Kilian Q.|last4=van der Maaten|first4=Laurens|date=2016-08-24|title=Densely Connected Convolutional Networks|eprint=1608.06993|class=cs.CV}}</ref> In comparison, a non-residual neural network is described as a ''plain network'' in the context of residual neural networks.
 [[File:Piramidal_cell.svg|thumb|right|A reconstruction of a pyramidal cell. Soma and dendrites are labeled in red, axon arbor in blue. (1) Soma, (2) Basal dendrite, (3) Apical dendrite, (4) Axon, (5) Collateral axon.]]
@@ Line 42: / Line 42: @@
 In both cases
-: <math display="inline">\eta</math> a learning rate (<math display="inline">\eta < 0)</math>,
+: <math display="inline">\eta</math> a [[learning rate]] (<math display="inline">\eta < 0)</math>,
 : <math display="inline">\delta^\ell</math> the error signal of neurons at layer <math display="inline">\ell</math>, and
 : <math display="inline">a_i^\ell</math> the activation of neurons at layer <math display="inline">\ell</math>.