[go: nahoru, domu]

Jump to content

Delta rule: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Rodeyo (talk | contribs)
mNo edit summary
backprop special case
 
(29 intermediate revisions by 27 users not shown)
Line 1: Line 1:
{{short description|Gradient descent learning rule in machine learning}}
{{multiple issues|
{{multiple issues|
{{disputed|date=December 2012}}
{{Refimprove|date=November 2012}}
{{Refimprove|date=November 2012}}
{{confusing|date=September 2012}}
{{confusing|date=September 2012}}
}}
}}


In [[artificial intelligence]], the '''delta rule''' is a [[gradient descent]] learning rule for updating the weights of the artificial neurons in a single-layer [[perceptron]].<ref>{{cite web|last=Russell|first=Ingrid|title=The Delta Rule|url=http://uhavax.hartford.edu/compsci/neural-networks-delta-rule.html|publisher=University of Hartford|accessdate=5 November 2012}}</ref> It is a special case of the more general [[backpropagation]] algorithm. For a neuron <math>j \,</math> with [[activation function]] <math>g(x) \,</math>, the delta rule for <math>j \,</math>'s <math>i \,</math>th weight <math>w_{ji} \,</math> is given by
In [[machine learning]], the '''delta rule''' is a [[gradient descent]] learning rule for updating the weights of the inputs to [[artificial neurons]] in a [[Feedforward neural network#A threshold (e.g. activation function) added|single-layer neural network]].<ref>{{cite web|last=Russell |first=Ingrid |title=The Delta Rule |url=http://uhavax.hartford.edu/compsci/neural-networks-delta-rule.html |publisher=University of Hartford |accessdate=5 November 2012 |url-status=dead |archiveurl=https://web.archive.org/web/20160304032228/http://uhavax.hartford.edu/compsci/neural-networks-delta-rule.html |archivedate=4 March 2016 }}</ref> It can be derived as the [[backpropagation]] algorithm for a single-layer neural network with mean-square error loss function.


For a neuron <math>j </math> with [[activation function]] <math>g(x) </math>, the delta rule for neuron <math>j </math>'s <math>i </math>-th weight <math>w_{ji} </math> is given by
:<math>\Delta w_{ji}=\alpha(t_j-y_j) g'(h_j) x_i \,</math>,

<math display="block">\Delta w_{ji} = \alpha(t_j-y_j) g'(h_j) x_i , </math>


where
where
* <math>\alpha </math> is a small constant called ''[[learning rate]]''
* <math>g(x) </math> is the neuron's activation function
* <math>g'</math> is the [[derivative]] of <math>g</math>
* <math>t_j </math> is the target output
* <math>h_j </math> is the weighted sum of the neuron's inputs
* <math>y_j </math> is the actual output
* <math>x_i </math> is the <math>i </math>-th input.


It holds that <math display="inline">h_j = \sum_i x_i w_{ji} </math> and <math>y_j=g(h_j) </math>.
{| cellpadding="2"
|
| <math>\alpha \,</math> is a small constant called ''learning rate''
|-
|
| <math>g(x) \,</math> is the neuron's activation function
|-
|
| <math>t_j \,</math> is the target output
|-
|
| <math>h_j \,</math> is the weighted sum of the neuron's inputs
|-
|
| <math>y_j \,</math> is the actual output
|-
|
| <math>x_i \,</math> is the <math>i \,</math>th input.
|}


The delta rule is commonly stated in simplified form for a neuron with a linear activation function as
It holds that <math>h_j=\sum x_i w_{ji} \,</math> and <math>y_j=g(h_j) \,</math>.
<math display="block">\Delta w_{ji} = \alpha \left(t_j-y_j\right) x_i </math>


While the delta rule is similar to the [[perceptron]]'s update rule, the derivation is different. The perceptron uses the [[Heaviside step function]] as the activation function <math>g(h)</math>, and that means that <math>g'(h)</math> does not exist at zero, and is equal to zero elsewhere, which makes the direct application of the delta rule impossible.
The delta rule is commonly stated in simplified form for a perceptron with a linear activation function as

:<math>\Delta w_{ji}=\alpha(t_j-y_j) x_i. This is because g'(h_j) = 1. \,</math>


==Derivation of the delta rule==
==Derivation of the delta rule==
The delta rule is derived by attempting to minimize the error in the output of the perceptron through [[gradient descent]]. The error for a perceptron with <math>j \,</math> outputs can be measured as
The delta rule is derived by attempting to minimize the error in the output of the neural network through [[gradient descent]]. The error for a neural network with <math>j </math> outputs can be measured as
<math display="block">E = \sum_{j} \tfrac{1}{2} \left(t_j-y_j\right)^2 .</math>


In this case, we wish to move through "weight space" of the neuron (the space of all possible values of all of the neuron's weights) in proportion to the gradient of the error function with respect to each weight. In order to do that, we calculate the [[partial derivative]] of the error with respect to each weight. For the <math>i </math>th weight, this derivative can be written as
:<math>E=\sum_{j} \frac{1}{2}(t_j-y_j)^2 \,</math>.
<math display="block">\frac{\partial E}{ \partial w_{ji} } .</math>


Because we are only concerning ourselves with the <math>j </math>-th neuron, we can substitute the error formula above while omitting the summation:
In this case, we wish to move through "weight space" of the neuron (the space of all possible values of all of the neuron's weights) in proportion to the gradient of the error function with respect to each weight. In order to do that, we calculate the [[partial derivative]] of the error with respect to each weight. For the <math>i \,</math>th weight, this derivative can be written as
<math display="block">\frac{\partial E}{ \partial w_{ji} } = \frac{ \partial }{ \partial w_{ji} } \left [\frac{1}{2} \left( t_j-y_j \right ) ^2 \right ] </math>

:<math>\frac{\partial E}{ \partial w_{ji} } \,</math>.

Because we are only concerning ourselves with the <math>j \,</math>th neuron, we can substitute the error formula above while omitting the summation:

:<math>\frac{\partial E}{ \partial w_{ji} } = \frac{ \partial \left ( \frac{1}{2} \left( t_j-y_j \right ) ^2 \right ) }{ \partial w_{ji} } \,</math>


Next we use the [[chain rule]] to split this into two derivatives:
Next we use the [[chain rule]] to split this into two derivatives:
<math display="block">\frac{\partial E}{\partial w_{ji}} = \frac{ \partial \left ( \frac{1}{2} \left( t_j-y_j \right ) ^2 \right ) }{ \partial y_j } \frac{ \partial y_j }{ \partial w_{ji} } </math>


To find the left derivative, we simply apply the [[power rule]] and the chain rule:
:<math>= \frac{ \partial \left ( \frac{1}{2} \left( t_j-y_j \right ) ^2 \right ) }{ \partial y_j } \frac{ \partial y_j }{ \partial w_{ji} } \,</math>
<math display="block">\frac{\partial E}{\partial w_{ji}} = - \left ( t_j-y_j \right ) \frac{ \partial y_j }{ \partial w_{ji} } </math>


To find the left derivative, we simply apply the general [[power rule]]:
To find the right derivative, we again apply the chain rule, this time differentiating with respect to the total input to <math>j </math>, <math>h_j </math>:
<math display="block">\frac{\partial E}{\partial w_{ji}} = - \left ( t_j-y_j \right ) \frac{ \partial y_j }{ \partial h_j } \frac{ \partial h_j }{ \partial w_{ji} } </math>


Note that the output of the <math>j</math>th neuron, <math>y_j </math>, is just the neuron's activation function <math>g </math> applied to the neuron's input <math>h_j </math>. We can therefore write the derivative of <math>y_j </math> with respect to <math>h_j </math> simply as <math>g </math>'s first derivative:
:<math>= - \left ( t_j-y_j \right ) \frac{ \partial y_j }{ \partial w_{ji} } \,</math>
<math display="block">\frac{\partial E}{\partial w_{ji}} = - \left ( t_j-y_j \right ) g'(h_j) \frac{ \partial h_j }{ \partial w_{ji} } </math>


To find the right derivative, we again apply the chain rule, this time differentiating with respect to the total input to <math>j \,</math>, <math>h_j \,</math>:
Next we rewrite <math>h_j </math> in the last term as the sum over all <math>k </math> weights of each weight <math>w_{jk} </math> times its corresponding input <math>x_k </math>:
<math display="block">\frac{\partial E}{\partial w_{ji}} = - \left ( t_j-y_j \right ) g'(h_j) \; \frac{ \partial}{ \partial w_{ji} } \!\!\left[ \sum_{i} x_i w_{ji} \right] </math>

:<math>= - \left ( t_j-y_j \right ) \frac{ \partial y_j }{ \partial h_j } \frac{ \partial h_j }{ \partial w_{ji} } \,</math>

Note that the output of the neuron <math>y_j \,</math> is just the neuron's activation function <math>g() \,</math> applied to the neuron's input <math>h_j \,</math>. We can therefore write the derivative of <math>y_j \,</math> with respect to <math>h_j \,</math> simply as <math>g() \,</math>'s first derivative:

:<math>= - \left ( t_j-y_j \right ) g'(h_j) \frac{ \partial h_j }{ \partial w_{ji} } \,</math>

Next we rewrite <math>h_j \,</math> in the last term as the sum over all <math>k \,</math> weights of each weight <math>w_{jk} \,</math> times its corresponding input <math>x_k \,</math>:

:<math>= - \left ( t_j-y_j \right ) g'(h_j) \frac{ \partial \left ( \sum_{k} x_k w_{jk} \right ) }{ \partial w_{ji} } \,</math>

Because we are only concerned with the <math>i \,</math>th weight, the only term of the summation that is relevant is <math>x_i w_{ji} \,</math>. Clearly,

:<math>\frac{ \partial x_i w_{ji} }{ \partial w_{ji} }=x_i \,</math>,


Because we are only concerned with the <math>i </math>th weight, the only term of the summation that is relevant is <math>x_i w_{ji} </math>. Clearly,
<math display="block">\frac{ \partial (x_i w_{ji}) }{ \partial w_{ji} } = x_i. </math>
giving us our final equation for the gradient:
giving us our final equation for the gradient:
<math display="block">\frac{\partial E}{ \partial w_{ji} } = - \left ( t_j-y_j \right ) g'(h_j) x_i </math>


As noted above, gradient descent tells us that our change for each weight should be proportional to the gradient. Choosing a proportionality constant <math>\alpha </math> and eliminating the minus sign to enable us to move the weight in the negative direction of the gradient to minimize error, we arrive at our target equation:
:<math>\frac{\partial E}{ \partial w_{ji} } = - \left ( t_j-y_j \right ) g'(h_j) x_i \,</math>
<math display="block">\Delta w_{ji}=\alpha(t_j-y_j) g'(h_j) x_i .</math>

As noted above, gradient descent tells us that our change for each weight should be proportional to the gradient. Choosing a proportionality constant <math>\alpha \,</math> and eliminating the minus sign to enable us to move the weight in the negative direction of the gradient to minimize error, we arrive at our target equation:

:<math>\Delta w_{ji}=\alpha(t_j-y_j) g'(h_j) x_i \,</math>.


==See also==
==See also==
* [[Stochastic gradient descent]]
* [[Stochastic gradient descent]]
* [[Backpropagation]]
* [[Rescorla–Wagner model]] – the origin of delta rule


==References==
==References==
Line 89: Line 69:


{{DEFAULTSORT:Delta Rule}}
{{DEFAULTSORT:Delta Rule}}
[[Category:Neural networks]]
[[Category:Artificial neural networks]]

[[de:LMS-Algorithmus]]

Latest revision as of 04:45, 27 October 2023

In machine learning, the delta rule is a gradient descent learning rule for updating the weights of the inputs to artificial neurons in a single-layer neural network.[1] It can be derived as the backpropagation algorithm for a single-layer neural network with mean-square error loss function.

For a neuron with activation function , the delta rule for neuron 's -th weight is given by

where

  • is a small constant called learning rate
  • is the neuron's activation function
  • is the derivative of
  • is the target output
  • is the weighted sum of the neuron's inputs
  • is the actual output
  • is the -th input.

It holds that and .

The delta rule is commonly stated in simplified form for a neuron with a linear activation function as

While the delta rule is similar to the perceptron's update rule, the derivation is different. The perceptron uses the Heaviside step function as the activation function , and that means that does not exist at zero, and is equal to zero elsewhere, which makes the direct application of the delta rule impossible.

Derivation of the delta rule

[edit]

The delta rule is derived by attempting to minimize the error in the output of the neural network through gradient descent. The error for a neural network with outputs can be measured as

In this case, we wish to move through "weight space" of the neuron (the space of all possible values of all of the neuron's weights) in proportion to the gradient of the error function with respect to each weight. In order to do that, we calculate the partial derivative of the error with respect to each weight. For the th weight, this derivative can be written as

Because we are only concerning ourselves with the -th neuron, we can substitute the error formula above while omitting the summation:

Next we use the chain rule to split this into two derivatives:

To find the left derivative, we simply apply the power rule and the chain rule:

To find the right derivative, we again apply the chain rule, this time differentiating with respect to the total input to , :

Note that the output of the th neuron, , is just the neuron's activation function applied to the neuron's input . We can therefore write the derivative of with respect to simply as 's first derivative:

Next we rewrite in the last term as the sum over all weights of each weight times its corresponding input :

Because we are only concerned with the th weight, the only term of the summation that is relevant is . Clearly, giving us our final equation for the gradient:

As noted above, gradient descent tells us that our change for each weight should be proportional to the gradient. Choosing a proportionality constant and eliminating the minus sign to enable us to move the weight in the negative direction of the gradient to minimize error, we arrive at our target equation:

See also

[edit]

References

[edit]
  1. ^ Russell, Ingrid. "The Delta Rule". University of Hartford. Archived from the original on 4 March 2016. Retrieved 5 November 2012.