Please enable JavaScript.
Coggle requires JavaScript to display documents.
Ch 12 Neural Networks (NN) (Back Propagation (How does NN learn?,…
Ch 12
Neural Networks (NN)
Non-linear learning - inspired from connections in the animal brain
Diagrammatic sample App1, SS8
Qualities
Works ok with noisy and erroneous data
Longer to prcoess
Relatively opaque to understand criterion
Input and Output Coding
Start off by standardizing all attributes
Min max for continuous
^ in cont - relatively robust to values slightly above min max - but if too big - can use ad-hoc stuff like giving them Min Max vals or rejecting
For categorical - can use flag if variables few
For classification
If just 2 categories - put a threshhold (ie. If output>0.6 then A)
If more than 1 - each category will have own output node - one with highest output may be chosen - also gives us conf via checking diff b/w highest and 2nd highest etc
(1 of n outputs)
^ then can further work on stuff with lil conf
Simple example of Neural Network
SS9, App1 for simple Neural Network
Has input layer (dependent on no and type of attributes), hidden layer (usually 1 psbly 2 - we can decide and inc fitting, output layer (dependent on no of psble classifications)
input layer just passes on data (thus don't share detailed not structure of other 2 layers - it's a one way flow and one neuron is connected to all others in next layer
How works
Input nodes simply forward data to hidden ones
Here weighting occurs such that each forward data is given a weight
There is an extra input of 1 (by convention) with its own weight as well
The values with their weights are all summed up then this value (x) is put in an activation function. (to mimic non-linear firing in actual neurons)
This is non-linear behavior
Most common activation function is the "Sigmoid Function"
y = 1 / (1+ e^-x)
combines linear, curvilinear, and nearly constant behavior (Graph in Appendix 1 SS10) - also called squashing function
The result of this is forward to the output layer with a weight - this is done for every neuron on hidden layer
Then output layer does same thing as hidden layer (With the extra "1" and activation function etc) then gives the output
For estimation and predication
Take output value then plug in the following info (Prediction = output(data range) + minimum)
Back Propagation
How does NN learn?
Supervised learning method - with large training set of complete records including target variable
The output value is compared with actual value and SSE (actual - output)^2 calculated
Usually but not necessarily SSE used
The challenge is to construct a model of weights that reduce the SSE
due to non linear nature of sigmoid - exists no closed formed solution - therefore must turn to optimization methods
Optimization methods
Gradient Descent Method
Use a derivative to check which direction would be closer to lowest SSE (Graph and eq in App1, SS11)
We also multiply a eta (learning rate) to calc change in weight
Back propagation rules
Takes the error and percolates it back in the network - assigning partitioned responsibility to various connections
The errors on these connections are then adjusted using gradient descent - see App 1, SS12 for mathematical version
error responsibility computed using partial derivative function of Sigmoid function with respect to netj
Uses stochastic (or online) back propagation so updates after every record
Goes from output node all the way upstream to make slight corrections in weight to red prediction error
Termination Criterion
It can go thru data set several times to keep improving weights
So ? When to stop
IF time issue - set limit on no of cycles of data set or real time - but reduced quality
Can set a low lvl of SSE as threshold but may result in over fitting
Therefore - most adopt the following cross validation techniques
retain part of data and set as a holdout termination set
Use the weights from training set to check SSE on test set
Retain one with lowest SSE and continue doing to find better - when find one that is much better than initial best (ie. rate of improvement also super low)- job done (though not necessarily will reach global minimum SSE point)
To improve - can try starting with diff initial weights or a momentum term may be added (will discuss l8er)
Setting neta or learning rate
the bigger it is, the bigger will be weight adjustment each turn - but keeping it small would take alot of time
So sol is to allow algorithm to change it (initially large - but when in neighborhood - dec it)
Momentum term
Appendix 1, SS13 for eq
Gives impact of prev changes to next one - like intertia
Causes to get to neighborhood of lowest SSE faster
Should correctly choose / experiment with diff values of alpha (b/w 0 and 1) (and neta as well) - might end up in first trough or an undershoot of the global minimum SSE
Sensitivity Analysis
While opaque - can check if one attribute impacts more than others
Test a new entry with mean of all atrributes
Then one by one - keeping all else means - changes one attribute from its min to max and see variation from all means circumstance
Will find out which attribute impacting most
There is a bit more in developing a classification neural network - but this is much of the work