Please enable JavaScript.
Coggle requires JavaScript to display documents.
3. One hidden layer Neural Network (3.6 Activation functions (Using a tanh…
3. One hidden layer Neural Network
3.2 Neural Network Representation
When we count layers in NN, we don't count the input layer.
input layer = layer 0
Hidden layer refers that in a training set the true values for nodes in the middle are not observed.
3.3 Computing a Neural Network's Output
When we vectorize, the rule of thumb is that stack different nodes in a layer vertically.
3.6 Activation functions
Using a tanh instead of a sigmoid function kind of has the effect of centering the data, so that the mean of the data is close to 0 rather than 0.5 and this actually makes learning for the next layer bit easier.
One of the downsides of both sigmoid function and the tanh function is that if z is either very large or very small, then the gradient of the derivative or the slope of this function becomes very small (close to 0) and this can slow down gradient descent.
tanh function goes between -1 and 1.
One exception is for the output layer. Using sigmoid function is better when you are using binary classification, in which the output is between 0 and 1.
An activation function that most always works better than the sigmoid function tanh function or the hyperbolic tangent function.
In practice, using ReLU often makes NN learn faster then tanh or sigmoid and the main reason is that there is less of this effect of the slope of the function going to 0.
ReLU: a = max(0,z) The derivative is 1 so long as z is positive and derivative or slope is 0 when z is negative.
Except for the output layer, ReLU is increasingly the default choice of activation function.
Leaky ReLU:Instead of being 0 when z is negative,it takes a slight slope. It usually works better than ReLU although it is not used as much in practice.
3.7 Why do you need non-linear activation functions?
If we were to use linear activation function (or the identity one), then the NN just outputs a linear function of the input.
Deep networks are NNs with many hidden layers. If you use a linear activation function, then no matter how many layers your NN had, it just compute a linear activation function, so you might as well not have any hidden layers.
Identity activation function: Output whatever was input.
The one place where you might use a linear function activation function is the output layer of linear regression.
3.8 Derivatives of activation functions
When you implement back-propagation for your NN, you need to compute the slope pr the derivative of the activation function.
3.10 Random Initialization
When you train a NN, it is important to initialize the weights randomly.
We usually prefer to initialize the ways to very small random values. If the weights are too large, then when you compute the activation values, z will be either very large or very small and so in that case you are more likely to end up at the flat parts of the tanh or sigmoid function, where the slope of the gradient is very small.