tensorflow – Can relu be used at the last layer of a neural network?
You can use
relu function as activation in the final layer.
You can see in the autoencoder example at the official TensorFlow site here.
Use the sigmoid/softmax activation function in the final output layer when you are trying to solve the Classification problems where your labels are class values.
What you have asked invokes another question which is very fundamental. Ask yourself: What you actually want the model to do?- Predicting a real value? Or Value within a certain range? – You will get your answer.
But before that what I feel I should give you a brief on what activation functions are all about and why we use them.
Activation functions main goal is to introduce non-linearity in your model. As the combination of linear functions is also a linear function, hence, without activation functions a
Neural Network is nothing but a giant linear function. Hence, being a liner function itself it wont be able to learn any non-linear behavior at all. This is the primary purpose of using an activation function.
Another purpose is to limit the range of output from a neuron. Following image shows
ReLU activation functions (the image is collected from here).
These two graphs show exactly what kind of limitations they can impose on values passed through them. If you look at
Sigmoid function it is allowing output to be in
between 0 to 1. So we can think it like a probability mapping based on some input value to the function. So where we can use it? Say for binary classification if we assign
1 for two different classes and use a
Sigmoid function in the output layer it can give us the probability of belonging to a certain class for an example input.
Now coming to
ReLU. What it does? It only allows
Non-negative values. As you can see all the negative values in horizontal axis is being mapped to 0 in vertical axis. But for positive values the 45 degree straight line shows that it does nothing to them and leave them as they are. Basically it helps us to get rid of negative values and makes them 0 and allows non-negative values only. Mathematically:
relu(value) = max(0, value).
Now picture a situation: Say you want to predict real values which can be positive, zero or even negative! Will you use
ReLU activation function in the output layer just because it looks cool? Nope! Obviously not. If you do so it will never be able to predict any negative values as all the negative values are being trimmed down to 0.
Now coming to your case, I believe this model should predict values which shouldnt be limited from
0 to 1. It should be a
real valued prediction.
Hence when you are using
sigmoid function, it is basically forcing the model to output between
0 to 1 and which is not a valid prediction in most of the cases and thus the model is producing large
MSE values. As the model is forcefully predicting something which is not anywhere near to the actual correct output.
Again when you are using
ReLU it is performing better. Because
ReLU doesnt change any non-negative value. Hence, the model is free to predict any non-negative values and now there is no bound to predict values which are close to actual outputs.
But what I think the model wants to predict intensity values which are likely from 0 to 255. Hence, there are already no negative values coming from your model. So in that sense technically there is no need of using
ReLU activation function in the last layer as it will not even get any negative values to filter out (if I am not mistaken). But you can use it as the official
TensorFlow documentation is using it. But it is only for safety purpose such that no
negative values can come out and again the
ReLU wont do anything to