Let’s say you have a dataset of 100 elements. What would you prefer? the 100 are close to target with an offset of 0.1 each, or all hitting the target, but on a specific case, the error is 10?
In both cases you have:
Case A: 100*0.1 = 10
Case B: 99*0+10 = 10
In case A the model is reliable, 0.1 error is fine. In case B, the model is overfitting on 99/100 but can kill a guy on that extreme cases. We prefer to have the model overall reliable, than to have it getting an extreme error in one case. Now if we raise to square you get the errors:
Case A: 100*0.1*0.1 = 1
Case B: 99*0*0+10*10 = 100.
We can see here that case A is better than case B. Hope it’s clearer :)