Can a model have a lower loss on the validation set than it shows during training?

resolved Mar 4

Yes

I created a new version of a machine learning model where its average loss on unseen validation data is slightly lower than its loss displayed during the training process. Its other metrics, like AUC, are also slightly better.

The model has Dropout and L2 and LayerNormalization and other similar techniques applied to every one of the 30 layers. Additionallly, there have so far been over 120 million sequences trained on the model.

My theory, which is supported by GPT-4, is that this extreme regularization is applied only during training. The loss displayed during training will be poor because so much of the data is lost by the time it gets through the model during the training process. During validation, Dropout and most of the other techiques are not applied, so the loss is slightly lower.

Is this theory correct?

Market context

Get

1,000

to start trading!

Sort by:

I actually ran into this a few days ago, here’s a thorough explanation of several possible causes: https://twitter.com/aureliengeron/status/1110839223878184960

TL;DR is that it’s mostly because some forms of regularization (like dropout) behave slightly differently during validation. There’s also some smaller effects, like that the training loss is calculated on average half an epoch behind the validation.

@TonyPepperoni This is interesting. In my case, I have so much data that I am GPU-limited to process all of it, so "epochs" aren't really a valid term here. I wonder if that has something to do with this, such as Keras's displayed metrics doing an average of data 50 million sequences ago or something. Then, when I run the test data, it's actually averaging starting with the better state the model was in.

The answer to "Can a model have a lower loss on the validation set than it shows during training?" is yes.

But the answer to "Is this theory correct?" is no. Data isn't "lost" due to the regularization.

The culprit is probably just that the distributions of your train and val sets are different, with the val set having a higher proportion of easier-to-predict stuff compared to the train set. This might indicate a bug in your splitting logic.

If the train and val sets are from the same distribution, then the val set might just be small and random variation is the culprit: your random split may have just happened to give you a val set that's slightly easier to predict compared to the train set. You can eliminate this as a possibility by retraining and reevaluating on multiple different train/val splits.

Related questions