ai4all-sfu/NLP_2020

Test data missing labels

Closed this issue · 3 comments

atol commented

The data/test.csv file is missing labels so we cannot verify the model's test accuracy.

Hi Alice,
After training the model when it goes to production you won't have labels to test accuracy on testing data. purpose of testing data here is little different than the conventional pipeline. I have already found the metrics on validation set.
So, you can see original training set is divided into two small subsets real training set + validation set/ testing set ( for DEV)
and original testing set as real production level testing data.

atol commented

Very good points! Although I think for this exercise, it may be useful for the students to compare the performance of the model against unseen data, e.g. to check for overfitting, etc.

Yup, I think that would be easier for the students to digest. Can you trim the training data in training and testing in 80:20 ratio.
command is
This is for training data:
df.sample(frac=0.8)
For testing data:
df.sample(frac=0.2)
and then we can delete the original testing data from the repository.