aws-samples/aws-machine-learning-university-accelerated-tab

Problem with final task dataset

agrigoriev opened this issue · 1 comments

After loading final project data

import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")
  
training_data = pd.read_csv('../data/final_project/training.csv')
test_data = pd.read_csv('../data/final_project/test_features.csv')
y_test = pd.read_csv('../data/final_project/y_test.csv')

print('The shape of the training dataset is:', training_data.shape)
print('The shape of the test dataset is:', test_data.shape)
print('The shape of the y_test is:', y_test.shape)

The shape of the training dataset is: (71538, 13)
The shape of the test dataset is: (23846, 12)
The shape of the y_test is: (23845, 1)

The number of samples for test features differs from y_test.
Is it correct?

Hi @agrigoriev. Thanks for going over the final project.

It looks like you skipped the first row of the file. If you read y_test like this, it puts that back:
y_test = pd.read_csv('../data/final_project/y_test.csv', header=None)

y_test.shape is (23846, 1)

This is not a problem with the training.csv and test_features.csv as both have a header row.