schuderer/mllaunchpad

Store/read dtypes together with CSV (FileDataSink/Source)

schuderer opened this issue · 8 comments

Right now (1.0.0rc1), direct control of the data dypes of CSV DataSources/Sinks is somewhat lacking.

As it's using pandas.read_csv under the hood, it is a blessing and a curse that dtypes get automagically guessed for you. Weird data like e.g. strings with number characters in them will get interpreted as numbers (with leading zeroes stripped out). This is particularly problematic if you need consistency between data from relational databases (which come typed) and test data from CSVs.

While one can always configure data types with the DataSource, e.g. using options:{ dtype={col1: str, col2: int64} }, this means that for CSVs with 200 columns, you need to paste a long dictionary here (which will you have to change for any changes).

In some cases, you might get away with specifying dtypes only for the misinterpreted columns. But the general problem persists.

It would be nice if FileDataSink/Source supported an automatic and consistent way to store and read a dataframe as a CSV file, together with its dtypes, without having to specify column-specific info in the options (a way to turn on/off this automatic dtype saving/loading would be nice, though).

People smarter than me have already thought about the types problem with CSV, see e.g. here: https://stackoverflow.com/questions/50047237/how-to-preserve-dtypes-of-dataframes-when-using-to-csv

Until this is tackled, you can pin down your dtypes by either trying to return them on training together with the model and then applying it to dataframes in testing/prediction, or alternatively by working around the issue by pickling your data using a binaryfileinstead ofcsv`: #113 (comment)

The problem with this workaround is that you have to deal with datasource-specific code in your model code. That is, if you use this method to persist and load DataFrames, you'll have to change your code if you decide to use a database instead of a file, which goes against the idea of separating code from the environment.

I'm reading into the issue and will report back my approach.

Approach:
-Figure out solution of SO separately.
-Integrate the working solution in a DataSource/Sink IO copy (in /examples)
-Write and run test for proposed solution

Hi, thanks for posting back! Do you already have an idea on what that solution might look like?

I first looked at the read_csv solution with dates preservation proposed in SO:

   # Read types first line of csv
   dtypes = {key:value for (key,value) in pd.read_csv(path,    
             nrows=1).iloc[0].to_dict().items() if 'date' not in value}

   parse_dates = [key for (key,value) in pd.read_csv(path, 
                  nrows=1).iloc[0].to_dict().items() if 'date' in value]
   # Read the rest of the lines with the types from above
   return pd.read_csv(path, dtype=dtypes, parse_dates=parse_dates, skiprows=[1])

and wanted to incorporate it in the get_dataframe function from the DataSources class.
It actually breaks on all int types because of this line

{key:value for (key,value) in pd.read_csv(path,    
             nrows=1).iloc[0].to_dict().items() if 'date' not in value}

especially with the if 'date' not in value condition: because it cannot iterate over int64 and float64 types.

I read further, and it is suggested that you can also just do

parse_dates=True

so pandas can automatically parse and detect the dates instead of manually providing it as in this example.

I also found this, in regarding to preserving the dtypes:
pandas-dev/pandas#27749
It seems that the issue will not be tackled, and actually parquet files are recommended but I think that might be out of league here.

So, I think the solution might look like this, slightly modified.
Have to test it ofcourse with different csv's (simple, to elaborate)

def read_csv(path):
    dtypes = {key:value for (key,value) in pd.read_csv(path,    
              nrows=1).iloc[0].to_dict().items() }
    parse_dates = [key for (key,value) in pd.read_csv(path, 
                   nrows=1).iloc[0].to_dict().items()]
    return pd.read_csv(path, dtype=dtypes, parse_dates=True, skiprows=[1])

Thanks for sketching the approach and sorry if I am missing anything in your comment (formatting seems to be off a bit in my browser). I am beginning to think that a dtypes-in-separate-file approach would be preferable to the one-line-with-types-in-the csv approach. So instead of just the .csv file, an additional .dtypes file would be written/read in the same path. I am thinking of this because the one line with dtypes would make the csvs incompatible with pretty much any other tool that reads csvs, wheres with the two-files-approach, the dtypes file can simply be ignored by them.

Edit: you could eg just use a flat text file format for .dtypes, with one line per dtype in the order of the columns in the csv, which would be pretty straightforward.
PS: the “‘date’ in” thing/parsedates looks like a red herring, as “‘date’ in “ tries to compare strings so the solution for that approach would probably just be to use ‘date’ in str(value), but that might not be relevant for the separate-file approach

Thanks for the feedback. What about the following?:

If we go with that approach, you need to read the dtypes file first, convert it to a dict.
Then pass it to the actual read_csv code in the DataSources class.

Also, I think we need to add the (path) dtypes to the yml file:

  bogus:
    type: csv
    path: ./mock.csv
    dtypes: ./bogus.dtypes

Sounds good! 👍 I also like the idea of creating a property dtypes. Makes it very explicit what's going to happen (another file will be written, what it will be called, etc.).