twosigma/ngrid

ValueError when a column contains many ints followed by a float

adereth opened this issue · 1 comments

I'm using this as sample data: http://spatialkeydocs.s3.amazonaws.com/FL_insurance_sample.csv.zip

There's a column with a bunch of 0 values and on line 902 it contains 7096.5. When I'm paging through the data using ngrid, everything is fine until it hits this line. At that point it dies with:

Traceback (most recent call last):
  File "/usr/local/bin/ngrid", line 9, in <module>
    load_entry_point('ngrid==0.1.0', 'console_scripts', 'ngrid')()
  File "/usr/local/lib/python2.7/dist-packages/ngrid/main.py", line 124, in main
    grid.show_model(model, num_frozen=options.frozenCols)
  File "/usr/local/lib/python2.7/dist-packages/ngrid/grid.py", line 1000, in show_model
    view.show()
  File "/usr/local/lib/python2.7/dist-packages/ngrid/grid.py", line 686, in show
    self.__print()
  File "/usr/local/lib/python2.7/dist-packages/ngrid/grid.py", line 831, in __print
    if idx < self.__model.num_rows 
  File "/usr/local/lib/python2.7/dist-packages/ngrid/grid.py", line 378, in get_row
    row = [ c(v) for c, v in zip(self.converts, row) ]
ValueError: invalid literal for int() with base 10: '7096.5'

You can use the --buffer_size option to use a larger number of rows for guessing column types, or --dataframe to load the entire dataset into memory up front.

It's on the todo list to adjust types dynamically in cases like this, but it's somewhat tricky to implement.