dalenewman/JunkDrawer

CSV imports incorrect column when datetime is detected

mattpenner opened this issue · 3 comments

Imported the following data both as csv and as xlsx:
t1,t2,t3,t4
Monday,10,1.1,1/1/2014
Tuesday,11,2.2,2/1/2014
Wednesday,12,3.3,3/1/2014
Thursday,13,4.4,4/1/2014
Friday,14,5.5,5/1/2014
Saturday,15,6.6,6/1/2014

Relevant portion of the config file is:

  <junkdrawer>
    <types>
      <!--<add type="boolean" />-->
      <!--<add type="byte" />-->
      <!--<add type="int16" />-->
      <add type="int32" />
      <!--<add type="int64" />-->
      <!--<add type="single" />-->
      <!--<add type="double" />-->
      <add type="datetime" />
    </types>
  </junkdrawer>

When imported as an xlsx I get the following:
image

Here is the relevant portions of the log:

2014-03-27 12:00:41 | Info | JDT5042517360000 | Data................ | Using 10 character string for t1.
2014-03-27 12:00:41 | Info | JDT5042517360000 | Data................ | Using int32 for t2.
2014-03-27 12:00:41 | Info | JDT5042517360000 | Data................ | Using 19 character string for t3.
2014-03-27 12:00:41 | Info | JDT5042517360000 | Data................ | Using datetime for t4.

When imported as a csv I get the following. Note that column 4 was actually imported also as column 3:
image

Here is the relevant portions of the log:

2014-03-27 12:04:22 | Info | JDT5042517360000 | Data................ | Using 10 character string for t1.
2014-03-27 12:04:22 | Info | JDT5042517360000 | Data................ | Using int32 for t2.
2014-03-27 12:04:22 | Info | JDT5042517360000 | Data................ | Using datetime for t3.
2014-03-27 12:04:22 | Info | JDT5042517360000 | Data................ | Using datetime for t4.

Okay. Thanks for finding and detailing this Matt. I'll take a look at it pretty soon.

We have two different things happening here.

First, Excel is recognizing 1.1, 2.2, 3.3, etc. as numbers, but by the time I run type checking on them, some have the super precise zero thing happening (i.e. 2.200000....), and without a numeric data type to fall into, they get to be strings.

Second, the .csv file isn't adding any zeroes to the 1.1, 2.2, 3.3, etc. But, luck would have it, that .NET's DateTime.TryParse method (given your system's culture), will see 1.1, 2.2, 3.3, etc. as valid dates (see http://stackoverflow.com/questions/6511372/datetime-tryparse-converts-decimal-to-datetime). t3 and t4 are not the same dates.

So, I'm not exactly sure how to handle either case yet. Is there a good reason not to include a numeric data type check ahead of the datetime check?

Ah, definitely interesting.

I don't have any good reason to not include numerical type check before date. In fact, your strategy would have been my first thought since I would think it's easy to invalidate a numerical when common date type delimiters are in the field, such as /,-,:. In addition a number might be misinterpreted as a date more often than a date would be misinterpreted as a number.

Matt

On Mar 28, 2014, at 6:11 AM, "Dale Newman" notifications@github.com wrote:

We have two different things happening here.

First, Excel is recognizing 1.1, 2.2, 3.3, etc. as numbers, but by the time I run type checking on them, some have the super precise zero thing happening (i.e. 2.200000....), and without a numeric data type to fall into, they get to be strings.

Second, the .csv file isn't adding any zeroes to the 1.1, 2.2, 3.3, etc. But, luck would have it, that .NET's DateTime.TryParse method (given your system's culture), will see 1.1, 2.2, 3.3, etc. as valid dates (see http://stackoverflow.com/questions/6511372/datetime-tryparse-converts-decimal-to-datetime). t3 and t4 are not the same dates.

So, I'm not exactly sure how to handle either case yet. Is there a good reason not to include a numeric data type check ahead of the datetime check?


Reply to this email directly or view it on GitHub.