ELToulemonde/dataPreparation

Enhancements Issues while using FindandTransformDates

Closed this issue · 12 comments

The function FindandTransformDates doesn't consider

  1. single digits for day or month in date e.g. 1/12/2017 or 25/1/2017. Is it because of using the standard R date formats. This becomes a problem while reading dates from text files

  2. first letter of month in lowercase e.g. "01/jan/2017", "01/feb/2017", "01/mar/2017", again I think same reason.

Also any reason why it reads the dates as character instead of as factors, because if it reads as factors then aggregation by dates becomes easier post reading them which is not so easy if read as character

Hello,

Thanks for the remarks, i guess they could be implemented quite easly.

Could you provide more information on your last point concerning factor and characters? I'm not sure to get it

Thanks

Hi,

Both points 1 and 2 have been integrated in #27

Concerning your last point, i guess you wanted to be able to transform factors: it as been intergated. If it's not that, please specify your request.

Modifications will be available in next CRAN release.

Thanks for your remarks,

I close for now.

Hello,

Sorry to reopen this again, to continue on the same thread. Had few more Qs around the update

We are using dataPreparation is 0.3.2 and have created a dummy file with all possible date formats. Have attached the file for your reference only change i our file are in csv format rather than the attached xls format. Below are our observations

  1. Column 2 & 3 are in same format but column 2 has been converted to date format successively and column 3 didn’t
  2. Column 4 has been converted to date format but “24-Jan-2016” has been converted to “2024-01-16”
  3. Column 5 & 7 has not been converted. Difference between these two is first letter is caps lock in column 5.
  4. Column 8 has combination of \ and -

dates_test.xlsx

Could you pls revert with your findings as well. Thanks

Sorry it doesn't work yet.

I will look into it.

I reopen

Hi,

I started to look into it.

First of all, the dates_test.xlsx file you provided doesn't math with the description you gave, I see:

date1 date2 date3 date4 date5 date6 date7 date8
14/01/2016 01/01/2016 01/01/2016 24-janv-16 Jan 12,2016 01/05/2016 jan-12,2016 01/01/2016
14/02/2016 01/02/2016 01/02/2016 24-févr-16 Feb 15,2016 01/08/2016 feb-12,2016 18/01/2016
14/03/2016 01/03/2016 10/03/2016 24-mars-16 Mar 20,2016 1/16/2016 mar-12,2016 23/01/2016
14/04/2016 01/04/2016 14/04/2016 24-avr-16 Apr 13,2016 1/24/2016 apr-17,2016 02/01/2016
14/05/2016 01/05/2016 17/05/2016 24-mai-16 May 14,2016 02/02/2016 may-18,2016 14/02/2016
14/06/2016 01/06/2016 18/06/2016 24-juin-16 Jun 12,2016 2/15/2016 jun-18,2016 03/01/2016
14/07/2016 01/07/2016 23/07/2016 24-juil-16 Jul 23,2016 2/28/2016 jul-21,2016 17/03/2016
14/08/2016 01/08/2016 23/08/2016 24-août-16 Aug 18,2016 03/01/2016 aug-10,2016 23/03/2016
14/09/2016 01/09/2016 24/09/2016 24-sept-16 Sep 19,2016 3/19/2016 sep-5,2016 28/03/2016
14/10/2016 01/10/2016 11/10/2016 24-oct-16 Oct 11,2016 10/25/2016 oct-19,2016 03/03/2016
14/11/2016 01/11/2016 23/11/2016 24-nov-16 Nov 21,2016 10/30/2016 nov-5,2016 04/03/2016
14/12/2016 01/12/2016 19/12/2016 24-déc-16 Dec 29,2016 12/25/2016 dec-1-2016 14/04/2016
14/01/2017 01/01/2017 12/01/2017 24-janv-17 Jan 12,2017 01/10/2017 jan-5-2017 22/05/2016
14/02/2017 01/02/2017 23/02/2017 24-févr-17 Feb 15,2017 1/18/2017 feb-28-2017 24/06/2016

Also to make your examples reproductible could you provide me the results of:

Sys.getlocale("LC_TIME")
Sys.getenv("LC_TIME")

I can start to answer by:

  • Regarding the mixing of separator (aka not always the same separator in the column), it is not a case I implemented because I haven't met this example. Could you tell me where you encountered this types of problems?
  • Same remark on date6: you have some times 0 and some other times not 0, it is not an harmonious format, why would it happen?
  • Regarding the different types of separator in a data (aka: Jan-15, 2016) it is not a case I handled yet, but it would be a nice to have. I implemented it in v0.3.3 cf PR #32.

Sys.getlocale("LC_TIME")
[1] "English_United States.1252"

Sys.getenv("LC_TIME")
[1] ""

• Regarding the mixing of separator (aka not always the same separator in the column), it is not a case I implemented because I haven't met this example. Could you tell me where you encountered this types of problems?
Column8 is the only column where both separators / and - are used. E.g. 2/1/2016 and 17-03-2016.

• Same remark on date6: you have some times 0 and some other times not 0, it is not an harmonious format, why would it happen?
In my csv file Column 1 is considering all double digits even if we pass single digits but column 6 will consider single as single and double as double. So here in my system format is harmonious.

attached is the screen shot of how it is in my system
date_screenshot

Ok, it seems that I encounter some issues reading your .xlsx

Could you provide it as a csv file? (save as csv (macintosh)) in excel.

Thanks

github is not allowing to share csv file, can you share an email id where I can send. Thanks

Hi,

here it is:


Thanks,

So I kept investigeting:

  • Date1: Is detected
  • Date2: Is detected
  • Date3: Format is not harmonious: some time sthere is the 0 for month and some times not... In which case do you encounter that? (NB: I did had a piece of code to detect presence/absence of 0 for days and month)
  • Date4: Has an ambiguous format: It is not possible to guess if 24 is a day or a year... To solve this issue, either provide the format by hand, or provide more examples.
  • Date5: has a "vicious" white space here: "Jul 23,2016 ". I had a piece of code to supress leading and tailing white spaces: will be detected in V0.3.3
  • Date6: Is detected
  • Date7: Format is not harmonious: at the beginning format is Month-day,year at the end it is Month-day-year
  • Date8: Format is not harmonious: it is sometimes day/month/year and sometimes day-month-year. You should consider changoing all "/" into "-" before performing any casting algorithm.

Concerning not harmonious format, to you have examples of when it would happen?

Hope I answered your questions.

Thanks.

  • Date 3 -Strangely the excel in our place is misbehaving and converting few months to single and retaining others as double digit. I agree this format is very unusual
  • Date 4 - I agree, we knew it to be dd-Mon-YY and it would be tough for the function to know
  • Date 5 - Thanks for taking care of leading and trailing whitespaces
  • Date 7 - Only concern there was lower and upper case which we tried on 0.3.2 and corrected the format and made it harmonious like Month-day,year but have month in lower case. 0.3.2 doesnt detect as date. Pls confirm
  • Date 8 - I agree, we will change to uniform separators before running the function

Concerning date7: lower case is handled in V0.3.2.
But there is also a mix in separators: at the begining MMM-dd,YYYY and at the end MMM-dd-YYYY

Anyway, thanks for those remarks, it helped to handle new cases:

  • Lower case
  • Missing 0
  • Different format in same dates
  • Leading and trailing white spaces