JanMarvin/readspss

FR: allow `convert.times = TRUE` for `read.sav`

Closed this issue · 10 comments

Currently the option convert.dates = TRUE forgets the times in what would be date-time data (with haven:read_sav I get POSIXct). I see that varmat=22 in my dataset.

Hi @iago-pssjd , thanks for the report. Follow up question

  1. Do you mean vectors only having hour minutes seconds (as in hh:mm:ss)? As written without day month year? Those are not converted because R does not provide a default vector type for this. There is the hms package you can use for manual conversion, but it relies on tidyverse dependencies and therefore isn't added to this package.
  2. Or do you have datetime vectors (yy-mm-dd hh:mm:ss) that are converted to dates? Not sure what exactly you mean. Maybe you could add a screenshot of what you are looking for?

Hi @JanMarvin , thanks for the answer.

I mean the second option, datetime vectors. Indeed, the str of the haven-readed data is

POSIXct[1:648], format: "2021-06-10 13:18:47" "2021-06-11 11:51:03" "2021-06-11 16:47:28" "2021-06-11 21:03:23" "2021-06-12 12:10:05" ...

while for the read.sav with convert.dates = TRUE is

Date[1:648], format: "2021-06-10" "2021-06-11" "2021-06-11" "2021-06-11" "2021-06-12" "2021-06-13" "2021-06-14" "2021-06-14" ...

and with convert.dates=FALSE is

num [1:648] 1.38e+10 1.38e+10 1.38e+10 1.38e+10 1.38e+10 ...
`

Oh sure that should be added. Should be done with convert.date as well. Will have to look into which varmats indicate datetime. Thought that it was already available, but maybe I was lazy because of never having dealt with datetimes until a few years ago (those were the days ...).

If you want to, please feel free to open a pull request.

FYI the relevant code is here:

readspss/R/readsav.R

Lines 300 to 320 in 17a9244

if (convert.dates) {
nams <- names(data)
isdate <- varmat[, 6] %in% c(20 , 22, 23, 24, 38, 39)
istime <- varmat[, 6] %in% c(21, 25)
if (any(isdate)) {
for (nam in nams[isdate]) {
data[[nam]] <- as.Date(as.POSIXct(
round(data[[nam]]), origin = "1582-10-14"))
}
}
if (any(istime)) {
message("time format found for", nams[istime],
"This is a 24 time and no date and thus not converted.")
# for (nam in nams[istime]) {
# data[[nam]] <- as.POSIXlt(data[[nam]], origin="1582-10-14")
# }
}
}

Some or all of these varmats can be datetime. If it's impossible to verify which type contains time, additional checks could be added if the variable is of type integer (date) or numeric (datetime). Or as alternative, provide additional options to always create datetime or date.

Thanks! @JanMarvin .
Actually I do not understand the meaning of the varmat[,6] possible values. If I'm not wrong, it seems that you take them from

unk43 = readbin(unk43, sav, swapit);

varmat(0,5) = unk43; // format type

and then from

readspss/src/spss.h

Lines 23 to 32 in 17a9244

template <typename T>
T readbin( T t , std::istream& sav, bool swapit)
{
if (!sav.read ((char*)&t, sizeof(t)))
Rcpp::stop("readbin: a binary read error occurred");
if (swapit==0)
return(t);
else
return(swap_endian(t));
}

but I cannot understand the values returned by swap_endian.
Beyond, if

If it's impossible to verify which type contains time

how other checks could be done?

Hi @iago-pssjd , no need to look at the Rcpp code (the snippet you've picked simply reads the bit from the sav binary and converts it to your system endianes if required). We have to identify, which of these varmat[, 6] %in% c(20 , 22, 23, 24, 38, 39) are datetime formats. They are somehow stored in SPSS with some format. They indicate how SPSS itself shows them.

How would I approach this: Maybe start with some initial assumption. Maybe varmat 20 is just the year, 22 is a year+month, 23 is year month in abbreviation, 24 is year month day, 3x... could be datetime. That's just something we have to try out. You mentioned that 22 looks like datetime, that's a first step. Ideally we have a file with all the different formats and can open it in SPSS/PSPP to compare their dates to ours. SPSS access is a bit of the problem for me. I'd have to ask some university guys if they can run a sample file for me, but PSPP is available. Maybe it's already documented in the PSPP docs.
I guess one other issue was that SPSS had a datetime variable with a ymd format.

PSPP documentation has the following: Variable-Record

20 DATE
21 TIME
22 DATETIME
23 ADATE
24 JDATE
25 DTIME
26 WKDAY
27 MONTH
28 MOYR
29 QYR
30 WKYR
31 PCT
32 DOT
33 CCA
34 CCB
35 CCC
36 CCD
37 CCE
38 EDATE
39 SDATE
40 MTIME
41 YMDHMS

Hi!,

I have the same issue with SPSS. Actually I should have SPSS at my work place, but I didn't get it to work with the last dataset I had to analyse. I will check ASAP these values with examples.

Thanks!

I came up with the sps file below using this as reference: https://libguides.library.kent.edu/SPSS/DatesTime

data list list / 
    d1 (DATE9) 
    d2 (DATE11) 
    a1 (ADATE8) 
    a2 (ADATE10)
    e1 (EDATE8)
    e2 (EDATE10)
    j1 (JDATE5)
    j2 (JDATE7)
    s1 (SDATE8)
    s2 (SDATE10)
    q1 (QYR6)
    q2 (QYR8)
    m1 (MOYR6)
    m2 (MOYR8)
    w1 (WKYR8)
    w2 (WKYR10)
    dt1 (DATETIME17)
    dt2 (DATETIME20)
    dt3 (DATETIME23.2)
    y1 (YMDHMS16)
    y2 (YMDHMS19)
    y3 (YMDHMS19) /* 19.2 .
    w3 (WKDAY3)
    w4 (WKDAY9)
    m3 (MONTH3)
    m4 (MONTH9).

begin data.
"31-JAN-13",  "31-JAN-2013", "01/31/13", "01/31/2013", "31.01.13", "31.01.2013", "13031", "2013031", "13/01/31", "2013/01/31", "1 Q 13", "1 Q 2013", "JAN 13", "JAN 2013", "5 WK 13", "5 WK 2013", "31-JAN-2013 01:02", "31-JAN-2013 01:02:33", "31-JAN-2013 01:02:33.72", "2013-01-31 1:02", "2013-01-31 1:02:33", "2013-01-31 1:02:33.72", "THU", "THURSDAY", "JAN", "JANUARY"
end data. 

save outfile = "/tmp/datetimes.sav" .

I've seen just now your comment and PR. Great!
Thank you!