SAS Transport writer format parsing bug
gorcha opened this issue · 2 comments
Hi!
Just noticed there's a bug in the format parser when writing SAS xport files.
xport_write_variables
is using sscanf
to pull the format name, width and decimals:
ReadStat/src/sas/readstat_xport_write.c
Line 114 in f8731f8
Because there are no spaces between the name and the other bits the whole string is being consumed as the format name by %s
, and width and decimals are never set. In many cases this is still interpreted correctly from the name alone, but others have issues, for e.g.:
- COMMA10.3 - longer than 8 chars, so is cut off when storing the format name
- 10.3 - fails with an invalid name warning in SAS
It seems like a bit of a pain to parse - the format is <name><width>.<decimal>
, but width and decimal and the period between them are all optional. Name follows the SAS naming rules, with the additional requirement that it can't end in a number.
Thanks for the report. I think a format string like "%[A-Z]%d.%d"
might do the trick, but this will need testing.
Yeah that was my initial thought as well, but unfortunately it's a bit more irritating - the name can contain numbers as well so it probably needs a simple FSM.
The regex for the format string would be something silly like ($?[A-Za-z_]([A-Za-z_0-9]*[A-Za-z_])?)([0-9]*\.?)?([0-9]*)?
So numeric sequences need to take context into account. It seems easiest parsing from the end, so ignoring any validation of the format name itself something along these lines:
int num_seq_end;
for (int i = strlen(format) - 1; i >= 0; i--) {
switch(format[i]) {
case '.':
if (state == DECIMAL) { /* i + 1 to num_seq_end as decimal value */ }
state = PERIOD;
break;
case '0'...:
if (state == START) { state = DECIMAL; num_seq_end = i; }
if (state == PERIOD) { state = WIDTH; num_seq_end = i; }
break;
default:
if (state == DECIMAL) { /* i + 1 to num_seq_end as decimal value */ }
if (state == WIDTH) { /* i + 1 to num_seq_end as width value */ }
state = NAME;
}
if (state == NAME) {
/* start to i characters as format name */
break;
}
}