WizardMac/ReadStat

SAS Transport writer format parsing bug

gorcha opened this issue · 2 comments

Hi!

Just noticed there's a bug in the format parser when writing SAS xport files.

xport_write_variables is using sscanf to pull the format name, width and decimals:

sscanf(variable->format, "%s%d.%d", name, &width, &decimals);

Because there are no spaces between the name and the other bits the whole string is being consumed as the format name by %s, and width and decimals are never set. In many cases this is still interpreted correctly from the name alone, but others have issues, for e.g.:

  • COMMA10.3 - longer than 8 chars, so is cut off when storing the format name
  • 10.3 - fails with an invalid name warning in SAS

It seems like a bit of a pain to parse - the format is <name><width>.<decimal>, but width and decimal and the period between them are all optional. Name follows the SAS naming rules, with the additional requirement that it can't end in a number.

Thanks for the report. I think a format string like "%[A-Z]%d.%d" might do the trick, but this will need testing.

Yeah that was my initial thought as well, but unfortunately it's a bit more irritating - the name can contain numbers as well so it probably needs a simple FSM.

The regex for the format string would be something silly like ($?[A-Za-z_]([A-Za-z_0-9]*[A-Za-z_])?)([0-9]*\.?)?([0-9]*)?

So numeric sequences need to take context into account. It seems easiest parsing from the end, so ignoring any validation of the format name itself something along these lines:

int num_seq_end;
for (int i = strlen(format) - 1; i >= 0; i--) {
  switch(format[i]) {
    case '.':
      if (state == DECIMAL) { /* i + 1 to num_seq_end as decimal value */ }
      state = PERIOD;
      break;
    case '0'...:
      if (state == START) { state = DECIMAL; num_seq_end = i; }
      if (state == PERIOD) { state = WIDTH; num_seq_end = i; }
      break;
    default:
      if (state == DECIMAL) { /* i + 1 to num_seq_end as decimal value */ }
      if (state == WIDTH) { /* i + 1 to num_seq_end  as width value */ }
      state = NAME;
  }

  if (state == NAME) {
    /* start to i characters as format name */
    break;
  }
}