onetrueawk/awk

Unicode separated values (USV) don't work

bakul opened this issue · 7 comments

bakul commented

From See https://github.com/sixarm/usv, in USV
Fields are separated by ␟ = U+241F = Symbol for Unit Separator &
Records are separated by ␞ = U+241E = Symbol for Record Separator

From https://news.ycombinator.com/item?id=31360327

 $ cat t.usv && echo
  id␟name␟age␞1␟Bob "Billy" Smith␟42␞2␟Jane
  Brown␟37
  $ goawk -F␟ -vRS=␞ -vOFS=, '{ print $1, $2, $3 }' t.usv 
  id,name,age
  1,Bob "Billy" Smith,42
  2,Jane
  Brown,37

This works in goawk, gawk & mawk but not awk. The USV values are kind of hard to see

plan9 commented

indeed. USV is not supported in OTA.

bakul commented

Shouldn't the user be allowed to pick any regexp as a field and any char/string as a record separator? Now that awk is extended to Unicode, I don't see why the above shouldn't be possible.

I did some experimentation, and there's a general problem here, using Unicode characters as RS and apparently as FS. Something broke sometime, since I had done some (minimal) testing using Unicode as RS. The code is somewhat fragile, unfortunately. I am reopening this issue, but I don't know when it will be solved.

plan9 commented

this has been fixed - thank you @arnoldrobbins

@benhoyt Please update your forum post that this issue is fixed.

@arnoldrobbins Unfortunately one can't edit or even reply to HN comments after a certain amount of time, and for that forum post that time has elapsed. Thanks for the fix though!

bakul commented