elsasserlab/minute

Make tsv parsing robust to blank spaces

cnluzon opened this issue · 3 comments

Some code editors silently replace tabs with spaces and then running the pipeline crashes due to parsing. I am not sure of the possible ramifications of just replacing line.strip().split("\t") with line.strip().split() in read_tsv function. Is there some scenario where this would be an undesired behavior?

tsv stands for tab-separated values, so it might be misleading to include spaces, but I'm not sure if other tools or parsing platforms are flexible with this or not. On the other hand I would expect that "standard" tsv don't include comments either.

I think this makes sense. I also had the problem of editing the tsv files in a text editor. Allowing spaces as separators would make this easier.

Some downsides of using split() over split("\t") are:

  • Values can no longer be the empty string
  • Values can no longer contain spaces

If I see correctly, the only value which we allow to be empty at the moment is the barcode, but there we use a dot (.) instead, so it’s not a problem. And values with spaces in them (such as sample names) probably don’t work properly anyway at the moment because they end up in file names.

tsv stands for tab-separated values, so it might be misleading to include spaces

I think the files should no longer be named .tsv. Perhaps .tab is generic enough?

Yes, I only could think of those two possibilities and I'd rather not have spaces in names or filenames.

I am not very inclined to change extension of the input files, at least not yet, if there is the possibility of further re-structuring later on. Just to minimize the amount of changes in the input files for current users in my lab, since this is more of a semantics than a functionality change.

On the other hand, it could be a good time to do it, since now max_barcode_errors is a field in config.yaml, so previous configurations would crash with the latest version of the pipeline, anyway.

We could have some backwards compatibility code that attempts to .tab first and then falls back to .tsv.

so previous configurations would crash with the latest version of the pipeline, anyway.

Also here it would be nice to be able to read configuration files without that setting and just use a default value if the setting is missing.