Grep output doesn't match what's in the lesson in a way that breaks the example
JCSzamosi opened this issue · 7 comments
I'm trying to run this lesson with the data files downloaded from FigShare. In the Redirection, lesson, the output of
grep -B1 -A2 NNNNNNNNNN SRR098026.fastq > bad_reads.txt
wc -l bad_reads.txt
returns 537
rather than the expected 802.
This is a problem because 537 is not a multiple of 4. This is happening because some of the reads with the string NNNNNNNNNN are non-contiguous in the file, so grep
is inserting a --
line between groups of contiguous results. I think the lesson as written will mislead learners about how they can use grep, since it doesn't mention this behaviour (which I have replicated on multiple machines, so it's not just a quirk of one system).
Has anyone encountered this problem? What do you do about it?
I have identified that grep behaves in the expected way (no ---
separator between results) on Mac OS. The problem is on Linux (and therefore probably WSL as well). So the lesson would work as long as everyone was on a mac.
Further update: On the latest version of MacOS, the --
separator is inserted. On Debian Linux, there is a --no-group-separator
flag for grep
which removes it, but that flag does not exist for MacOS, therefore this part of the lesson no longer works on Mac OS, but can be made to work on Linux (and WSL if they have a recent Debian distro). I don't know about git-bash or other Windows options.
Okay, I was mis-reading the lesson. We don't need the lines in bad_reads.txt to be divisible by 4. But I'm still getting 537 instead of 802. And I don't really feel that asking novices to pipe grep
to grep -v
is reasonable.
I have encountered the same error as well. I am running on WSL2 and when I run the command;
grep -B1 -A2 NNNNNNNNNN SRR098026.fastq > bad_reads.txt
wc bad_reads.txt
I get the output 537 1073 23217 bad_reads.txt
as opposed to the expected 802 1338 24012 bad_reads.txt
I checked that the data file I downloaded from Figshare was uploaded in 2019 and this is the same file used to create the lesson (Sept. 2020). Since the file has not been modified, we should expect the same output.
This needs to be corrected in the lesson
I also saw 537 vs 802 when I was running this on the Amazon instance. I do think the double grep with inverted 2nd grep is a bit hard for novices to understand. Especially without seeing an inverted grep first. There does seem to be an option --no-group-separator
but it isn't available on all operating systems. If this option is in the instance, we could change to using that option but leave a callout that if that isn't an option on your system you can use the double grep instead?
If doing this has them practicing the pipe not enough, then we could add more practice too the data wrangling section. When I recently taught data wrangling, I showed learners my most common pipe combo where I pipe ls into wc to check the number of input and output files.
Edit: Just checked and the Amazon Web Instance does have the --no-group-separator
option.
So the actually number of bad reads is 536 and not 802?