NaN in multqc json files.
mohammedmsk opened this issue · 12 comments
Hello Michael,
Thanks for making this R package.
In my multiqc json files for some samples I have NaN for some results. For example a sample had very low amounts of STAR mapping and hence was not carried forward in analysis and will not have rseqc metric.
Currently the function cannot handle such NaNs but would be great if it could. Ideally these NaNs should be converted to NA's in R i.e. missing values in R.
Thanks,
-Mohammed.
Hi, it sounds like this is a duplicate of #6. NaN is not a valid JSON token, so it must be a bug in the STAR module for multiqc. If you can post an example of an input file for multiqc that triggers this behaviour then I'm happy to file it as an issue under the multiqc repo (or you can, if you'd like)
But if you want a quick fix, the sed
command mentioned in that issue should work.
Hello,
So I tried the sed command and it still gives an error:
Error in parse_con(txt, bigint_as_char) :
lexical error: invalid char in json text.
"mapped_failed_pct": NA, "paired in
(right here) ------^
I am attaching an original json file to replicate the error:
multiqc_data.zip
The multiqc report comes from nf-core nextflow pipeline. For now if I sed NaNs to 0 (Zero) then it works without any errors.
Perhaps even better than that would be sed -i 's/NaN/null/g' multiqc_data.json
which will set them to NULL
which won't get mixed up with a true 0 value.
Love what you've set up here, Michael. The tidy functions (specifically those couple {purrr}
calls) have helped us immensely.
Only reason I'm not using {TidyMultiqc}
directly as a dependency in https://github.com/umccr/dracarys/blob/6ec6be/R/multiqc.R is due to this NaN
issue (which of course has nothing to do with your pkg).
In case it helps others who are looking for a pure R workaround, in the code linked above I use {RJSONIO}
(as suggested in https://stackoverflow.com/questions/31955051/handling-nan-when-using-fromjson-in-r) instead of {jsonlite}
to import the JSON, then call your processors in a somewhat modified manner (more specifically, handling outputs from our own workflows, handling annoying 'NA's etc.).
e.g. for a SnpEff issue we've been getting with the following element:
{
"name": "Non Canonical Start Codon",
"data": [
1.0,
NaN
]
}
we get the following by using {RJSONIO}
:
> j <- "multiqc_data_w_NaN.json"
> p1 <- jsonlite::read_json(j) # errors out...
Error in parse_con(txt, bigint_as_char) :
lexical error: invalid char in json text.
NaN ]
(right here) ------^
> p2 <- RJSONIO::fromJSON(j) # passes
> p2$report_plot_data$snpeff_effects$datasets[[1]][[30]]
$name
[1] "Non Canonical Start Codon"
$data
$data[[1]]
[1] 1
$data[[2]]
NULL
(We actually don't even use the report_plot_data
downstream, btw).
Hope that helps, and thanks again for this pkg.
Thank you for TidyMultiqc. I am running into a similar error, and the sed -i 's/NaN/null/g' multiqc_data.json
does not seem to work.
When I run it to all *.json
files, I am getting the following error sed -i 's/NaN/null/g' *.json
sed: 1: "multiqc_data_bamstats_b ...": invalid command code m
.
@pdiakumis thanks for your insight. To be honest if RJSONIO
parses this acceptably then I'm okay to start using that instead. I feel like this is the biggest issue with the package at the moment and I'll happily accept any decent way to fix it, especially if it involves an existing, well supported package.
@loukesio are you on Mac? I wonder if this issue relates to the weird behaviour of Mac sed
. If yes, can you try brew install gnu-sed
and then use gsed
instead?
-
Re:
RJSONIO
, we've probably used it on > 2,000 JSON outputs now, without an issue.jsonlite
is a more modern version of the former, with Jeroen basically rewriting it from scratch (I believe it used to be a fork ofRJSONIO
; there was a blog post about it I read some years ago). This specificNaN
issue, well, it isn't even valid JSON so we can't really complain ;-)
There will probably be differences in the structure of the parsed JSON, though I cannot tell how muchTidyMultiqc
's functions will be affected by this.NA/NaN
etc. handling dominate the issues on those repos, and you'll probably get similar issues withRJSONIO
given the number of different modules used in MultiQC. It's up to you, really. Happy to help any way I can with testing if you decide to go down that path. -
Re:
sed
, I can confirm that I get that same error when using thesed
command on my problematic JSON (on a Mac), and thatgsed
via brew works fine (specificallygsed -i 's/NaN/null/g' multiqc_data.json
).
Right, I vaguely recall choosing jsonlite
for those kinds of reasons, but maybe RJSONIO
could be a nice stop gap solution until this is fixed in multiqc
(which I have a feeling might take a while). It's quite reassuring that you've had success with that much data, honestly. Many thanks for verifying the sed
workaround as well.
@loukesio are you on Mac? I wonder if this issue relates to the weird behaviour of Mac
sed
. If yes, can you trybrew install gnu-sed
and then usegsed
instead?
Indeed it worked when I used gsed
. Thank you a lot
This seems to have been fixed upstream: MultiQC/MultiQC#2432. However, the fix won't land in MultiQC until MultiQC 1.22. Once it's released I think I will just encourage everyone to upgrade and use that version to ensure good compatibility with TidyMultiqc.