SchulzLab/ORNA

Information on logrithm base value

minor7b5 opened this issue · 1 comments

Could I please ask for a bit more information on the function of the base value - what it does and what impact higher or lower values will have, and correspondingly how I should decide a setting based on the structure of my dataset?

For extra information, my RNAseq dataset is comprised of multiple tissues from various heterozygous individuals. I intend to conduct a multiple-k approach - should I run ORNA separate times with different values of k on the dataset to generate a set of normalised reads for respective assembly run, or should a single normalisation using the smallest k suffice?

Best wishes,
Reza

Dear Reza,
the base parameter, sets the logarithm base that is used to determine the minimal number of times a k-mer must be retained in the reduced dataset. For example consider a k-mer that occurs in 8 reads. With base=2, log_2(8)=3, at least 3 reads must be kept. With base=10, log_10(8)=0.9, at least 1 read must be kept. Because every value lower than one is set to 1.
Thus the higher the base the higher is the reduction.

If you have a very large dataset, as it sounds in your case, you can easily go higher with the base parameter. For example, base=3 with 1000 reads would retain 7 reads of them (at least).
Concerning multi-k assemblies. I would recommend to use the same normalised data for all of them. I guess you are thinking that when you were to redo the normalisation with the k-mer parameter used for each of the k-mer assemblies, you ensure that the k-mer connectivity is preserved for each of the k-mer assemblies. But what is also true is that larger k-mer values lead to less reduction if you use the same base parameter. The higher the value for k, the more unique k-mers are in the dataset, thus the more reads get preserved. To speed things up, I would stick with the smaller k-mer value used in the multi-k assembly, assuming of course that this is a reasonable value for your data.

Hope that helps,
Marcel