jermp/fulgor

Consistent terminology

jermp opened this issue · 9 comments

Not urgent and not performance/correctness related, but rather stylistic: use consistent terminology between code and papers.

Examples: color_class --> color; the query colors() --> color(); k2u --> dictionary; u2c() --> unitig_id_to_color_id(); etc.

I find calling a color class just "color" confusing. It's also not in line with some of the most important papers on this topic: For example, the original colored DBG paper of Iqbal et al. [1] uses terminology where each sample is a distinct color, and each node is colored by a set of colors. I also use this convention in Themisto. If we'd like to work towards an unified colored DBG dump format, we should standardize the terminology as well. I would advocate using the convention of Iqbal. Thoughts?

[1] Iqbal, Zamin, et al. "De novo assembly and genotyping of variants using colored de Bruijn graphs." Nature genetics 44.2 (2012): 226-232.

Thanks for the feedback, Jarno. Yes, I agree and that's why I opened this issue. Shall we then go for "color" and "color sets" (the "color classes" being the distinct "color sets")?

That sounds perfect to me.

Other literature: Bifrost and VARI also use this convention. Metagraph avoids using the term color and prefers to call them annotations. Knut Reinert's lab likes to call them "user bins".

It does clash with your manuscript though (and also with Mantis), so up to you if you're willing to take that hit.

Well, to be fair, calling a document a "color" sounds strange from the very beginning :)

The most appropriate nomenclature stems from IR, where we speak of "documents" being indexed
(hence "colors" -> "documents") and a unique integer identifier is assigned to a document, hence "color" -> "docid".
The lists of docids are named "posting lists" or "inverted lists".

That's right. But also, it gets less clear once you start pooling multiple documents together into single color, or maybe coloring based on say, individual marker k-mers. In that case, just calling them "annotations" is not so bad either. I think "color" is also fine since colored DBG is a somewhat established concept already.

Yes, I agree that sticking to the most popular and adopted pays off in the long term. The goal is to make people understand, after all.

Actually, we also find this inconsistency across papers extremely confusing. I think the intuitive interpretation of the word color is that we draw a de Bruijn graph and then color individual paths, like here:

image

from https://arxiv.org/pdf/1505.02710


I guess @kamimrcht – as a major unifier of the dBG nomenclature – might have a lot of to say about this :).

Yup. Our next paper (CC @rob-p) will definitely use the standard nomenclature.

A small update: as of latest commits (credits to: @Alessio-Campa), the entire Fulgor codebase is using the "standard" terminology ("colors" for num. indexed references; "color sets" for the sorted lists of reference identifiers or "colors").
This has also been reflected to the extended Recomb paper that we submitted on 30 June to JCB: https://jermp.github.io/assets/pdf/papers/JCB2024.pdf.