Proposed contribution task for Outreachy applicants: Register NCBI assembly Dog10K_Boxer_Tasha
Closed this issue · 20 comments
Dog10K_Boxer_Tasha
is a Dog assembly available at NCBI: https://www.ncbi.nlm.nih.gov/assembly/GCF_000002285.5/
Note that Dog10K_Boxer_Tasha
is the assembly that canCam6
, the latest UCSC genome for Dog, is based on. See "List of UCSC genome releases" at https://genome.ucsc.edu/FAQ/FAQreleases.html for all the genomes currently supported by UCSC.
Also check out the "Genome Browser Gateway" page here. This is the main entrance to the "UCSC Genome Browser". Find Dog in the UCSC species tree on the left, click on it, then make sure to select the latest Dog Assembly (canFam6
). This will display a bunch of additional information about the canFam6
assembly. In particular, it will indicate what NCBI assembly this genome is based on. This information is the Accession ID field. This field is usually set to a GenBank (GCA_000*.*
) or RefSeq (GCF_000*.*
) accession number.
Note that many NCBI assemblies are already registered in the GenomeInfoDb package (223 as of October 2022!). The registered_NCBI_assemblies()
function in GenomeInfoDb returns the list of all the NCBI assemblies that are currently registered in the package. An important thing to be aware of is that getChromInfoFromNCBI()
still works on an unregistered assembly, but in "degraded" mode, that is:
- The name of the assembly is not recognized, only look up by GenBank or RefSeq accession works.
- The returned circularity flags are not guaranteed to be accurate. This potential inaccuracy is communicated to the user by placing
NA
s instead ofFALSE
s in thecircular
column of the returned data.frame.
Registering an assembly fixes that. In other words, once an NCBI assembly is registered in GenomeInfoDb, getChromInfoFromNCBI()
will recognize its name and return accurate circularity flags.
See ?getChromInfoFromNCBI
(after loading GenomeInfoDb) for more information.
Registering a new NCBI assembly for an organism that is already supported is only a matter of editing the corresponding file in GenomeInfoDb/inst/registered/NCBI_assemblies/
.
IMPORTANT NOTES TO OUTREACHY APPLICANTS:
- Make sure to complete all the Preliminary tasks listed here before you start working on this task. In particular, make sure that you have R 4.2 and that you are set up to use the devel version of Bioconductor (currently 3.16).
- Only one applicant can work on this task. If you choose to work on this task, please make sure to assign yourself so other applicants know that the task is already being worked on. If later on you change your mind, please unassign yourself. It's ok to change your mind!
- To work on this task, please fork the GenomeInfoDb repository. Then do your work on that fork.
- Always test your changes before you commit them to your fork. This consists in installing the modified package, starting R, loading the package, and playing around with the new functionality. This process is called "ad hoc manual testing". Once everything behaves and looks as expected, run
R CMD build
andR CMD check
on the package. Note thatR CMD check
should always be run on the source tarball produced byR CMD build
. R CMD check
might produce some NOTEs and even some WARNINGs. These are ok if they existed before your changes. You can check that by taking a look at the daily report produced by our automated builds here: https://bioconductor.org/checkResults/devel/bioc-LATEST/ Make sure to not introduce new NOTEs or WARNINGs!- Once your work is ready to be merged, please submit a PR (Pull Request).
- Remember to record your contribution on Outreachy at https://www.outreachy.org/outreachy-december-2022-internship-round/communities/bioconductor/refactor-the-bsgenomeforge-tools/contributions/.
Hi @hpages, please I would like to be assigned this task but can't find the assign button. I have completed the preliminary task.
Hi @Simplecodez,
Were you able to install Linux on your machine? Do you have any question about the preliminary tasks? Don't hesitate to ask. You can ask me by email or in the #outreachy channel on the community-bioc Slack (don't ask questions about these Preliminary tasks here, in this issue, to stay on-topic).
Would you mind choosing the "Register NCBI assembly UCB_Xtro_10.0" issue instead? It's the same as this issue but for a different NCBI assembly. The reason I'm asking this is because another applicant is already working on the first group of tasks. See: https://github.com/Bioconductor/BSgenomeForge/wiki/List-of-contribution-tasks-for-the-Outreachy-application-period
Thanks,
H.
Goodday sir @hpages, I don't mind mind. Please can you assign to that task?
I am also done with the preliminary tasks i was able to install Linux on my machine.
@hpages, please can you assign this project to me?
Done. There's currently very little information about how to register a new NCBI assembly, sorry. I'll need to improve this. In the meantime I expect that you'll have a lot of questions for me. I'm ready! 😉
@hpages sure! 😂.
One important link on the NCBI page for any assembly is the link to the "Full sequence report" on the right:
The "Full sequence report" is a tab-delimited file describing all the sequences in the assembly. This is the file that getChromInfoFromNCBI()
downloads and returns in a data frame. Note that because Dog10K_Boxer_Tasha
is not registered yet, you must pass a GenBank or RefSeq assembly accession to getChromInfoFromNCBI()
:
getChromInfoFromNCBI("Dog10K_Boxer_Tasha") # does not work at the moment
getChromInfoFromNCBI("GCA_000002285.4") # works (in degraded mode)
getChromInfoFromNCBI("GCF_000002285.5") # works (in degraded mode)
See ?getChromInfoFromNCBI
for more information.
We can register an NCBI assembly either with its GenBank or its RefSeq assembly accession, but not with both. So we need to choose. It's recommended to compare the two data frames returned by getChromInfoFromNCBI()
before we choose. Normally they are identical, but sometimes they are not (this is a rare situation):
- If they are identical, then choosing one or the other doesn't really matter. However, if an UCSC genome is based on this assembly (like is the case here), we should use whatever the
Accession ID
field says on the Genome Browser Gateway page for the UCSC genome. - If they are not identical, then it's a more complicated situation. If this happens, we'll need to identify the differences and try to understand them. Then we'll be able to decide if they matter or not, and choose based on our assessment of the situation.
Hi @hpages , As you said ,
getChromInfoFromNCBI("GCA_000002285.4") # works (in degraded mode)
getChromInfoFromNCBI("GCF_000002285.5") # works (in degraded mode)
But after adding these lines of code to /NCBI_assembliesCanis_lupus_familiaris.R
list(assembly="Dog10K_Boxer_Tasha",
date="2020/10/06",
extra_info=c(breed="boxer"),
assembly_accession="GCF_000002285.5", # canFam6
circ_seqs="chrM")
I expected that getChromInfoFromNCBI("Dog10K_Boxer_Tasha")
would work also but it doesn't. Please what am I missing? How do I get it to download the information on the "Full sequence report" page?
P:S: I didn't find any difference between GenBank and its RefSeq assembly accession so I used the Accession ID
Did you reinstall GenomeInfoDb after editing Canis_lupus_familiaris.R
in GenomeInfoDb/inst/registered/NCBI_assemblies/
?
Always reinstall the package and load it in a fresh R session to see the effects of your changes. In this particular case, before you even try getChromInfoFromNCBI()
, you should check that the data frame returned by registered_NCBI_assemblies()
has a new entry for Dog10K_Boxer_Tasha
. Check all the fields in the new entry: they should reflect what you've put in Canis_lupus_familiaris.R
for Dog10K_Boxer_Tasha
.
@hpages I reinstall GenomeInfoDb
after the edit and also loaded it in a new session but registered_NCBI_assemblies()
still didn't include Dog10K_Boxer_Tasha
. I have done it a number of times and it's the same result. I also tried to use the GenBank assembly accession, didn't also work.
Where are you putting
list(assembly="Dog10K_Boxer_Tasha",
date="2020/10/06",
extra_info=c(breed="boxer"),
assembly_accession="GCF_000002285.5", # canFam6
circ_seqs="chrM")
exactly? This should be added to the ASSEMBLIES
list in Canis_lupus_familiaris.R
. Note that ASSEMBLIES
is a list of lists. Currently its length is 5. After you add the new entry for Dog10K_Boxer_Tasha
, it will have length 6.
Yes, I added it in the Canis_lupus_familiaris.R
file
Here's the full content of the file
ORGANISM <- "Canis lupus familiaris"
### List of assemblies first by breed then by date.
### Yep, different genome assemblies can have the same name! (don't ask me why)
### Lookup by genome name will pick-up the first in the list.
ASSEMBLIES <- list(
## breed: boxer
list(assembly="CanFam2.0",
date="2005/07/12",
extra_info=c(breed="boxer"),
assembly_accession="GCF_000002285.1",
circ_seqs=character(0)),
list(assembly="CanFam2.0",
date="2005/07/12",
extra_info=c(breed="boxer"),
assembly_accession="GCF_000002285.2", # canFam2
circ_seqs="MT"),
list(assembly="CanFam3.1",
date="2011/11/02",
extra_info=c(breed="boxer"),
assembly_accession="GCF_000002285.3", # canFam3
circ_seqs="MT"),
list(assembly="UMICH_Zoey_3.1",
date="2019/05/30",
extra_info=c(breed="Great Dane"),
assembly_accession="GCA_005444595.1", # canFam5
circ_seqs="chrM"),
list(assembly="UU_Cfam_GSD_1.0",
date="2020/03/10",
extra_info=c(breed="German Shepherd"),
assembly_accession="GCA_011100685.1", # canFam4
circ_seqs="chrM"),
list(assembly="Dog10K_Boxer_Tasha",
date="2020/10/06",
extra_info=c(breed="boxer"),
assembly_accession="GCF_000002285.5", # canFam6
circ_seqs="chrM")
)
I just copied what you show above in my own Canis_lupus_familiaris.R
file, reinstalled GenomeInfoDb, started a fresh R session, loaded GenomeInfoDb (with library(GenomeInfoDb)
), and I get:
> registered_NCBI_assemblies("Canis lupus familiaris")
organism assembly date extra_info
1 Canis lupus familiaris CanFam2.0 2005/07/12 breed:boxer
2 Canis lupus familiaris CanFam2.0 2005/07/12 breed:boxer
3 Canis lupus familiaris CanFam3.1 2011/11/02 breed:boxer
4 Canis lupus familiaris UMICH_Zoey_3.1 2019/05/30 breed:Great Dane
5 Canis lupus familiaris UU_Cfam_GSD_1.0 2020/03/10 breed:German Shepherd
6 Canis lupus familiaris Dog10K_Boxer_Tasha 2020/10/06 breed:boxer
assembly_accession circ_seqs
1 GCF_000002285.1
2 GCF_000002285.2 MT
3 GCF_000002285.3 MT
4 GCA_005444595.1 chrM
5 GCA_011100685.1 chrM
6 GCF_000002285.5 chrM
I don't understand why this doesn't work for you.
Can you commit and push your changes to your fork so I can look at this? Thanks
Then it must be from my end. I will keep trying it.
Here's my fork with the changes I made. https://github.com/Priceless-P/GenomeInfoDb/tree/Dog10K_Boxer_Tasha
Your fork works fine for me. Here is a transcript of what I did (I do everything in a terminal):
hpages@spectre:~/github/Priceless-P$ git clone https://github.com/Priceless-P/GenomeInfoDb.git
Cloning into 'GenomeInfoDb'...
remote: Enumerating objects: 3301, done.
remote: Counting objects: 100% (30/30), done.
remote: Compressing objects: 100% (10/10), done.
remote: Total 3301 (delta 22), reused 20 (delta 20), pack-reused 3271
Receiving objects: 100% (3301/3301), 84.67 MiB | 15.12 MiB/s, done.
Resolving deltas: 100% (2353/2353), done.
hpages@spectre:~/github/Priceless-P$ cd GenomeInfoDb
hpages@spectre:~/github/Priceless-P/GenomeInfoDb$ git checkout Dog10K_Boxer_Tasha
Branch 'Dog10K_Boxer_Tasha' set up to track remote branch 'Dog10K_Boxer_Tasha' from 'origin'.
Switched to a new branch 'Dog10K_Boxer_Tasha'
hpages@spectre:~/github/Priceless-P/GenomeInfoDb$ tail inst/registered/NCBI_assemblies/Canis_lupus_familiaris.R
extra_info=c(breed="German Shepherd"),
assembly_accession="GCA_011100685.1", # canFam4
circ_seqs="chrM"),
list(assembly="Dog10K_Boxer_Tasha",
date="2020/10/06",
extra_info=c(breed="boxer"),
assembly_accession="GCF_000002285.5", # canFam6
circ_seqs="chrM")
)
hpages@spectre:~/github/Priceless-P/GenomeInfoDb$ R CMD INSTALL .
* installing to library ‘/home/hpages/R/R-4.2.r82318/library’
* installing *source* package ‘GenomeInfoDb’ ...
** using staged installation
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (GenomeInfoDb)
hpages@spectre:~/github/Priceless-P/GenomeInfoDb$ R
R version 4.2.0 Patched (2022-05-04 r82318) -- "Vigorous Calisthenics"
Copyright (C) 2022 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(GenomeInfoDb)
Loading required package: BiocGenerics
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:stats’:
IQR, mad, sd, var, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, aperm, append, as.data.frame, basename, cbind,
colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
table, tapply, union, unique, unsplit, which.max, which.min
Loading required package: S4Vectors
Loading required package: stats4
Attaching package: ‘S4Vectors’
The following objects are masked from ‘package:base’:
expand.grid, I, unname
Loading required package: IRanges
> registered_NCBI_assemblies("Canis lupus familiaris")
organism assembly date extra_info
1 Canis lupus familiaris CanFam2.0 2005/07/12 breed:boxer
2 Canis lupus familiaris CanFam2.0 2005/07/12 breed:boxer
3 Canis lupus familiaris CanFam3.1 2011/11/02 breed:boxer
4 Canis lupus familiaris UMICH_Zoey_3.1 2019/05/30 breed:Great Dane
5 Canis lupus familiaris UU_Cfam_GSD_1.0 2020/03/10 breed:German Shepherd
6 Canis lupus familiaris Dog10K_Boxer_Tasha 2020/10/06 breed:boxer
assembly_accession circ_seqs
1 GCF_000002285.1
2 GCF_000002285.2 MT
3 GCF_000002285.3 MT
4 GCA_005444595.1 chrM
5 GCA_011100685.1 chrM
6 GCF_000002285.5 chrM
As you can see: no problem! Can you perform those exact commands in a terminal?
I noticed that circ_seqs should not be chrM so i checked here I saw it should be MT instead so I corrected it.
Note that you can also see this by looking at the "Full sequence report" for Dog10K_Boxer_Tasha here. Mitochondrion is usually at the bottom of the file.
Okay. I have noted that. Thank you @hpages
PR #53 merged, thanks @Priceless-P !
Next task in your group is issue #45. Whenever you are ready, go there and ask me to assign you.