Indexing new data for release 13.0
Closed this issue · 5 comments
AntonPetrov commented
We need to add Genome as a fourth entity type in Rfam search in addition to Family, Clan, and Motif.
The Genome object should contain the following fields:
-
uniprot_reference_proteome_id
- empty if UPID not available -
rfam_genome_id
- empty if UPID is available -
gca_accession
- empty if GCA not available -
description
-
length
- in nucleotides -
taxonomy_lineage
- string likeBacteria; Firmicutes; etc
-
ncbi_taxonomy_id
cross reference - see comment below -
num_rfam_hits
- number of significant Rfam family hits -
num_rfam_families
- number of distinct Rfam families with significant hits
kalvari commented
We need to rename gca_accesison to something else as we also have GCF accessions for a small number of genomes I downloaded from ncbi
AntonPetrov commented
Correction:
-
ncbi_taxid
should be a cross-reference like<ref dbkey="9606" dbname="ncbi_taxonomy_id"/>
AntonPetrov commented
New field request:
-
<field name="popular_species">9606</field>
- similar to Family object -
common_name
-
species
- corresponds toscientific_name
-
assembly_level
-
assembly_name
AntonPetrov commented
Update:
-
num_rfam_families
should benum_families
because this field is already used in clans objects -
num_rfam_hits
,gca_accession
,length
should be set to retrievable in EBI search docs - all new fields should have descriptions in EBI search docs
AntonPetrov commented
We also need to add Match as a fifth entry type:
-
entry_type
=Match
-
rfamseq_acc
-
seq_start
-
seq_end
-
cm_start
-
cm_end
-
evalue_score
-
bit_score
-
alignment_type
- eitherfull
orseed
. Should be a faceted field. -
truncated
- one of 0, 5, 3, 53 -
common_name
-
scientific_name
-
<ref>
to Rfam family -
<ref>
toncbi_taxonomy_id
-
<ref>
to ENA accession -
<ref>
to UPID
Only significant hits should be indexed.