Rfam/rfam-website

Indexing new data for release 13.0

Closed this issue · 5 comments

We need to add Genome as a fourth entity type in Rfam search in addition to Family, Clan, and Motif.

The Genome object should contain the following fields:

  • uniprot_reference_proteome_id - empty if UPID not available
  • rfam_genome_id - empty if UPID is available
  • gca_accession - empty if GCA not available
  • description
  • length - in nucleotides
  • taxonomy_lineage - string like Bacteria; Firmicutes; etc
  • ncbi_taxonomy_id cross reference - see comment below
  • num_rfam_hits - number of significant Rfam family hits
  • num_rfam_families - number of distinct Rfam families with significant hits

We need to rename gca_accesison to something else as we also have GCF accessions for a small number of genomes I downloaded from ncbi

Correction:

  • ncbi_taxid should be a cross-reference like <ref dbkey="9606" dbname="ncbi_taxonomy_id"/>

New field request:

  • <field name="popular_species">9606</field> - similar to Family object
  • common_name
  • species - corresponds to scientific_name
  • assembly_level
  • assembly_name

Update:

  • num_rfam_families should be num_familiesbecause this field is already used in clans objects
  • num_rfam_hits, gca_accession, length should be set to retrievable in EBI search docs
  • all new fields should have descriptions in EBI search docs

We also need to add Match as a fifth entry type:

  • entry_type = Match
  • rfamseq_acc
  • seq_start
  • seq_end
  • cm_start
  • cm_end
  • evalue_score
  • bit_score
  • alignment_type - either full or seed. Should be a faceted field.
  • truncated - one of 0, 5, 3, 53
  • common_name
  • scientific_name
  • <ref> to Rfam family
  • <ref> to ncbi_taxonomy_id
  • <ref> to ENA accession
  • <ref> to UPID

Only significant hits should be indexed.