An small crosstabulation of what the public Fragalysis targets are.
The target names may differ from Uniprot preferred name because:
- The target names in Fragalysis may end in an 'A', this is a SGC/CMD thing and is how protein are stored in Scarab
- The protein/gene goes by multiple names —this is normal
- The target may be a viral protein expressed as a single polyprotein chain that is cleaved by a protease
- The target may be know as the domain
As of November 22 the targets in Fragalysis do not bear protein metadata. This is an attempt at rectifying it.
Two caveats:
- this is uncurated data —there's a chimp protein in the final table
- there are many private targets, this is only the public ones
Files herein:
Fragalysis_targets.ipynb
: the notebooktargets_df.p
(and it's exportstargets.csv
andtargets.md
) is a pickledpandas.DataFrame
targets_blast.p
aDict[str, Bio.Blast.Record.Alignment]
targets.fasta
the seqs as they are in Fragalysistargets_mol.p
the mol data as in Fragalysis
Some entries lack sequences. This is because they predate when Rachael implemented the sequence in the API.
The taking of the first hit in the blast search is a poor choice —i.e. chimp
The Bio.Blast.Record.Alignment
instances in
targets_blast.p
have an attribute .hsps
, which is a List[Bio.Blast.Record.Hsp]
.
The first couple may have close scores...
The Uniprot XML schema is byzanthine thing. The JSON conversions are likewise confusing. I have missed a lot of juicy things.
I stored in some cells of the pandas.DataFrame
values as tuples, which was a mistake.
I thought that .fillna(tuple())
was okay but it is as unwelcome as .fillna(list())
.
But majorly, these do no export well, but I am not a fan of pipe seperated fields.