ajsmith/cs747-project

Build labeled data set

Closed this issue · 2 comments

Build the labeled data set. It should have this format:

>>> labeled_data
      uid                                                seq    label
0  Q6GZX4  MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQV...  Viruses
1  Q6GZX3  MSIIGATRLQNDKSDTYSAGPCYAGGCSAFTPRGTCGKDWDLGEQT...  Viruses
2  Q197F8  MASNTVSAQGGSNRPVRDFSNIQDVAQFLLFDPIWNEQPGSIVPWK...  Viruses
3  Q197F7  MYQAINPCPQSWYGSPQLEREIVCKMSGAPHYPNYYPVHPNALGGA...  Viruses
4  Q6GZX2  MARPLLGKTSSVRRRLESLSACSIFFFLRKFCQKMASLVFLNSPVY...  Viruses
5  Q6GZX1  MNAKYDTDQGVGRMLFLGTIGLAVVVGGLMAYGYYYDGKTPSSGTS...  Viruses
6  Q197F5  MRYTVLIALQGALLLLLLIDDGQGQSPYPYPGMPCNSSRQCGLGTC...  Viruses
7  Q6GZX0  MQNPLPEVMSPEHDKRTTTPMSKEANKFIRELDKKPGDLAVVSDFV...  Viruses
8  Q91G88  MDSLNEVCYEQIKGTFYKGLFGDFPLIVDKKTGCFNATKLCVLGGK...  Viruses
9  Q6GZW9  MYKMYFLKDQKFSLSGTIRINDKTQSEYGSVWCPGLSITGLHHDAI...  Viruses

Depends on:

Create ~ 17K of each segment of dataset, (3% of total dataset)

3% bacteria
3% Viruses
3% archaea
Eukaryote:

  • 3% viridiplantae + any other plants
  • 3% fungi
  • 3% Mammalia + other vertebrata
  • 3% Insecta + nematoda + others (Non-vertebrata)
  • 3% single cells eukaryote

Done!