Build labeled data set
Closed this issue · 2 comments
ajsmith commented
Build the labeled data set. It should have this format:
>>> labeled_data
uid seq label
0 Q6GZX4 MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQV... Viruses
1 Q6GZX3 MSIIGATRLQNDKSDTYSAGPCYAGGCSAFTPRGTCGKDWDLGEQT... Viruses
2 Q197F8 MASNTVSAQGGSNRPVRDFSNIQDVAQFLLFDPIWNEQPGSIVPWK... Viruses
3 Q197F7 MYQAINPCPQSWYGSPQLEREIVCKMSGAPHYPNYYPVHPNALGGA... Viruses
4 Q6GZX2 MARPLLGKTSSVRRRLESLSACSIFFFLRKFCQKMASLVFLNSPVY... Viruses
5 Q6GZX1 MNAKYDTDQGVGRMLFLGTIGLAVVVGGLMAYGYYYDGKTPSSGTS... Viruses
6 Q197F5 MRYTVLIALQGALLLLLLIDDGQGQSPYPYPGMPCNSSRQCGLGTC... Viruses
7 Q6GZX0 MQNPLPEVMSPEHDKRTTTPMSKEANKFIRELDKKPGDLAVVSDFV... Viruses
8 Q91G88 MDSLNEVCYEQIKGTFYKGLFGDFPLIVDKKTGCFNATKLCVLGGK... Viruses
9 Q6GZW9 MYKMYFLKDQKFSLSGTIRINDKTQSEYGSVWCPGLSITGLHHDAI... Viruses
Depends on:
Kelvin-T-Lu commented
Create ~ 17K of each segment of dataset, (3% of total dataset)
3% bacteria
3% Viruses
3% archaea
Eukaryote:
- 3% viridiplantae + any other plants
- 3% fungi
- 3% Mammalia + other vertebrata
- 3% Insecta + nematoda + others (Non-vertebrata)
- 3% single cells eukaryote
ajsmith commented
Done!