songlab-cal/tape

length of transformer's output cannot match input's

quailwwk opened this issue · 1 comments

Hi, I try to apply the pre-trained model "Transformer" on some fasta files.Just like:
tape-embed transformer 1a0b_1_A.fasta 1a0b_1_A.npz bert-base --full_sequence_embed
where 1a0b_1_A.fasta contains a 117-length sequence:

>1a0b_1_A
KSEALLDIPMLEQYLELVGPKLITDGLAVFEKMMPGYVSVLESNLTAQDKKGIVEEGHKIKGAAGSVGLRHLQQLGQQIQSPDLPAWEDNVGEWIEEMKEEWRHDVEVLKAWVAKAT

But the 'seq' result in the output npz file has a shape of [119, 768].

>>> np.load('1a0b_1_A.npz',allow_pickle=True)['1a0b_1_A'].item()['seq'].shape
(119, 768)

Is there any mistake in my usage? Or it's just a normal result?
If the latter, how can I map the result to the input sequence? Thanks!

rmrao commented

TAPE adds a start and end token to each sequence, so you can remove the first and last position to get a per-residue embedding.