danny305/StabilityOracle

Datasets' "chain", "position" and "wtAA" often not matching the PDB files as downloaded from RCSB

gvisani opened this issue · 2 comments

Hi,

I have noticed that the information in the datasets' .csv files columns does not always match the information in the raw PDB files'.
I suspect there is some non-obvious shifting being applied to the residue positions along the chain, and it seems to occur most regularly for the cdna117K dataset for some reason.

This problem prevents the effective use of the dataset. Could you please share details of how to map residue position between the .csv files and the PDB files? Perhaps by sharing your preprocessing pipeline?

Thanks!

hi,
When processing the CDNA117K data, the experimental data did not use the full amino acid sequence, only a portion was used. This resulted in a mismatch between the position and sequence in the dataset.csv file.Perhaps by sharing your preprocessing pipeline?

I see! I ended up changing them manually, it wasn't as time-intensive as I had thought