What is the use of two similarity files in the data?
Closed this issue · 8 comments
Hello, I saw your dataset introduction and ran your code. First, I confirmed that, according to the information I understand, your method is actually based on the smiles-protein sequence as input, so it is not used. Drug structure information or protein structure information, right?
In addition, I did not see the use of kiba_drug_sim.txt
and kiba_target_sim.txt
files in the code, that is, these two files are redundant and useless?
Maybe I don't fully understand your method, please also give me a lot of guidance, thank you!
Hello @zhouhao-learning, yes, our method only uses sequence input.
Those similarity files were used in two additional tests that are explained in the paper, please refer to Tables 3-4. You'll see while CNN was used to build representation for one component (e.g. SMILES), for the other we used the similarity matrix (e.g. kiba_target_sim). With this, we wanted to understand how much information CNN brings in.
In the source code, these correspond to build_single_prot and build_single_drug functions, but it seems I didn't include the code that is required to run them. I can update the code if they will be useful for you.
Best.
@hkmztrk
Ok, thank you very much for your reply, I think, you can update this part of the code, it will be better to try for me,
In addition, does the kiba_target_sim.txt
or kiba_drug_sim.txt
file use structural information about proteins or drugs? If I have some SMILES and proteins, is there any way to convert them to similarity matrix
?
Please also give us a lot of advice, thank you very much!
Best Wishes!
@zhouhao-learning for target similarity, smith-waterman algorithm is used, for drug similarity Pubchem structure similarity is used. Please refer to the DeepDTA article for more detail.
I'll try to update the code in a few days.
Best!
@hkmztrk
Ok, thank you for your reply, I have one last question,
CHARPROTSET = { "A": 1, "C": 2, "B": 3, "E": 4, "D": 5, "G": 6,
"F": 7, "I": 8, "H": 9, "K": 10, "M": 11, "L": 12,
"O": 13, "N": 14, "Q": 15, "P": 16, "S": 17, "R": 18,
"U": 19, "T": 20, "W": 21,
"V": 22, "Y": 23, "X": 24,
"Z": 25 }
CHARCANSMISET = { "#": 1, "%": 2, ")": 3, "(": 4, "+": 5, "-": 6,
".": 7, "1": 8, "0": 9, "3": 10, "2": 11, "5": 12,
"4": 13, "7": 14, "6": 15, "9": 16, "8": 17, "=": 18,
"A": 19, "C": 20, "B": 21, "E": 22, "D": 23, "G": 24,
"F": 25, "I": 26, "H": 27, "K": 28, "M": 29, "L": 30,
"O": 31, "N": 32, "P": 33, "S": 34, "R": 35, "U": 36,
"T": 37, "W": 38, "V": 39, "Y": 40, "[": 41, "Z": 42,
"]": 43, "_": 44, "a": 45, "c": 46, "b": 47, "e": 48,
"d": 49, "g": 50, "f": 51, "i": 52, "h": 53, "m": 54,
"l": 55, "o": 56, "n": 57, "s": 58, "r": 59, "u": 60,
"t": 61, "y": 62}
CHARISOSMISET = {"#": 29, "%": 30, ")": 31, "(": 1, "+": 32, "-": 33, "/": 34, ".": 2,
"1": 35, "0": 3, "3": 36, "2": 4, "5": 37, "4": 5, "7": 38, "6": 6,
"9": 39, "8": 7, "=": 40, "A": 41, "@": 8, "C": 42, "B": 9, "E": 43,
"D": 10, "G": 44, "F": 11, "I": 45, "H": 12, "K": 46, "M": 47, "L": 13,
"O": 48, "N": 14, "P": 15, "S": 49, "R": 16, "U": 50, "T": 17, "W": 51,
"V": 18, "Y": 52, "[": 53, "Z": 19, "]": 54, "\\": 20, "a": 55, "c": 56,
"b": 21, "e": 57, "d": 22, "g": 58, "f": 23, "i": 59, "h": 24, "m": 60,
"l": 25, "o": 61, "n": 26, "s": 62, "r": 27, "u": 63, "t": 28, "y": 64}
What is the definition of the index values corresponding to these symbols? I see that they are not defined according to the order of the letters. Is there any way?
Hello @zhouhao-learning, sorry I couldn't find the time to update to code, last few weeks have been hectic.
As for your question, no actually, the numerical IDs assigned everytime a new character is detected in the corpus.
@hkmztrk
What is the definition of the index values corresponding to these symbols? I see that they are not defined according to the order of the letters. Is there any way?
No, there is not a correspondence behind these numerical id assignments, they are random.
@hkmztrk
OK! Thank You