Generate fake DMS signal using RNAstructure pairing probabilities, to train our models until we get the experimental data.
The function createFakeData
reads a fasta file and outputs a dict with reference, sequence and synthetic DMS signal. It is located in createFakeData.py
.
def createFakeData(fasta_file, rnastructure_path='', sequencer_noise=0.001):
"""
Reads a fasta file and outputs a dict with reference, sequence and synthetic DMS signal.
Args:
fasta_file (str): The path to the input fasta file.
rnastructure_path (str): The path to the RNAstructure executable. Default is '' (i.e. RNAstructure is in the PATH).
sequencer_noise (float): The amount of sequencer noise to add to the DMS signal. Default is 0.001.
Returns:
data (dict): A dictionary with reference as key and sequence / synthetic DMS signal as value.
Example:
>>> createFakeData('testData/refs.fasta')
Predicting RNA structures: 100%|███████████████████████████████████| 3/3 [00:01<00:00, 1.92seq/s]
{'3042-O-flank_1=hp1-DB':
{
'sequence': 'TTAAACCGGCCAACATACCGCATATGAGGATCACCCATATGCTCAAGATATTCGAAAGAATATCTTTCCACAGTCGAAAGACTGTGTCTCTCTCTTCCTTTTTCTCTTCCTCTTTCTCTTTCTCTTTCTCTTCTCTTCTGTATTACGAGTTCGCTACTCGTTCCTTTCGA',
'fakeDMS': [0.0008763192815798796, 0.003282160575810705, 0.019867824327687585, 0.02829582753478791, 0.031266714341378524, 0.010288700618344545, 0.22755049833180016, 0.5741991237121613, 0.5018175337184233, 0.004588974386334777, 0.0027875607249180545, 0.010719083819803172, 0.0311800339547883, 0.026734132953685468, 0.5171140931132598, 0.526320420155328, 0.5302629606137425, 0.5973646221995781, 0.06787639095534494, 0.996380706717648, 0.9988325172245213, 0.9986708144405785, 0.9984979885954683, 0.9984931638770012, 0.9984703346038281, 0.9986003515173016, 0.0005489978550856028,
},
'3043-CC-flank_1=hp1-DB_2':
...
"""