Usability

Question

Usability

kullrich opened this issue 2 years ago · 5 comments

Hi,

I am missing the part what data would be needed to apply this algorithm to a non-model organism.

Could you please indicate which data sets are needed and how they should be pre-processed to use the software, my feeling is that otherwise it will be hard to use it even for a trained bioinformatician.

Answer 1 · 2022-05-22T11:22:04.000Z

I appreciate your interest in our work, we've already trained the classification model based on multiple datasets. If you wish to conduct experiments on one of the datasets mentioned in the paper, you can simply download the data and the trained model and run motif_extractor_example.ipynb.

However, if you wish to train your own model on your customed dataset, a simple way to use the software is by the following steps：

Transform your sequence data into the one-hot matrixes, example data can be found here
Modify the configuration file in experiments, and train your model via 5-fold cross-validation.
Conduct motif extractor by following the tutorials in motif_extractor_example.ipynb.

Answer 2 · 2022-05-22T17:33:23.000Z

Thank you for your response.

Still, for me it is missing:

Clear explanation, how to create such one-hot matrixes (without needing to investigate individual example data sets myself, you are the experts and should explain how this needs to be done).
Clear explanation what are the best practise for configuration parameter, to be able to run a "valid" model training.
Example, how to apply the already trained model on an arbitray "new" species showing what input data is really necessary to get reasonable results.

Thank you in anticipation

Answer 3 · 2022-05-26T14:01:18.000Z

Thank you for your interest in our work

For one-hot encoding for the input DNA sequences, each base is encoded as a vector of all zeros except one in a specific position, A is encoded as (1,0,0,0), T as (0,0,0,1), C as (0,1,0,0), and G as (0,0,1,0). The preprocessed hotspot sequences can be derived from here.

You can also train on your customized data. You are able to crop the recombination hotspot sequences from the genome assembly GRCh38 (hg38) according to the recombination map(or so-called genetic map). Such recombination map can be different in resolution, available in Google Drive

The example recombination map can be found here:

Genetic map computed from the nature 2020 crossovers.
The data columns are as follows:
Chr (chromosome).
Begin (start point position of interval in GRCh38 coordinates).
End (end point position of interval in GRCh38 coordinates).
cMperMb (recombination rate in interval).
cM (centiMorgan location of end point of interval)
Chr	Begin	End	cMperMb	cM
num hot : 17300 num cold 1103150
hot_rate avg : 14.06 cold_rate 0.2243

chr1	829059	879059	3.68742022408169	0.2694653240675081
chr1	879059	929059	2.907389022833642	0.4148347752091902

For model training, you can just follow the optimal parameters illustrated in the experiment , example configuration is listed as follows:

{
  "dataset":        "human_science_2019",
  "num_data":       40000,
  "max_length":     1000,
  "min_length":     500,
  "model":          "SeqModel_Attention",
  "log_dir":        "logs",
  "output_dir":     "output",
  "output_prefix":  "SeqModel_Attention",

  "input_length":1000,
  "input_height":4,
  "data_augmentation":1,
  "hidden_dim":64,
  "attention":1,
  "equivariant":0,
  "get_distribution":0,
  "get_line_comparison":1,

  "dp_rate":0.1,

  "trails":4,
  "folds":5,
  "epochs":100,
  "batch_size":64,
  "lr":0.001,

  "init_weight":0,
  "reinforce_train":0,
  "loss":"binary_crossentropy"
}

We have not provided a direct testing script in this repo so that you can directly run the model on a completely new dataset. Indeed, it would be more useful for bioinformaticians if a direct testing script is provided. However, implementation is straightforward. You can simply load the trained model and the transformed one-hot matrix and conduct the testing.

Pseudo code can be as simple as follows:

$ load your data in one-hot matrix
model = SeqModel(epochs=args["epochs"], args=args, loss=loss)
predictions = model.predict_mc( $your data, n_preds=4)

Answer 4 · 2023-05-20T12:46:42.000Z

Ok, thank you very much. I'll try. If you have any questions, please come back

Answer 5 · 2023-05-30T03:13:09.000Z

I still don't know how to predict recombination hotspot by this proggram with my data