by Vadim Puller, Richard Neher, and Jan Albert, biorxiv, doi:10.1101/129387
Code associated with our manuscript on estimating the time of HIV infection from next generation sequencing data. This project uses whole genome deep sequencing data from Zanini et al 2015 to establish a method to estimate time of infection from viral diversity.
The directory Frequency_data
contains the training data on which the analysis is based.
These files are derived from the NGS data by mapping.
patient_tt.npy
: time points for patient samplespatient_data.npy
andpatient_mask.npy
: data and mask for the array of nucleotide frequencies for all time pointspatient_viral_load.npy
: viral load datapatient_dilutions.npy
andpatient_dilutions_mask.npy
: dilutions data (see Zanini et al 2015)annotations.txt
: auxhiliary gene annotations info
The directory K31_data
contains the validation data consisting of two time points from additional 31 patients.
The directory content is structured as above in Frequency_data
.
The scripts EDI_functions.py
, EDI_plotting.py
, and EDI_median_regression.py
contain the source code to load the data, determine the regression coefficients,
and generate the figures in the manuscript.
EDI_functions.py
: functions for loading data, calculating sequence diversity, and linear fittingEDI_plotting.py
: functions for data analysis and generating plotsEDI_median_regression.py
: script generating plots presented in the manuscriptK31_prediction.py
: Script generating figure 6 (validation of ETI inference on additional 31 patients)