Missing DNA PCA algorithm for ancestry masked or ancient genomic data.
Run the method using the following command from command line.
python3 covariance_matrix_method.py params.txt
params.txt
is the parameters file that is passed as input to the method. The following parameters can be specified in the parameters file:
ROOT_DIR
(str): path to the directory of array folders containing input files.BEAGLE_VCF_FILE
(str): name of the genetic data file without the file extension. It can be a Beagle / VCF file.IS_MASKED
(bool):True
if an ancestry file is passed for ancestry-specific masking, orFalse
otherwise.VIT_FBK_TSV_FILE
(str): name of the ancestry file without the file extension. It can be a VIT / FBK / TSV file.NUM_ANCESTRIES
(int): the total number of ancestries in the ancestry file.ANCESTRY
(int): ancestry number of the ancestry for which dimensionality reduction is to be performed. Ancestry counter starts at 0 if the ancestry file is a TSV file, and starts at 1 if it is a VIT or an FBK file.PROB_THRESH
(float): minimum probability threshold for a SNP to belong to an ancestry, if the ancestry file is an FBK file or an FB TSV file.AVERAGE_PARENTS
(bool):True
if the DNAs from the two parents are to be combined (averaged) for each individual, orFalse
otherwise.IS_WEIGHTED
(bool):True
if weights are provided in the labels file, orFalse
otherwise.LABELS_FILE
(str): path to the labels file. It should be a TSV file where the first column has headerindID
and contains the individual IDs, and the second column has headerlabel
and contains the labels / groups for all individuals. IfIS_WEIGHTED
is specified asTrue
, then the file must have an additional column with headerweight
. Theweight
column must contain the weights for each individual. This file can also have 2 optional columns with headerscombination
andcombination_weight
, if sets of individuals are to be combined to form combined individuals. Thecombination
column must contain the combining groups for each individual and thecombination_weight
column must contain the weights for those combinations.
NOTE: Individuals with positive weights are weighted accordingly. Individuals with zero weight are removed. Negative weights are not acceptable. Provide acombination
value of0
for individuals not to be combined in any group, and1
for individuals of 1st combined group,2
for 2nd combined group and so on. Each set of individuals that is to be combined must have the samelabel
andcombination_weight
(if provided). Ifcombination_weight
column is not provided, the combinations are assigned a default weight of1
.GROUPS_TO_REMOVE
(dict): dictionary specifying the groups that are to be removed for each array. The keys in the dictionary are the array numbers and the values are the list of corresponding groups that are to be removed for that array. For example,{1: ['group1', 'group2'], 2: [], 3: ['group3']}
MIN_PERCENT_SNPS
(float): threshold for the minimum percent of SNPs to be known in an individual, for the individual to be included in the method. All individuals that have fewer percent of unmasked SNPs than this threshold will be dropped from the method.SAVE_MASKS
(bool):True
if the masked matrices generated by gen_tools are to be saved in a .npz file, orFalse
otherwise.LOAD_MASKS
(bool):True
if the masked matrices are to be loaded directly from a .npz file, orFalse
otherwise.MASKS_FILE
(str): path to the .npz file. The masked matrices are saved to this file / loaded from this file.OUTPUT_FILE
(str): path to the output file, to which the output of the run is written. It is a TSV file with 3 columns. The first column contains the individual IDs, and the second and third column contain the ancestry-specific projections obtained after dimensionality reduction using PCA.SCATTERPLOT_FILE
(str): path to the scatter plot file with.html
extension. The scatter plot of the individuals is saved in this file.SAVE_COVARIANCE_MATRIX
(bool):True
if the covariance matrix is to be saved as a binary file, orFalse
otherwise.COVARIANCE_MATRIX_FILE
(str): path to the covariance matrix file. The covariance matrix is saved in this file.NUM_DIMS
(int): number of dimensions to save inoutput.csv
.RSID_OR_CHROMPOS
(int):1
if the SNP ID format in the Beagle / VCF file is rsID, or2
if it is Chromosome_position.METHOD
(int): method number of the method to be run, out of the 5 available options.PERCENT_VALS_MASKED
(float): percent of values in the covariance matrix to be masked and then imputed, ifMETHOD
is specified as3
or4
.
NOTE: There are 2 acceptable formats for SNP indices in the VCF / Beagle file:
- rsid:
rs
followed by the id (int). For example,rs12345
. - position: chromosome number (int) followed by
_
, followed by the position (int). For example,10_12345
.
NOTE: The parameters file must have all the above parameters. Each line in the parameters file must have a parameter name followed by =
, followed by the value for that parameter. The value for a parameter that is not useful for the run can be filled with any value compatible with the parameter type.
ROOT_DIR
(param) must have a folder named array1
. In the folder there must be a genetic data file with the name BEAGLE_VCF_FILE
(param). The folder can also have an ancestry file with the name VIT_FBK_TSV_FILE
(param).