/mdPCA

missing DNA PCA algorithm for ancestry masked or ancient genomic data.

Primary LanguagePython

mdPCA

Missing DNA PCA algorithm for ancestry masked or ancient genomic data.

Usage

Run the method using the following command from command line.

python3 covariance_matrix_method.py params.txt

params.txt is the parameters file that is passed as input to the method. The following parameters can be specified in the parameters file:

  • ROOT_DIR (str): path to the directory of array folders containing input files.
  • BEAGLE_VCF_FILE (str): name of the genetic data file without the file extension. It can be a Beagle / VCF file.
  • IS_MASKED (bool): True if an ancestry file is passed for ancestry-specific masking, or False otherwise.
  • VIT_FBK_TSV_FILE (str): name of the ancestry file without the file extension. It can be a VIT / FBK / TSV file.
  • NUM_ANCESTRIES (int): the total number of ancestries in the ancestry file.
  • ANCESTRY (int): ancestry number of the ancestry for which dimensionality reduction is to be performed. Ancestry counter starts at 0 if the ancestry file is a TSV file, and starts at 1 if it is a VIT or an FBK file.
  • PROB_THRESH (float): minimum probability threshold for a SNP to belong to an ancestry, if the ancestry file is an FBK file or an FB TSV file.
  • AVERAGE_PARENTS (bool): True if the DNAs from the two parents are to be combined (averaged) for each individual, or False otherwise.
  • IS_WEIGHTED (bool): True if weights are provided in the labels file, or False otherwise.
  • LABELS_FILE (str): path to the labels file. It should be a TSV file where the first column has header indID and contains the individual IDs, and the second column has header label and contains the labels / groups for all individuals. If IS_WEIGHTED is specified as True, then the file must have an additional column with header weight. The weight column must contain the weights for each individual. This file can also have 2 optional columns with headers combination and combination_weight, if sets of individuals are to be combined to form combined individuals. The combination column must contain the combining groups for each individual and the combination_weight column must contain the weights for those combinations.
    NOTE: Individuals with positive weights are weighted accordingly. Individuals with zero weight are removed. Negative weights are not acceptable. Provide a combination value of 0 for individuals not to be combined in any group, and 1 for individuals of 1st combined group, 2 for 2nd combined group and so on. Each set of individuals that is to be combined must have the same label and combination_weight (if provided). If combination_weight column is not provided, the combinations are assigned a default weight of 1.
  • GROUPS_TO_REMOVE (dict): dictionary specifying the groups that are to be removed for each array. The keys in the dictionary are the array numbers and the values are the list of corresponding groups that are to be removed for that array. For example, {1: ['group1', 'group2'], 2: [], 3: ['group3']}
  • MIN_PERCENT_SNPS (float): threshold for the minimum percent of SNPs to be known in an individual, for the individual to be included in the method. All individuals that have fewer percent of unmasked SNPs than this threshold will be dropped from the method.
  • SAVE_MASKS (bool): True if the masked matrices generated by gen_tools are to be saved in a .npz file, or False otherwise.
  • LOAD_MASKS (bool): True if the masked matrices are to be loaded directly from a .npz file, or False otherwise.
  • MASKS_FILE (str): path to the .npz file. The masked matrices are saved to this file / loaded from this file.
  • OUTPUT_FILE (str): path to the output file, to which the output of the run is written. It is a TSV file with 3 columns. The first column contains the individual IDs, and the second and third column contain the ancestry-specific projections obtained after dimensionality reduction using PCA.
  • SCATTERPLOT_FILE (str): path to the scatter plot file with .html extension. The scatter plot of the individuals is saved in this file.
  • SAVE_COVARIANCE_MATRIX (bool): True if the covariance matrix is to be saved as a binary file, or False otherwise.
  • COVARIANCE_MATRIX_FILE (str): path to the covariance matrix file. The covariance matrix is saved in this file.
  • NUM_DIMS (int): number of dimensions to save in output.csv.
  • RSID_OR_CHROMPOS (int): 1 if the SNP ID format in the Beagle / VCF file is rsID, or 2 if it is Chromosome_position.
  • METHOD (int): method number of the method to be run, out of the 5 available options.
  • PERCENT_VALS_MASKED (float): percent of values in the covariance matrix to be masked and then imputed, if METHOD is specified as 3 or 4.

NOTE: There are 2 acceptable formats for SNP indices in the VCF / Beagle file:

  1. rsid: rs followed by the id (int). For example, rs12345.
  2. position: chromosome number (int) followed by _, followed by the position (int). For example, 10_12345.

NOTE: The parameters file must have all the above parameters. Each line in the parameters file must have a parameter name followed by =, followed by the value for that parameter. The value for a parameter that is not useful for the run can be filled with any value compatible with the parameter type.

Input Data Format

ROOT_DIR(param) must have a folder named array1. In the folder there must be a genetic data file with the name BEAGLE_VCF_FILE(param). The folder can also have an ancestry file with the name VIT_FBK_TSV_FILE(param).