armartin/ancestry_pipeline

Genetic Map Files (for RFMix)

cwarden45 opened this issue · 11 comments

Hi Alicia,

First, I thought it might be nice to point out that I was referred here from another site (for the chromosome painting portion of your scripts):

slowkoni/rfmix#9

Second, I noticed that you provide a link to download the genetic map files used by RFMix. I'm guessing those are within the .tgz files (which I am working on downloading).

Just out of curiosity, did you know of a direct link to those genetic mapping files? I was having some difficulty finding them within a subfolder of the FTP, but that doesn't mean they aren't there somewhere.

I would guess that you already know about it, but I did find this HapMap file (from a Biostars discussion):

https://ftp.hapmap.org/hapmap/recombination/2011-01_phaseII_B37/

However, I'm guessing what I can download from 1000 Genomes will cover more sites (but something like that smaller file is what I am looking for).

Best Wishes,
Charles

It is definitely an "unoffiical" solution, but I have uploaded the 1000 Genomes genetic map files on Google Drive:

https://drive.google.com/open?id=1z1a961djGVH3zWaAkFithHLqHx8ZEnkt

So, this may save some time for other users.

Also, I will be tidy, and close this ticket :)

That said, if you follow the tutorial (and not try to go directly to running SHAPEIT + rfmix on your samples), I see that you do need other files from the link (which I'm not providing separately in the above link).

Hi Charles,
This is my first time working with genetic map files and I saw the files for different chr on your drive. It would be great help if you can please give me an idea on how to get genetic map files for a particular population from 1000 genome project?

Hi @shikhaMeda - I don't do this on a regular basis, so I would have to refresh my memory about the specifics (and I may or may not be able to provide enough guidance to do that, exactly).

In other words, sometimes, you may need to use code to manually create a smaller reference file.

However, I think sometimes selecting specific subsets can be handled by some programs.

For example, when running STITCH, I think these are the relevant commands from the following programs:

run_STITCH-REF286.R:

		STITCH(
			bamlist = bam_list,
			sampleNames_file = name_list,
			outputdir = output_folder,
			method = "diploid",
			regenerateInput = TRUE,
			regionStart = STITCH_start,
			regionEnd = STITCH_end,
			buffer = buffer,
			niterations = 1,
			chr = chr,
			reference_populations = c("CEU", "GBR", "ACB"),#this is a reference set of 286 samples
	#		reference_populations = c("CEU"),#this is a reference set of 99 samples		
			reference_haplotype_file = human_reference_haplotype_file,
			reference_sample_file = human_reference_sample_file,
			reference_legend_file = human_reference_legend_file,
			posfile = human_posfile,
			shuffleHaplotypeIterations = NA,
			refillIterations = NA,
			K = human_K, tempdir = temp_folder, nCores = 1, nGen = human_nGen)

run_STITCH-REF99-DOWN2.R:

		STITCH(
			bamlist = bam_list,
			sampleNames_file = name_list,
			outputdir = output_folder,
			method = "diploid",
			regenerateInput = TRUE,
			regionStart = STITCH_start,
			regionEnd = STITCH_end,
			buffer = buffer,
			niterations = 1,
			chr = chr,
			downsampleFraction = 0.5,
	#		reference_populations = c("CEU", "GBR", "ACB"),#this is a reference set of 286 samples
			reference_populations = c("CEU"),#this is a reference set of 99 samples		
			reference_haplotype_file = human_reference_haplotype_file,
			reference_sample_file = human_reference_sample_file,
			reference_legend_file = human_reference_legend_file,
			posfile = human_posfile,
			shuffleHaplotypeIterations = NA,
			refillIterations = NA,
			K = human_K, tempdir = temp_folder, nCores = 1, nGen = human_nGen)

I think you could change the reference sets for GLIMPSE, but the computational requirements were lower and I don't think I did that so much.

So, I am sorry that this is not precisely an answer to your question, but I hope that might help.

I don't think that I can guarantee being able to help you, but do you have a specific point in Alicia's scripts that you are having difficulty with? If she can't provide assistance, then I can see if that was something I also had a hard time with (and whether I found a solution, when I have a chance to sort through what I did).

You are welcome, @shikhaMeda.

However, to be clear, STITCH is used for imputations in lcWGS data. So, I am trying to say that your downstream program may have built-in ways to filter for certain samples (with commonly used file formats), if your eventual goal was to use something like STITCH.

I am not suggesting that you use STITCH to subset reference samples to use for a different program. I think that might not work, unless temporary files are saved (in a format that is usable for other programs).

Also, I see that Alicia has left a comment at the top of the README:

"While I hope these scripts continue to be helpful, I am no longer maintaining or developing them, so consider code as is."

So, she may not respond to your question. However, I think specifying a point in the scripts on this repository may be the best chance of getting assistance. If different than any other issues, you could try submitting a new issue (to see if the general public can respond).

If focusing on Alicia's scripts, did you see this set of demo files to download?

https://www.dropbox.com/sh/zbwka9u09f73gwo/AABc6FNl9fVBPjby8VQWzyeXa?dl=0

I initially overlooked that, but I think it was a helpful starting point.

Oh, OK. Thank you for the explanation, @shikhaMeda.

For maximal reproducibity, I think you probably should reference the larger file here:

https://ftp.hapmap.org/hapmap/recombination/2011-01_phaseII_B37/

However, Alicia's scripts (and I think other scripts) only require a subset of that large file. The original file is larger to download, but you do need to understand that is where the files are coming from (and how they were created).

For example, I did not create any of those files myself. That is important to make clear, but I don't think I can (or should) say too much more on the discussion about running the ancestry scripts from Alicia (if that is not what you are trying to run).

I hope that helps!