chr1swallace/ibd-ets2-analysis

Query on input files (formats)

Opened this issue · 0 comments

Hi Dr Wallace,

I saw your recent work (A disease-associated gene desert directs macrophage inflammation through ETS2) where I was specifically interested in running susie.

Just had a couple of questions as I was reading through some code (this repo was linked in the paper) -

The targets file requires reading in some data; specifically -

ibd_raw = read_raw("IBD_DeLange_28067908_1-hg38.tsv.gz")
psc_raw = read_raw( "PSC_Ji_27992413_1-hg38.tsv.gz")
as_raw = read_raw( "ANS_Cortes_23749187_1-hg38.tsv.gz")

It wasn't too clear to me what these files were (what they contained and their format) but looking at the read_raw function, it looks like they were some sort of variant summary files?
I made a work around and downloaded data from EBI (assuming that the number in the file was the pubmed ID)

https://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST004001-GCST005000/GCST004131/harmonised/28067908-GCST004131-EFO_0003767.h.tsv.gz
https://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST004001-GCST005000/GCST004030/harmonised/27992413-GCST004030-EFO_0004268.h.tsv.gz
https://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST005001-GCST006000/GCST005529/harmonised/23749187-GCST005529-EFO_0003898.h.tsv.gz

Is this roughly correct?

I also get a bit stuck with the get_mafld() -> This calls a ruby file which involves vcftools

vcffile="#{DIR}/reference/byblock/#{block}.vcf.gz"
puts "subsetting vcf file #{vcffile}, searching for #{lines.length} SNPs"
samplefile="/home/cew54/share/Data/reference/1000GP_Phase3/sparse_basis/EUR.sample" # for now, can make an option later
command = "zcat #{vcffile} | " +
"sed 's/^##fileformat=VCFv4.3/##fileformat=VCFv4.2/' | " +
"#{ENV['HOME']}/localc/bin/vcftools " +
" --gzvcf - " +
" --remove-indels --recode --remove-filtered-all --keep #{samplefile} " +
" --positions #{infile} --stdout > #{vcftemp} "

Just to double-check I am interpreting this correctly:
You have a reference directory containing VCF files (derived from 1000G phase 3?) on a per chromosome + "block" basis e.g. chr21_block16 and a separate file of individuals at different levels (e.g. individual populations such as GBR and superpopulation like EUR) which allows you to subset the VCF file?