thierrygosselin/stackr

Some problem around colony input files

jblamyatifremer opened this issue · 7 comments

Dear Thierry,

I am using your package to input a large amount of data to Colony. Our package is the unique way to pass data from stack to colony.

1- First, i use a awk script to filtred out locus without polymorphisms
awk '$2 > 0 {print}' "$src_root"/14_reassignation/input/batch_2.haplotypes.tsv > "$src_root"/14_reassignation/input/TRIM.haplotypes.tsv

2- I use your R package to feed the haplo2colony fonction

res <- haplo2colony("/media/XXX/TRIM.haplotypes.tsv"
, blacklist.id = NULL, whitelist.loci = NULL,
sample.markers = 5, 1, 2, pop.select = "all",
allele.freq = FALSE, inbreeding = 0, mating.sys.males = 0,
mating.sys.females = 0, clone = 0, run.length = 2, analysis = 1,
allelic.dropout = 0, error.rate = 0.02, print.all.colony.opt = FALSE,
imputations = FALSE, imputations.group = "populations", num.tree = 100,
iteration.rf = 10, split.number = 100, verbose = TRUE,
parallel.core = 2, filename = "/home/XXX/colony/colony2_v1.dat")
3- I rename the colony2_v1.dat to colony2.dat into the colony directory
4- I got an error when using colony2s.ifort.out with the
jean-baptiste@ordi[colony] mv ./colony2_v1.dat ./colony2.dat [ 6:43]
jean-baptiste@ordi[colony] ./colony2s.ifort.out [ 6:43]

COLONY, Version 2.0.6.2, Build 20160825, Expire Date 20180825
Copyright (C) by Jinliang Wang, Institute of Zoology, Zoological Society of London
Email: jinliang.wang@ioz.ac.uk

Opening & reading data input file: colony2.dat
Marker 2 has the same ID, 169, as marker 1
Errors in DATA. Insufficient data or incorrect format.
Please check DATA and format and then re-run the program
Program stopped in subroutine StopOnDataError

5- After looking into the colony manual user, in the attached file (i modified the extension)
colony2.txt
line 23, the loci name (header) is duplicated... After deleting all duplicates by hand I got a new (and more severe error). :

jean-baptiste@ordi[jean-baptiste] cd ~/colony [ 6:36]
jean-baptiste@ordi[colony] ./colony2s.ifort.out [ 6:36]

COLONY, Version 2.0.6.2, Build 20160825, Expire Date 20180825
Copyright (C) by Jinliang Wang, Institute of Zoology, Zoological Society of London
Email: jinliang.wang@ioz.ac.uk

Opening & reading data input file: colony2.dat
Reading offspring genotype data...
forrtl: Is a directory
forrtl: severe (30): open failure, unit 10, file /home/jean-baptiste/colony/
Image PC Routine Line Source
colony2s.ifort.ou 0000000000633E04 Unknown Unknown Unknown
colony2s.ifort.ou 00000000006493AB Unknown Unknown Unknown
colony2s.ifort.ou 000000000042AE18 Unknown Unknown Unknown
colony2s.ifort.ou 0000000000423E26 Unknown Unknown Unknown
colony2s.ifort.ou 0000000000401EF6 Unknown Unknown Unknown
colony2s.ifort.ou 0000000000401E7E Unknown Unknown Unknown
colony2s.ifort.ou 00000000006E47A4 Unknown Unknown Unknown

Since colony2 inputs are quite plainfull to build-up, i will be very happy to have any inputs.

JB

Bonjour JB, je regarde le problème en fin de journée, heure du Québec.
merci!
Thierry

stackr v.0.4.6:
haplo2colony is now deprecated in favour of write_colony that uses more sophisticated codes that enables more input file formats. See the new version commit, or the function description for details.

try something simple first, with your batch_2.haplotypes.tsv directly (the function takes care of monomorphic markers. e.g.:

setwd("/media/XXX/")

test <- stackr::write_colony(data = "TRIM.haplotypes.tsv", strata = "you need a strata file here")

The strata file is described in the argument of the function write_colony, it's basically a STACKS population.map with headers.

Look at the new colony file in your working directory.
Try that one with COLONY if it works, then you can try out different argument parameters and imputations.

Cheers
Thierry

Dear Thierry,

I have tried your new stackR version, thanks for the quick response.

I got an error using your newest function write_colony (i provide my vcf and tsv file). I quickly look into your R code to find where the problem could come from... I did not find yet. I should spend more time on this (after 6 of december). Maybe you will be quicker.

I tried both type of input (.vcf and .tsv) with population map following the stack format. Below the outputs :

strataa <- read.table("/media/jean-baptiste/Passport0_5/002_PROJETS_CODES/RADSeq_HYSEA/14_reassignation/input/population_strata.csv", sep=",",stringsAsFactors = FALSE,header=TRUE)

write_colony("/media/jean-baptiste/Passport0_5/002_PROJETS_CODES/RADSeq_HYSEA/07_mendel_error_exploration/output/batch_2_3.vcf"

  •           , strata = strataa, pop.levels = NULL, pop.labels = NULL,
    
  •          blacklist.id = NULL, blacklist.genotype = NULL,
    
  •          whitelist.markers = NULL, monomorphic.out = TRUE, snp.ld = NULL,
    
  •          common.markers = TRUE, maf.thresholds = NULL, maf.pop.num.threshold = 1,
    
  •          maf.approach = "SNP", maf.operator = "OR", max.marker = NULL,
    
  •          sample.markers = NULL, pop.select = "all", allele.freq = "overall",
    
  •          inbreeding = 0, mating.sys.males = 0, mating.sys.females = 0,
    
  •          clone = 0, run.length = 1, analysis = 1, allelic.dropout = 0,
    
  •          error.rate = 0.02, print.all.colony.opt = FALSE,
    
  •          imputation.method = NULL, impute = "genotype",
    
  •          imputations.group = "populations", num.tree = 100, iteration.rf = 10,
    
  •          split.number = 100, verbose = TRUE,
    
  •          parallel.core = parallel::detectCores() - 1, filename = "/home/jean-baptiste/colony/colony2_v2.dat")
    

#######################################################################
######################## stackr::write_colony ########################
#######################################################################
File type: vcf.file
Importing data...
Error in if (biallelic > 4) { : missing value where TRUE/FALSE needed

write_colony("/media/jean-baptiste/Passport0_5/002_PROJETS_CODES/RADSeq_HYSEA/14_reassignation/input/TRIM.haplotypes.tsv"

  •           , strata = strataa, pop.levels = NULL, pop.labels = NULL,
    
  •          blacklist.id = NULL, blacklist.genotype = NULL,
    
  •          whitelist.markers = NULL, monomorphic.out = TRUE, snp.ld = NULL,
    
  •          common.markers = TRUE, maf.thresholds = NULL, maf.pop.num.threshold = 1,
    
  •          maf.approach = "SNP", maf.operator = "OR", max.marker = NULL,
    
  •          sample.markers = NULL, pop.select = "all", allele.freq = "overall",
    
  •          inbreeding = 0, mating.sys.males = 0, mating.sys.females = 0,
    
  •          clone = 0, run.length = 1, analysis = 1, allelic.dropout = 0,
    
  •          error.rate = 0.02, print.all.colony.opt = FALSE,
    
  •          imputation.method = NULL, impute = "genotype",
    
  •          imputations.group = "populations", num.tree = 100, iteration.rf = 10,
    
  •          split.number = 100, verbose = TRUE,
    
  •          parallel.core = parallel::detectCores() - 1, filename = "/home/jean-baptiste/colony/colony2_v2.dat")
    

#######################################################################
######################## stackr::write_colony ########################
#######################################################################
File type: haplo.file
Importing data...
Error in enc2utf8(col_names(col_labels, sep = sep)) :
argumemt is not a character vector

I juste change the extension from .tsv to .txt (the vcf file is to big) (Github only accepts conventional extensions

for attached files).

Cheers,
JB

PS : N'hesite pas à m'envoyer un MP si tu veux des precisions. Je peux t'envoyer le vcf sur filesender.
TRIM.haplotypes.txt

Ok send me the strata file please
I'll test your haplotypes file

Hi JB,

Try:

Test1: as mentioned above

test <- stackr::write_colony(data = "TRIM.haplotypes.tsv", strata = "you need a strata file here")

This works or not ?

Test2 with strata file already in the global environment:

strata <- readr::read_tsv(file = "strata.test.colony.tsv")
stackr::write_colony(data = "TRIM.haplotypes.txt", strata = strata)

This should also work

Comments

  • You don't have to specify the other arguments if your using the defaults.
  • Make sure you read carefully the argument of write_colony.
  • The argument pop.select:
    • the default value is pop.select = NULL
    • all is not an option.
    • you don't have any grouping, so leave it to the default so that write_colony takes care of it
    • however, if you need to specify a population use the save value as in your STRATA column.

Cheers
Thierry

Dear Thierry

Test1: as mentioned above
test <- stackr::write_colony(data = "TRIM.haplotypes.tsv", strata = "you need a strata file here")

My previous command was messy with non-existent option as you highlighted.
I was able to get colony input from the command above but Colony throw me an error (I do not have my linux station with me to reproduce the error message). Roughly, colony was complained about the format and the amount of data.

But i re-tried the same command with a ".vcf" from the same dataset (with the same strata file), No more error and Colony is still working on it.
That is good...

Tomorow, i tried again to get a colony input with stackr from the ".tsv" to figure out what is going on.

I also see that you are involved in Mapcomp... I will use it very soon. bravo for your work it is very usefull !

Test2 with strata file already in the global environment:
strata <- readr::read_tsv(file = "strata.test.colony.tsv")
stackr::write_colony(data = "TRIM.haplotypes.txt", strata = strata)