This project includes two scripts that are useful for processing OBITools output tables. The main MergeAndFilter.R R script allows for merging of different OBITools identifications (assuming its based on the same barcode sequences) and applying additional filtering to the results.
The R script requires additional input in the form of a sequence count table, which can be generated from the demultiplexed and collapsed .fasta files produced by OBITools with the ExtractSampleCountData.py Python script.
The MergeAndFilter.R script fulfils two important roles, merging different OBITools outputs generated by different reference databases and applying additional filtering to the results.
The script can be run from the command line, or manually in R. Regardless, it requires a number of input files:
-
At Least one OBITools output .TSV table, which can be generated with the
obitab
function. -
A sequence count .TSV table containing the "raw" demultiplexed sequences counts for each sample. This table can be generated with the ExtractSampleCountData.py script.
-
A blacklist for synthetic sequences used in the experiment. See the blacklist section for details regarding formatting.
-
A blacklist for contaminant sequences.
The full command is as followed:
MergeAndFilter.R [number of input files] [OBITools output x] [OBITools table x name] [count file] [synthetic blacklist file] [regional blacklist file] [base output name] [minimum identity (1)] [minimum number of reads per observation (3)] [minimum total reads (10)] [minimum total observations (3)]
The values within parentheses are default values.
The first argument is the number of input OBITools tables (minimum 1), followed by the path the table and the dataset name. When multiple tables are provided they will be grouped as path and table pairs. For three input files the first part of the command would look like:
MergeAndFilter.R 3 table1.tsv name1 table2.tsv name2 table3.tsv name3
The last four settings work on a per barcode basis and function as followed:
-
If the minimum identity value (0-1, default 1) is true for any OBITools input set, the barcode sequence is kept in.
-
For each observation, at least 3 (default) reads need to be present, lower counts are set to 0.
-
Each barcode needs at least 10 (default) reads across the entire dataset. Barcodes with a lower total are removed.
-
Each barcode needs at least 3 (default) observations across the entire dataset. Barcodes with lower counts are removed.
The MergeAndFilter.R script requires that the samples and repeats are strictly formatted in order to correctly identify which repeats belong to the same sample. It expects that the sample names are identical between the repeats and that the repeat number is always at the end of the name. The following sample names and repeats will be parsed correctly:
samplename_1
samplename_2
run1_sample2_repeat1
run1_sample2_repeat2
sample3-run2_1
sample3-run2_2
While these names and repeats will cause issues:
1_samplename
2_samplename
run1_repeat1_sample2
run1_repeat2_sample2
sample-three_run2_1
sample3_run2_2
The MergeAndFilter.R script produces four output files:
-
A
_filtered.tsv
output file. This contains the merged and filtered data. The format follows that of the input OBITools files, but excludes the obiclean columns. -
The
_summary.tsv
output file. Similar format as the_filtered.tsv
file, but for each sample the repeat data is summarized into 6 categories:-
totread
: contains the total reads for the sample. -
avgpread
: the average read count across the repeats. -
sdpread
: the read count standard deviation across the repeats. -
totrep
: the number of repeats for the sample that contain the barcode. -
proprep
: the above repeat count converted to a proportion. -
weightrep
: the weighted proportional repeat count. Weighting is based on the total filtered read abundance for each repeat. For example:
repeat 1 repeat 2 repeat 3 proprep weightrep total sum 250 500 250 proportion 0.25 0.5 0.25 barcode 1 present present present 1 1 barcode 2 present present absent 0.66 0.75 barcode 3 absent present absent 0.33 0.5 The weighted proportion prioritizes presences in repeats with greater sequencing depths.
-
-
The
_samplestat.tsv
file. This table contains a more generalized summary on a per sample and repeat basis. For each sample, the file contains the following:-
Raw reads per repeat and the average and standard deviation.
-
The proportion of reads that was filtered out by the technical filtering steps. (OBITools PCR error correction and the minimum read and repeat counts in the MergeAndFilter.R script.
-
The proportion of reads filtered out by the minimum identity criteria.
-
The proportion of reads filtered out according to the blacklists.
-
The proportion of reads that was retained after filtering.
-
The average replication of the 10 most read dominant taxa both prior
avg_rep
and postavg_filt_rep
filtering, along with the standard deviation. These two numbers provide insight on how replicable the data is. Theoverlap
column contains the number of barcodes that appear in both categories (max 10). -
The average barcode lengths and standard deviations.
avg_seq_length
counts each barcode in a repeat as a separate sequence, while theavg_single_seq_length
values count a barcode once across all repeats for a sample.
-
-
The
_lengths.tsv
table. This file contains the barcode lengths for each sample and is summarized in the_samplestat.tsv
file. The barcode lengths are comma seperated and can be used for more detailed visualization of the lenght distributions.
The expected blacklists are in a .tsv format and look like:
sequence description
Both the synthetic and contaminant / regional blacklist filtering can be disabled by providing a file with nonsense sequences / descriptions.
The ExtractSampleCountData.py script parses through a collapsed OBITools .fasta file and the OBITools tag lookup table and produces a .tsv file that lists the number of "raw" sequences after demultiplexing. This file is used by the MergeAndFilter.R script in order to calculate what percentage of reads is retained or removed during the various filtering steps.
The collapsed OBITools .fasta file is produced by the OBITools obiuniq
tool.
The script can be ran with the following command:
ExtractSampleCountData.py [input collapsed fasta file] [tag lookup table] > [output table]
The output is a .tsv file containing the sample name and read counts. For example:
Sample name Read count
Sample1 1000
Sample2 500
Sample3 700