MergeAndFilter scripts

This project includes two scripts that are useful for processing OBITools output tables. The main MergeAndFilter.R R script allows for merging of different OBITools identifications (assuming its based on the same barcode sequences) and applying additional filtering to the results.

The R script requires additional input in the form of a sequence count table, which can be generated from the demultiplexed and collapsed .fasta files produced by OBITools with the ExtractSampleCountData.py Python script.

MergeAndFilter.R

The MergeAndFilter.R script fulfils two important roles, merging different OBITools outputs generated by different reference databases and applying additional filtering to the results.

Command

The script can be run from the command line, or manually in R. Regardless, it requires a number of input files:

At Least one OBITools output .TSV table, which can be generated with the obitab function.
A sequence count .TSV table containing the "raw" demultiplexed sequences counts for each sample. This table can be generated with the ExtractSampleCountData.py script.
A blacklist for synthetic sequences used in the experiment. See the blacklist section for details regarding formatting.
A blacklist for contaminant sequences.

The full command is as followed:

MergeAndFilter.R [number of input files] [OBITools output x] [OBITools table x name] [count file] [synthetic blacklist file] [regional blacklist file] [base output name] [minimum identity (1)] [minimum number of reads per observation (3)] [minimum total reads (10)] [minimum total observations (3)]

The values within parentheses are default values.

The first argument is the number of input OBITools tables (minimum 1), followed by the path the table and the dataset name. When multiple tables are provided they will be grouped as path and table pairs. For three input files the first part of the command would look like:

MergeAndFilter.R 3 table1.tsv name1 table2.tsv name2 table3.tsv name3

The last four settings work on a per barcode basis and function as followed:

If the minimum identity value (0-1, default 1) is true for any OBITools input set, the barcode sequence is kept in.
For each observation, at least 3 (default) reads need to be present, lower counts are set to 0.
Each barcode needs at least 10 (default) reads across the entire dataset. Barcodes with a lower total are removed.
Each barcode needs at least 3 (default) observations across the entire dataset. Barcodes with lower counts are removed.

Sample and repeat names

The MergeAndFilter.R script requires that the samples and repeats are strictly formatted in order to correctly identify which repeats belong to the same sample. It expects that the sample names are identical between the repeats and that the repeat number is always at the end of the name. The following sample names and repeats will be parsed correctly:

samplename_1
samplename_2

run1_sample2_repeat1
run1_sample2_repeat2

sample3-run2_1
sample3-run2_2

While these names and repeats will cause issues:

1_samplename
2_samplename

run1_repeat1_sample2
run1_repeat2_sample2

sample-three_run2_1
sample3_run2_2

Output

The MergeAndFilter.R script produces four output files:

A _filtered.tsv output file. This contains the merged and filtered data. The format follows that of the input OBITools files, but excludes the obiclean columns.

The _summary.tsv output file. Similar format as the _filtered.tsv file, but for each sample the repeat data is summarized into 6 categories:

totread: contains the total reads for the sample.
avgpread: the average read count across the repeats.
sdpread: the read count standard deviation across the repeats.
totrep: the number of repeats for the sample that contain the barcode.
proprep: the above repeat count converted to a proportion.
weightrep: the weighted proportional repeat count. Weighting is based on the total filtered read abundance for each repeat. For example:

	repeat 1	repeat 2	repeat 3	proprep	weightrep
total sum	250	500	250
proportion	0.25	0.5	0.25
barcode 1	present	present	present	1	1
barcode 2	present	present	absent	0.66	0.75
barcode 3	absent	present	absent	0.33	0.5

The weighted proportion prioritizes presences in repeats with greater sequencing depths.

The _samplestat.tsv file. This table contains a more generalized summary on a per sample and repeat basis. For each sample, the file contains the following:
- Raw reads per repeat and the average and standard deviation.
- The proportion of reads that was filtered out by the technical filtering steps. (OBITools PCR error correction and the minimum read and repeat counts in the MergeAndFilter.R script.
- The proportion of reads filtered out by the minimum identity criteria.
- The proportion of reads filtered out according to the blacklists.
- The proportion of reads that was retained after filtering.
- The average replication of the 10 most read dominant taxa both prior avg_rep and post avg_filt_rep filtering, along with the standard deviation. These two numbers provide insight on how replicable the data is. The overlap column contains the number of barcodes that appear in both categories (max 10).
- The average barcode lengths and standard deviations. avg_seq_length counts each barcode in a repeat as a separate sequence, while the avg_single_seq_length values count a barcode once across all repeats for a sample.
The _lengths.tsv table. This file contains the barcode lengths for each sample and is summarized in the _samplestat.tsv file. The barcode lengths are comma seperated and can be used for more detailed visualization of the lenght distributions.

blacklists

The expected blacklists are in a .tsv format and look like:

sequence description

Both the synthetic and contaminant / regional blacklist filtering can be disabled by providing a file with nonsense sequences / descriptions.

ExtractSampleCountData.py

The ExtractSampleCountData.py script parses through a collapsed OBITools .fasta file and the OBITools tag lookup table and produces a .tsv file that lists the number of "raw" sequences after demultiplexing. This file is used by the MergeAndFilter.R script in order to calculate what percentage of reads is retained or removed during the various filtering steps.

The collapsed OBITools .fasta file is produced by the OBITools obiuniq tool.

Command

The script can be ran with the following command:

ExtractSampleCountData.py [input collapsed fasta file] [tag lookup table] > [output table]

Output

The output is a .tsv file containing the sample name and read counts. For example:

Sample name    Read count
Sample1    1000
Sample2    500
Sample3    700

Y-Lammers/MergeAndFilter

MergeAndFilter scripts

MergeAndFilter.R

Command

Sample and repeat names

Output

blacklists

ExtractSampleCountData.py

Command

Output