We attempt to replicate the results of Baron et al., processing single-cell RNA-seq data to identify commmon and uncommon cell types in the human pancreas.
Daniel Gealow, Nikita Tomar, and David Lenci
Project 4 Report.pdf
: Our final report.
count_barcodes.qsub
: A shell script (to be submitted to the qsub queue)
that runs three instances of zcat in parallel, each gradually piping the
contents of one of the sample barcode fastq files to an instance of
count_barcodes.py
.
count_barcodes.py
: Recieves a fastq file as piped input and counts the
number of occure barcodes (first 19 bp in the sequence) into a Counter()
dictionary, which is then saved to a pickle file.
plot_bc_counts.py
: Plots the distribution of barcode counts in each
pickle file in two figures to help determine an appropriate filter
cutoff.
create_whitelist.py
: Determines the set of barcodes that appear
at least 10^4.5 times in any of the pickled counters, and writes them
to whitelist files. (The combined whitelist.txt
file is the one that we
actually use in our further analysis).
generate_index.qsub
: Runs salmon index
to generate an index from
the gencode v40 human reference transcriptome.
run_alevin.qsub
: Runs salmon alevin
to generate the UMI count matrix.
Requires the barcode and read 2 files for each of the three samples,
the whitelist.txt
file, a transcript-to-gene mapping file (t2g_map.tsv
),
and the index created by generate_index.qsub
.
Programmer.R
: Processes the avelin data, filters out low quality genes,
reduces dimension and performs clustering on them.
analyst_main.R
: Contains the code for identifying potential cell markers,
labeling clusters, and then generating the clustered heatmap.