/mas-seq-paper-data

Data and additional information from the initial MAS-ISO-seq study, "High-throughput RNA isoform sequencing using programmable cDNA concatenation"

Primary LanguageJupyter Notebook

High-throughput RNA isoform sequencing using programmable cDNA concatenation

Abstract

Alternative splicing is a core biological process that enables profound and essential diversification of gene function. Short-read RNA sequencing approaches fail to resolve RNA isoforms and therefore primarily enable gene expression measurements - an isoform unaware representation of the transcriptome. Conversely, full-length RNA sequencing using long-read technologies are able to capture complete transcript isoforms, but their utility is deeply constrained due to throughput limitations. Here, we introduce MAS-ISO-seq, a technique for programmably concatenating cDNAs into single molecules optimal for long-read sequencing, boosting the throughput >15 fold to nearly 40 million cDNA reads per run on the Sequel IIe sequencer. We validated unambiguous isoform assignment with MAS-ISO-seq using a synthetic RNA isoform library and applied this approach to single-cell RNA sequencing of tumor-infiltrating T cells. Results demonstrated a >30 fold boosted discovery of differentially spliced genes and robust cell clustering, as well as canonical PTPRC splicing patterns across T cell subpopulations and the concerted expression of the associated hnRNPLL splicing factor. Methods such as MAS-ISO-seq will drive discovery of novel isoforms and the transition from gene expression to transcript isoform expression analyses.

Authors

Aziz M. Al’Khafaji1*†, Jonathan T. Smith1*, Kiran V Garimella1*†, Mehrtash Babadi1*†, Moshe Sade-Feldman1,2, Michael Gatzen1, Siranush Sarkizova1, Marc A. Schwartz1,3,4, Victoria Popic1, Emily M. Blaum1,2, Allyson Day1, Maura Costello1, Tera Bowers1, Stacey Gabriel1, Eric Banks1, Anthony A. Philippakis1, Genevieve M. Boland5, Paul C. Blainey1,6,8,†, Nir Hacohen1,7,10,11,†

  1. Broad Institute of Harvard and MIT, Cambridge, MA, USA
  2. Department of Medicine, Center for Cancer Research, Massachusetts General Hospital, Boston, MA, USA
  3. Department of Pediatrics, Harvard Medical School, Boston, Massachusetts, USA.
  4. Division of Hematology/Oncology, Boston Children's Hospital, Boston, Massachusetts, USA.
  5. Department of Pediatric Oncology, Dana Farber Cancer Institute, Boston, Massachusetts, USA.
  6. Division of Surgical Oncology, Massachusetts General Hospital, Harvard Medical School, Boston, MA
  7. Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
  8. Center for Cancer Research, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
  9. Koch Institute for Integrative Cancer Research at Massachusetts Institute of Technology, Cambridge, MA, USA
  10. Harvard Medical School, Boston, MA, USA
  11. Center for Immunology and Inflammatory Diseases, Massachusetts General Hospital, Charlestown, MA, USA

* - These authors contributed equally
† - Corresponding authors

Data

  • All data from this study are available online (or are in the process of being uploaded).

There were two datasets from this study:

Dataset Number of Samples Location
Human tumor-infiltrating CD8+ T cells 2 Release in progress...
Spike-in RNA Variant Control Mix data (SIRVs set 4, Lexogen) 2 Terra*,
FTP: ftp://gsapubftp-anonymous@ftp.broadinstitute.org/MasSeqNatBiotech2021

* - The SIRV samples were prepared with two library preparation techniques: a length 10 MAS-ISO-seq array and a length 15 MAS-ISO-seq array. They were multiplexed into a single pooled sample and sequenced in a single run on a PacBio Sequel IIe. Our software package, Longbow, was then used to demultiplex the single SIRV multiplexed sample into two outputs - one for the length 15 array and one for the length 10 array. These demultiplexed files are what is currently available in the Terra workspace.

Terra Workspace Example

A Terra workspace with an example of how to process MAS-ISO-seq data can be found here:

This workspace is an example of how to segment and align MAS-ISO-seq data.

The data in this workspace are the same Spike-in RNA Variant Control Mix (SIRVs set 4, Lexogen) samples that we used as controls in the paper.

Pre-print of the Paper

A preprint of the paper can be found on bioRxiv here: High-throughput RNA isoform sequencing using programmable cDNA concatenation

Additional Analysis Scripts

Addition scripts and Jupyter notebooks used to perform analysis and figure creation for the paper are located in the scripts directory.