This repository contains a series of pipelines that can be used for processing, analysing and exploratory analysis of single cell RNA sequencing transcriptomics data using a server running on a Grid Engine. The scripts can also be used on personal machines and/or in interactive mode or by adapting the shel scripts to be used on other cluster softwares.
These pipelines were developed for the study Decoding the development of the blood and immune systems during human fetal liver haematopoiesis and due to their wider application they are being used and further extended for other projects.
The bundle includes:
- tools for building data sets from multiple CellRanger count tables
- data annotation aids
- training machine learning cell type classifiers
- doublet removal
- computing data reduction coordinates like tSNE, UMAP, force directed graph and diffusion map
- trajectory analysis
- interactive plots as html pages
- batch correction
- signature genes
- cell type comparison metrics and plots
- creating animated force directed graphs
- handling big data sets as Seurat objects
To create mode advanced data exploration apps and web portal(s) visit https://github.com/DoruMP/Fast-data-portals-for-scRNAseq-data. This repository will be referred in this documentation as Fast Portals.
To run the pipelines you must download the entire bundle and transfer to a server/personal computer. The folder structure must be kept as it is.
- create the seurat object (seurat_from_count_tables.sh)
- train a doublet detector (train_doublets_SVM.sh)
- update doublet assignment in data using
apply_doublet_classifier.R
which uses the Apply_Classifier_On_Seurat_Object() function. Alternatively, run again the seurat object creation process with seurat_from_count_tables.sh and set the argument_identify.doublets = TRUE_
) - subset the seurat object into singlets and doublets (split_seurat_by_category.sh). It is advised to keep the doublets data (even if you have not further use for them) for future reference.
- compute dimensionality reduction coordinates (add_dr.sh)
- make an annotation template and annotate (make_cell_annotation_template.sh)
- optional: annotation can be made a lot easier by plotting annotation assisting word clouds (wordclouds.sh) and making an interactive heat map (interactive heat map tool at Fast Portals interactive_heatmap_dotplot).
- important notice: it is highly advisable to use only alphanumeric characters in naming cell population. Some characters (e.g. "\", "/") can create problems and raise errors with some pipelines. While many of these issues are solved for, it is still advisable as good practice to avoid fancy characters in naming. This is because it is imposible to predict all possible issues created by non-alphanumeric characters and even when they do trigger errors, the error messages are particularly vague in such situations. Here alpha-numeric is defined as the collection of Latin letters and Arabic digits.
- update the annotation (update_annotation.sh)
- make all the apps that allow easy data exploration:
- plot_dr.sh for interactive UMAP, FDG, tSNE and AGA;
- Fast Portals interactive_heatmap_dotplot interactive heatmap but this time with the labels, not clusters;
- Fast Portals web_portal web portal tool for gene expression (you must have access to web server or alternatively set an Apache server on your local machine)
- Fast Portals gene_grouping for gene expression patterns
- super_markers.sh gets cell types signatures. You will understand the power of these signature when you input them in the interactive heat map (if you make one). This will be very useful for annotating new data, for supporting data annotation, for exploring expression patterns in new data sets or for designing new flow panels
- optional: you could run again the word clouds because this also show DEGs as word clouds
- optional: you could train a cell type classifier for fast integration of new data sets. However it is highly recommended that any published conclusions should be made on whole data annotation, not on classifiers results from this bundle (or any other type of machine learning classification/label propagations/projections made with tools that are not part of this bundle)
- it is recommended that you treat portals, doublet detectors, cell type classifiers and gene signatures as resources not as results. You should only share such resources with relevant people otherwise you might risk leaking results to others before publication.
- next steps are project specific
Python version 3.6
pandas 0.22.0
pptx 0.6.9
patsy 0.5.0
scanpy 1.2.2
sklearn 0.19.1
numpy 1.14.2
scipy 1.0.0
umap 0.2.3
cv2 3.3.1
R version 3.4.2
Seurat 2.3.4
dplyr 0.7.6
reshape 0.8.8
plyr 1.8.4
RColorBrewer 1.1.2
BiocParallel 1.12.0
gridExtra 2.3
grid 3.4.2
sva 3.26.0
destiny 2.6.2
ggplot2 3.0.0
monocle 2.6.4
harmony 0.1.0
methods 3.4.2
utils 3.4.2
wordcloud 2.6
The main folder is called single_cell_data_analysis_bundle and must contain:
- data folder where all seurat object as kept as RDS file and scanpy objects as h5ad files
- output folder where jobs save their output. This is where the user can get the results of running a pipeline
- pipelines folder contain a folder for each pipeline
- resources folder containing sample key, colour keys, cell type classifier, doublet detectors, options file etc.
- tools folder - there is no reason why the user should be concern with this folder. It contains programs for running force-directed graph, AGA, UMAP, classifier prediction, doublet detection, pseudotime 3D viewer app builder and a file with lots of utilities. These are never required to be called directly by the user.
- the tools folder includes the file bunddle_utils.R
- you must set the variable
python.addr
ain the script tools/bunddle_utils.R and set the R path in each shel script - if you want to use the portal tools and fast gene expression explorer you must have an additional folder named portal_tool where these tools are stored and can be used as pipelines
Color keys compatible with the single cell analysis bundle can be generated using the interactive tool color_management.html. Instructions can be found inside the interactive tool if opened in a browser (recommended: Chrome, Firefox; to avoid if possible: all versions of Internet Explorer and all versions of Microsoft Edge).
The functions in the bunddle_utils.R can be used in new pipelines or in customized scripts. If this is required check the parameter description for the required function and ensure access to required scripts that it need to call.
The bunddle_utils.R:
- declares the tool.addr and python.addr variables. Change the python.addr if you want to use a different python version for your work but make sure first that the version you are trying to use has installed all the required packages.
- the functions runFDG, RunUMAP are used to compute force-directed graph and UMAP coordinates for a seurat object. Although currently Seurat package has a function to compute UMAP which goes by the same name, the function in this bundle was created before Seurat published its umap computing function. Both the in-house and the Seurat RunUMAP functions do the same thing but because the bundle was build before Seurat had the ability to compute UMAP it is recomended to use the RunUMAP from the bunddle_utils.R script with the current bundle for compatibility reasons.
- runFDG(pca.df, snn, iterations = 600, tool_addr, python.addr)
- computes force directed coordinates on a seurat object. This function requires time and computation resources for big data sets.
- @param
pca.df
the input data frame with variables as columns. In the pipeline this is used on the pca coordinates of a seurat object which can be retrieved atseurat.obj@dr$pca@cell.embeddings
. The function is very flexible due to this input and can used on many types data formats not limited to a seurat object. Further flexibility of this function comes from the fact that other embeddings can be used besides pca (e.g. batch corrected pca stored atseurat.obj@dr$harmony@cell.embeddings
). - @param
snn
shared nearest neighbor graph. In a seurat object this is available atseurat.obj@snn
. If you want to use this function outside a pipeline you must make sure that the snn has been computed on the seurat object or if you are not using a seurat object you must computed using other available tools. You must pay attention to seurat object subsetting which as of writing this documentation does not recompute the snn. - @param
tool_addr
folder name were the bundle tools are stored. this function uses force_abstract_graph/make_fdg.py for the actual computations. If you need to use this function outside the pipelines you must make sure you have this script and set the tool.addr argument properly. - @param
python.addr
python address. This is pre-set in all pipelines but having this as an argument allows the user to re-use the function in other scripts and choose the python version
- RunUMAP(pca.df, tool_addr, python.addr)
- computes umap coordinates. Currently Seurat packages has published a function with same name that computs UMAP coordinates on a seurat object. However the RunUMAP in this bundle is more flexible and can be used not just on a seurat object. This function is the fastest dimension reduction method (compared to PCA, tSNE and FDG).
- @param
pca.df
the input data frame with variables as columns. In the pipeline this is used on the pca coordinates of a seurat object which can be retrieved atseurat.obj@dr$pca@cell.embeddings
. The function is very flexible due to this input and can used on many types data formats not limited to a seurat object. Further flexibility of this function comes from the fact that other embeddings can be used besides pca (e.g. batch corrected pca stored atseurat.obj@dr$harmony@cell.embeddings
). - @param
tool_addr
folder name were the bundle tools are stored. this function uses umap/umap_compute.py stored in the tools folder to compute the umap coordinates. - @param
python.addr
python address. This is pre-set in all pipelines but having this as an argument allows the user to re-use the function in other scripts and choose the python version
- Apply_Classifier_On_Seurat_Object(seurat.obj, classifier.fname, tool_addr, python.addr)
- this function applies an SVM classifier to the seurat objects and outputs predictions as a character vector;
- the predictions can be added to the meta-data of the seurat objects
- can be used to predict cell types or doublets. Make sure that the cell type classifier or doublet detector is relevant for the data you want to predict (e.g. do not use a thymus cell type classifier on spleen)
- @param
seurat.obj
name of seurat object - data must be normalized before applying the function - @param
classifier.fname
folder name were the classifier is stored. cell type classifier are trained with the train_classifier.sh pipeline and doublet detectors are obtained with train_doublets_classifier.sh pipeline; - @param
tool_addr
folder name were the bundle tools are stored. This function calls the predict_by_classifier/predict.py script to load the classifier and must be present in the tool.addr folder. Inside the pipelines this argument is already set but if you want to use this function in other scripts you must set the proper tool.addr - @param
python.addr
python address. This is pre-set in all pipelines but having this as an argument allows the user to re-use the function in other scripts and choose the python version
- make_3D_interactive_page(data_frame_3D, tool_addr, python.addr, save.to)
- outdated since the web portal was created (see Fast Portals)
- this function was used to create a 3D visualising html page.
- this function is the ancestor of the pseudotime web portal
- this function is used by the plot_dr.sh pipeline to make a interactive tool for visualising diffusion map coordinates.
- It can also used outside the pipeline to visualize other types of three-dimensional data sets (3D UMAP, 3D FDG, 3D tSNE) but this will be to be pre-computed
- @param
data_frame_3D
data frame with the 3 dimensions in the first 3 columns, colours as hexdecimals in the forth columns and cell labels in 5th columns which must be named "Labels". The names of the other columns are not important. Make sue you do not have non-alphanumeric character in the labels (e.g. /, , @ etc.) which can cause issues with the output. - @param
tool_addr
folder name were the bundle tools are stored. This function uses interactive_3D_viewer/html_WebGL_3D_viewer.py script which must be found in the tools folder. The function can be used outside the pipelines by setting the tool_addr argument. - @param
python.addr
python address. This is pre-set in all pipelines but having this as an argument allows the user to re-use the function in other scripts and choose the python version - @param
save.to
address to save the resulted interactive page
- make_2D_interactive_page(data_frame_2D, tool_addr, python.addr, save.to="./")
- this functions is similar to make_3D_interactive_page and it produces and interactive html page for vizualizing 2D data (UMAP, tSNE, FDG)
- @param
data_frame_2D
has similar formating with the parameter data_frame_3D in function make_3D_interactive_page the difference being that it features only two dimensions - @param
tool_addr
folder name were the bundle tools are stored. This function uses interactive_2D_viewer/html_WebGL_2D_viewer.py script which be found in the tools folder. - @param
python.addr
python address. This is pre-set in all pipelines but having this as an argument allows the user to re-use the function in other scripts and choose the python version - @param
save.to
address to save the resulted interactive page
- create_gene_expression_viewer_apps(seurat.obj, dim.type = 'umap', save.to, tool_addr, python.addr, categories.colours=NA)
- deprecated
- this has been replaced by the web portal creating tool (see Fast Portals)
- this can still be used by the pipeline seurat_to_interactive_gene_expression.R
- this creates html interactive pages that allow data exploration for dimensional reduction plots and gene expression. However the data is embedded in the page so the results is heavy and will require time to load in a browser. It is not recommended to used the output on a web server. The output is useful for internal distribution of data. For other situation I recommend you use the web portal building tool.
- plot.indexed.legend(label.vector, color.vector, ncols = 2, left.limit = 3.4, symbol.size = 8, text.size = 10, padH = 1, padV = 1, padRight = 0)
- function that creates a legend pdf file to be used with dimension reduction plots
- this function is called in plot_dr.sh
- @param
label.vector
a character vector with the cell labels - @param
color.vector
a character vector with the colors for the labels written in hexdecimal format. hexdecimal colour keys can be created using the interactive tool color_management.html - @param
ncols
number of columns to arrange the labels in the legend - @param
left.limit
left padding - @param
symbol.size
symbol size - @param
text.size
text size - @param
padH
horizontal padding - @param
padV
vertical padding - @param
padRight
right padding
- dr.plot(point.labels, dr1, dr2, dr1.name, dr2.name, no.legend = F, plt.lb.sz = 5, txt.lb.size = 3, pt.size = .2, random_state = 2, use.cols = NULL, use.labels = NULL, limits = NULL, annotate.plot = T, index.map = NA)
- function used to plot dimensionality reduced coordinates (e.g. tSNE, UMAP)
- @param
point.labels
character vector of all labels in the data set - @param
dr1
numeric vector of first dimension coordinates - @param
dr2
numeric vector of second dimension coordinates - @param
no.legend
boolean indicating if to plot the legend inside the final plot. Sometimes it is desirable to plot the legend separatly to allow for more flexibility in arranging figure panels. In this case set this argument toFALSE
and use plot.indexed.legend to create a separate legend figure - @param
plt.lb.sz
label size - @param
txt.lb.size
text label size - @param
pt.size
point size - @param
random_state
used for generating repeatable random colours when colours are not available - @param
use.cols
if null, then generate colours randomly - @param
use.labels
vector of labels - @param
limits
plot limits - @param
annotate.plot
boolean to indicate if plot should be annotated by with indices - @param
index.map
either a NA type or a numeric vector to set the indices of each label
Each pipeline is run with the command qsub name_of_pipeline.sh 'arguments inside single quotes'
All pipelines must be run from their home directory
Inside the single quotes each argument must be take a value. These value should be in double quotes if strings or without otherwise (basically following R standard syntax);
A few pipelines do not take external arguments (split_seurat_by_category.sh, subset_seurat.sh and compute_DEG.sh). In these cases the arguments must be changed inside the R script.
The standard assumptions of all pipelines are:
- all the data is in the data folder
- resource files are found in the resources folder
- any processed data resulted from the job will be saved to the data folder
- any other type of result will be saved in the output folder
- all pipelines (exceptions split_seurat_by_category.sh, merge_seurat_objects.sh and subset_seurat.sh, compute_DEG.sh, gene_discriminatory_power_analysis.sh) when run create a dedicated folder inside the output folder. This carries the name of the pipeline + name of inputed data + unique time. Results other than processed data will be saved to this folder. This allows the user to map jobs with output. Inside the pipeline output folder there is a temporary folder created to stored transient processed data. This will be deleted most of the time when the job ends. In same case the transient processed data might be important for further work so the user should comment out the line
unlink(output_folder_material, ...)
An example of running a pipeline:
qsub add_dr.sh 'seurat.addr = "fliv_lymphoid_Stage_1.RDS"; do.normalize = T; add.PCA = T; add.TSNE = T; add.UMAP = T; add.FDG = T; save.dr = F'
- the above job will load the file "fliv_lymphoid_Stage_1.RDS" from the data folder. If this file is not inside the data folder an error will occur and the job will be stoped;
- then the job will do data normalisation followed by the computation of PCA, tSNE, UMAP and FDG coordinates;
- lastly it will save the resulting Seurat object overwriting the file that was initially loaded (in this case "fliv_lymphoid_Stage_1.RDS").
Jobs can be killed but take notice that if you terminate a process while it is writing to disk, the corresponding data will be lost.
For smaller data sets the scripts can be run on a local station. In such cases one cannot submit jobs using the
qsub
command. Instead the R and/or Python scripts can be run directly in interactive environments but the global parameters either pe passed directly to the script or set inside the scripts.
This is the template for building new pipelines compatible with the current bundle. It contains boiler plate code and can be used as the starting point for making new pipelines. It can handle any number of arguments, just place them in the arguments.list one on a line and ending with ".arg".
- used to compile a seurat object from Cellranger output
- an example of runing this pipeline:
qsub seurat_from_count_tables.sh 'organ="liver"; ProjectName="Liver10x"; save.at="liver_all.RDS"; sequencing.types="normal"; annotate.cells = T; identify.doublets = T; cell.type.SVM = "classifier_svm_cell_type_ftlliv"; doublet.svm = "classifier_svm_doublets_ftlliv"'
- reads the file key.csv in the resources folder. From this it selects the data based on organ name and sequencing type
- the seurat object is saved having the file name set by the argument save.at. The file will be saved in data folder
- The arguments:
- organ: string, name of the organ. Must exist in the key.csv file (e.g. liver, thymus)
- ProjectName: string, passed to the project argument when creating the seurat object
- save.at: string, file name for the seurat object to be save to. File extension must be RDS. The file will be saved in the data folder
- sequencing.types: strings, can be 'normal' for 3' data or 5GEX for 5' data
- annotate.cells: boolean, use a trained SVM to automatically annotated the data. Make sure you have a trained SVM before asking for automatic annotation
- identify.doublets: boolean, use a doublet detector to flag doublets in the dataset
- cell.type.SVM: string, name of the cell type classifier. The classifier must exit in the resources folder and should have been created with the pipeline train_classifier
- doublet.svm: string, name of the doublet detector. The doublet detector must exist in the resource folder and should have been created with the pipeline train_doublets_classifier
- makes an annotation template for data stored in a seurat object.
- it first clusters the data than computes DEGs and arranges the results in a form than can readily be used for data annotation
- the data file is overwritten at the end to include the new clustering.
- After clustering it is recommended to run the pipeline interactive_heatmap_dotplot (see Fast Portals) on the clustered data because the interactive heatmap is highly useful for the annotation. But before doing that you should pre-append a string to the cluster index because pure cluster indices are not handled well in the interactive heat map (e.g. use Cluster_110 instead of 110).
- an example of runing this pipeline:
qsub make_annotation_template.sh 'seurat.addr = "spleen_all.RDS"; clustering.res=30; DE.downsample=T'
- The arguments:
- seurat.addr: string, name of data file. Must be RDS format and contain a seurat object
- clustering.res: integer, Louvain clustering resolution. Use higher values for bigger data sets. Always aim at over clustering for the purpose of data annotation. It is better to merge clusters that will be assigned the same labels than to have rarer population diluted inside bugger clusters and never being detected.
- DE.downsample: boolean, set to
TRUE
for bigger data sets. This downsamples cell number in bigger clusters decreasing time significantly for DEG computation. However, pay attention that downsampling will use the bpl package that might throw an error about 'problem too large'
- splits a seurat object in several subsets by the levels of a column in the meta data (e.g. splits by gender will create 2 smaller seurat objects, one for each gender)
- this pipeline does not use the output folder and the resulting data is saved in the pipeline home folder. I made this choice to allow investigating the resulting subsets of data before I transfer to data folder to make sure I am not overwriting any thing.
- this pipeline does not accept external arguments. Arguments must be changed inside the R script split_seurat_by_category.R
- it is recommended that the results subsets are run through dimensionality reduction or batch correction before they are used for any downstream work.
- note: if any of the levels in the column of the meta data contain a "/", the resultant .RDS file will read this as a path and cause errors with saving the RDS. Avoid this character or make folders (e.g., if name of level is yolk_sac_progenitor/MPP you will have to make a folder in pipeline directory called "yolk_sac_progenitor". The rds file will be names MPP.RDS and will be saved within this folder)
- an example of runing this pipeline:
qsub split_seurat_by_category.sh
- The arguments:
- sort.by column meta data by which the data should be splitted
- seurat.addrs full or relative path for the RDS file storing the Seurat object
- merges all the seurat object from a list fo file names
- an example of runing this pipeline:
qsub merge_seurat_objects.sh 'seurat.addrs = c("data1.RDS", "data2.RDS"); append_tag = T; tags_to_append = c("tag1", "tag2"); append_tags_at = "sample.ids"; save.at = "merged_data.RDS"'
- the arguments:
- seurat.addrs character vector of RDS file names (must be at least 2) containing the seurat objects. Must include only the file name, not the full path. The assumption is that the datasets are found in the data folder inside the bundle
- append_tag boolean flag to append a tag to the meta data to help keep track of the merged data sets. This has proved very useful for many downstream work so it is recommended to add the tags
- tags_to_append character vector containing the tags. Must be the same length as seurat.addrs. If append_tag is set to
FALSE
this argument will be ignored but should not be omitted from the list of arguments and can be set toNULL
orNA
- append_tags_at meta data column where to append to tags
- save.at RDS file name where to save the merged seurat object. Must contain only the file name, not the path. It will be save to the data folder. Make sure the file names does not exist before running the pipeline to avoid over-writing of data.
- used to subset a part of the data set and save to a new RDS file as a seurat object
- this pipeline does not use the output folder
- this pipeline does not take external arguments. Arguments must be written inside the R script subset_seurat.R
- an example of runing this pipeline:
qsub subset_seurat.sh
- the arguments:
- seurat.obj.addr full or relative path of the input RDS file
- save.at full or relative path of the output RDS file
- process boolean flag to run common data processing (normalisation, scaling, variable genes computation, PCA). It is recommended to set this to
TRUE
. However there are cases when data processing might not be required so in this case time can be saved by setting this argument toFALSE
. - add.dr boolean flag to compute tSNE, UMAP and FDG. It is recommended to set this to
TRUE
. IfTRUE
the previous argument also be set toTRUE
otherwise an error will raised. There are times when these computations are not required so set this argument toFALSE
. Not processing and not adding dimensionally reduction also ensures light-weight data sets which are easy to transfer over the web. - filter.args list of named vectors indicating fields on which to subset the data
- computes tSNE, UMAP and FDG on a seurat object
- also includes a script called add_dr_COMBAT.R which can be used to run batch correction using COMBAT implemented in Python. However this is very slow on data sets with more than 50k cells. COMBAT correction changes gene expression
- the default of this pipeline is not to use COMBAT batch correction. If this is required edit the add_dr.sh file by replacing add_dr.R with add_dr_COMBAT.R
- an example of runing this pipeline:
qsub add_dr.sh 'seurat.addr = "data_scseq.RDS"; do.normalize = T; add.PCA = T; add.TSNE = T; add.UMAP = T; add.FDG = T; save.dr = F'
- the arguments are:
- seurat.addr file name of the RDS object containing the input data as a seurat object. Must include only the file name not the path because the assumption is that data files are kept in data folder
- do.normalize boolean to normalize data. This must be set to
TRUE
if the input data has not yet been normalized. If this is set toFALSE
but the data has not been previously normalised and error will occur and the job will be killed - add.PCA boolean to compute PCA. Same principles and warnings as for the previous argument
- add.TSNE boolean to compute tSNE
- add.UMAP boolean to compute UMAP
- add.FDG boolean to compute FDG. note, this takes 2 extra hours (for dataset of ~100,000 cells)
- save.dr boolean to save the dimensionality reduction coordinates as a data frame in a csv file in the pipeline output folder. This is particularly useful for bigger data sets which either take long time to load or are not manageable on personal computers at all. In those case having the coordinates and meta data in csv files will save time
- computes differential expressed genes (DEGs) on a seurat object
- this is different from make_cell_annotation_template.sh which computes DEGs only on clusters
- it allows computation of DEGs on any meta data category and also can be used post-annotation to get cell type DEGs
- this pipeline does not take external arguments. Arguments must be set inside the R script compute_DEG.R
- an example of runing this pipeline:
qsub compute_DEG.sh
- the arguments:
- seurat.addrs full or relative path of the RDS file containing the Seurat object.
- save.to file name where to save markers genes in csv format
- DE.downsample boolean to downsample data if to big. Set this to
TRUE
for big data sets. - category meta data column by which DEGs are computed (e.g. cell.labels, stages)
- makes violin plots using set genes from a seurat object
- Seurat package already has ability to construct violin plots. However this pipeline is advantageous for big data sets that are difficult or impossible to manage on personal computers or interactive sessions
- an example of runing this pipeline:
qsub violin_plots.sh 'seurat.addr = "data_scseq.RDS"; set.ident = "cell.annotations"' type.to.colours = "data_color_key.csv"; cell.labels = "fname_with_labels"; plot.width = 10; plot.height = 10; features.file = "fname_with_genenames"'
- the arguments:
- seurat.addr file name of the RDS object containing the input data as a seurat object. Must include only the file name not the path because the assumption is that data files are kept in data folder.
- set.ident meta data column to set the identity of cells
- type.to.colours name of csv file that contains the colour key (mapping between categories in the set.ident and colours). Must contain only the file name not the full or relative path because the assumption is that this is a resource file that is placed in resource folder. Color keys compatible with the single cell analysis bundle can be generated using the interactive tool color_management.html
- cell.labels indicates what categories to include in the violin plot. If set to
"all"
it will use all the categories. If a subset of categories is desired you must pass the file name that exits in the resource folder and contain one category name per line - plot.width numeric plot width
- plot.height numeric plot height
- features.file name of file that must be found in the resource folder and to contain a gene name per line
- and example of cell.labels file:
MEP_kidney
Mast cell_kidney
Megakaryocyte_kidney
Late Erythroid_skin
Mid Erythroid_skin
MEP_skin
Megakaryocyte_skin
Mast cell_skin
- an example of features.file file:
UNG
MSH6
CD19
TNFRSF13C
TRNT1
PIK3CD
MOGS
INO80
TNFRSF13B
TNFSF12
- plots a heat map and a dot plot using selected gene names and cell types from a seurat object
- an example of runing this pipeline:
qsub gene_heatmap_and_spotplot.sh 'seurat.addr = "data_scseq.RDS"; set.ident = "cell.labels"; genes.to.plot = "fname_genes"; cell.types = "fname_with_labels"; cluster.genes = T; diagonalize = F; plot.dims = c(10, 10, 10, 10); save.gene.order = T;'
- the arguments:
- seurat.addr file name of the RDS object containing the input data as a seurat object. Must include only the file name not the path because the assumption is that data files are kept in data folder.
- set.ident meta data column to set the identity of cells
- genes.to.plot name of file that must be found in resource folder and to contain one gene name per line
- cell.types indicate what categories to include in the violin plot. If set to
"all"
it will use all the categories. If a subset of categories is desired you must pass the file name that exits in resource folder and contain one category name per line - cluster.gene boolean if cluster genes
- diagonalize boolean if diagnolize genes (i.e. placing highest values in each row closer to the diagonal to make for better vizualisation)
- plot.dims numeric vector containing the widths and heights of the heat map and dot plot respectively
- save.gene.order this is useful if the ordering of genes from diagonalization and/or clustering has created good visualisation and the user needs to store the ordered genes for future plots using the same gene set
- an example of genes.to.plot file:
TRAV2
TRAV3
TRAV4
TRAV5
TRAV6
TRAV7
- before considering using this you should know that the web portal tool might be a better option (see Fast Portals)
- this pipeline creates an interactive app that allows data exploration in a browser. Part of its limitations is that it shows only the variable genes and that expression data is embedded in the interactive page making it heavy and requiring time to load in a browser. The web portal tool has a further advantage in that it can build portals from both Seurat and Scanpy objects. However if all you need is to share a data exploration app with your team members or collaborators without disclosing data before publication this might be the pipeline for you
- an example of runing this pipeline:
qsub seurat_to_interactive_gene_expression.sh 'seurat.addr = "data_scseq.RDS"; dim.type = "umap";'
- the arguments:
- seurat.addr file name of the RDS object containing the input data as a seurat object. Must include only the file name not the path because the assumption is that data files are kept in data folder.
- dim.type dimensionality reduction type to be used in the resulted app (e.g. tsne, umap, fdg). Must exist in the
@dr
slot of the seurat objects, otherwise it will raise error and kill the process. You can use the add_dr.sh pipeline prior to this if you are not sure
- pipeline used to plot all dimensionally coordinates and colour the points by any column in the meta data
- there is also a plot_dr_numerical pipeline which should be used to plot by numerical categories, e.g., louvain clustering. This will not give a seperate legend - it will plot cluster numbers directly on the plot
- if you would like larger point size for dots on FDG/UMAP/TSNE, please add 'pt.size=1' argument to plot.tsne/plot.umap/plot.fdg functions in ~line 190.
- additional it can also compute diffusion map and AGA graph
- warning about diffusion maps: most of them will make no sense if the data does not contain a true lineage. To ensure the diffusion map will make sense care must be taken in up stream work flow and must ensure removal of doublets, removal of outliers if possible and most importantly that the cell types in the data are part of a true biological lineage and only one lineage
- warning about AGA: some times AGA results are difficult to interpret especially when running on data sets that contain too many cell types. Establishing what too many means is up to trial and error and experience
- this pipeline also creates interactive apps (as html pages) that allow exploration of the dimensionality reduction coordinates and AGA structure. Furthermore diffusion maps can be visualised in a 3D interactive enviroment
- warning about SS2 data - your raw.data slot in seurat object may be a matrix. We expect a sparse matrix (as is the norm with 10X data. As such, to avoid an error please do not set DiffMap or AGA to true when running plot_dr on SS2 data).
- The 2D interactive app are build for tSNE, UMAP, FDG but take notice that these should be computed prior to running this pipeline (see add_dr.sh or batch_correction.sh).
- The AGA interactive app includes instructions when opened on a browser. See general description of single cell analysis pipeline for browser compatibility.
- an example of runing this pipeline:
qsub plot_dr.sh 'seurat.addr = "keratinocytes.RDS"; plot.by = "annotations"; type.to.colours = NA; runDiffusionMap=F; runAGA = T, overlay.data="data_type"; overlay.data.ordered=c("10X", "ss2")'
- the arguments:
- seurat.addr file name of the RDS object containing the input data as a seurat object. Must include only the file name not the path because the assumption is that data files are kept in data folder.
- plot.by indicates the meta data column(s) to be used in colouring the plots. Can be one string if only one column is used or a character vector if more columns are required
- type.to.colours indicates colours for all categories for all meta data columns chosen to plot. Can be one string if only on column is chosen or a character vector if more columns are chosen. Each value can be
NA
if random colours are required or a color key file in csv format found in resource folder. See the color_management.html tool for generating color keys compatible with the single cell analysis bundle - note: if you use this colour management html you need to add "Celltypes,Colours" to the top line of the .csv (without apostophes). - runDiffusionMap boolean to run diffusion map. Set this to
TRUE
only after considering the warnings about diffusion maps. - runAGA boolean to run AGA
- overlay.data indicates seurat object metadata column containing two categorical variables. This will plot dimensional reduction with different shapes for each of these categories - can alternatively be set as NA.
- overlay.data.ordered is a list of the categorical variables within overlay.data. The first value in the list will be plotted as a circle and the second value will be plotted as a triangle - can alternatively be set as NA (if overlay.data is also set as NA).
- this pipeline trains a random forest for classifying cell labels in a seurat object using a set of gene names
- it was created to assess discriminatory power of gene sets using a random forest classification report
- the random forest was chosen for this purpose due the partial similarities between it classifying mechanism and flow sorting
- this pipeline may take a while (15 hours for Popescu et al, 2019, data). Accordingly, a downsampling step has been added to only use 1000 of each celltype according to cell.labels.
- this is a not a pipeline for training classifiers. if that's what you need check train_classifier.sh pipeline
- an example of runing this pipeline:
qsub gene_discriminatory_power_analysis.sh 'seurat.addr = "data_scseq.RDS"; feature_genes = "fname_with_gene_names"; cell.types = "fname_with_cell_types"; save.to.dir = "gene_power"; ident.set = "validated.cell.labels"; type.to.colours = "colorkey_fname";'
- the arguments:
- seurat.addr file name of the RDS object containing the input data as a seurat object. Must include only the file name not the path because the assumption is that data files are kept in data folder.
- feature_genes name of file that must be found in resource folder and to contain one gene name per line
- cell.types name of file found in resource folder and contains one cell type per line
- save.to.dir name of folder to save results
- ident.set name of column in meta data to used for partitioning the data
- type.to.colours name of csv file found in resource folder that contains the cell type to colour mapping. To generate colour key compatibly with the single cell analysis bundle check the interactive tool color_management.html
- see violin_plots.sh and the resource folder for examples of feature genes and cell types file formats
- used for trajectory analysis
- uses diffusion map. please check plot_dr.sh pipeline for warnings about diffusion maps
- computes pseudotime and (optionaly) genes that vary with pseudotime
- it outputs plots of normalized and non normalized gene expression by pseudotime
- it also produces an interactive page used for visualising some of the top variable genes. Check the pseudotime portal creation tool (see Fast Portals) for a better alternative richer in functionalities and showing a higher number of genes
- an example of runing this pipeline:
qsub pseudotime.sh 'seurat.addr = "data_scseq.RDS";set.ident = "cell.labels"; cell.types = c("HSC", "Pre@@pro@@B@@cell", "pro-B@@cell", "pre-B@@cell", "B@@cell"); root_cell_type = "HSC"; var.genes = NA; type.to.colours = "type_colours.csv"'
- the arguments:
- seurat.addr file name of the RDS object containing the input data as a seurat object. Must include only the file name not the path because the assumption is that data files are kept in the data folder.
- ident.set name of column in meta data to used for partitioning the data
- cell.types character vector containing a list of cell types to be used in the trajectory. You must replace every white space in the names (" ") with double sign ("@@"). Check commands example for reference. This the most important parameter for this pipeline - check the warnings about diffusion map
- root_cell_type name of root of trajectory. must exist in cell.types. Must have the same formating (replacing " " with "@@")
- var.genes set this to NA to flag the computation of all variable genes. If instead f computing variable genes one needs to analyse expression pattern of certain genes this argument must be the name of file placed in resource folder and containing a gene name per line
- type.to.colours name of csv file found in resource folder that contains the cell type to colour mapping. To generate colour key compatibly with the single cell analysis bundle check the interactive tool color_management.html
- Note that this pipeline is very similar to 92_pseudotime_webportal, but does not have the argument
lineage.name=
, and requires the argumentvar.genes=
. - This pipeline returns the top 100 genes per cell type, unlike 92_pseudotime_webportal, which returns adjusted p-value < 0.001. If you plot pdt genes of this pipeline, check that all plotted genes are significant by comparing with the output of 92_pseudotime_webportal.
- this pipeline is used for the fist step in creating an animated force direct graph
- see here and example of animated force directed graph
- the pipeline name is write_input.sh
- this processes and saves material to be used in the animation creation (a legend figure, a csv file mapping cell types to colours, PCA data and the shared nearest neighbour graph in sparse matrix format)
- this pipeline does not take external arguments
- the entire workflow for animation creation continues with the tools in the folder force_abstract_graph_2Danimation. It is recomended to run this in a interactive environment (on a personal computer not on a server). Inside this folder there must be an empty folder called input. This is where the material created by write_input.sh must be transfered. Then start make_fdg_animation.py. Inside this script there is the line
subprocess.call(["Rscript", "make_plots.R"], shell = True)
. This might fail on some platforms. The solution in this case is run the script make_plots.R independently then carry on with make_fdg_animation.py from where you left - the tools in force_abstract_graph_2Danimation need installation of an in-house modified version of fa2 package. The modified version can be found at force_abstract_graph_2Danimation/iterative_fa2/. To install this go to force_abstract_graph_2Danimation/iterative_fa2/ and run
python3.6 setup.py install
(change according to your python version) - you must also have open computer vision Python package opencv2 pre-installed. Check the online documentation on how to install this on your platform
- the write_input.sh has only 2 arguments:
- seurat.addr file name of the RDS object containing the input data as a seurat object. Must include only the file name not the path because the assumption is that data files are kept in data folder.
- cell.type.to.colour name of csv file found in the resource folder that contains the cell type to colour mapping. To generate colour key compatibly with the single cell analysis bundle check the interactive tool color_management.html
- used for training cell type classifiers - may take days to run
- classifier type SVM using PCA input. The classifier is saved as 3 files: SVM (model.pickle), PCA projection (pca.pickle), and a list of feature genes (feature_genes.RDS)
- it also saves classification reports (classification_report.txt, confusion_matrix.csv and confusion_matrix.pdf). These files are important for assessing the perfomance metrics of the resulted classifier
- the classifier files and report are saved in a user-defined folder placed in the resource folder
- see resources/classifier_svm_cell_type_liver for an example of a trained classifier
- classifiers can be applied using the function Apply_Classifier_On_Seurat_Object (see the bunddle_utils.R script)
- IMPORTANT NOTICE: classifiers must be used only on the same type of data that it was trained on e.g. a classifier for cells in liver should only be applied on liver data. If for example you will apply the liver cell types classifier on thymus the classifier will only "see" the cell types it was trained to see and your results will be wrong. Furthermore most of the time the use of classifiers in a cross-tissue manner is highly unprofessional and may indicate severe incompetence and a potential need for staff replacement. There are however a few (very few) exceptions where a classifier could be used on a different tissue from its training (e.g. the origin of the unlabelled data is not known; used as a pseudo-metric for cross-tissue cell type similarities).
- DEGs must be computed before training a classifier (see compute_DEG pipeline)
- an example of runing this pipeline:
qsub train_classifier.sh 'seurat.addr = "spleen_all.RDS"; marker.genes="spleen_all_DEGs.csv"; save.at="classifier_svm_cell_type_spleen"; classifier="svm";'
- the arguments:
- seurat.addr file name of the RDS object containing the input data as a seurat object. Must include only the file name not the path because the assumption is that data files are kept in data folder.
- marker.genes name of csv file storing DEGs by cell population. The top 20 DEGs for each cell type are used as feature genes. The file must be found in the folder resource/marker_genes. See compute_DEG.sh on how to obtain DEGs for a data set.
- save.at folder name were classifier files and report are saved. The folder will be created in the resource folder because classifiers are consider resources.
- classifier name of classifier script. Currently the SVM is the only one implemented so this argument should always be set to "svm". However this argument allows extending of this pipeline to work with other classifiers. During development other classifiers were logistic regression, random forest, ada boost and multi-layer perceptron. However the SVM showed the best accuracy and recall over all so only the svm was kept.
- pipeline for training doublet detector
- each doublet detector should only be used on the data set it was trained. While this sounds conter-intuitive for the machine learning users, all methods for doublet detection use this approach. The detector is actually trained on original data merged with dummy doublet data but it is applied only on real data. The idea behind ML-based doublet detectors is that a trained ML will find the optimum separation plane between dummy doublet and real data and because overfitting is avoided by regularisation, the plane will also separate a big part of the real doublets. It is intrinsically difficult to validate the identified doublets but the doublet removed by the the detectors created by this pipeline a) have a higher UMI and gene number counts b) have detected no doublet in plates data at least so far c) has improved downstream analysis, especially diffusion maps and UMAPs.
- It is usually better to first train a doublet detector, then remove doublets and only then do annotation (manual with the annotation template or automatically with trained classifiers)
- an example of runing this pipeline:
qsub train_doublets_SVM.sh 'seurat.addr = "kidney_all.RDS"; save.to = "classifier_svm_doublets_kidney"'
- the arguments:
- seurat.addr file name of the RDS object containing the input data as a seurat object. Must include only the file name not the path because the assumption is that data files are kept in data folder
- save.to folder name were classifier files and report are saved. The folder will be created in resource folder because doublet detectors are consider resources
- pipeline for applying doublet detector
- Example:
qsub apply_doublet_classifier.sh 'seurat.addr = "dummydata.RDS"; save.at = "dummydata.svm.RDS"; doublet.svm = "test_classifier_svm_doublets_dummy"'
- the arguments:
- seurat.addr filename of the RDS object containing the input data as a seurat object. Must include only the file name not the path because the assumption is that data files are kept in data folder
- save.at filename of updated Seurat object.
- doublet.svm_ dir name where
train_doublets_SVM.sh
output is saved (i.e. save.to folder parameter)
- used for saving to disk parts of a data set. This is especially useful for bigger data sets or when needing to run statistics for specific questions that do not require the entire data and allows working in an interactive session
- the arguments:
- seurat.addr file name of the RDS object containing the input data as a seurat object. Must include only the file name not the path because the assumption is that data files are kept in data folder.
- genes.file name of file found in the resource folder that list required genes to subset having one gene name per line. If this is NA than all genes will used
- set.ident name of column in meta data to used for partitioning the data
- cell.types name of file found in the resource folder that lists the cell type for partitioning the data with one cell type per lane. If set to "all" all the cell types will be used.
- save.meta boolean to save meta data. Having the meta data in separate file it is very useful when working with big data sets and will save a lot of time
- save.raw.data boolean to save the raw counts data
- save.data boolean to save normalized gene expression data
- save.dr boolean to save all dimensionally reduction coordinates
- compares Louvain clustering with 2 other types of clustering (agglomerative clustering and Gaussian mixture)
- metrics are rand index and adjusted mutual information
- the arguments
- seurat.addr file name of the RDS object containing the input data as a seurat object. Must include only the file name not the path because the assumption is that data files are kept in data folder.
- set.ident name of column in meta data to used for partitioning the data
- n_clusters number of clusters
- type.to.colour name of csv file that contains the colour key (mapping between categories in the set.ident and colours). Must contain only the file name not the full or relative path because the assumption is that this is a resource file that is placed in resource folder. Color keys compatible with the single cell analysis bundle can be generated using the interactive tool color_management.html
- perform batch correction at the level of principal components (also called data integration) using harmony R package
- the dimensionality reduction coordinates are computed based on harmony principal components
- the arguments
- seurat.addr file name of the RDS object containing the input data as a seurat object. Must include only the file name not the path because the assumption is that data files are kept in data folder.
- correct.by meta data column to be correct by. Usually this should indicate sample assignment
- save.at name of file to save the batch corrected data as Seurat object in RDS format. The file will be saved in the folder data
- creates word clouds of cell types and DEGs for each cell population or other categories in the data set
- cell type word clouds are generated based on tag mentions weighted by gene expression
- the arguments:
- seurat.addr file name of the RDS object containing the input data as a seurat object. Must include only the file name not the path because the assumption is that data files are kept in data folder.
- set.ident name of column in meta data to used for partitioning the data
- see the image for an example of word cloud output:
- used to update the annotations in a seurat object following manual annotation
- you need to add update_template.csv and annotation_markers.csv to the pipeline directory to run this pipeline. Both these csv's are outputs from the make_annotation_template pipeline. The update_template.csv file is required to map new values, and consists of louvain clusters in column 1 and celltype annotations in column 2. The annotation_markers.csv file is only required if make_app (boolean) is set to true.
- annotation should be kept in csv file. The empty template is produced by the make_cell_annotation_template.sh which requires only filling in the cluster assignment
- the arguments:
- seurat.addr file name of the RDS object containing the input data as a seurat object. Must include only the file name not the path because the assumption is that data files are kept in data folder.
- make.app boolean to make interactive page for visualising the annotated data. Usually not necessary as better alternatives have been made available since this pipelines was made (check plot_dr.sh and the web portal tool). Make sure dimensionality reduction has be computed before setting this argument to
TRUE
- update.file file name of the update template to be used for re-annotation. Should be saved in the pipeline directory.
- makes 1-to-1 comparison between cell types with the same data sets or between 2 data sets
- comparisons include correlation plots, AGA score plots and DEGs for each comparison
- the arguments:
- seurat.query.addr query data (same format as the argument seurat.addr in other pipelines)
- seurat.ref.addr reference data (same format as the argument seurat.addr in other pipelines). This can have the same value as seurat.query.addr if the reference cell types and query cell types come from the same data set
- set.ident.query query set identification meta data column (same format as the argument set.ident in other pipeliens)
- set.ident.ref reference set identification meta data columns (same format as the argument set.ident in other pipeliens)
- cell.types.query query cell types (same format as cell.types in other pipelines)
- cell.types.ref reference cell types (same format as cell.types in other pipelines)
- dims.plot width and height in inches for the correlation and AGA plots
- compute.DEGs boolean to compute DEG and plot them as jitter plots
- selects and order DEGs most specific for each cell population in the data set
- the result are cell type signatures i.e. list of genes highly relevant for cell annotations
- the arguments:
- seurat.addr file name of the RDS object containing the input data as a seurat object. Must include only the file name not the path because the assumption is that data files are kept in data folder.
- set.ident meta data column to set the identity of cells
- see the image for an example of plotted truncated signature (for B cells) obtained using this pipeline:
- converts a scanpy object to a seurat object
- the arguments:
- scanpy.addr file name storing a scanpy object. Must be present in data folder
- save.to file name for saving the resulted Seurat object. Will be saved in data folder
- computed AGA graphs and force-directed graph on multiple subsets of cell labels from one or several data sets
- useful for pre-trajectory work by helping to explore putative trajectories in a data set or across many data sets
- this pipeline take one argument which is an option file
- an example of the option file is included with the pipeline
This repository contains tools for making web portals and interactive tools used for exploring and visualising single cell RNA data.
- see an example here
- the pipeline can be called with web_portal.sh
- this take as input an option file
- option files should be placed in the options folder
- inside this folder there are two examples: one option file for a seurat object (liver_web_portal_options.txt) and on option file for scanpy objects (thymus_web_portal_options.txt)
- the format of the option file:
- line 1 indicates data file: "file_name: ../../data/mydata.RDS"
- line 2 indicates output folder: "output_folder: output_here/my_dear_web_portal"
- line 3 indicates what dimenssionality reduction types to be included: "dr_coordonates: UMAP->X_umap; tSNE->X_tsne; FDG->X_draw_graph_fa;". Each field must end with ";". The dr fields must be in the
@dr
slot for seurat objects or indata.obsm[dr_coor]
attribute for scanpy objects. - line 4 indicates meta data fields to include in the portal: "Cell Labels->Annotation->null;Flow gate->sort->null;Sample->fetus->null;Gender->gender->null;". Each field must end with ";". Each field indicate the name of the meta data column followed by "->" followed by name of the field to appear in the data portal (for example you might have a meta data column called "sort.ids" but on the web portal you want it to appear as "Sorting gates") followed by "->" followed by the name color key (csv file found in the resource folder) or set to "null" ir such a color key does not exist for a particular meta data and random colours must be generated instead.
- the output is tar.gz file. Unziping this will generate a folder with lots of contents (the web portal folder)
- if you require password protection go to the folder templates_password_protection and run the Python script generate_password.py. This will insert a new password to the template. The password is written to the file password.txt. Keep this for reference.
- copy paste to web portal folder the 4 files from the folder templates_password_protection (if you need password protection) or from folder folder templates_no_password_protection (if you do not need password protection)
- these files are: fetch_category.php, fetch_dr_coordinates.php, fetch_gene_expression.php, index.php
- upload to web server
-
see an example here
-
this is used to generated an interactive heatmap/dot plot from a seurat object or scanpy object
-
takes one argument which is the path of an option files
-
there are two examples of option files in the folder options: options_fetal_liver.txt for a seurat object and options_fetal_thymus.txt for a scanpy object
-
format of the option file:
- line 1: data file path
- line 2: meta data column by which to assign the identity of cells
- line 3: name of output folder
- line 4: name of interactive html page
- line 5: a short line describing the data which will be included in the interactive page
-
IMPORTANT NOTICE: if the vector partitioning the data (i.e. meta data column) is using integer indices (e.g. Louvain clustering which assigns integer identifies to clusters) it is highly recommended to pre-append the tag "Cluster_" to all indices (e.g. "1" and "103" becomes "Cluster_1" and "Cluster_103" respectively). Failure to do so will not raise any errors, but the resulting interactive heatmap/dot plot will have glitches.
-
A version of this script (91b) creates a html file from an csv file containing expression levels (columns: genes, rows: cell types; first datacell is empty). For this, the option file format is:
- line 1: expressions csv data filepath
- line 2: name of output folder
- line 3: name of interactive html page
- line 4: a short line describing the data
- see an example here
- this creates a web portal useful for exploring the results of a trajectory analysis
- this only works on seurat objects, and not on scanpy objects
- seurat.addr full or relative path of the file storing the seurat object or scanpy object
- set.ident meta data column used for partion the data
- cell.types an R character vector listing the cell types to used in the trajectory. White spaces in the names must be replaced with double at sign ("@@").
- root_cell_type name of the cell type to use as root
- type.to.colours full or relative path to the color key (a csv file mapping cell types to colours). To generate your own colour key use color_management.html (see here)
- lineage.name lineage name that appears in the portal. White spaces must be replaced with double at sign ("@@" instead of " ")
- Note that this pipeline is very similar to 13_pseudotime, but does not have the argument
var.genes=
, and requires the argumentlineage.name=
; and by default does not exclude cell cycle genes. - This pipeline returns genes with adjusted p-value < 0.001, unlike 13_pseudotime, which returns the top 100 genes per cell type.
- An alternative script,
pseudotime_webportal_nocycle.sh
excludes selected cell cycle genes from the analysis (as in 13_pseudotime). - If you wish to add the pseudotime histogram plot to the website, then copy
plotly.min.js
into the trajectory_directory. Available here: https://cdn.plot.ly/plotly-latest.min.js?_ga=2.76101205.1155538419.1562253068-1682084107.1555083282
- creates a graphical interface for fast data exploration of gene expression.
- the resulting program also groups genes by expression patterns and allows the user to manually change group assignment
- seurat.addr full or relative path of the file storing the seurat object
- no_clusters number of gene clusters required
- the output is a folder
- copy the gene_viewer.py script to the outputted folder and then you can run the GUI from a python shell