zji90/TSCAN

Changes to get TSCAN in the OSCA book (part 1?)

Closed this issue · 17 comments

LTLA commented

I'm finally getting around to writing the trajectory chapter in the OSCA book (see here for its current state). Given that TSCAN is one of the Bioconductor packages for trajectory analysis, I was planning to demonstrate its use on some example data. To this end, it would help if exprmclust() was able to take an arbitrary clusterid. For example, one could set clusterid=NULL by default to achieve the existing behavior, but I could also run the function with my precomputed clusters to ensure that TSCAN is consistent with the rest of my analysis.

I have some other requests/suggestions, but this seems like a low-hanging fruit to improve useability. I am also willing to put in a PR if that is of interest.

zji90 commented

Dear Aaron,
Thanks a lot for your interest in TSCAN! I would very much like to see it appearing in your book.
It's really a good suggestion to allow arbitrary clustering. Actually I have implemented it some time ago for Github version (https://github.com/zji90/TSCAN/blob/master/R/exprmclust.R). There is now a 'cluster' option that allows users to supply arbitrary clustering. However, I haven't updated the Bioconductor version for quite a while. I will consider doing that in the near future.

LTLA commented

Thanks @zji90; I look forward to seeing that functionality in the Bioconductor version. (I should add that, to me, the Bioconductor version is main version of your package). If I may, I will also mention the other requests/suggestions that I alluded to in my previous post.

  • Is there a function to draw lines between cluster centroids in an arbitrary embedding? It would be nice to be able to see the TSCAN-determined connections between clusters on, say, a t-SNE plot. I already wrote some code for this to use in the book, but it seems it would be generally useful. For reference, this is what I did for the book (mst is the graph object containing the minimum spanning tree, coords is the matrix of cluster centroids, and sce.nest is a SingleCellExperiment):

    # TODO: stuff this into a function somewhere.
    pairs <- Matrix::which(mst[] > 0, arr.ind=TRUE)
    coords <- reducedDim(by.cluster, "TSNE")
    group <- rep(seq_len(nrow(pairs)), 2)
    stuff <- data.frame(rbind(coords[pairs[,1],], coords[pairs[,2],]), group)
    
    plotTSNE(sce.nest, colour_by="cluster") +
        geom_line(data=stuff, mapping=aes(x=X1, y=X2, group=group))
  • Have you considered a more refined pseudo-time calculation beyond consecutive integers from 1 to the number of cells? It seems like you are already mapping cells onto the closest edge of the MST; why not report the coordinate of each cell along the edge? This would be more helpful for downstream analyses.

    On a related note, multiple paths through the MST (from some root node) could be handled with multiple pseudo-times; this seems like it would be more useful than a single vector of pseudo-times, which doesn't really give much insight into how branches work. This is, for example, how slingshot handles branching events.

  • Perhaps consider easing up on the set.seed() calls you have scattered throughout your code. (See here for a discussion.)

LTLA commented

To continue this discussion: this section of the book attempts to discuss how to use TSCAN for trajectory analysis. For the time being, I have effectively reimplemented the algorithm to work with the SingleCellExperiment data structures, but it would be much nicer to call TSCAN functions directly.

I'm more than willing to make a PR to add my desired functionality to TSCAN, if that would be of interest. In particular, you may find the projected pseudotimes in Figure 17.4 to be particularly useful.

I will re-iterate that these improvements would be very welcome. Slingshot is excellent, but is extremely slow for "large" (approaching 1e6) datasets. As the centroid/cluster <-> MST approach is far more scalable TSCAN could be the default choice for big scRNA trajectory analysis.

LTLA commented

@zji90 I am willing to assume responsibility for maintenance and further development of this package, if you like.

@davemcg In the meantime, I wrote my own implementation of the TSCAN approach and dumped it in scran as ?createClusterMST. Only in BioC-devel right now, I'm afraid; I was hoping to move it here at some point.

Thanks @LTLA I've installed BioC-devel scran and I'll give it a go. Happy with slingshot but it is hard to tweak when it literally takes a week to run!

zji90 commented
LTLA commented

There are several ways to move forward, depending on how much you are willing to trust me :)

The simplest approach would be for me to put in PRs for each change. This will require you to review each PR, merge them and manage the upload to Bioconductor; more work on your end, but you will have a clear idea of what is changing.

If there's a bit more trust, you might add me as a collaborator to this GitHub repository. Then I could just go ahead and add the functions, do various bits of maintenance and improvements, etc. without requiring much work from you. Of course, this also means that if I were evil, I could just change things that you didn't want to change.

If there's even more trust, you could give me write permissions to TSCAN's Bioconductor git repository, in which case I could also manage the upload of any changes to Bioconductor. So even less work required from you, but also more opportunity for me to be evil, because now any changes would appear in the official Bioconductor version of the package.

So, it's really up to you. I'd like to say that I'm not evil, but that's exactly what an evil person would say... I'll just mention that I use all three models for different packages, depending on who I'm working with and how confident I am that they'll do a good job. I've only had one case where I gave someone too much permission and they did something I wasn't happy with.

zji90 commented
LTLA commented

Great, thanks @zji90. I've cloned the repo from Bioconductor and implemented all my changes at https://github.com/LTLA/TSCAN; I've added you as a collaborator as well.

(As a side note: I didn't realize how old this repository actually was. It precedes the SVN-to-Git transition on the BioC servers that happened in several years ago, which means that the Git history in this repo is not compatible with the Git history in Bioconductor's "official" repositories. So, I just decided to work off the Bioconductor version rather than attempting to reconcile the two histories.)

If you check out my latest commits in that repo, several interesting points are:

  • The createClusterMST() function creates a cluster-based MST, from several possible inputs. I've added the idea of an OMEGA group (called outgroup here) from slingshot to avoid joining different trajectories.
  • The reportEdges() function reports the edge coordinates for easy plotting.
  • The mapCellsToEdges() function maps cells to the closest edge in the MST. This is a cool function because it works on cells that weren't used to construct the MST, so you can "project" other datasets onto an existing MST.
  • The orderCells() function computes a pseudotime, considering all branched paths through the MST from a starting node. The output format was again inspired by what slingshot also produces.
  • All functions have additional methods to smoothly operate with Bioconductor's standard SE and SCE classes, while also remaining operational for people who just have a numeric matrix (e.g., of PC coordinates).
  • I have added a GitHub action to check that everything is working before any merge to master is allowed.
  • I took the liberty of cleaning up the DESCRIPTION.

As for getting write access to the BioC Git servers: can you write an email to bioc-devel@r-project.org, and just say that you want to give me write access to TSCAN? It would be most straightforward if you did sent this email using the JHU address from your DESCRIPTION, but if you don't have access to that anymore, that's okay; they just need a way to make sure that it's actually you, so I'm sure we can figure something out.

LTLA commented

Any thoughts on giving me access to the BioC repos? I can start the conversation on the Bioc-devel mailing list if it helps.

zji90 commented
LTLA commented

Excellent, thanks. I'll do some prodding on my end as well.

zji90 commented
LTLA commented

Yes, I also poked them as well. I'm running through some final checks now and will push it up shortly.

LTLA commented

And it's done! I'll keep an eye on the builds... looking forward to working with you on this!