CalabreseLab/seekr

seekr_graph with cross species classification

kmothera opened this issue · 11 comments

Hi Jessime,

I'm trying to use seekr_graph to find communities of lncRNAs within our transcriptome assembly of a non-model organism and compare it across species to GENCODE mouse ncRNAs. I extracted the lncRNAs, normalized against the total StringTie assembly and repeated the process with GENCODE mouse to perform the pearson correlation.

Running the seekr_graph module consistently errors out though. seekr_graph with self vs self correlations functions perfectly but not using cross-species correlations.

Traceback (most recent call last): File "/seekr/virtualenv/bin/seekr_graph", line 10, in <module> sys.exit(console_graph()) File "/seekr/virtualenv/lib/python3.7/site-packages/seekr/console_scripts.py", line 500, in console_graph float(args.resolution), int(args.n_comms), args.seed) File "/seekr/virtualenv/lib/python3.7/site-packages/seekr/console_scripts.py", line 475, in _run_graph maker.make_gml_csv_files() File "/seekr/virtualenv/lib/python3.7/site-packages/seekr/graph.py", line 175, in make_gml_csv_files self.build() File "/seekr/virtualenv/lib/python3.7/site-packages/seekr/graph.py", line 166, in build self.graph = networkx.from_pandas_adjacency(self.adj) File "/seekr/virtualenv/lib/python3.7/site-packages/networkx/convert_matrix.py", line 185, in from_pandas_adjacency raise nx.NetworkXError("Columns must match Indices.", msg % missing)

Thank you!

Hey, this is a great question. The answer is going to depend a on what experiment you're trying to perform, exactly.

seekr_graph expects a square adjacency matrix where the names of the rows match the names of the columns. Lets say that your species has n transcripts, and mouse has m transcripts. You would need to pass the whole, (n+m)^2 matrix, which would be asking the question, "What communities form if we throw all transcripts together and run community detection?"

Is this what you want to do? There isn't a neat tool to do it from the command line at this point, but I can write up something for you in Python.

Another approach would be to do the self vs. self communities for both, and then compare the communities afterwards, using something like normalized information. We've successfully used this approach in the past.

These are just two possibilities; maybe there's another thing you're trying to accomplish?

Hi Jessime,

We performed an RNA-Seq experiment for our non-model species and and are interested in the population of lncRNAs in our transcriptome assembly that are differentially expressed in two different conditions. I'd like to use seekr to identify non-coding genes that are in both our assembly and a well-annotated assembly like GENCODE human or mouse as a potential clue for their function in our organism.

I suppose the question being asked is, given a subset of lncRNAs that are DE: n(DE), which ones are similar to the total population of mouse lncRNAs, m. Is there an easy way to compare across species? I suppose it makes sense to perform self vs. self for both transcriptomes than compare between them. Would you recommend this strategy or something else entirely?

Thanks!

I still think there are several ways you could go about this. I'll show you how I would directly answer:

given a subset of lncRNAs that are DE: n(DE), which ones are similar to the total population of mouse lncRNAs, m?

as an example. Let's assume you've just finished running:

$ seekr_pearson DE_counts.csv mouse_counts.csv -o adj.csv

You could open python and run:

import pandas as pd

df = pd.read_csv('adj.csv', index_col=0)
mean_similarity = df.mean(axis=1).sort_values(ascending=False)
mean_similarity

This will give you a sorted list of how similar each DE transcript is to the mouse transcriptome, on average.

Alternatively, if the number of DE transcripts is fairly small (hundreds?) it might also make sense to see which DE transcripts fall into which mouse communities.

I don't think the communities are directly relevant to our work. We are more interested in identifying specifically conserved lncRNA in our assembly in relation to a well annotated species like mouse i.e. the likelihood that transcript n is Xist based on pearson correlation. Does seekr support something like a one to one comparison between two distinct lncRNAs? Would it be feasible to find the best cross-species pair that have the best pearson correlation or am I mistaken?

It sounds like all the information you want is just in adj.csv, right? It's just a matter of querying it to answer your question. Right now, the command line tools don't have a way to do this, but it's straightforward in python.

Using the example:

the likelihood that transcript n is Xist based on pearson correlation.

import pandas as pd

df = pd.read_csv('adj.csv', index_col=0)
most_xist_like = df['Xist-201'].sort_values(ascending=False)

Assuming you've labeled one of the mouse rows as Xist-201 (you can replace Xist-201 with the whole fasta header, another lncRNA, or whatever row you need), you'll have a sorted list of how similar each of your non-model transcripts are to Xist.

Hi Jessime!

Thanks for this. It was a great help. One thing I noticed was that there was a low degree of correlation between lncRNAs between species; even those that were evolutionarily conserved. This seems counterintuitive to me: wouldn't conserved lncRNAs have a high correlation due to having similar k-mer composition? Or am I just missing something?

Great, glad it's working out!

When you use 6mers, correlations are always pretty low, because of the large number of kmers in each profile. The math works out that as the length of two arrays approaches infinity, the correlation value between the arrays approaches 0, no matter the contents of the arrays. But even with the low values, we found that we could detect a significantly higher set of correlations between known homologs than other random comparisons.

I should say though that if your sole goal is to find evolutionary homologs, there are better tools than SEEKR, even for lncRNAs. SEEKR is probably better for detecting functional similarity.

hello.
I have a similar question. I want to find cross-species homologs of lncRNA and protein-coding genes. I just put all of them in a same fasta file. Can it work?
It seem like in your manuscript, humans and mouse are separately run seekr_kmer_counts. Then using seekr_compare compare them similarity.
which one is more suitable? Or they are the different routes to the same goal?

Hey @huanananan,

The difference between putting the species in the same file versus different files isn't going to be huge, but it might be significant. The difference between the two is the baseline normalization will change. If you have the time, I'd recommend trying both and seeing which works better for you.

If you only want to use one approach, I'd suggest keeping the files separate. This does two things:

  1. The normalization for each species is then relative only to itself, which I think makes more sense.
  2. The output file size is going to be much smaller this way.

Thank you. Very helpful.

Of course. I'm going to go ahead and close this issue since it's pretty old. But feel free to open new ones if you run into other questions/issues with seekr!