aertslab/arboreto

error when running grnboost

Closed this issue · 2 comments

Hello,

I am implementing pySCENIC program and ran into a problem with grnboost package. I followed the instructions and wrote my code similar to this:
//
import pandas as pd
from arboreto.utils import load_tf_names
from arboreto.algo import grnboost2
if name == 'main':
# load the data
ex_matrix = pd.read_csv(<ex_path>, sep='\t')
tf_names = load_tf_names(<tf_path>)
network = grnboost2(expression_data=ex_matrix, tf_names=tf_names)
//
pySCENIC works fine with small data set of 250 genes; however, for bigger data set that I am testing out (~2000 genes or more), this is the error that I got:

UserWarning: Large object of size 1.17 MB detected in task graph:
(["('from-delayed-7f2fea60c7dfbbfb0ec7f83dc75b83af ... af', 19972)"],)
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and
keep data on workers

future = client.submit(func, big_data)    # bad

big_future = client.scatter(big_data)     # good
future = client.submit(func, big_future)  # good

% (format_bytes(len(b)), s))

The program stuck at this point and never finished when I ran it on Macbook Pro (2.6Hz i7). I also tried the command-line version as pyscenic grnboost -o OUTPUT @grn_args.txt in which grn_args.txt contains names of expression matrix and known TF file; expression matrix input have cell IDs as rows and genes as columns.
What would you think is the issue here?

Thank you,
Diep

Hello,

Note that the GRN inference step is a very intensive computational step, which might take hours to days on a laptop. GRNBoost2 was designed to run on 1 or multiple big machines (e.g. dual 12-core Xeon CPU, 128GB ram), on a laptop you might run into memory problems and very long execution times.

In some cases, increasing the worker memory limit helps:

client = Client(LocalCluster(memory_limit=8e9))

On a Mac you can use the system monitor to see what is happening. On Linux we typically use htop.

kind regards,
Thomas

Hi Thomas,

I can get the results from grnboost. The output file will have 3 columns. What can I do if I want import this result back to R and run the scenic pipeline ? I found the code was missing in the tutorial.

Best,
Peng