srlearn/rnlp

Parallelism for makeIdentifiers

hayesall opened this issue · 1 comments

A large amount of the running time tends to be spent in parse.makeIdentifiers(), which is essentially a triple-nested for loop over blocks, sentences, and words.

Previously this was "resolved" by wrapping the outer loop with tqdm to estimate how long the process would take. This did not actually change anything but likely would make someone feel better about the situation.


joblib may be a viable way to execute the outer loop in parallel:

from joblib import Parallel, delayed
from tqdm import tqdm

def foo(block, blockID):
    """
    :param block: The current block to be processed (list of lists).
    :param blockID: Index of the current block (int).
    """
    return [blockID]

Blocks = list(range(5000))
facts = Parallel(n_jobs=-1)(delayed(foo)(Blocks[i], i) for i in tqdm(range(len(Blocks))))

In the short example above, the "Blocks" would in reality be the the list of blocks generated earlier. foo(block, blockID) would be something similar to the current parse.makeIdentifiers() method, but blockID is passed as a parameter rather than an integer that increments at the end of the outer loop.

Current progress is on batflyer/rnlp (parallel). I did a short round of testing to estimate the sort of performance gains that we might expect, graphed below.

Both plots were tested on the same corpus and ran on my local machine.

  • Top graph set blockSize=1
  • Bottom graph set blockSize=2.
  • x-axis varies the number of cores
  • y-axis displays the amount of time (in seconds) that it took to process the blocks.

time_vs_cores