vanheeringen-lab/genomepy

caching

siebrenf opened this issue · 4 comments

Hmm, but it would a bit too much to use two different caching libraries. However, joblib.Memory doesn't have an expiry-like option, it caches "forever". BucketCache on the other hand doesn't recognize the input arguments as the same, so it essentially doesn't cache.

Originally posted by @simonvh in #162 (comment)

We need a caching method with short term (session) and long term (several days) options. For the long term storage, an expiry system is required. Furthermore, it'd be nice if we can store methods, functions and objects. Bonus if the caching method(s) are thread safe. Some potentially suitable packages are listed in the linked thread, but seem to lack the ability to do all 3 core requirements. Likely this means creating a wrapper from two separate methods.

in pseudocode:

  1. if the called output is present in the short term cache, return it
  2. if the called output is present in the long term cache, and is outdated, delete it
  3. if the called output is present in the long term cache, load it in the short term cache and return it
  4. else run the required function/method, store the output in both long and short term caches, and return it

in code for a function call (here termed call):

import os
import time
import pickle

CACHE_DIR = "..."
memory = dict()


def cache(func):
    """wrapper with long & short term storage, and long term storage expiry"""
    
    def check_caches(): 
        # short term caching (in-memory):
        if call in memory:
            return memory[call]
        
        call_pickle = of.path.join(CACHE_DIR, "...")

        # long term cache expiry
        if os.path.exists(call_pickle):
            current_time = time.time()
            expiry_time = os.path.getctime(call_pickle) + 7 * 24 * 60 * 60  
            if current_time > expiry_time:
                os.remove(call_pickle)
        
        # long term caching (e.g. pickling)
        if os.path.exists(call_pickle):
            memory[call] = pickle.load(call_pickle)
        else:
            memory[call] = func() 
            pickle.save(call_pickle, memory[call])
        
        return memory[call]        
            
    return check_caches

oh, extra bonus points if the short term cache remains intact for ~10 minutes (since last call to it) when running genomepy from the command line!

This is a wrapper inspired by this example that implements both, short and long-term, caching using diskcache and joblid.

from diskcache import Cache
import timeit
from functools import wraps
import joblib
import os
import time

cache = Cache(directory="/tmp/disk-cache-test")



def cached(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        # Define short/long term cache expiration in seconds
        exp_short = 5 # type: int
        exp_long = 120 # type: int   
        # Generate the cache key from the function's arguments.
        key_parts = map(str,[func.__name__] + list(args))
        key = '-'.join(key_parts)
        # Pickle object
        ext = ".joblib"
        pickle_obj = os.path.join(cache.directory,f"{key}{ext}") #Check if value exists in the short term cache
        value = cache.get(key)
        #Check if the function call deos not exists in the memory cache
        if value is None:
            #Check if the value exists as pickle and if it is expired
            if os.path.exists(pickle_obj):
                current_time = time.time()
                expiry_time = os.path.getctime(pickle_obj) + exp_long
                # Remove pickled object (long-term) if expired 
                if (current_time > expiry_time):
                    print(f"Removing {key} from short-term memory since it is expired")
                    os.remove(pickle_obj)
                    # We need to re-compute and store the result in memory and as pickle
                    print(f"re-computing {key} and caching value")
                    value = func(*args, **kwargs)
                    cache.close()
                    with Cache(cache.directory) as reference:
                        reference.set(key, value, exp_short)
                    # Store function result as pickle
                    joblib.dump(value, os.path.join(cache.directory,pickle_obj))
                else:
                    # Load existing pickle into memory and skip computation
                    print(f"Loading {key} from pickle into cache")
                    with open(pickle_obj, 'rb') as pickle_handle:
                        cache.close() 
                        with Cache(cache.directory) as reference:
                            reference.set(key, joblib.load(pickle_obj), exp_short)
                    value = cache.get(key)
            else:
                # Run the function and cache the result for next time.
                print(f"computing {key} and storing value")
                value = func(*args, **kwargs)
                cache.close()
                with Cache(cache.directory) as reference:
                    reference.set(key, value, exp_short)
                # Store function result as pickle
                joblib.dump(value, os.path.join(cache.directory,pickle_obj))
        else:
            print(f"Loading {key} from short-term memory")
            # Skip the function entirely and use the cached value instead.
            value= cache.get(key)
        return value
    return wrapper


@cached
def fibonacci(n):
    if n == 0:
        return 0
    elif n == 1:
        return 1
    return fibonacci(n - 1) + fibonacci(n - 2)

print(timeit.timeit('fibonacci(45)', globals=globals(), number=1))

This looks good!

Could you implement it in the caching module and try it out? Here are two different heavy functions to try on:

This function takes a list of genes as input, which might very well be a list of 40k strings and might be an issue. It gave me warnings with the current joblib implementation.

import genomepy

# if you haven't got the genome + annotation already
genomepy.install_genome("hg38", annotation=True)

a = genomepy.Annotation("hg38")
a.map_genes(gene_field="ensembl.gene")  # or another gene_field
a.map_genes(gene_field="ensembl.gene")  # or another gene_field

Another heavy function is getting the genomes dictionary. There's one for each provider, but the biggest is the one for NCBI.

import genomepy
genomepy.clean()
n=genomepy.Provider.create("ncbi")
n=genomepy.Provider.create("ncbi")