Import task runs out of memory for large databases

Question

Import task runs out of memory for large databases

Opened this issue 8 years ago · 1 comments

Caution: long ticket incoming!

When you take students, faculty, staff, and alumni together, CWRU has a lot more people than Yelp. Since this app works based on the idea that you have a database containing every "employee", we have to import them all into the database. Some of the operations that probably worked fine in the setting of Yelp's employee database are no longer feasible for us, and the current pain point is importing.

Employee import (from CSV or JSON) runs in a taskqueue setting, and each "request" in a taskqueue gets 10 minutes and a memory limit of 128MB. When requests fail, they are retried (I believe a maximum of 10 times) with exponential backoff. The import process used to work as follows:

Open the import source (CSV/JSON) and begin reading. For CSVs, a DictReader is used, which creates an iterator and does not load the whole file into memory.
Iterate over the import source, and either add (if the employee is new) or update (if the employee already exists) each entry. Each employee entry is added to one (or possibly two) lists.
Scan the employee table (which is loaded into memory all at once) for employees which are not present in the import. Mark them as terminated.
Finally, rebuild the autocomplete index. This works by iterating over each employee in the table (again in memory), listing out all possible substrings of their first/last name and username, and creating an entry in the index for them.

Initially, I was getting timeouts. However, this was due to the fact that my database dump had over 100,000 employees, and there wasn't enough time to process all of them. I resolved that by chunking up the import into smaller files, which is inconvenient but helped the situation. Since then I have improved my dump process (see #1 and brenns10/ldump), and my import file became much smaller anyway. I've found that 15,000-20,000 is a manageable amount to do in a single import, and potentially even more could work.

Next, I ran into memory issues. Initially, this was due to step 3, loading the whole Employee table in memory, as well as step 2, keeping all the imported employees in memory. I got rid of that logic (in 084507a), so that CSV importing is a purely additive process. I also write back to the database every 500 employees so that we never keep the whole import list in memory (in 0a0bd57).

Now, I run into memory issues with rebuilding the index. This is a lesser problem - you don't need to have autocomplete for the app to work, and the import does happen successfully. But it bothers me because autocomplete is incredibly useful, and we definitely don't need to load the whole user database into memory in order to successfully rebuild the search index! In 0a0bd57 I was able to rewrite the Employee query so that it returns results in chunks of 200 (which the index logic was already using). However, I'm still getting memory issues, and I really don't know why, since the table is no longer loaded into memory. Maybe it has to do with the size/number of the substrings. Here are my next strategies for addressing these issues:

CSV import does not have to happen in the same task as reindexing. I could separate them out, which will give us more time to do each task.
If the memory size of the substrings is the problem, I can lower the chunk size for indexing, so that we're keeping fewer substrings in memory. This might make the runtime of the indexing process slower, but that can be managed.
If indexing is too slow, I can take advantage of cursors to split the indexing process into multiple tasks! By making the indexing process take a database cursor, and making it aware of how much time has elapsed during the request, it can stop itself, queue a new task starting at its current cursor location, and terminate successfully.

So long as the memory issue can be fixed, I think we can be in the clear.

Answer 1 · 2017-02-20T17:20:46.000Z

Doc links I'm leaving for later work:

could log and monitor memory usage during indexing to better understand what happens: https://cloud.google.com/appengine/docs/python/refdocs/google.appengine.api.runtime.runtime
explicitly call gc to see if that helps. I have some del's in there but I guess that doesn't actually force something to be garbage collected. https://docs.python.org/2/library/gc.html
my little paging code may have been stupid, since it looks like it's already implemented with query().iter(batch_size=n)
other tasks use background threads... would that help? https://cloud.google.com/appengine/docs/python/refdocs/google.appengine.api.background_thread.background_thread