ENCODE-DCC/encoded

Indexing assays with large number of files associated

Parul-Kudtarkar opened this issue · 3 comments

Hi developers,

Appreciate feedback/suggestions:

For single cell assays, per assay we have several biosamples and hundreds of files - https://www.t2depigenome.org/search/?type=Experiment&assay_term_name=single+cell+isolation+followed+by+RNA-seq
When starting a new server, indexing is bottleneck


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "bin/es-index-data", line 75, in
sys.exit(encoded.commands.es_index_data.main())
File "/srv/encoded/src/encoded/commands/es_index_data.py", line 47, in main
return run(app, args.item_type, args.record)
File "/srv/encoded/src/encoded/commands/es_index_data.py", line 19, in run
'recovery': True
File "/srv/encoded/eggs/WebTest-2.0.20-py3.4.egg/webtest/utils.py", line 37, in wrapper
return self._gen_request(method, url, **kw)
File "/srv/encoded/eggs/WebTest-2.0.20-py3.4.egg/webtest/app.py", line 736, in _gen_request
expect_errors=expect_errors)
File "/srv/encoded/eggs/WebTest-2.0.20-py3.4.egg/webtest/app.py", line 606, in do_request
res = req.get_response(app, catch_exc_info=True)
File "/srv/encoded/eggs/WebOb-1.6.0-py3.4.egg/webob/request.py", line 1295, in send
application, catch_exc_info=True)
File "/srv/encoded/eggs/WebOb-1.6.0-py3.4.egg/webob/request.py", line 1263, in call_application
app_iter = application(self.environ, start_response)
File "/srv/encoded/eggs/WebTest-2.0.20-py3.4.egg/webtest/lint.py", line 198, in lint_app
iterator = application(environ, start_response_wrapper)
File "/srv/encoded/eggs/pyramid-1.6a2-py3.4.egg/pyramid/router.py", line 223, in call
response = self.invoke_subrequest(request, use_tweens=True)
File "/srv/encoded/eggs/pyramid-1.6a2-py3.4.egg/pyramid/router.py", line 198, in invoke_subrequest
response = handle_request(request)
File "/srv/encoded/develop/snovault/src/snovault/stats.py", line 63, in stats_tween
response = handler(request)
File "/srv/encoded/src/encoded/renderers.py", line 75, in fix_request_method_tween
return handler(request)
File "/srv/encoded/src/encoded/renderers.py", line 137, in normalize_cookie_tween
return handler(request)
File "/srv/encoded/eggs/subprocess_middleware-0.3-py3.4.egg/subprocess_middleware/tween.py", line 31, in subprocess_tween
response = handler(request)
File "/srv/encoded/src/encoded/renderers.py", line 162, in set_x_request_url_tween
response = handler(request)
File "/srv/encoded/eggs/pyramid-1.6a2-py3.4.egg/pyramid/tweens.py", line 20, in excview_tween
response = handler(request)
File "/srv/encoded/eggs/pyramid_tm-0.12.1-py3.4.egg/pyramid_tm/init.py", line 101, in tm_tween
reraise(*exc_info)
File "/srv/encoded/eggs/pyramid_tm-0.12.1-py3.4.egg/pyramid_tm/compat.py", line 15, in reraise
raise value
File "/srv/encoded/eggs/pyramid_tm-0.12.1-py3.4.egg/pyramid_tm/init.py", line 83, in tm_tween
response = handler(request)
File "/srv/encoded/src/encoded/renderers.py", line 118, in security_tween
return handler(request)
File "/srv/encoded/eggs/pyramid-1.6a2-py3.4.egg/pyramid/router.py", line 145, in handle_request
view_name
File "/srv/encoded/eggs/pyramid-1.6a2-py3.4.egg/pyramid/view.py", line 541, in _call_view
response = view_callable(context, request)
File "/srv/encoded/eggs/pyramid-1.6a2-py3.4.egg/pyramid/config/views.py", line 327, in attr_view
return view(context, request)
File "/srv/encoded/eggs/pyramid-1.6a2-py3.4.egg/pyramid/config/views.py", line 303, in predicate_wrapper
return view(context, request)
File "/srv/encoded/eggs/pyramid-1.6a2-py3.4.egg/pyramid/config/views.py", line 243, in _secured_view
return view(context, request)
File "/srv/encoded/eggs/pyramid-1.6a2-py3.4.egg/pyramid/config/views.py", line 352, in rendered_view
result = view(context, request)
File "/srv/encoded/eggs/pyramid-1.6a2-py3.4.egg/pyramid/config/views.py", line 506, in _requestonly_view
response = view(request)
File "/srv/encoded/develop/snovault/src/snovault/elasticsearch/indexer.py", line 431, in index
errors = indexer.update_objects(request, invalidated, xmin, snapshot_id, restart)
File "/srv/encoded/develop/snovault/src/snovault/elasticsearch/mpindexer.py", line 152, in update_objects
update_object_in_snapshot, tasks, chunkiness)):
File "/usr/lib/python3.4/multiprocessing/pool.py", line 314, in
return (item for chunk in result for item in chunk)
File "/usr/lib/python3.4/multiprocessing/pool.py", line 689, in next
raise value
sqlalchemy.exc.DatabaseError: (psycopg2.DatabaseError) server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
[SQL: 'SELECT keys.name AS keys_name, keys.value AS keys_value, keys.rid AS keys_rid, propsheets_1.sid AS propsheets_1_sid, propsheets_1.rid AS propsheets_1_rid, propsheets_1.name AS propsheets_1_name, propsheets_1.properties AS propsheets_1_properties, propsheets_1.tid AS propsheets_1_tid, current_propsheets_1.rid AS current_propsheets_1_rid, current_propsheets_1.name AS current_propsheets_1_name, current_propsheets_1.sid AS current_propsheets_1_sid, resources_1.rid AS resources_1_rid, resources_1.item_type AS resources_1_item_type \nFROM keys JOIN resources AS resources_1 ON resources_1.rid = keys.rid JOIN current_propsheets AS current_propsheets_1 ON resources_1.rid = current_propsheets_1.rid JOIN propsheets AS propsheets_1 ON current_propsheets_1.sid = propsheets_1.sid \nWHERE keys.name = %(name)s AND keys.value = %(value)s'] [parameters: {'value': 'TSTFF059661', 'name': 'accession'}]

hitz commented

that is a new one. What is in the postgres logs?

@hitz on restarting the cluster - there was no issue and the postgres logs had no errors with logging_commands = all
The indexing was slow and failed due to exceeding the elasticsearch queue capacity. The work around that I used was to increase the hardware(cluster) and expand the queue capacity for indexing(from 200 to 1000)
{
"persistent" : {
"threadpool" : {
"index" : {
"queue_size" : "1000"
},
"search" : {
"queue_size" : "400"
}
}
},
"transient" : { }
}

On that note for assays that have several hundreds of files associated is it practically plausible to
a) write a script to run EC2 instance (m4.xlarge, this should not take more than 1-2 days), download single cell files, tar merge (my immediate though is to merge them based on file type which would create minimal files per assay, however this breaks the derived from association between files) and upload back to s3. b) Clear all the postgres file transactions/links associated with single cell and link newly merged files to respective assays is this practically plausible to implement
https://www.t2depigenome.org/search/?type=Experiment&assay_term_name=single+cell+isolation+followed+by+RNA-seq

Thanks a lot!