ENCODE-DCC/encoded

Indexing of bed files for region search

Parul-Kudtarkar opened this issue · 4 comments

Hi,

The indexing(peak indexing) of bed files slows down significantly as more experimental data-sets are added. This might not be evident on servers running for longer time, since I believe re-indexing is done only for newer data-sets. However, every-time a new server is launched there is complete indexing which is slow due to larger data-set (peak files). Is there a workaround this issue?

Thank you!
Parul Kudtarkar

Bek commented

You can run separate EC2 instance with Elasticsearch installed on it and edit the security group rules to allow your instances to talk over 9200-9300 port ranges. Then peak indexer uses the remote machine as specified here: https://github.com/ENCODE-DCC/encoded/blob/master/buildout.cfg#L89. New instances can now connect to machine that has the indexed data.

Thank you so much @Bek for a quick response.

Great! This works

@Bek a quick question, the peak_indexer.py and region_search.py scripts would be native to the ec2 instance running those scripts and not machine with indexed data, right?

Thank you!
Parul Kudtarkar