/centillion

Centillion is the Data Commons search engine. One centillion is 3.03 log-times better than a googol.

Primary LanguagePythonMIT LicenseMIT

centillion

version number travis

centillion: a document search engine that searches across Github issues, Github pull requests, Github files, Google Drive documents, and Disqus comment threads.

a centillion: a very large number consisting of a 1 with 303 zeros after it.

One centillion is 3.03 log-times better than a googol.

Screenshot: centillion search

What is centillion

centillion is a search engine that can index different kinds of document collections: Google Documents (.docx files), Google Drive files, Github issues, Github files, Github Markdown files, and Disqus comment threads.

How centillion works

The backend of centillion defines how documents are obtained and how the search index is constructed. centillion builds and updates the search index by using APIs to get the latest versions of documents, and updates its search index accordingly.

The centillion frontend provides a web interface for running queries and interfacing with the search index. (More information)

How to configure centillion

To configure centillion read the full documentation at http://nih-data-commons.us/centillion/. Some example configuration files are in the examples/ directory. Before configuring centillion to search your organization's file systems, we recommend following the Quickstart.

Quickstart

This quickstart will get you started with a centillion instance that is populated with fake documents (avoiding the need to make real API calls). This will allow you to try out centillion before you enable any APIs.

Clone:

Start by cloning a copy of the repo:

cd
git clone https://github.com/dcppc/centillion
cd ~/centillion/

Virtual Environment:

(This step is optional but recommended. If you do not have a virtualenv installed, follow these installation instructions.)

Start by setting up a virtual environment, where centillion will be installed:

virtualenv vp
source vp/bin/activate

Install:

To install centillion, first install the required packages:

pip install -r requirements.txt

Now install centillion:

python setup.py build install

Test that your centillion installation went okay:

python -m centillion

If you see no output, that means centillion has been successfully installed. If you see an error message, check that you have activated your virtual environment (source vp/bin/activate).

Run:

Crete a temporary working directory:

mkdir -p /tmp/my-centillion-instance && cd /tmp/my-centillion-instance

Now create a minimal centillion instance with the following Python program:

run_centillion.py:

import centillion

app = centillion.webapp.get_flask_app(config_file='config.py')
app.run()

The config.py file can be copied verbatim from the example configuration file in the repository:

cp ~/centillion/config/config_centillion.example.py config.py

Now run the centillion instance by running the script:

python run_centillion.py

This will run the webapp on port 5000, so navigate to http://localhost:5000 in the browser.

centillion does not populate the search index, so the first time you run you will not see any documents in the search index.

Screenshot: centillion no docs indexed

Before you can use centillion, you must manually populate the search index.

Populate the Search Index:

To populate the search index, visit the control panel:

http://localhost:5000/control_panel

From here you can re-index the search engine. The example configuration file uses fake documents instead of real API calls, so the reindexing will work even without a network connection. To return to the index, click the centillion banner.

Visit the Master List:

The master list shows a list of every document indexed by centillion. Visit the master list:

http://localhost:5000/master_list

Try Searching:

Visit the help page for more information about running searches:

http://localhost:5000/help

Try searching for the following terms to see search results:

  • barley
  • masked figure
  • bananas
  • bacteria
  • microscope

Resources for centillion

centillion on Github: https://github.com/dcppc/centillion

centillion documentation: http://dcppc.github.io/centillion