Materials database

Welcome to the webpage of the materials_db project. materials_db is a light-weight open-source platform for storing and searching chemical compounds and their properties. This platform can help analyze the properties of compounds and find relationships between various properties.

Introduction

Materials_db is an open-source web app with functionalities for adding and searching chemical compounds. The project is in its early development stage, and currently has only the backend. The frontend provides access to search/add APIs, as well as gives an option to populate the database through csv file upload.

Usage

The platform may be accessed either through the front page https://materials-db.herokuapp.com, which provides access to the search/add APIs, or by interacting with the APIs directly, using tools such as Postman or curl. The frontend contains two text fields - for adding and searching materials in the database, and a file upload field for uploading a csv file. Both add/search APIs accept requests formatted as JSON, and there are checks to make sure that the request is correctly formatted and is used for correct purposes. The rest of this section describes the usage of the APIs.

API for adding materials

The add APIs is available at the https://materials-db.herokuapp.com/data/add. It accepts POST requests containing JSON arrays formatted as follows:

[
  {
    "compound": "PbS",
    "properties": [
      {
        "propertyName": "Band gap",
        "propertyValue": "0.41"
      },
      {
        "propertyName": "Color",
        "propertyValue": "Black"
      }
    ]
  }
]

Several compounds can be added at once - just do not forget to separate individual JSON objects by commas:

[
  {
    "compound": "...",
    "properties": "..."
  },
  {
    "compound": "...",
    "properties": "..."
  }
]

There are several rules that must be followed when adding new compounds:

Both compound and properties keys must be present for each of the materials to be added.
The compound must be entered as a proper chemical formula. The platform relies on pyEQL package to parse chemical formulas, so all the rules that apply to that package also apply to materials_db. For example, you should add "H2O" but not "h2o" or "H2o". Also, made-up chemical formulas such as "HeLLoU" won't parse correctly and the platform will complain.
All numerical values such as the value "0.41" for the band gap of PbS above must be entered as strings. The platform will figure out if these strings contain valid floats or integers and will process them appropriately.
Note that the /data/add API checks the input JSON against a schemas['add'] schema, and will complain if the request does not conform to it (or if it is not JSON).

API for searching materials

The API for searching materials in the database is available at https://materials-db.herokuapp.com/data/search. It accepts POST requests containing search query JSON:

{
  "search": "element:(Pb AND Se)",
  "properties" : [
    {
      "name" : "gap",
      "value" : "0.2",
      "logic" : "lt"
    },
    {
      "name" : "color",
      "value" : "gray",
      "logic" : "imatch"
    }
  ]
}

The platform is capable of doing full-text Lucene search in the database, using the elasticsearch engine through haystack interface, and it includes additional filters for the compound properties through Django interface. As such, the search consists of two parts - a general full-text search query is associated with the search key, and (optional) one or more filters for the compound properties are associated with the properties key. These two mechanisms of finding the compounds operate differently, so I'll discuss them separately.

Full-text `search` query

Most searches can be done using just one search field:

{
  "search" : "PbS"
}

For instance, if you'd like to find the PbS material that we've just added, you could search for it the way shown above. Since some compounds in the database could be entered as "Pb1S1", it is useful to be able to search over constituent elements:

{
  "search" : "element:Pb AND element:S"
}

This could be written more concisely as element:(Pb AND S).

You can learn more about elasticsearch flavor of Lucene syntax on the Elastic official webpage. It supports all the standard features such as AND, OR, NOT keywords, groupings, wildcards, and regular expressions.

Searching without keywords queries the index made of the compound name, the elements it contains, names of the properties and their values (similar to the format of csv file that can be used to populate the database, but with the addition of element names).

You can (and should) use keywords to make the search more precise. Keywords that can be used are compound, element, group, and period. This also means you can search, for example, for all the A^IVB^VI compounds by entering group:(14 AND 16) into the search field. (The group numbers are stored in the newer CAS format, not IUPAC format.) Groups and periods are determined automatically when a new compound is saved in the database, with the help of pyEQL library.

Additional `properties` filters

After carrying out a full-text search of the compounds in the database, you can narrow down search results by including one or more property filters:

{
  "search": "",
  "properties" : [
    {
      "name" : "density",
      "value" : "1",
      "logic" : ">"
    }
  ]
}

For instance, the search query shown above will return all the materials that have the property called "density" with the value of greater than 1. Again, there are several rules for writing property filter queries:

The properties search query relies on standard Django database API, which does not have Lucene goodies such as AND/OR/NOT logical words, groupings, wildcards, etc.
There must be three fields defined for each property: name, value, and logic. All three should be strings, no floats/integers.
The name is the name of the property. It is used to filter compounds using the icontains QuerySet keyword:

Material.objects.filter(properties__propertyName__icontains='density')

Be careful with the property names - the request shown above would match both "Density" and "Electron density" fields - use more complete property names if needed.

Some properties may be numerical, such as the density or the band gap of the material, while other may be purely textual, for example, the color ("Red") or smell ("Rotten eggs"). Still, the value keyword must be entered as strings - the app will figure out if the property is numerical and will convert it to a float as needed.
The field logic refers to the logical opertor which will be used to query the database. Some standard Django operators - exact, iexact, contains, icontains, gt, lt, gte, lte are supported, as well as their synonyms such as ==, >, >=, <, <=. The app checks if the correct logical operator is used for each data type, and complains if it's not. (Try filtering by color greater than "red"). The full list of the operators, their effect on the QuerySet, as well as the data type they are appropriate for, is given in the Table below:

Logical operators	Django QuerySet keyword	Data types
`eq`, `==`, `=`, `match`, `matches`, `exact`	`exact`	text, number
`ieq`, `imatch`, `imatches`, `iexact`	`iexact`	text
`contain`, `contains`, `contained`	`contains`	text
`icontain`, `icontains`, `icontained`, `in`	`icontains`	text
`>`, `gt`, `more`, `greater`	`gt`	number
`>=`, `gte`, `ge`	`gte`	number
`<`, `lt`, `less`	`lt`	number
`<=`, `=<`, `lte`, `le`	`lte`	number

Similar to the /data/add API, /data/search checks the input JSON against a schemas['search'] schema, and will complain if the request does not conform to it (or if it is not JSON).

Installing locally

You are welcome to access the app at its current web address but you don't have to! You can install the app locally and play with it. To install, you will need to have necessary dependencies, as well as set up and configure the PostgreSQL database and elasticsearch engine. Earlier versions of this app used easier-to-setup filesystem-based sqlite3 database and whoosh search engine, so if you'd like to start with them, you can restore the code from earlier commits. (But be aware that parts of this readme will not work for the older version.) The installation instructions given below are for Ubuntu 16.04 system. First, download the zip file containing this distribution from https://github.com/agaiduk/materials-db/archive/master.zip, and unpack it on your local computer.

Install dependencies

The dependencies are managed using pipenv utility and are listed in the Pipfile supplied with this distribution. Note that this Pipfile is configured for use with the Heroku platform - if you are installing the app locally, you can delete these Heroku-specific dependencies:

urllib3 = "==1.22"
gunicorn = "*"
django-heroku = "*"
dj_database_url = "*"

Install the pipenv utility if you don't have it yet:

$ pip install pipenv

Then, while in the materials_db root directory, create a virtual environment and install dependencies:

$ pipenv install

Set up a database

The app is using PostgreSQL database. You can follow the rest of this tutorial (and setup the same database) but you don't have to - since all requests are made through haystack or Django APIs, you can switch to any databases they support. The rest of this subsection is based on this great tutorial for Ubuntu system.

Using apt command, install the database and helper packages:

$ sudo apt-get update
$ sudo apt-get install python-dev libpq-dev postgresql postgresql-contrib

$ sudo su - postgres

Then open a Postgres session:

$ psql

Now, let's create a database for our app. I will name it materials_db:

CREATE DATABASE materials_db;

Then, create a user:

CREATE USER <user> WITH PASSWORD '<password>';

Replace <user> and <password> with the username/password of your choice. Note that the password must be in quotes. Now set up user defaults:

ALTER ROLE <user> SET client_encoding TO 'utf8';
ALTER ROLE <user> SET default_transaction_isolation TO 'read committed';
ALTER ROLE <user> SET timezone TO 'UTC';

Give all materials_db database privileges to the user we've just created:

GRANT ALL PRIVILEGES ON DATABASE materials_db TO <user>;

Exit the SQL prompt to get back to the postgres user session:

\q

And exit the postgres session to go back to your shell session:

$ exit

Set up a search engine

The materials_db platform uses the elasticsearch search engine (v.2.4.6). (Newer v.5 and v.6 releases cannot be used with the django-haystack search interface.) You can download the deb package here. Install it on your system as follows:

$ sudo apt-get update
$ sudo dpkg -i elasticsearch-2.4.6.deb
$ sudo apt-get -f install

Now, start the search engine:

$ sudo /etc/init.d/elasticsearch start

You can check the status of the search engine if you wish:

$ curl "localhost:9200/_nodes?pretty=true&settings=true"

We are almost done!

Configure and run the app

We need to configure the app to let it know where the database and search engine are. Open the materials_db/settings.py file and delete the following blocks of code:

import django_heroku
import dj_database_url
from urllib.parse import urlparse

es = urlparse(os.environ.get('SEARCHBOX_URL') or 'http://127.0.0.1:9200/')
port = es.port or 80

# Haystack configuration with Elasticsearch

HAYSTACK_SIGNAL_PROCESSOR = 'haystack.signals.RealtimeSignalProcessor'
HAYSTACK_CONNECTIONS = {
    'default': {
        'ENGINE': 'haystack.backends.elasticsearch2_backend.Elasticsearch2SearchEngine',
        'URL': es.scheme + '://' + es.hostname + ':' + str(port),
        'INDEX_NAME': 'materials_db',
    },
}

if es.username:
    HAYSTACK_CONNECTIONS['default']['KWARGS'] = {"http_auth": es.username + ':' + es.password}

django_heroku.settings(locals())
DATABASES['default'] = dj_database_url.config(conn_max_age=600, ssl_require=True)

Add the database configuration:

# Database
# https://docs.djangoproject.com/en/2.0/ref/settings/#databases
DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.postgresql_psycopg2',
        'NAME': 'materials_db',
        'USER': '<user>',
        'PASSWORD': '<password>',
        'HOST': 'localhost',
        'PORT': '',
    }
}

Do not forget to replace <user> and <password>! Add haystack configuration with elasticsearch binding:

# Haystack configuration with Elasticsearch
HAYSTACK_SIGNAL_PROCESSOR = 'haystack.signals.RealtimeSignalProcessor'
HAYSTACK_CONNECTIONS = {
    'default': {
        'ENGINE': 'haystack.backends.elasticsearch2_backend.Elasticsearch2SearchEngine',
        'URL': 'http://127.0.0.1:9200/',
        'INDEX_NAME': 'materials_db',
    },
}

Activate app's virtual environment by typing:

$ pipenv shell

This will also install all necessary dependencies. Now make migrations for the database:

$ python manage.py makemigrations data

Assuming you didn't do any changes to the models, this will output No changes detected in app 'data'. Apply the migrations to the database:

$ python manage.py migrate

We're ready to run our app:

$ python manage.py runserver

This should print something like

Performing system checks...

System check identified no issues (0 silenced).
March 12, 2018 - 22:55:17
Django version 2.0.3, using settings 'materials_db.settings'
Starting development server at http://127.0.0.1:8000/
Quit the server with CONTROL-C.

Open the webpage http://127.0.0.1:8000/ in your browser and try some simple search, for example:

{
  "search" : ""
}

This will return an empty array [] because there is no data in our database yet. We can populate it quickly by uploading the csv file supplied with the package. Click on the "Choose file" at the bottom of the http://127.0.0.1:8000/ webpage and select the file; then click on "Upload". If everything is OK, it will reload the materials_db webpage with a message "103 of 103 materials added to the database". That's it! Haystack's RealtimeSignalProcessor updates the elasticsearch index every time something happens to the database, so there is no need to update it manually. You can start using the materials_db platform!

Future work

materials_db is an early-stage project! It can and will be developed further. Among the things I will work on next are the front-end (which is pretty much missing currently), to make it more user-friendly. Also, I will extend the full-text functionality to the properties of the compounds. Eventually, there will be only two fields to filter the materials, compound and properties:

{
  "compounds" : "element:(Pb AND Se)",
  "properties" : "(name:gap AND value:>0.2)"
}

I also plan to explore NoSQL databases such as mongoDB to provide more uniform search capabilities, potentially moving to a single search field.

agaiduk/materials-db