/MolecularLucene

Lucene tokenizer for chemical structures indexing/searching

Primary LanguageJava

MolecularLucene

Lucene is exceptionally good in texts search. Other kinds of query (dates/numbers ranges, geospatial e.t.c) are also supported. Here is attempt to bring chemcal structures search into Lucene world.

This project introduces special kind of lucene analyzer for searching/indexing chemical structures.

In order to be indexed and/or searched by MolecularLucene chemical structures should be provided as text representation (SMILES is the only supported format now, but I am going to add InChi ).

This allows to create full text search and similar chemical structures search in one common "canvas".

For example lucene index contains documents having fields "description" and "smiles" Field "description" is free-text description of chemical compound and "smiles" contais chemical structure information. A query to index looks like this:

description:"amino acid" AND smiles:c1ccc2c\(c1\)cc\[nH\]2

Note that characters (,),[ and ] are escaped becase they have special meaning in Lucene query syntax.

Literally this means: Show me compounds having phrase "amino acid" in description and chemical structure similar to indole (smiles:c1ccc2c(c1)cc[nH]2).

See autotests source code for basic example of usage.

References

Post about this project at habrahabr.ru (in Russian). Demo: chemical wikipedia search project ChWiSe.Net. Source code available on github.