This repo is meant to help you figure whether a compound of interest is similar to a compound that is already known. Given a target composition, this code will look through a library of compounds (that you define) and see whether the target is similar to any in that library. The similarity is judged by fuzzy matching techniques, which is the (intended) specialty of this code - although keep in mind that these are just estimates. The exact formula does not need to match - we are looking for similar formulas. The greater the similarity, the higher score of the match. The convention is that an exact match to the formula has a score of 100.
For complete beginners, the easiest way to run this code is:
- Download the repo (there is a link provided by Github to Clone/Download the repo).
- cd to the directory containing
compoundsearch.py
. - Run the file by typing
python compoundsearch.py Pb2TeSe
into your shell, wherePb2TeSe
is an example of the formula you want to look up.
If you know how to use Python, you can do something like this instead:
cs = CompoundSearch()
cs.search("Pb2TeSe")
which will return something like this (first couple of output lines):
formula_pretty score matches line_no tag1 tag2 tag3
FeCuO2 15 ['anonymized formula'] 5 Gonçalves, A. P. & Godart, C. New promising bulk thermoelectrics: Intermetallics, pnictides and chalcogenides. Eur. Phys. J. B 87, 42 (2014). p2
BiSb 15 ['fuzzy formula'] 22 Gonçalves, A. P. & Godart, C. New promising bulk thermoelectrics: Intermetallics, pnictides and chalcogenides. Eur. Phys. J. B 87, 42 (2014). p7
AgSbTe2 15 ['anonymized formula'] 40 Gonçalves, A. P. & Godart, C. New promising bulk thermoelectrics: Intermetallics, pnictides and chalcogenides. Eur. Phys. J. B 87, 42 (2014). p9
GaAgTe2 15 ['anonymized formula'] 41 Gonçalves, A. P. & Godart, C. New promising bulk thermoelectrics: Intermetallics, pnictides and chalcogenides. Eur. Phys. J. B 87, 42 (2014). p9
GaCuTe2 15 ['anonymized formula'] 42 Gonçalves, A. P. & Godart, C. New promising bulk thermoelectrics: Intermetallics, pnictides and chalcogenides. Eur. Phys. J. B 87, 42 (2014). p9
AgSbTe2 15 ['anonymized formula'] 47 Gonçalves, A. P. & Godart, C. New promising bulk thermoelectrics: Intermetallics, pnictides and chalcogenides. Eur. Phys. J. B 87, 42 (2014). p10
GeTe 30 ['group formula'] 48 Gonçalves, A. P. & Godart, C. New promising bulk thermoelectrics: Intermetallics, pnictides and chalcogenides. Eur. Phys. J. B 87, 42 (2014). p10
SnTe 45 ['group formula', 'fuzzy formula'] 51 Gonçalves, A. P. & Godart, C. New promising bulk thermoelectrics: Intermetallics, pnictides and chalcogenides. Eur. Phys. J. B 87, 42 (2014). p10
TePb 45 ['group formula', 'fuzzy formula'] 52 Gonçalves, A. P. & Godart, C. New promising bulk thermoelectrics: Intermetallics, pnictides and chalcogenides. Eur. Phys. J. B 87, 42 (2014). p10
The highest score is for SnTe
and TePb
since they have a score of 45 (maximum amongst the rows listed). The GeTe
is close in terms of a match to the target, which (if you recall) was Pb2TeSe
in the above code snippet.
This code uses the default thermoelectrics library file found in compounds_list/thermoelectrics.txt. This library file contains some known thermoelectrics compositions; the code is searching to see which of the known compositions best matches the query "Pb2TeSe". There are no exact matches but a few formulas that are similar enough to be reported as of interest.
You can modify or create a new library file as desired. The library file format is simple. Each line contains a formula that is searched. If there is a line that begins with "#", "##", or "###", the line used to tag the entries below with a 1st level, 2nd-level, or 3rd-level tag (respectively). The tags will be reported along with the compound match and can be used to help track down where you found the compound. For example, you can use the 1st level tag to give a paper reference, a 2nd level tag to give the page number, and a 3rd level tag for a note. Note that if you set a level 1 tag, it will clear out the level 2 and level 3 tags, i.e., the tags are hierarchical.
There are a few different types of matches we are looking for:
- exact formula matches
- same chemical system
- same anonymized formula
- formula match if elements are replaced by their group number
- formula match if elements by placeholders such as "A" for cation, "Tm" for transition metal, "X" for anion, etc.
Each type of match has a different score that is added if the match is confirmed. Suggestions for algorithm improvements are welcome.
This is common - you basically need to decide for yourself which formulas are truly similar and which ones are not from the list provided. The code is only there to feed you ideas for what known compounds might be similar to your composition. It is not a complete solution by any means.
Can you improve the interface? e.g., allow me to sort the results by score? Or give me a command line interface?
If there is enough interest in this code, these things are possible.