The source for Nutrimatic
(If this doesn't work for you, see the manual steps below.)
-
You'll need a working C++ build system; for Debian/Ubuntu:
sudo apt install build-essential curl
-
Install mise-en-place as a tool installer:
curl https://mise.run | sh
(or see other install methods) -
Run
./dev_setup.py
which will install various dependencies locally -
Then run
conan build .
which will leave binaries inbuild/
(The scripted path above is easier! But maybe that's too magical, or you don't like mise...)
-
As above, you'll need C++ build tools
-
Use Python 3.10 (avoids this wikiextractor bug exposed by this change in Python 3.11)
-
You probably want to set up a Python venv
-
Install Conan, CMake, etc:
pip install -r dev_requirements.txt
-
Configure Conan to build on your machine (if you haven't already)
conan profile detect conan profile path default # note the path this outputs
Edit the file listed by
conan profile path default
to setcompiler.cppstd=17
(orgnu17
) -
Install C++ dependencies:
conan install . --build=missing
-
Then run
conan build .
which will leave binaries inbuild/
To actually use Nutrimatic, you will need to build an index from Wikipedia.
-
Download the latest Wikipedia database dump (this is a ~20GB file!):
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
(You can also look for a mirror closer to you.)
-
Extract the text from the articles using Wikipedia Extractor (this generates ~12GB, and takes hours!):
pip install wikiextractor # installs into the local virtualenv wikiextractor enwiki-latest-pages-articles.xml.bz2
(There are probably better extractors these days!)
This will write many files named
text/??/wiki_??
. -
Index the text (this generates ~100GB of data, and also takes hours!):
find text -type f | xargs cat | build/make-index wikipedia
This will write many files named
wikipedia.?????.index
. (You can break this up by runningmake-index
with different chunks of input data, replacing "wikipedia" with unique names each time.) -
Merge the indexes; I normally do this in two stages:
for x in 0 1 2 3 4 5 6 7 8 9 do build/merge-indexes 2 wikipedia.????$x.index wiki-merged.$x.index done
followed by
build/merge-indexes 5 wiki-merged.*.index wiki-merged.index
There's nothing magical about this 10-batch approach, you can use any strategy you like. The 2 and 5 numbers are phrase frequency cutoffs (how many times a string must occur to be included).
-
Enjoy your new index:
build/find-expr wiki-merged.index '<aciimnrttu>'
If you want to run the nutrimatic.org style
interface, point a web server at the web_static/
directory, and for
root requests have it launch cgi_scripts/cgi-search.py
with
$NUTRIMATIC_FIND_EXPR
set to the find-expr
binary and $NUTRIMATIC_INDEX
set to the index you built.
(You might want to use install_to_dir.py
which will copy executables,
CGI scripts, and static content to the directory of your choice.)
For example, you could adapt this nginx config:
location /my-nutrimatic/ {
# Serve static files (change /home/me/nutrimatic_install to your install dir)
alias /home/me/nutrimatic_install/web_static/;
# For root requests, run the CGI script
location = /my-nutrimatic/ {
fastcgi_pass unix:/var/run/fcgiwrap.socket;
fastcgi_buffering off; # send results as soon as we find them
include /etc/nginx/fastcgi_params;
gzip off; # gzip compression also causes buffering
# (change /home/me/nutrimatic_install to your install dir)
fastcgi_param SCRIPT_FILENAME /home/me/nutrimatic_install/cgi_scripts/cgi-search.py;
fastcgi_param NUTRIMATIC_FIND_EXPR /home/me/nutrimatic_install/bin/find-expr;
# (change to wherever you put your index file)
fastcgi_param NUTRIMATIC_INDEX /home/me/nutrimatic_install/wiki-merged.index;
}
}
If you want to reproduce historical results from the website, you need to build an index from the corresponding Wikipedia data dump using compatible index building and searching logic:
- nutrimatic.org/2016 (Dec 2016 - Feb 2024): historical code with enwiki-20161101 (discussion)
- nutrimatic.org/2024 (Feb 2024 - current): main branch with enwiki-20231201