The source for Nutrimatic
(If this doesn't work for you, see the manual steps below.)
-
You'll need a working C++ build system (debian/ubuntu:
sudo apt install build-essential
) -
Install mise-en-place as a tool installer:
curl https://mise.run | sh
(or see other install methods) -
Run
./dev_setup.py
which will install various dependencies locally -
Then run
conan build .
which will leave binaries inbuild/
(The scripted path above is easier! But maybe that's too magical, or you don't like mise...)
-
As above, you'll need C++ build tools (debian/ubuntu:
sudo apt install build-essential
) -
Use Python 3.10 (avoids this wikiextractor bug exposed by this change in Python 3.11)
-
You probably want to set up a Python venv
-
Install Conan, CMake, etc:
pip install -r dev_requirements.txt
-
Configure Conan to build on your machine (if you haven't already)
conan profile detect conan profile path default # note the path this outputs
Edit the file listed by
conan profile path default
to setcompiler.cppstd=17
(orgnu17
) -
Install C++ dependencies:
conan install . --build=missing
-
Then run
conan build .
which will leave binaries inbuild/
To actually use Nutrimatic, you will need to build an index from Wikipedia.
-
Download the latest Wikipedia database dump (this is a ~20GB file!):
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
(You can also look for a mirror closer to you.)
-
Extract the text from the articles using Wikipedia Extractor (this generates ~12GB, and takes hours!):
pip install wikiextractor # installs into the local virtualenv wikiextractor enwiki-latest-pages-articles.xml.bz2
(There are probably better extractors these days!)
This will write many files named
text/??/wiki_??
. -
Index the text (this generates ~100GB of data, and also takes hours!):
find text -type f | xargs cat | build/make-index wikipedia
This will write many files named
wikipedia.?????.index
. (You can break this up by runningmake-index
with different chunks of input data, replacing "wikipedia" with unique names each time.) -
Merge the indexes; I normally do this in two stages:
for x in 0 1 2 3 4 5 6 7 8 9 do build/merge-indexes 2 wikipedia.????$x.index wiki-merged.$x.index done
followed by
build/merge-indexes 5 wiki-merged.*.index wiki-merged.index
There's nothing magical about this 10-batch approach, you can use any strategy you like. The 2 and 5 numbers are phrase frequency cutoffs (how many times a string must occur to be included).
-
Enjoy your new index:
build/find-expr wiki-merged.index '<aciimnrttu>'
If you want to run the nutrimatic.org style
interface, point a web server at the web_static/
directory, and for
root requests have it launch cgi_scripts/cgi-search.py
with
$NUTRIMATIC_FIND_EXPR
set to the find-expr
binary and $NUTRIMATIC_INDEX
set to the index you built.
(You might want to use install_to_dir.py
which will copy executables,
CGI scripts, and static content to the directory of your choice.)
For example, you could adapt this nginx config:
location /my-nutrimatic/ {
# Serve static files (change /home/me/nutrimatic_install to your install dir)
alias /home/me/nutrimatic_install/web_static/;
# For root requests, run the CGI script
location = /my-nutrimatic/ {
fastcgi_pass unix:/var/run/fcgiwrap.socket;
fastcgi_buffering off; # send results as soon as we find them
include /etc/nginx/fastcgi_params;
gzip off; # gzip compression also causes buffering
# (change /home/me/nutrimatic_install to your install dir)
fastcgi_param SCRIPT_FILENAME /home/me/nutrimatic_install/cgi_scripts/cgi-search.py;
fastcgi_param NUTRIMATIC_FIND_EXPR /home/me/nutrimatic_install/bin/find-expr;
# (change to wherever you put your index file)
fastcgi_param NUTRIMATIC_INDEX /home/me/nutrimatic_install/wiki-merged.index;
}
}
Have fun,
If you need the version of Nutrimatic website as it was served historically, you will need to rebuild it using the instructions above. For that you need the codebase and website from that time, as well as the Data dump for Wikipedia from the right month (Link to all historic data dumps).
- Nutrimatic.org (Dec 2016 - Feb 2024) : Codebase URL. Data dump = enwiki 1 Nov 2016 (See #14)