/crdb-text-search

Example of doing a full text index and search using CockroachDB

Primary LanguagePythonApache License 2.0Apache-2.0

Example: Text Search in CockroachDB

alt text

This is a simple example of doing full text indexing and search on a set of HTML documents. What is shown here is based on Ubuntu 18.04 (running on a locally deployed VM), and I am assuming a CockroachDB instance is running on the network, in insecure mode. My experimental setup uses a 3 node CockroachDB instance with HAProxy listening on port 5432 (the traditional PostgreSQL port), all running locally on a MacBook Pro.

Prerequisites

  • The Psycopg PostgreSQL driver for Python:
$ pip install --user -U psycopg2
$ sudo apt-get install python3-bs4
  • The Pip Python package installer:
$ sudo apt install python-pip
$ pip install --user -U nltk
  • Install a couple of supporting data files, for NLTK:
$ python -c 'import nltk; nltk.download("stopwords");'

  • Data: a set of HTML documents to index. If you'd like to try the ones used in my demo, you can grab them like so (assuming wget is installed):
wget -O - https://github.com/mgoddard/crdb-text-search/raw/master/crdb-v20.2-docs.tar.gz | tar xzvf -
  • A SQL client to connect to the CockroachDB cluster. I did all my SQL using DbVisualizer, a Java based GUI SQL client, running on my MacBook Pro.

The Demo

  • Using your SQL client, build the tables and indexes using the provided DDL

  • The tar xzvf - up above should have created the crdb-v20.2-docs. Change into that directory:

$ cd crdb-v20.2-docs
  • Set the two environment variables required to connect, substituting the appropriate values. If you aren't using HAProxy, the PGPORT value is probably going to be 26257:
$ export PGHOST=192.168.1.4
$ export PGPORT=5432
  • Run the indexing step:
mgoddard@ubuntu:~/crdb-v20.2-docs$ ../crdb-text-search/html_indexer.py crdb_docs *.html
Indexing file 404.html now ...
Indexing file add-column.html now ...
Indexing file add-constraint.html now ...
Indexing file admin-ui-access-and-navigate.html now ...
Indexing file admin-ui-cdc-dashboard.html now ...
Indexing file admin-ui-cluster-overview-page.html now ...
Indexing file admin-ui-custom-chart-debug-page.html now ...
Indexing file admin-ui-databases-page.html now ...
[...]
Indexing file views.html now ...
Indexing file view-version-details.html now ...
Indexing file window-functions.html now ...
Total time: 107.10261821746826 s
  • Using your SQL client, try the example query:
WITH d AS (
  SELECT idx_name, uri, n_words
  FROM docs
  WHERE content @> '{crdb_docs, follow, read}'
), w AS (
  SELECT idx_name, uri, SUM(cnt) n
  FROM words
  WHERE idx_name = 'crdb_docs' AND word IN ('follow', 'read')
  GROUP BY (idx_name, uri)
)
SELECT w.uri, (100.0 * n/n_words)::NUMERIC(9, 3) score FROM w
JOIN d ON d.idx_name = w.idx_name AND d.uri = w.uri
ORDER BY score DESC
LIMIT 12;
  • Try some different queries. Since each of our indexed words was transformed by stemming, you'll want to do the same with your query terms; e.g.
mgoddard@ubuntu:~/crdb-v20.2-docs$ python
Python 3.6.9 (default, Jul 17 2020, 12:50:27)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk
>>> sno = nltk.stem.SnowballStemmer("english")
>>> sno.stem("installing")
'instal'
>>> sno.stem("insecure")
'insecur'
  • Use these stemmed terms in your revised SQL query:
WITH d AS (
  SELECT idx_name, uri, n_words
  FROM docs
  WHERE content @> '{crdb_docs, instal, insecur}'
), w AS (
  SELECT idx_name, uri, SUM(cnt) n
  FROM words
  WHERE idx_name = 'crdb_docs' AND word IN ('instal', 'insecur')
  GROUP BY (idx_name, uri)
)
SELECT w.uri, (100.0 * n/n_words)::NUMERIC(9, 3) score FROM w
JOIN d ON d.idx_name = w.idx_name AND d.uri = w.uri
ORDER BY score DESC
LIMIT 12;

Try the Docker image

The Docker image will run against your locally deployed CockroachDB instance.

  1. Ensure you've run the SQL commands in ./crdb_text_search.sql to build the tables and indexes.
  2. Pull the Docker image: $ ./docker_pull_image.sh
  3. Edit ./docker_run_image.sh, changing PGHOST, PGPORT, and any other values to suit your setup.
  4. Run the image: $ ./docker_run_image.sh
  5. Edit ./add_doc_to_index.sh if necessary (host, port, index_name may need to be changed)
  6. Use that script to index URLs, one at a time; e.g. $ ./add_doc_to_index.sh https://www.cockroachlabs.com/docs/stable/architecture/overview.html
  7. Edit ./search_client.py (again, host and port may need to be changed)
  8. Run a search: $ ./search_client.py -i crdb_docs architecture

Ideas for what to do next

  • Improve relevance calculations
  • N-grams?
  • Synomyms?
  • Store the text in the docs table to support phrase matching? How does this impact the size of the data being stored?

References

Set up a local CockroachDB cluster

HAProxy setup

PostgreSQL libpq environment varables

Building the container