
A fork of Algolia's awesome DocSearch Scraper, customized to index data in Typesense (an open source alternative to Algolia)

Typesense DocSearch scraper

Start things up

  1. Start Docusaurus
cd my-website
yarn build
yarn serve
  1. Start Typesense server
docker compose up -d
  1. Start crawler
docker run \
    -it --env-file=./.env \
    -e "CONFIG=$(cat config.json | jq -r tostring)" \
    --add-host=host.docker.internal:host-gateway  \

Dan's notes:

  • Docusaurus site

npx seems to not be working today, so I copied in an old Docusaurus demo so I can work. It is in the dir my-website.

I had to edit the docusaurus.config.js file to set the URL to 'http://host.docker.internal:3000' as this is used when generating the sitemap that the crawler uses.

Build the Docusaurus site:

cd my-website
yarn build

Host it on port 3000:

yarn serve
  • Crawler config The crawler docs provide a link to a Docusaurus 2.x crawler config and one line to update. Since I am running Docusaurus on localhost and the crawler in a container I had to specify the hostname as host.docker.internal instead of localhost. Here is the crawler config:
  "index_name": "docusaurus-2",
  "sitemap_urls": [
  "selectors": {
    "lvl0": {
      "selector": "(//ul[contains(@class,'menu__list')]//a[contains(@class, 'menu__link menu__link--sublist menu__link--active')]/text() | //nav[contains(@class, 'navbar')]//a[contains(@class, 'navbar__link--active')]/text())[last()]",
      "type": "xpath",
      "global": true,
      "default_value": "Documentation"
    "lvl1": "article h1, header h1",
    "lvl2": "article h2",
    "lvl3": "article h3",
    "lvl4": "article h4",
    "lvl5": "article h5, article td:first-child",
    "lvl6": "article h6",
    "text": "article p, article li, article td:last-child"
  "strip_chars": " .,;:#",
  "custom_settings": {
    "separatorsToIndex": "_",
    "attributesForFaceting": [
    "attributesToRetrieve": [
  "conversation_id": [
  "nb_hits": 46250
  • compose file for typesense server
version: '3.4'
    image: typesense/typesense:0.24.0
    restart: on-failure
      - "8108:8108"
      - ./typesense-data:/data
    command: '--data-dir /data --api-key=xyz --enable-cors'
  • Run typesense server
docker compose up -d
  • env file that crawler will use name is .env. Since both typesense server and crawler are running in Docker containers the TYPESENSE_HOST is host.docker.internal. I should change this to be two services in the docker compose file.
  • Run a crawl
docker run \
  -it --env-file=./.env \
  -e "CONFIG=$(cat config.json | jq -r tostring)" \
  --add-host=host.docker.internal:host-gateway  \

Note: The above docker run command uses a local build of the image. If you want to run the typesense build from Docker Hub then replace the last line of the above docker run command with:

  • Add search widget to Docusaurus site This requires adding a package, and editing the docusaurus.config.js file.
yarn add docusaurus-theme-search-typesense@next

Docusaurus config:

I add the themes line just before themeConfig.

I added the typesense theme config as the first entry in themeConfig. You can see it below, it ends just before the navbar configuration.

        theme: {
          customCss: require.resolve('./src/css/custom.css'),

  themes: ['docusaurus-theme-search-typesense'],

    /** @type {import('@docusaurus/preset-classic').ThemeConfig} */
      typesense: {
      typesenseCollectionName: 'docusaurus-2', // Replace with your own doc site's name. Should match the collection name in the scraper settings.

      typesenseServerConfig: {
        nodes: [
            host: 'localhost',
            port: 8108,
            protocol: 'http',
        apiKey: 'xyz',

      // Optional: Typesense search parameters: https://typesense.org/docs/0.21.0/api/documents.md#search-parameters
      typesenseSearchParameters: {},

      // Optional
      contextualSearch: true,

      navbar: {
        title: 'My Site',

Editing the code

As I am planning to customize the Docker container (update to latest Scrapy, etc.) I had to update the Dockerfile as some of the packages are obsolete (for example, the Google Chrome used by Selenium--I think, I don't actually use Selenium myself).

So, to build a new container the command is:

# Run this from the root of the repo
./docsearch docker:build

Updating Pipfile.lock

The ./docsearch docker:build command copies in the Pipfile.lock, so updating the Pipfile and then building is not sufficient. To generate a new lock file run the container with an overridden entrypoint:

docker run --entrypoint /bin/bash \
  -it --env-file=./.env 
  -e "CONFIG=$(cat config.json | jq -r tostring)" 

And then in the container run:

pipenv lock

Copy out the new Pipfile.lock file:

docker cp <container name>:/home/seleuser/Pipfile.lock .