/docs-scraper

Scrape documentation into MeiliSearch

Primary LanguagePythonOtherNOASSERTION

Docs Scraper

A scraper for your documentation website that indexes the scraped content into a MeiliSearch instance.

MeiliSearch is a powerful, fast, open-source, easy to use and deploy search engine. Both searching and indexing are highly customizable. Features such as typo-tolerance, filters, and synonyms are provided out-of-the-box

This scraper is used in production and runs on the MeiliSearch documentation on each deployment.

Table of Contents

Installation and Usage

Run your MeiliSearch Instance

First of all, you need to run your own MeiliSearch instance. This scraper will scrap your website and automatically index its content in MeiliSearch.
MeiliSearch is open-source and can run on your own server! 😄

Without running a MeiliSearch instance, the scraper will not work.

Here is the documentation to install and run MeiliSearch.

A tutorial about how to run MeiliSearch in production is coming...

The variables MEILISEARCH_HOST_URL and MEILISEARCH_API_KEY you will set in the next steps are the credentials of this MeiliSearch instance.

From Source Code

This project supports Python 3.6+.

The pipenv command must be installed.

Set both environment variables MEILISEARCH_HOST_URL and MEILISEARCH_API_KEY.

Then, run:

$ pipenv install
$ pipenv run ./docs_scraper <path-to-your-config-file>

With Docker

$ docker run -t --rm \
    -e MEILISEARCH_HOST_URL=<your-meilisearch-host-url> \
    -e MEILISEARCH_API_KEY=<your-meilisearch-api-key> \
    -v <absolute-path-to-your-config-file>:/docs-scraper/config.json \
    getmeili/docs-scraper:v0.9.0 pipenv run ./docs_scraper config.json

In a GitHub Action

To run after your deployment job:

run-scraper:
    needs: <your-deployment-job>
    runs-on: ubuntu-18.04
    steps:
    - uses: actions/checkout@master
    - name: Run scraper
      env:
        HOST_URL: ${{ secrets.MEILISEARCH_HOST_URL }}
        API_KEY: ${{ secrets.MEILISEARCH_API_KEY }}
        CONFIG_FILE_PATH: <path-to-your-config-file>
      run: |
        docker run -t --rm \
          -e MEILISEARCH_HOST_URL=$HOST_URL \
          -e MEILISEARCH_API_KEY=$API_KEY \
          -v $CONFIG_FILE_PATH:/docs-scraper/config.json \
          getmeili/docs-scraper:v0.9.0 pipenv run ./docs_scraper config.json

Here is the GitHub Action file we use in production for the MeiliSearch documentation.

About the API Key

The API key you must provide as environment variable should have the permissions to add documents into your MeiliSearch instance.

Thus, you need to provide the private key or the master key.

More about MeiliSearch authentication.

Configuration file

A generic configuration file:

{
  "index_uid": "docs",
  "start_urls": ["https://www.example.com/doc/"],
  "sitemap_urls": ["https://www.example.com/sitemap.xml"],
  "stop_urls": [],
  "selectors": {
    "lvl0": {
      "selector": ".docs-lvl0",
      "global": true,
      "default_value": "Documentation"
    },
    "lvl1": {
      "selector": ".docs-lvl1",
      "global": true,
      "default_value": "Chapter"
    },
    "lvl2": ".docs-content .docs-lvl2",
    "lvl3": ".docs-content .docs-lvl3",
    "lvl4": ".docs-content .docs-lvl4",
    "lvl5": ".docs-content .docs-lvl5",
    "lvl6": ".docs-content .docs-lvl6",
    "text": ".docs-content p, .docs-content li"
  }
}

The scraper will focus on the highlighted information depending on your selectors.

Here is the configuration file we use for the MeiliSearch documentation.

And for the search bar?

After having scraped your documentation, you might need a search bar to improve your user experience!

docs-searchbar-demo

For the front part, check out the docs-searchbar.js repository, wich provides a front-end search bar adapted for documentation.

Authentication

WARNING: Please be aware that the scraper will send authentication headers to every scraped site, so use allowed_domains to adjust the scope accordingly!

Basic HTTP

Basic HTTP authentication is supported by setting these environment variables:

  • DOCS_SCRAPER_BASICAUTH_USERNAME
  • DOCS_SCRAPER_BASICAUTH_PASSWORD

Cloudflare Access: Identity and Access Management

If it happens to you to scrape sites protected by Cloudflare Access, you have to set appropriate HTTP headers.

Values for these headers are taken from env variables CF_ACCESS_CLIENT_ID and CF_ACCESS_CLIENT_SECRET.

In case of Google Cloud Identity-Aware Proxy, please specify these env variables:

Installing Chrome Headless

Websites that need JavaScript for rendering are passed through ChromeDriver.
Download the version suited to your OS and then set the environment variable CHROMEDRIVER_PATH.

Development Workflow

Install and Launch

The pipenv command must be installed.

Set both environment variables MEILISEARCH_HOST_URL and MEILISEARCH_API_KEY.

Then, run:

$ pipenv install
$ pipenv run ./docs_scraper run <path-to-your-config-file>

Linter and Tests

$ pipenv install --dev
# Linter
$ pipenv run pylint scraper
# Tests
$ pipenv run pytest ./scraper/src -k "not _browser"

Release

Once the changes are merged on master, in your terminal, you must be on the master branch and push a new tag with the right version:

$ git checkout master
$ git pull origin master
$ git tag vX.X.X
$ git push --tag origin master

A GitHub Action will be triggered and push the latest and vX.X.X version of Docker image on DockerHub

Credits

Based on Algolia's docsearch scraper repository from this commit.
Due to a lot of future changes on this repository compared to the original one, we don't maintain it as an official fork.


MeiliSearch provides and maintains many SDKs and Integration tools like this one. We want to provide everyone with an amazing search experience for any kind of project. If you want to contribute, make suggestions, or just know what's going on right now, visit us in the integration-guides repository.