A scraper for your documentation website that indexes the scraped content into a MeiliSearch instance.
MeiliSearch is a powerful, fast, open-source, easy to use and deploy search engine. Both searching and indexing are highly customizable. Features such as typo-tolerance, filters, and synonyms are provided out-of-the-box
This scraper is used in production and runs on the MeiliSearch documentation on each deployment.
- Table of Contents
- Installation and Usage
- Configuration file
- And for the search bar?
- Authentication
- Installing Chrome Headless
- Development Workflow
- Credits
First of all, you need to run your own MeiliSearch instance. This scraper will scrap your website and automatically index its content in MeiliSearch.
MeiliSearch is open-source and can run on your own server! 😄
Without running a MeiliSearch instance, the scraper will not work.
Here is the documentation to install and run MeiliSearch.
A tutorial about how to run MeiliSearch in production is coming...
The variables MEILISEARCH_HOST_URL
and MEILISEARCH_API_KEY
you will set in the next steps are the credentials of this MeiliSearch instance.
This project supports Python 3.6+.
The pipenv
command must be installed.
Set both environment variables MEILISEARCH_HOST_URL
and MEILISEARCH_API_KEY
.
Then, run:
$ pipenv install
$ pipenv run ./docs_scraper <path-to-your-config-file>
$ docker run -t --rm \
-e MEILISEARCH_HOST_URL=<your-meilisearch-host-url> \
-e MEILISEARCH_API_KEY=<your-meilisearch-api-key> \
-v <absolute-path-to-your-config-file>:/docs-scraper/config.json \
getmeili/docs-scraper:v0.9.0 pipenv run ./docs_scraper config.json
To run after your deployment job:
run-scraper:
needs: <your-deployment-job>
runs-on: ubuntu-18.04
steps:
- uses: actions/checkout@master
- name: Run scraper
env:
HOST_URL: ${{ secrets.MEILISEARCH_HOST_URL }}
API_KEY: ${{ secrets.MEILISEARCH_API_KEY }}
CONFIG_FILE_PATH: <path-to-your-config-file>
run: |
docker run -t --rm \
-e MEILISEARCH_HOST_URL=$HOST_URL \
-e MEILISEARCH_API_KEY=$API_KEY \
-v $CONFIG_FILE_PATH:/docs-scraper/config.json \
getmeili/docs-scraper:v0.9.0 pipenv run ./docs_scraper config.json
Here is the GitHub Action file we use in production for the MeiliSearch documentation.
The API key you must provide as environment variable should have the permissions to add documents into your MeiliSearch instance.
Thus, you need to provide the private key or the master key.
More about MeiliSearch authentication.
A generic configuration file:
{
"index_uid": "docs",
"start_urls": ["https://www.example.com/doc/"],
"sitemap_urls": ["https://www.example.com/sitemap.xml"],
"stop_urls": [],
"selectors": {
"lvl0": {
"selector": ".docs-lvl0",
"global": true,
"default_value": "Documentation"
},
"lvl1": {
"selector": ".docs-lvl1",
"global": true,
"default_value": "Chapter"
},
"lvl2": ".docs-content .docs-lvl2",
"lvl3": ".docs-content .docs-lvl3",
"lvl4": ".docs-content .docs-lvl4",
"lvl5": ".docs-content .docs-lvl5",
"lvl6": ".docs-content .docs-lvl6",
"text": ".docs-content p, .docs-content li"
}
}
The scraper will focus on the highlighted information depending on your selectors.
Here is the configuration file we use for the MeiliSearch documentation.
After having scraped your documentation, you might need a search bar to improve your user experience!
For the front part, check out the docs-searchbar.js repository, wich provides a front-end search bar adapted for documentation.
WARNING: Please be aware that the scraper will send authentication headers to every scraped site, so use allowed_domains
to adjust the scope accordingly!
Basic HTTP authentication is supported by setting these environment variables:
DOCS_SCRAPER_BASICAUTH_USERNAME
DOCS_SCRAPER_BASICAUTH_PASSWORD
If it happens to you to scrape sites protected by Cloudflare Access, you have to set appropriate HTTP headers.
Values for these headers are taken from env variables CF_ACCESS_CLIENT_ID
and CF_ACCESS_CLIENT_SECRET
.
In case of Google Cloud Identity-Aware Proxy, please specify these env variables:
IAP_AUTH_CLIENT_ID
- # pick client ID of the application you are connecting toIAP_AUTH_SERVICE_ACCOUNT_JSON
- # generate in Actions -> Create key -> JSON
Websites that need JavaScript for rendering are passed through ChromeDriver.
Download the version suited to your OS and then set the environment variable CHROMEDRIVER_PATH
.
The pipenv
command must be installed.
Set both environment variables MEILISEARCH_HOST_URL
and MEILISEARCH_API_KEY
.
Then, run:
$ pipenv install
$ pipenv run ./docs_scraper run <path-to-your-config-file>
$ pipenv install --dev
# Linter
$ pipenv run pylint scraper
# Tests
$ pipenv run pytest ./scraper/src -k "not _browser"
Once the changes are merged on master
, in your terminal, you must be on the master
branch and push a new tag with the right version:
$ git checkout master
$ git pull origin master
$ git tag vX.X.X
$ git push --tag origin master
A GitHub Action will be triggered and push the latest
and vX.X.X
version of Docker image on DockerHub
Based on Algolia's docsearch scraper repository from this commit.
Due to a lot of future changes on this repository compared to the original one, we don't maintain it as an official fork.
MeiliSearch provides and maintains many SDKs and Integration tools like this one. We want to provide everyone with an amazing search experience for any kind of project. If you want to contribute, make suggestions, or just know what's going on right now, visit us in the integration-guides repository.