Vectara Web Crawler

This is a Web Crawler for Vectara

About

The web crawler currently has 4 modes of operation:

Single URL
Sitemap
RSS
Recursive

For the former, provide the crawler with a URL and it will ingest it into Vectara. For the latter, provide the crawler with a root page, and it will retrieve the sitemap(s) and index all links from the sitemap.

Dependencies

This crawler has a minimum set of python dependencies, as outlined in requirements.txt.

Install these requirements by running:

pip3 install -r requirements.txt

Setup and Requirements

The crawler generates PDFs for each page to upload to Vectara's file upload API. The crawler relies on headless browsers to both extract links and to generate these PDFs. This allows for realistic text rendering, even of javascript-heavy websites. Chrome/Chromium is required for link extraction, and there are currently 2 supported headless browsers for PDF generation, each with their own tradeoffs:

pyhtml2pdf which in turn uses headless Chrome for rendering. You will either need to install Chrome locally or keep a copy of chromedriver in your PATH.
wkhtmltopdf which uses Qt WebKit for rendering. It's highly recommended that you download a precompiled wkhtmltopdf binary and add it to your PATH (as opposed to trying to install wkhtmltopdf via a package manager)

Unfortunately no website PDF rendering system is perfect, though for the purposes of neural search, it generally doesn't need to be: you just need to make sure the right text is rendered in roughly the right order.

wkhtmltopdf tends to do a pretty good job of this task but doesn't handle URL fragments (things after # in the URL), so crawls using wkhtmltopdf will remove any URL fragment from the document ID when submitted to Vectara. wkhtmltopdf also can be insecure, so either keep the process sandboxed or only run it on sites that you trust.

pyhtml2pdf (and Chrome) generally produce more accurate colors and positioning of rendering than wkhtmltopdf though for the purposes of neural text search these generally do not matter. Unfortunately, the visual accuracy can sometimes yield programmatic inaccuracies where certain elements of the PDF blocks are located in the wrong place.

In general, if you have full access to the content and/or have the ability to do more bespoke content extraction, it will yield better results than a generic web crawler, and Vectara maintains a full text/metadata indexing API as well for those users.

Usage

python3 crawler.py [parameters]

Parameters are:

Parameter	Required?	Description	Default
url	Yes	The starting URL, domain, or homepage	N/A
crawl-type	No	single-page, rss, sitemap, or recursive	single-page
pdf-driver	No	What to convert pages to PDFs. chrome or wkhtmltopdf	chrome
(no-)install-chrome-driver	No	Whether or not to install the Chrome driver for extracting links	--install-chrome-driver
depth	No	Maximum depth to discover and crawl links	3
crawl-pattern	No	Optional regular expression to stick the crawl to	.* (all URLs)
customer-id	Yes	Your Vectara customer ID	N/A
corpus-id	Yes	Your Vectara corpus ID	N/A
appclient-id	Yes	OAuth 2.0 client ID to index content	N/A
appclient-secret	Yes	OAuth 2.0 client ID to secret content	N/A
customer-id	Yes	Your Vectara customer ID	N/A
auth-url	No	OAuth2 authentication URL	Defined by your account
indexing-endpoint	No	OAuth2 authentication URL	api.vectara.com

License