mendableai/data-connectors

Error: Invalid character in entity name

Closed this issue · 2 comments

Code like this:

const webDataConnector = createDataConnector({
  provider: "web-scraper",
})

webDataConnector.setOptions({
  urls: ["http://localhost:3000"],
  mode: "sitemap",
})

const documents = await webDataConnector.getDocuments();

is creating the error:

Scraping data from http://localhost:3000
(node:53598) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead.
(Use `node --trace-deprecation ...` to show where the warning was created)
Error processing http://localhost:3000: Error: Invalid character in entity name
Line: 5
Column: 2516
Char: &
Found 0 urls in sitemap
[]

But the http://localhost:3000/sitemap.xml is not empty, it does have urls, and it does not have an ampersand (&) character.

I am unclear what line 5 and column 2516 is referring to; I don't see an ampersand.

Changing the url value from http://localhost:3000 to http://localhost:3000/sitemap.xml solved the problem. (I thought that the url parameter would get the sitemap.xml appended when mode is sitemap.)

That's a good idea though @awhitford, will add a fallback that tries to capture the sitemap even if it is not provided, when the sitemap mode is on.