algolia/docsearch-scraper

Algolia index records reduction after an undefined amount of time

FilippoRezzonico opened this issue · 6 comments

Situation

docsearch-scraper docker image starts scraping my company documentation website any time we deploy a new version of it and correctly updates our indexes. Using the search bar will then return the correct records.
The problem is that, after an undefined amount of time (can be weeks or even months), our algolia search bar will not return any record if used. Is it possible that indexes get deleted after a certain amount of time or that algolia docsearch-scraper deletes them or returns no indexes if some condition occurs (maybe launching it multiple times in a small amount of time)?

Result

Our documentation search through Algolia does not return any single record untill we run the docsearch-scraper docker image again.

Workaround

Any time that we notice that our Algolia search does not return any result we run docsearch-scraper docker image again.

Hey @FilippoRezzonico,

Can you confirm you are using the latest version of the docsearch-scraper docker image?

The only case an index can be unavailable, is at the end of a successful crawl: when the crawler runs, it stores records in a _tmp index and rename the index to the production name at the end of the crawl.

Hi @shortcuts,
Yes, we are currently using the latest versione of the docker image.
We only run the scraper when we release a new version of our documentation (while this problems seem to happen randomly). Moreover, releasing our documentation (which will launch the docsearch-scraper image) seems to restore the correct functioning of our searchbar.
Today, after about 3 days from the last time, our search bar is not working again. We noticed the following anomalies:

  • The records number decreased suddenly from 23.8K to 20.4K
    monitoring
  • Some records of our top searches seem to have been removed
    template_record
  • Some of our top searches were also in the searches without result (seems like they have been filtered before being returned)
    advanced
    Do you think that these sudden changes can actually provide us some hints about the problem that we are facing?

Today, after about 3 days from the last time, our search bar is not working again.

Could you confirm the index is deleted when the search does not work anymore? If it's the case, I suggest you to contact support@algolia.com (also provide the link of this issue so you don't have to re-explain it) and they will be able to give you information of why your index is deleted.

With our scraper, we always keep the production index up and don't perform delete operations

We noticed the following anomalies:

If the index is not deleted but only the search does not work, it might be related to some inconsistencies during the crawl. Could you please provide a gist with your config file so I can try it?

  • Do you have client-side rendered content? Make sure to use the js_render option and add some delay if needed using js_wait
  • If you don't have a sitemap yet, make sure to check our tips for a good search section

Hope this gives you hints :D

Actually, our indexes are not deleted but it seems that their number of records gets reduced after some time. So I think that, as you said, it could be caused by some issues during the crawling.
Here is the config file that we are currently using:

{
  "index_name": "mia-platform-docs",
  "start_urls": [
    "https://docs.mia-platform.eu"
  ],
  "stop_urls": [
    "/$"
  ],
  "selectors": {
    "text": "article p, article li",
    "lvl1": "header h1",
    "lvl2": "article h2",
    "lvl3": "article h3",
    "lvl4": "article h4",
    "lvl5": "article h5",
    "lvl6": "article h6",
    "lvl0": {
      "selector": ".menu__link--sublist.menu__link--active",
      "global": true,
      "default_value": "Documentation"
    }
  },
  "sitemap_urls": [
    "https://docs.mia-platform.eu/sitemap.xml"
  ],
  "sitemap_alternate_links": true,
  "strip_chars": " .,;:#",
  "custom_settings": {
    "separatorsToIndex": "_",
    "attributesForFaceting": [
      "language",
      "version",
      "type",
      "docusaurus_tag"
    ],
    "attributesToRetrieve": [
      "hierarchy",
      "content",
      "anchor",
      "url",
      "url_without_anchor",
      "type"
    ]
  },
  "min_indexed_level": 0,
  "conversation_id": [
    "1280385092"
  ],
  "nb_hits": 12708
}

Do you see any possible problem with it?

  • Both the links of js_render and js_wait that you provided me in the previous message return a 404 landing page, have they been moved elsewhere?
    We wait exactly 1 minute after releasing our documentation to launch the algolia scraper, do you think that we should increase this amount of time?
  • We currently have a sitemap.xml file in our documentation project, but I am pretty sure that it is not updated and it is possible that it contains pages that are not present anymore. Do you think that this could cause the issue?

We currently have a sitemap.xml file in our documentation project, but I am pretty sure that it is not updated and it is possible that it contains pages that are not present anymore. Do you think that this could cause the issue?

As the issue is mostly inconsistencies between crawls, it might have an impact, yes.

Both the links of js_render and js_wait that you provided me in the previous message return a 404 landing page, have they been moved elsewhere?

The new doc has been deployed since then, links are now at js_render, js_wait, sorry! :D

On my side, I had 33740 hits without client-side rendering, and 32574 with it.

We wait exactly 1 minute after releasing our documentation to launch the algolia scraper, do you think that we should increase this amount of time?

There's no caching on our side so it shouldn't have an impact.

For now, I will try to update our sitemap.xml file and see if this solves the problem once and for all.
Since this problem has a randomic nature, I think it will require us some months without this issue happening to be sure that it has been fixed.
I will let you know if this issue happens again after having updated the sitemap.xml file.
Thanks a lot for your support :)