SylvainLapoix/search4eu_competition

Crawl paginated results from "Case search"

Opened this issue · 0 comments

On a case details page, we currently follow the links to policy areas, each link leading to a page with the first 50 results for this policy area.
Access to more pages requires clicking the "Next" button that triggers the execution of a Cold Fusion script.
Memorious does not do dynamic scraping, so we should complement with a selenium script that retrieves additional links and either:

  • stores the HTML pages for case details (but then we'd need to apply the same cleaning as in memorious) for ingestion in aleph, or
  • feeds them back to memorious as seeds (how could we best do that?).