biglocalnews/court-scraper

OK refactor

Closed this issue · 0 comments

Refactor OK as detailed below

Tasks

  • Add oscn.Site.search to support case number-based search for detailed pages
  • Add oscn.Site.search_by_date to support case discovery by date
  • Remove defunct code:
    • oscn/pages/search.py
    • oscn/pages/url.py

Background

Oklahoma Courts State Network (OSCN) site offers a variety of search functionalities. Most relevant to us appear to be:

  • Daily Filings by County, which can be used to compile a case index by scraping day by day using the %m/%d/%y format (e.g. 07/15/21). This system can help us simplify backfilling of data.
  • The Case Number Lookup system, which allows search by specific case numbers across individual or all counties. This system is more useful for ongoing updates of open cases.

The Daily Filings system provides access to case detail pages that appear to contain the same data as the Case Number Lookup system. However, the HTML provided by Case Number Lookup contains more useful semantic hooks (e.g. section headers) that would be useful for downstream parsing.

For example, this daily filings search for July 15, 2021, includes this case detail page (among others). The equivalent Case Number Lookup detail page is here.

We'll need both of these search functionalities to support backfilling of data, and ongoing updates for specific case numbers. Providing the ability to search by case number is needed for an initial, minimal integration with the CLI. In the future, we could investigate other search features and potentially add support for search by name or wildcard (assuming these are supported). But for a v1, we should start with the indexing and search capabilities described above.