biglocalnews/court-scraper

AttributeError: 'NoneType' object has no attribute 'find_all'

ryanelittle opened this issue · 10 comments

I have been using Court Scraper to scrape OSCN. Counties that do not use DailyFilings will not return a list of case numbers when searching for all case numbers in a given year (start_date = 20TK-1-1, end_date = 20TK-12-31).

---> 16         self.results = self.site.search_by_date(
     17             start_date=self.start_date,
     18             end_date=self.end_date

c:\users\rlitt\code\my-packages\court-scraper\court_scraper\platforms\oscn\site.py in search_by_date(self, start_date, end_date, case_details)
     80         if not start_date:
     81             start_date, end_date = self.current_day, self.current_day
---> 82         results = search_obj.search(start_date, end_date, case_details=case_details)
     83         return results

c:\users\rlitt\code\my-packages\court-scraper\court_scraper\platforms\oscn\pages\search.py in search(self, start_date, end_date, extra_params, case_details)
     44             # Merge any additional query parameters
     45             search_params.update(extra_params)
---> 46             html, basic_case_data = self._run_search(search_params)
     47             # Skip if there were no results for date
     48             if not basic_case_data:

c:\users\rlitt\code\my-packages\court-scraper\court_scraper\platforms\oscn\pages\search.py in _run_search(self, search_params)
     76         html = response.text
     77         page = SearchResultsPage(self.place_id, html)
---> 78         return html, page.results
     79 
     80     @property

c:\users\rlitt\code\my-packages\court-scraper\court_scraper\platforms\oscn\pages\search_results.py in results(self)
     22         results = {}
     23         # Only grab result rows (i.e. skip header)
---> 24         for row in self.soup.table.find_all('tr', class_='resultTableRow'):
     25             case_id_cell, filing_date, case_name, found_party = row.find_all('td')
     26             case_id = case_id_cell.a.text.strip()

AttributeError: 'NoneType' object has no attribute 'find_all'

Looking in the code, I found this note: "Always limit query to a single filing date, to minimize chances of truncate results." I did not expect this behavior based on the documentation. Could the code be changed to behave in the same way as DailyFilings? I.E. When provided a date range, Search searches each date and provides results for a large range?

The error persists even when supplying single dates.

@ryanelittle Can you share the code or CLI command that is triggering the error?

I am using Site.search_by_date in a custom class. This is my function:

    def get_case_numbers(self, county, start_date, end_date):
        self.county = county
        self.start_date = start_date
        self.end_date = end_date
        self.site = Site(self.county)
        self.results = self.site.search_by_date(
            start_date=self.start_date,
            end_date=self.end_date
        )
        self.case_numbers = []
        for self.result in self.results:
            self.case_numbers.append(self.result.number)

@ryanelittle Great. Can you also provide the date ranges you're using? Sounds like it may generally be broken, but I wouldn't mind trying to test with the exact parameters you've tried so far.

@ryanelittle oh, also if you could supply the value stored in self.county, that'll let my replicate your test

I tried a few. None of them worked. Just tried 'ok_atoka', '2020-03-01', '2020-03-01', did not work.

@ryanelittle The bug appears to be due to the OSCN site now rejecting web requests with the default Python User-Agent supplied by the requests library. This must be new(ish) behavior, since the code was working a few months back when we created it. Anyhow, the site now treats such requests as unauthorized and returns a 403 error page, which does not contain the expected elements and therefore triggers the error we're seeing at the BeautifulSoup layer.

Providing a realistic User-Agent header appears to fix the problem. Updating the code in search.py to pass in a User-request that mimics a more realistic browser specs should fix the issue.

In the short term, if you need to press forward on your project, I would just fork and hard-code a User-Agent.

Thank you for the fix @zstumgoren.

@ryanelittle Sure thing. We'll try to ship a proper release to PyPI containing the bug fix in the near future. We'll leave this ticket open until then. Meantime, thanks for bringing it to our attention!

@zstumgoren I've used fake-useragent (https://pypi.org/project/fake-useragent/) to randomize my useragents in the past. It might be a good solution so court scraper doesn't have the same header for everyone who uses it.