open-austin/indigent-defense-stats

scraper: replace request_page_with_retry with a retry library

Opened this issue · 7 comments

scraper: replace request_page_with_retry with a retry library

I attempted to integrate this retry library with the request_page_with_retry method but hit a wall. It seems that there are some differences in the original error handling (or waiting for the request to come back?) in this customized retry procedure that is different from the retry library that I can't sort out. With the original code, it takes a lot longer to run and waits for the pages to load, but the retry implementation seems to run super fast and isn't waiting as long. It's unclear to me why.

@newswim Any ideas? Otherwise, I'm going to abandoned this ticket because I've sunk some time into it and it's not high priority or necessary for production.

If no ideas from anybody after awhile, I'll go ahead and close the ticket.

I can take a look at it, I'm unfamiliar but I agree it's definitely just a nice-to-have.

Cc @Matt343 just in case there's an obvious thing we're doing wrong.

So the issue is that you're calling write_debug_and_quit, which in turn calls sys.exit(), which is a non-retryable situation. So to prevent that from happening, what you want to do is just print the error info and then re-raise the error with raise e instead

Hmm. I'm not so sure. If write_debug_and_quit was triggering within request_page, then you'd see "[(verification text, which I think is "Record Count")] couldn't be found in page" and the html of the failed page saved to the logging folder. But it seems the HTML search is returning a search page where the results appear 0 (I logged the HTML below and it was the proper results page but empty with zero search results), with no html of the failed page saved to the logging folder. So it's not failing the code exactly, but making the results page appear as 0, which writes no new HTML, which is not what the unittest expects. So the unit test is throwing an unexpected assertion.

image

It makes me think that the earlier scrape_search_page method within the scraper isn't working correctly because it's running too fast and then it's messing up the actual search in an odd way the returns 0 in the results page, when it should be ~81 cases for this judicial officer on this date (what happens when you run the unit test for main).

HTML example of what we're getting for the results page, with the correct verification text (so it's not really throwing an error to activate write_debug_and_quit):

<html>
  <head>
    <link rel="stylesheet" type="text/css" href="CSS/PublicAccess.css">
  </head>
  
  <body>
    <form id="SearchParameters" action="CourtCalendarSearchResults.aspx" method="post" style="display:none;">
      <input id="SearchType" name="SearchType" type="hidden" value='JUDOFFC'/>   <!-- The type of search this is: by case or by party -->
      <input id="SearchMode" name="SearchMode" type="hidden" value='JUDOFFC'/>   <!-- The specific search mode of SearchType - like Search Case by CaseNumber -->
      <input id="HearingTypeIDs" name="HearingTypeIDs" type="hidden" value=''/>
      <input id="SearchBy" name="SearchBy" type="hidden" value='3'/>
      <input id="NameTypeKy" name="NameTypeKy" type="hidden" value=''/>
      <input id="CaseCategories" name="CaseCategories" type="hidden" value='CR'/>
      <input id="CourtCaseSearchValue" name="CourtCaseSearchValue" value=''/>
      <input id="UseSoundex" name="UseSoundex" type="hidden" value=''/>
      <input id="LastName" name="LastName" type="hidden" value=''/>
      <input id="FirstName" name="FirstName" type="hidden" value=''/>
      <input id="MiddleName" name="MiddleName" type="hidden" value=''/>
      <input id="cboJudOffc" name="cboJudOffc" type="hidden" value=''/>
      <input id="cboMagist" name="cboMagist" type="hidden" value=''/>
      <input id="DateSettingOnAfter" name="DateSettingOnAfter" type="hidden" value='07/01/2024'/>
      <input id="DateSettingOnBefore" name="DateSettingOnBefore" type="hidden" value='07/01/2024'/>
      <input id="SearchParams" name="SearchParams" type="hidden" value=''/>
      <input id="SortType" name="SortType" type="hidden" value=''/>
    </form>    
    <?xml version="1.0" encoding="utf-8"?><table cellspacing="0" cellpadding="0" width="100%" border="0" style="table-layout: fixed; margin:0px; padding:0px;" xmlns:msxsl="urn:schemas-microsoft-com:xslt" xmlns:PublicAccessUser="urn:PublicAccessUser"><tr><td class="ssHeaderTitleBanner">Court Calendar Results</td></tr></table><table cellspacing="0" cellpadding="0" width="100%" border="0" style="table-layout: fixed; margin:0px; padding:0px;" xmlns:msxsl="urn:schemas-microsoft-com:xslt" xmlns:PublicAccessUser="urn:PublicAccessUser"><tr><td bgcolor="#000000" height="20px"><table cellspacing="0" cellpadding="0" width="100%" border="0"><tr><td align="left" style="padding-left: 5px"><font size="1"><a class="ssBlackNavBarHyperlink" href="#MainContent">Skip to Main Content</a>&nbsp;<a class="ssBlackNavBarHyperlink" href="logout.aspx">Logout</a>&nbsp;<a class="ssBlackNavBarHyperlink" href="MyAccount.aspx?ReturnURL=default.aspx">My Account</a>&nbsp;<a class="ssBlackNavBarHyperlink" href="default.aspx">Search Menu</a>&nbsp;<a class="ssBlackNavBarHyperlink" href="Search.aspx?ID=900">New Calendar Search</a>&nbsp;<a class="ssBlackNavBarHyperlink" href="Search.aspx?ID=900&amp;RefineSearch=1">Refine Search</a>&nbsp;</font></td><td align="center" class="ssBlackNavBarLocation"></td><td align="right" style="padding-right: 10px"><table cellspacing="0" cellpadding="0" border="0"><tr><td class="ssBlackNavBarLocation">
                          Location : All Courts</td><td><font size="1"><a class="ssBlackNavBarHyperlink" target="_blank" href="help.aspx">Help</a></font></td></tr></table></td></tr></table></td></tr></table><a id="MainContent" name="MainContent" tabindex="-1" xmlns:msxsl="urn:schemas-microsoft-com:xslt" xmlns:user="http://www.tylertechnologies.com"></a><table border="0" cellpadding="0" cellspacing="0" width="100%" style="table-layout: fixed; font-size: 8pt; font-family: arial" xmlns:msxsl="urn:schemas-microsoft-com:xslt" xmlns:user="http://www.tylertechnologies.com"><tr><td style="width:85px;"><b>Record Count: </b></td><td style="text-align:left;"><b>0</b></td></tr><tr><td id="SearchParamList" colspan="2"></td></tr></table><table border="0" cellpadding="1" cellspacing="0" width="100%" style="table-layout: fixed; font-size: 8pt; font-family: arial" xmlns:msxsl="urn:schemas-microsoft-com:xslt" xmlns:user="http://www.tylertechnologies.com"><col width="25%" /><col width="35%" /><col width="20%" /><col width="20%" /><tr><td colspan="4"><table border="0" cellpadding="0" cellspacing="0" width="100%" style="table-layout: fixed; font-size: 8pt; font-family: arial"><col width="25%" /><col width="35%" /><col width="20%" /><col width="20%" /><tr style="padding-top:5px;"><td colspan="4"><label for="SortBy" style="font-weight:bold;">Sort By  </label><select id="SortBy" name="SortBy" onChange="SwitchCourtSort(this.value)"><option value="CN" selected="true">
            Case Number
          </option><option value="DT">
            Date and Time
          </option><option value="DN">
            Defendant Name
          </option><option value="HT">
            Hearing Type
          </option><option value="JN">
            Judicial Officer Name
          </option><option value="PN">
            Plaintiff Name
          </option></select></td></tr><tr><td nowrap="true"> </td><td nowrap="true"> </td><td nowrap="true"> </td><td nowrap="true"><b>Date</b></td></tr><tr><td nowrap="true"><b>Case Number</b></td><td nowrap="true"></td><td nowrap="true"><b>Judicial Officer</b></td><td nowrap="true"><b>Time</b></td></tr><tr><th class="ssSearchResultHeaderBottom" nowrap="true">Type</th><th class="ssSearchResultHeaderBottom" nowrap="true">Style</th><th class="ssSearchResultHeaderBottom" nowrap="true">Physical Location</th><th class="ssSearchResultHeaderBottom" nowrap="true">Hearing Type</th></tr></table></td></tr><tr height="100"><td colspan="4" align="center"><b>No cases matched your search criteria.</b></td></tr></table>
  </body>
  
  <script language="javascript">
    function SwitchCourtSort(sSortType)
    {
      // Set sort type to form element
      document.getElementById("SortType").value = sSortType;
      
      SearchParameters.submit();
      return true;
    }

  </script>
</html>

Oh interesting. I wonder if it's related to removing this sleep(ms_wait / 1000 * (i + 1)) from the beginning of the request function? That would trigger a sleep on the first try, but now we don't do that. That's the only other change I can see that should have any impact, unless I'm missing something obvious :)

I tried that and had it sleep for 5 whole seconds before making each request and still no dice. Haunted, perhaps. 👻