GSA/srt-fbo-scraper

BUG: DLA urls sometimes redirect to an archive search

Closed this issue · 1 comments

Expected Behavior

The urls for each notice should take you directly to that notice document's page. For example:
https://www.fbo.gov/index?s=opportunity&mode=form&id=97e71b45acfd5d42782a5cf9ab6b229e&tab=core&_cview=0

From there we can scrape the attachments.

Current Behavior

Some Defense Logistics Agency urls fail to redirect you to the notice's FBO page where you can find the notice's attachments. Instead, they redirect you to a archive search page. For example,
clicking here actually takes you to https://www.fbo.gov/index?s=opportunity&mode=list&tab=archives. But clicking here actually takes you to the notice's page.

Possible Solution

If you first make a HEAD request to one of those DLA urls, the location header is '/index?s=opportunity&mode=list&tab=archives' if it's going to redirect you to the archive search. If it's going to take you to the FBO notice, then it'll look like: /index?s=opportunity&mode=form&id=e475caeca4747588413256b51dc26648&tab=core&_cview=1.

A possible solution is to HEAD request every url, checking for that bad location header. If it's present, then you go to the archive search and then scrape the correct url from the search results page. Then you make another request to get the actual notice page and get the attachments.

Context

This issue is somewhat of a priority since it misses a large swath of attachments. Of the ICT 11,801 notices posted between 5/10/18 and 10/10/18, 1,558 (~13%) are DLA urls.

closed by #124