internetarchive/openlibrary

Replace all urllib / urllib2 imports and uses with `requests`

Opened this issue · 15 comments

hornc commented

urllib2 is old, and a bit clunky -- I've found recently it can't do some of things I needed it to do, and it also causes problem for moving to Python3, and generally seems to make the code more complex.

Instead of re-using it, and adding new features that use the older mechanisms, can we consider refactoring to use requests instead?

My PR where I stripped out urllib2 and StringIO etc to enable HEAD requests: #2838

a similar PR that went with using requests to avoid old bugs: #2779

By using requests (rather than adding imports from six) we should be getting a better tool for making requests, and reducing code complexity, and the number of imports.

Relates to #2308 since a common use of simplejson is to parse urllib2 responses. requests has a builtin json() method so an extra import is not required.

Describe the problem that you'd like solved

Proposal & Constraints

Additional context

Stakeholders

I'm strongly supportive of switching to requests since it has things like built-in (or easy to add) support for things like retries, timeouts, caching, etc.

Having said that, there's so much critical work pending that I'm not sure we can afford to being doing "nice to have" work on existing code that works. Refactoring modules as they get worked on seems fine though.

hornc commented

Refactoring modules as they get worked on seems fine though.

Agree.
Refactoring is why I wanted to publicise this as a common solution. When devs work on a module -- If they find themselves banging heads against urllib2 for any reason, first thought should be to remove it and use requests instead. I just had this happen a couple of times in a row, and wanted to share it as a common time saving way forward, and perhaps we can standardise some of the retry mechanisms etc.

hornc commented

possibly unused urllib + urllib2 in openlibrary/plugins/upstream/addbook.py ?

PyCharm identifies them as unused (amongst other things)
image

#2894 shows some code smells which need to be avoided when doing any upgrade of this type.

If you still have an import urllib2, it's probably wrong.

% grep urlopen **/*.py

  • openlibrary/admin/numbers.py: sqlite_contents = urllib.request.urlopen(url).read()
  • openlibrary/catalog/amazon/add_covers.py: ret = simplejson.load(urlopen(url))
  • openlibrary/catalog/amazon/other_editions.py: return urllib.request.urlopen(url).read()
  • openlibrary/catalog/utils/edit.py: prev = unmarshal(json.load(urllib.request.urlopen(url)))
  • openlibrary/core/lending.py: content = urllib.request.urlopen(url=url, timeout=config_http_request_timeout).read()
  • openlibrary/core/lending.py: content = urllib.request.urlopen(request, timeout=config_http_request_timeout).read()
  • openlibrary/core/lending.py: content = urllib.request.urlopen(url=url, timeout=config_http_request_timeout).read()
  • openlibrary/core/lending.py: response = simplejson.loads(urllib.request.urlopen(url).read())
  • openlibrary/core/lending.py: return simplejson.loads(urllib.request.urlopen(url).read())
  • openlibrary/core/lending.py: jsontext = urllib.request.urlopen(config_ia_loan_api_url, payload,
  • openlibrary/core/models.py: d = simplejson.loads(urllib.request.urlopen(url).read())
  • openlibrary/coverstore/code.py: text = urlopen("https://archive.org/metadata/" + item).read()
  • openlibrary/coverstore/code.py: jsontext = urlopen(url).read()
  • openlibrary/plugins/admin/code.py: response = urllib.request.urlopen(s).read()
  • openlibrary/plugins/admin/code.py: json = urllib.request.urlopen("http://www.archive.org/download/stats/numUniqueIPsOL.json").read()
  • openlibrary/plugins/admin/services.py: self.data = BeautifulSoup(urllib.request.urlopen(url).read(), "lxml")
  • openlibrary/plugins/books/readlinks.py: json_data = urllib.request.urlopen(solr_select).read()
  • openlibrary/plugins/books/readlinks.py: json_data = urllib.request.urlopen(solr_select).read()
  • openlibrary/plugins/books/readlinks.py: json_data = urllib.request.urlopen(solr_select).read()
  • openlibrary/plugins/books/readlinks.py: json_data = urllib.request.urlopen(solr_select).read()
  • openlibrary/plugins/openlibrary/code.py: return urllib.request.urlopen(url).read()
  • openlibrary/plugins/upstream/borrow.py: response = simplejson.loads(urllib.request.urlopen(url).read())
  • openlibrary/plugins/upstream/borrow.py: response = simplejson.loads(urllib.request.urlopen(url).read())
  • openlibrary/plugins/upstream/covers.py: response = urllib.request.urlopen(upload_url, urllib.parse.urlencode(params))
  • openlibrary/plugins/upstream/data.py: return urllib.request.urlopen(url).read()
  • openlibrary/plugins/upstream/models.py: data = urllib.request.urlopen(url).read()
  • openlibrary/plugins/worksearch/subjects.py: response = json.load(urllib.request.urlopen(solr_url))['response']
  • openlibrary/plugins/worksearch/subjects.py: response = json.load(urllib.request.urlopen(solr_url))['response']
  • openlibrary/solr/find_modified_works.py: ret.append(extract_works(json.load(urllib.request.urlopen(url))))
  • openlibrary/solr/find_modified_works.py: changes = list(json.load(urllib.request.urlopen(url)))
  • openlibrary/utils/solr.py: data = urllib.request.urlopen(url, timeout=10).read()
  • openlibrary/utils/solr.py: data = urllib.request.urlopen(request, timeout=10).read()
  • openlibrary/accounts/model.py: f = urllib.request.urlopen(req)
  • openlibrary/api.py: return urllib.request.urlopen(req)
  • openlibrary/catalog/add_book/init.py: res = urllib.request.urlopen(upload_url, urllib.parse.urlencode(params))
  • openlibrary/catalog/amazon/crawl.py: #page = urlopen('http://amazon.com/dp/' + asin).read()
  • openlibrary/catalog/amazon/load_merge.py: result = urllib.request.urlopen(ureq).read(100000)
  • openlibrary/catalog/get_ia.py:def urlopen_keep_trying(url):
  • openlibrary/catalog/get_ia.py: f = urllib.request.urlopen(url)
  • openlibrary/catalog/get_ia.py: return '<!--' in urlopen_keep_trying(IA_DOWNLOAD_URL + loc).read()
  • openlibrary/catalog/get_ia.py: data = urlopen_keep_trying(item_base + marc_xml_filename).read()
  • openlibrary/catalog/get_ia.py: data = urlopen_keep_trying(item_base + marc_bin_filename).read()
  • openlibrary/catalog/get_ia.py: tree = etree.parse(urlopen_keep_trying(url))
  • openlibrary/catalog/get_ia.py: tree = etree.parse(urlopen_keep_trying(url))
  • openlibrary/catalog/get_ia.py: f = urlopen_keep_trying(ureq)
  • openlibrary/catalog/get_ia.py: f = urlopen_keep_trying(url)
  • openlibrary/catalog/get_ia.py: f = urlopen_keep_trying(url)
  • openlibrary/catalog/importer/db_read.py: data = urlopen(url).read()
  • openlibrary/catalog/marc/download.py: f = urllib.request.urlopen("http://archive.org/download/bpl_marc/" + file)
  • openlibrary/catalog/marc/download.py: f = urllib.request.urlopen(ureq)
  • openlibrary/catalog/marc/marc_subject.py: f = urlopen_keep_trying(url)
  • openlibrary/catalog/marc/marc_subject.py: f = urlopen_keep_trying(url)
  • openlibrary/catalog/utils/query.py:def urlopen(url, data=None):
  • openlibrary/catalog/utils/query.py: return urllib.request.urlopen(req)
  • openlibrary/catalog/utils/query.py: return json.load(urlopen(url))
  • openlibrary/catalog/utils/query.py: return urlopen(url).read()
  • openlibrary/core/fulltext.py: json_data = urllib.request.urlopen(search_select, timeout=30).read()
  • openlibrary/core/loanstats.py: logger.info("urlopen %s", url)
  • openlibrary/core/loanstats.py: response = urllib.request.urlopen(url).read()
  • openlibrary/coverstore/utils.py: r = urlopen(req)
  • openlibrary/plugins/akismet/akismet.py: h = urllib.request.urlopen(req)
  • openlibrary/plugins/ol_infobase.py: response = urllib.request.urlopen(url, json).read()
  • openlibrary/plugins/openlibrary/tests/test_listapi.py: def urlopen(self, path, data=None, method=None, headers={}):
  • openlibrary/plugins/openlibrary/tests/test_listapi.py: self.urlopen("/account/login", data=urllib.parse.urlencode(data), method="POST")
  • openlibrary/plugins/openlibrary/tests/test_listapi.py: response = self.urlopen(
  • openlibrary/plugins/openlibrary/tests/test_listapi.py: data = self.urlopen("/people/" + self.username + "/lists.json").read()
  • openlibrary/plugins/openlibrary/tests/test_listapi.py: data = self.urlopen(key + ".json").read()
  • openlibrary/plugins/openlibrary/tests/test_listapi.py: data = self.urlopen(key + "/seeds.json").read()
  • openlibrary/plugins/openlibrary/tests/test_listapi.py: response = self.urlopen(key + "/seeds.json", json)
  • openlibrary/plugins/openlibrary/tests/test_ratingsapi.py: def urlopen(self, path, data=None, method=None, headers={}):
  • openlibrary/plugins/openlibrary/tests/test_ratingsapi.py: self.urlopen("/account/login", data=urllib.parse.urlencode(data), method="POST")
  • openlibrary/plugins/openlibrary/tests/test_ratingsapi.py: r = self.urlopen(
  • openlibrary/plugins/search/solr_client.py: ru = urlopen(query_url)
  • openlibrary/plugins/search/solr_client.py: ru = urlopen(query_url)
  • openlibrary/plugins/upstream/adapter.py: response = urllib.request.urlopen(req)
  • openlibrary/plugins/upstream/utils.py: tree = etree.parse(urllib.request.urlopen(url))
  • openlibrary/plugins/worksearch/code.py: solr_result = urllib.request.urlopen(url, timeout=10)
  • openlibrary/solr/update_work.py:def urlopen(url, data=None):
  • openlibrary/solr/update_work.py: return urllib.request.urlopen(req)
  • openlibrary/solr/update_work.py: result = json.load(urlopen(url))
  • openlibrary/solr/update_work.py: logger.info("urlopen %s", url)
  • openlibrary/solr/update_work.py: reply = json.load(urlopen(url))
  • openlibrary/solr/update_work.py: reply = json.load(urlopen(url))
  • openlibrary/tests/catalog/test_get_ia.py: monkeypatch.setattr(get_ia, 'urlopen_keep_trying', return_test_marc_xml)
  • openlibrary/tests/catalog/test_get_ia.py: monkeypatch.setattr(get_ia, 'urlopen_keep_trying', return_test_marc_bin)
  • openlibrary/tests/catalog/test_get_ia.py: monkeypatch.setattr(get_ia, 'urlopen_keep_trying', return_test_marc_bin)
  • openlibrary/tests/solr/test_update_work.py: monkeypatch.setattr(update_work, 'urlopen', lambda url: StringIO(solr_response))
  • openlibrary/utils/httpserver.py: >>> response = urllib.request.urlopen('http://0.0.0.0:8090/hello/world')
  • openlibrary/utils/httpserver.py: >>> urllib.request.urlopen('http://0.0.0.0:8090/hello/world')
  • openlibrary/views/showmarc.py: data = urllib.request.urlopen(url).read()
  • openlibrary/views/showmarc.py: data = urllib.request.urlopen(url).read()
  • openlibrary/views/showmarc.py: result = urllib.request.urlopen(ureq).read(100000)
imskr commented

Thanks, @cclauss for guiding. I'm on it!

  • Refactor to use requests json #3709 Approved by @cclauss
  • Requests refactor only query.py #3666
  • refactor update_work and test_update_work to use requests #3696 Approved by @cclauss
  • Refactor .read() to use requests #3710
  • converted only simple GETs to requests #3694
  • use requests instead of stdlib #3658
  • Fix portability and dependency issue #3559

This is going to need SEVERAL PRs before it is finally closed.

@hornc Would be willing to work with @aasifkhan7 to remove urllib from openlibrary/catalog/get_ia.py in favor of requests?

grep urlopen **/*.py > files.txt

with open("files.txt") as in_file:
    files = {line.split(":")[0] for line in in_file}

print("\n".join(sorted(files)))
  • openlibrary/catalog/amazon/load_merge.py
  • openlibrary/catalog/importer/db_read.py #4539
    --
  • openlibrary/catalog/marc/download.py
  • openlibrary/views/showmarc.py #4527
  • scripts/oclc_to_marc.py #4515
  • scripts/z3950_view.py 4a6be85
    --
  • openlibrary/plugins/akismet/akismet.py See #4572
    --
  • openlibrary/plugins/upstream/adapter.py
  • openlibrary/plugins/upstream/utils.py
    --
  • openlibrary/plugins/search/solr_client.py #4593 #4729
  • openlibrary/plugins/worksearch/code.py #4604
  • scripts/new-solr-updater.py #4605

I don't think this script works any more scripts/z3950_view.py because PyZ3950 in not installed in the requirements any more. I can't find PyZ3950 on pypi.org. I cannot find reference to z3950_view in the codebase. Given this, should we remove z3950_view.py?

Last released in 2004 https://pypi.org/project/PyZ3950 --> http://www.panix.com/~asl2/software/PyZ3950 --> https://github.com/asl2/PyZ3950 (see the PRs) which is far from Python 3 compatible.

I don't think this file openlibrary/catalog/marc/download.py is used any more. Because it tries to use web.run which is no longer in the version of web.py installed. Also it tries to call http://wiki-beta.us.archive.org:9090 which is not able to be accessed.

Should openlibrary/catalog/marc/download.py be deleted?

I went ahead and did a little grepping and such to see that these are all the places urllib is used (excluding infogami).

I'm not sure which of these functions we should still try to migrate away from.
At least some of the urlencodes could be changed to just passing the dict to requests.
@cclauss what do you think?

Filename Count 16 5 4 3 3 3 2 2 2 2 2 1 1 1 1
./plugins/upstream/utils.py 8 urlencode quote parse_qs urlparse urlunparse quote_plus
./plugins/upstream/borrow.py 8 urlencode quote
./coverstore/utils.py 8 urlencode urlsplit unquote urlunsplit unquote_plus parse_qsl
./utils/solr.py 6 urlencode urlsplit
./plugins/openlibrary/code.py 8 urlencode parse_qs urlparse urlunparse
./plugins/openlibrary/tests/test_listapi.py 8 urlencode Request build_opener HTTPCookieProcessor
./plugins/openlibrary/tests/test_ratingsapi.py 7 urlencode Request build_opener HTTPCookieProcessor
./plugins/upstream/adapter.py 4 urlencode Request urlopen
./core/fulltext.py 2 urlencode
./core/ia.py 2 urlencode
./core/models.py 2 urlencode
./core/sponsorships.py 2 urlencode
./coverstore/tests/test_webapp.py 2 urlencode
./plugins/admin/graphs.py 2 urlencode
./plugins/importapi/code.py 2 urlencode
./plugins/worksearch/code.py 3 urlencode
./plugins/upstream/addbook.py 4 quote urlparse
./core/processors/readableurls.py 3 quote quote_plus
./catalog/utils/query.py 3 quote
./plugins/books/code.py 3 urlsplit unquote
./core/helpers.py 2 urlsplit
./plugins/openlibrary/lists.py 1 parse_qs