Replace all urllib / urllib2 imports and uses with `requests`
Opened this issue · 15 comments
urllib2
is old, and a bit clunky -- I've found recently it can't do some of things I needed it to do, and it also causes problem for moving to Python3, and generally seems to make the code more complex.
Instead of re-using it, and adding new features that use the older mechanisms, can we consider refactoring to use requests
instead?
My PR where I stripped out urllib2
and StringIO
etc to enable HEAD
requests: #2838
a similar PR that went with using requests
to avoid old bugs: #2779
By using requests
(rather than adding imports from six
) we should be getting a better tool for making requests, and reducing code complexity, and the number of imports.
Relates to #2308 since a common use of simplejson
is to parse urllib2
responses. requests
has a builtin json()
method so an extra import is not required.
Describe the problem that you'd like solved
Proposal & Constraints
Additional context
Stakeholders
I'm strongly supportive of switching to requests since it has things like built-in (or easy to add) support for things like retries, timeouts, caching, etc.
Having said that, there's so much critical work pending that I'm not sure we can afford to being doing "nice to have" work on existing code that works. Refactoring modules as they get worked on seems fine though.
Refactoring modules as they get worked on seems fine though.
Agree.
Refactoring is why I wanted to publicise this as a common solution. When devs work on a module -- If they find themselves banging heads against urllib2
for any reason, first thought should be to remove it and use requests instead. I just had this happen a couple of times in a row, and wanted to share it as a common time saving way forward, and perhaps we can standardise some of the retry mechanisms etc.
possibly unused urllib
+ urllib2
in openlibrary/plugins/upstream/addbook.py ?
#2894 shows some code smells which need to be avoided when doing any upgrade of this type.
If you still have an import urllib2
, it's probably wrong.
% grep urlopen **/*.py
- openlibrary/admin/numbers.py: sqlite_contents = urllib.request.urlopen(url).read()
- openlibrary/catalog/amazon/add_covers.py: ret = simplejson.load(urlopen(url))
- openlibrary/catalog/amazon/other_editions.py: return urllib.request.urlopen(url).read()
- openlibrary/catalog/utils/edit.py: prev = unmarshal(json.load(urllib.request.urlopen(url)))
- openlibrary/core/lending.py: content = urllib.request.urlopen(url=url, timeout=config_http_request_timeout).read()
- openlibrary/core/lending.py: content = urllib.request.urlopen(request, timeout=config_http_request_timeout).read()
- openlibrary/core/lending.py: content = urllib.request.urlopen(url=url, timeout=config_http_request_timeout).read()
- openlibrary/core/lending.py: response = simplejson.loads(urllib.request.urlopen(url).read())
- openlibrary/core/lending.py: return simplejson.loads(urllib.request.urlopen(url).read())
- openlibrary/core/lending.py: jsontext = urllib.request.urlopen(config_ia_loan_api_url, payload,
- openlibrary/core/models.py: d = simplejson.loads(urllib.request.urlopen(url).read())
- openlibrary/coverstore/code.py: text = urlopen("https://archive.org/metadata/" + item).read()
- openlibrary/coverstore/code.py: jsontext = urlopen(url).read()
- openlibrary/plugins/admin/code.py: response = urllib.request.urlopen(s).read()
- openlibrary/plugins/admin/code.py: json = urllib.request.urlopen("http://www.archive.org/download/stats/numUniqueIPsOL.json").read()
- openlibrary/plugins/admin/services.py: self.data = BeautifulSoup(urllib.request.urlopen(url).read(), "lxml")
- openlibrary/plugins/books/readlinks.py: json_data = urllib.request.urlopen(solr_select).read()
- openlibrary/plugins/books/readlinks.py: json_data = urllib.request.urlopen(solr_select).read()
- openlibrary/plugins/books/readlinks.py: json_data = urllib.request.urlopen(solr_select).read()
- openlibrary/plugins/books/readlinks.py: json_data = urllib.request.urlopen(solr_select).read()
- openlibrary/plugins/openlibrary/code.py: return urllib.request.urlopen(url).read()
- openlibrary/plugins/upstream/borrow.py: response = simplejson.loads(urllib.request.urlopen(url).read())
- openlibrary/plugins/upstream/borrow.py: response = simplejson.loads(urllib.request.urlopen(url).read())
- openlibrary/plugins/upstream/covers.py: response = urllib.request.urlopen(upload_url, urllib.parse.urlencode(params))
- openlibrary/plugins/upstream/data.py: return urllib.request.urlopen(url).read()
- openlibrary/plugins/upstream/models.py: data = urllib.request.urlopen(url).read()
- openlibrary/plugins/worksearch/subjects.py: response = json.load(urllib.request.urlopen(solr_url))['response']
- openlibrary/plugins/worksearch/subjects.py: response = json.load(urllib.request.urlopen(solr_url))['response']
- openlibrary/solr/find_modified_works.py: ret.append(extract_works(json.load(urllib.request.urlopen(url))))
- openlibrary/solr/find_modified_works.py: changes = list(json.load(urllib.request.urlopen(url)))
- openlibrary/utils/solr.py: data = urllib.request.urlopen(url, timeout=10).read()
- openlibrary/utils/solr.py: data = urllib.request.urlopen(request, timeout=10).read()
- openlibrary/accounts/model.py: f = urllib.request.urlopen(req)
- openlibrary/api.py: return urllib.request.urlopen(req)
- openlibrary/catalog/add_book/init.py: res = urllib.request.urlopen(upload_url, urllib.parse.urlencode(params))
- openlibrary/catalog/amazon/crawl.py: #page = urlopen('http://amazon.com/dp/' + asin).read()
- openlibrary/catalog/amazon/load_merge.py: result = urllib.request.urlopen(ureq).read(100000)
- openlibrary/catalog/get_ia.py:def urlopen_keep_trying(url):
- openlibrary/catalog/get_ia.py: f = urllib.request.urlopen(url)
- openlibrary/catalog/get_ia.py: return '<!--' in urlopen_keep_trying(IA_DOWNLOAD_URL + loc).read()
- openlibrary/catalog/get_ia.py: data = urlopen_keep_trying(item_base + marc_xml_filename).read()
- openlibrary/catalog/get_ia.py: data = urlopen_keep_trying(item_base + marc_bin_filename).read()
- openlibrary/catalog/get_ia.py: tree = etree.parse(urlopen_keep_trying(url))
- openlibrary/catalog/get_ia.py: tree = etree.parse(urlopen_keep_trying(url))
- openlibrary/catalog/get_ia.py: f = urlopen_keep_trying(ureq)
- openlibrary/catalog/get_ia.py: f = urlopen_keep_trying(url)
- openlibrary/catalog/get_ia.py: f = urlopen_keep_trying(url)
- openlibrary/catalog/importer/db_read.py: data = urlopen(url).read()
- openlibrary/catalog/marc/download.py: f = urllib.request.urlopen("http://archive.org/download/bpl_marc/" + file)
- openlibrary/catalog/marc/download.py: f = urllib.request.urlopen(ureq)
- openlibrary/catalog/marc/marc_subject.py: f = urlopen_keep_trying(url)
- openlibrary/catalog/marc/marc_subject.py: f = urlopen_keep_trying(url)
- openlibrary/catalog/utils/query.py:def urlopen(url, data=None):
- openlibrary/catalog/utils/query.py: return urllib.request.urlopen(req)
- openlibrary/catalog/utils/query.py: return json.load(urlopen(url))
- openlibrary/catalog/utils/query.py: return urlopen(url).read()
- openlibrary/core/fulltext.py: json_data = urllib.request.urlopen(search_select, timeout=30).read()
- openlibrary/core/loanstats.py: logger.info("urlopen %s", url)
- openlibrary/core/loanstats.py: response = urllib.request.urlopen(url).read()
- openlibrary/coverstore/utils.py: r = urlopen(req)
- openlibrary/plugins/akismet/akismet.py: h = urllib.request.urlopen(req)
- openlibrary/plugins/ol_infobase.py: response = urllib.request.urlopen(url, json).read()
- openlibrary/plugins/openlibrary/tests/test_listapi.py: def urlopen(self, path, data=None, method=None, headers={}):
- openlibrary/plugins/openlibrary/tests/test_listapi.py: self.urlopen("/account/login", data=urllib.parse.urlencode(data), method="POST")
- openlibrary/plugins/openlibrary/tests/test_listapi.py: response = self.urlopen(
- openlibrary/plugins/openlibrary/tests/test_listapi.py: data = self.urlopen("/people/" + self.username + "/lists.json").read()
- openlibrary/plugins/openlibrary/tests/test_listapi.py: data = self.urlopen(key + ".json").read()
- openlibrary/plugins/openlibrary/tests/test_listapi.py: data = self.urlopen(key + "/seeds.json").read()
- openlibrary/plugins/openlibrary/tests/test_listapi.py: response = self.urlopen(key + "/seeds.json", json)
- openlibrary/plugins/openlibrary/tests/test_ratingsapi.py: def urlopen(self, path, data=None, method=None, headers={}):
- openlibrary/plugins/openlibrary/tests/test_ratingsapi.py: self.urlopen("/account/login", data=urllib.parse.urlencode(data), method="POST")
- openlibrary/plugins/openlibrary/tests/test_ratingsapi.py: r = self.urlopen(
- openlibrary/plugins/search/solr_client.py: ru = urlopen(query_url)
- openlibrary/plugins/search/solr_client.py: ru = urlopen(query_url)
- openlibrary/plugins/upstream/adapter.py: response = urllib.request.urlopen(req)
- openlibrary/plugins/upstream/utils.py: tree = etree.parse(urllib.request.urlopen(url))
- openlibrary/plugins/worksearch/code.py: solr_result = urllib.request.urlopen(url, timeout=10)
- openlibrary/solr/update_work.py:def urlopen(url, data=None):
- openlibrary/solr/update_work.py: return urllib.request.urlopen(req)
- openlibrary/solr/update_work.py: result = json.load(urlopen(url))
- openlibrary/solr/update_work.py: logger.info("urlopen %s", url)
- openlibrary/solr/update_work.py: reply = json.load(urlopen(url))
- openlibrary/solr/update_work.py: reply = json.load(urlopen(url))
- openlibrary/tests/catalog/test_get_ia.py: monkeypatch.setattr(get_ia, 'urlopen_keep_trying', return_test_marc_xml)
- openlibrary/tests/catalog/test_get_ia.py: monkeypatch.setattr(get_ia, 'urlopen_keep_trying', return_test_marc_bin)
- openlibrary/tests/catalog/test_get_ia.py: monkeypatch.setattr(get_ia, 'urlopen_keep_trying', return_test_marc_bin)
- openlibrary/tests/solr/test_update_work.py: monkeypatch.setattr(update_work, 'urlopen', lambda url: StringIO(solr_response))
- openlibrary/utils/httpserver.py: >>> response = urllib.request.urlopen('http://0.0.0.0:8090/hello/world')
- openlibrary/utils/httpserver.py: >>> urllib.request.urlopen('http://0.0.0.0:8090/hello/world')
- openlibrary/views/showmarc.py: data = urllib.request.urlopen(url).read()
- openlibrary/views/showmarc.py: data = urllib.request.urlopen(url).read()
- openlibrary/views/showmarc.py: result = urllib.request.urlopen(ureq).read(100000)
- Refactor to use requests json #3709 Approved by @cclauss
- Requests refactor only query.py #3666
- refactor update_work and test_update_work to use requests #3696 Approved by @cclauss
- Refactor .read() to use requests #3710
- converted only simple GETs to requests #3694
- use requests instead of stdlib #3658
- Fix portability and dependency issue #3559
This is going to need SEVERAL PRs before it is finally closed.
@hornc Would be willing to work with @aasifkhan7 to remove urllib from openlibrary/catalog/get_ia.py
in favor of requests?
grep urlopen **/*.py > files.txt
with open("files.txt") as in_file:
files = {line.split(":")[0] for line in in_file}
print("\n".join(sorted(files)))
- openlibrary/catalog/amazon/load_merge.py
- openlibrary/catalog/importer/db_read.py #4539
-- - openlibrary/catalog/marc/download.py
- openlibrary/views/showmarc.py #4527
- scripts/oclc_to_marc.py #4515
- scripts/z3950_view.py 4a6be85
-- - openlibrary/plugins/akismet/akismet.py See #4572
-- - openlibrary/plugins/upstream/adapter.py
- openlibrary/plugins/upstream/utils.py
-- - openlibrary/plugins/search/solr_client.py
#4593#4729 - openlibrary/plugins/worksearch/code.py #4604
- scripts/new-solr-updater.py #4605
I don't think this script works any more scripts/z3950_view.py
because PyZ3950 in not installed in the requirements any more. I can't find PyZ3950 on pypi.org. I cannot find reference to z3950_view in the codebase. Given this, should we remove z3950_view.py?
Last released in 2004 https://pypi.org/project/PyZ3950 --> http://www.panix.com/~asl2/software/PyZ3950 --> https://github.com/asl2/PyZ3950 (see the PRs) which is far from Python 3 compatible.
I don't think this file openlibrary/catalog/marc/download.py
is used any more. Because it tries to use web.run
which is no longer in the version of web.py installed. Also it tries to call http://wiki-beta.us.archive.org:9090 which is not able to be accessed.
Should openlibrary/catalog/marc/download.py
be deleted?
I went ahead and did a little grepping and such to see that these are all the places urllib is used (excluding infogami).
I'm not sure which of these functions we should still try to migrate away from.
At least some of the urlencodes could be changed to just passing the dict to requests.
@cclauss what do you think?
Filename | Count | 16 | 5 | 4 | 3 | 3 | 3 | 2 | 2 | 2 | 2 | 2 | 1 | 1 | 1 | 1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
./plugins/upstream/utils.py | 8 | urlencode | quote | parse_qs | urlparse | urlunparse | quote_plus | |||||||||
./plugins/upstream/borrow.py | 8 | urlencode | quote | |||||||||||||
./coverstore/utils.py | 8 | urlencode | urlsplit | unquote | urlunsplit | unquote_plus | parse_qsl | |||||||||
./utils/solr.py | 6 | urlencode | urlsplit | |||||||||||||
./plugins/openlibrary/code.py | 8 | urlencode | parse_qs | urlparse | urlunparse | |||||||||||
./plugins/openlibrary/tests/test_listapi.py | 8 | urlencode | Request | build_opener | HTTPCookieProcessor | |||||||||||
./plugins/openlibrary/tests/test_ratingsapi.py | 7 | urlencode | Request | build_opener | HTTPCookieProcessor | |||||||||||
./plugins/upstream/adapter.py | 4 | urlencode | Request | urlopen | ||||||||||||
./core/fulltext.py | 2 | urlencode | ||||||||||||||
./core/ia.py | 2 | urlencode | ||||||||||||||
./core/models.py | 2 | urlencode | ||||||||||||||
./core/sponsorships.py | 2 | urlencode | ||||||||||||||
./coverstore/tests/test_webapp.py | 2 | urlencode | ||||||||||||||
./plugins/admin/graphs.py | 2 | urlencode | ||||||||||||||
./plugins/importapi/code.py | 2 | urlencode | ||||||||||||||
./plugins/worksearch/code.py | 3 | urlencode | ||||||||||||||
./plugins/upstream/addbook.py | 4 | quote | urlparse | |||||||||||||
./core/processors/readableurls.py | 3 | quote | quote_plus | |||||||||||||
./catalog/utils/query.py | 3 | quote | ||||||||||||||
./plugins/books/code.py | 3 | urlsplit | unquote | |||||||||||||
./core/helpers.py | 2 | urlsplit | ||||||||||||||
./plugins/openlibrary/lists.py | 1 | parse_qs |