Replace all urllib / urllib2 imports and uses with `requests`

Question

Replace all urllib / urllib2 imports and uses with `requests`

Opened this issue 5 years ago · 15 comments

urllib2 is old, and a bit clunky -- I've found recently it can't do some of things I needed it to do, and it also causes problem for moving to Python3, and generally seems to make the code more complex.

Instead of re-using it, and adding new features that use the older mechanisms, can we consider refactoring to use requests instead?

My PR where I stripped out urllib2 and StringIO etc to enable HEAD requests: #2838

a similar PR that went with using requests to avoid old bugs: #2779

By using requests (rather than adding imports from six) we should be getting a better tool for making requests, and reducing code complexity, and the number of imports.

Relates to #2308 since a common use of simplejson is to parse urllib2 responses. requests has a builtin json() method so an extra import is not required.

Describe the problem that you'd like solved

Proposal & Constraints

Additional context

Stakeholders

Answer 1 · 2020-01-14T00:05:31.000Z

I'm strongly supportive of switching to requests since it has things like built-in (or easy to add) support for things like retries, timeouts, caching, etc.

Having said that, there's so much critical work pending that I'm not sure we can afford to being doing "nice to have" work on existing code that works. Refactoring modules as they get worked on seems fine though.

Answer 2 · 2020-01-14T00:21:59.000Z

Refactoring modules as they get worked on seems fine though.

Agree.
Refactoring is why I wanted to publicise this as a common solution. When devs work on a module -- If they find themselves banging heads against urllib2 for any reason, first thought should be to remove it and use requests instead. I just had this happen a couple of times in a row, and wanted to share it as a common time saving way forward, and perhaps we can standardise some of the retry mechanisms etc.

Answer 3 · 2020-01-14T04:16:13.000Z

possibly unused urllib + urllib2 in openlibrary/plugins/upstream/addbook.py ?

Answer 4 · 2020-01-14T05:36:53.000Z

PyCharm identifies them as unused (amongst other things)

Answer 5 · 2020-01-21T00:52:26.000Z

#2894 shows some code smells which need to be avoided when doing any upgrade of this type.

If you still have an import urllib2, it's probably wrong.

Answer 6 · 2020-07-23T17:30:36.000Z

% grep urlopen **/*.py

openlibrary/admin/numbers.py: sqlite_contents = urllib.request.urlopen(url).read()
openlibrary/catalog/amazon/add_covers.py: ret = simplejson.load(urlopen(url))
openlibrary/catalog/amazon/other_editions.py: return urllib.request.urlopen(url).read()
openlibrary/catalog/utils/edit.py: prev = unmarshal(json.load(urllib.request.urlopen(url)))
openlibrary/core/lending.py: content = urllib.request.urlopen(url=url, timeout=config_http_request_timeout).read()
openlibrary/core/lending.py: content = urllib.request.urlopen(request, timeout=config_http_request_timeout).read()
openlibrary/core/lending.py: content = urllib.request.urlopen(url=url, timeout=config_http_request_timeout).read()
openlibrary/core/lending.py: response = simplejson.loads(urllib.request.urlopen(url).read())
openlibrary/core/lending.py: return simplejson.loads(urllib.request.urlopen(url).read())
openlibrary/core/lending.py: jsontext = urllib.request.urlopen(config_ia_loan_api_url, payload,
openlibrary/core/models.py: d = simplejson.loads(urllib.request.urlopen(url).read())
openlibrary/coverstore/code.py: text = urlopen("https://archive.org/metadata/" + item).read()
openlibrary/coverstore/code.py: jsontext = urlopen(url).read()
openlibrary/plugins/admin/code.py: response = urllib.request.urlopen(s).read()
openlibrary/plugins/admin/code.py: json = urllib.request.urlopen("http://www.archive.org/download/stats/numUniqueIPsOL.json").read()
openlibrary/plugins/admin/services.py: self.data = BeautifulSoup(urllib.request.urlopen(url).read(), "lxml")
openlibrary/plugins/books/readlinks.py: json_data = urllib.request.urlopen(solr_select).read()
openlibrary/plugins/books/readlinks.py: json_data = urllib.request.urlopen(solr_select).read()
openlibrary/plugins/books/readlinks.py: json_data = urllib.request.urlopen(solr_select).read()
openlibrary/plugins/books/readlinks.py: json_data = urllib.request.urlopen(solr_select).read()
openlibrary/plugins/openlibrary/code.py: return urllib.request.urlopen(url).read()
openlibrary/plugins/upstream/borrow.py: response = simplejson.loads(urllib.request.urlopen(url).read())
openlibrary/plugins/upstream/borrow.py: response = simplejson.loads(urllib.request.urlopen(url).read())
openlibrary/plugins/upstream/covers.py: response = urllib.request.urlopen(upload_url, urllib.parse.urlencode(params))
openlibrary/plugins/upstream/data.py: return urllib.request.urlopen(url).read()
openlibrary/plugins/upstream/models.py: data = urllib.request.urlopen(url).read()
openlibrary/plugins/worksearch/subjects.py: response = json.load(urllib.request.urlopen(solr_url))['response']
openlibrary/plugins/worksearch/subjects.py: response = json.load(urllib.request.urlopen(solr_url))['response']
openlibrary/solr/find_modified_works.py: ret.append(extract_works(json.load(urllib.request.urlopen(url))))
openlibrary/solr/find_modified_works.py: changes = list(json.load(urllib.request.urlopen(url)))
openlibrary/utils/solr.py: data = urllib.request.urlopen(url, timeout=10).read()
openlibrary/utils/solr.py: data = urllib.request.urlopen(request, timeout=10).read()
openlibrary/accounts/model.py: f = urllib.request.urlopen(req)
openlibrary/api.py: return urllib.request.urlopen(req)
openlibrary/catalog/add_book/init.py: res = urllib.request.urlopen(upload_url, urllib.parse.urlencode(params))
openlibrary/catalog/amazon/crawl.py: #page = urlopen('http://amazon.com/dp/' + asin).read()
openlibrary/catalog/amazon/load_merge.py: result = urllib.request.urlopen(ureq).read(100000)
openlibrary/catalog/get_ia.py:def urlopen_keep_trying(url):
openlibrary/catalog/get_ia.py: f = urllib.request.urlopen(url)
openlibrary/catalog/get_ia.py: return '<!--' in urlopen_keep_trying(IA_DOWNLOAD_URL + loc).read()
openlibrary/catalog/get_ia.py: data = urlopen_keep_trying(item_base + marc_xml_filename).read()
openlibrary/catalog/get_ia.py: data = urlopen_keep_trying(item_base + marc_bin_filename).read()
openlibrary/catalog/get_ia.py: tree = etree.parse(urlopen_keep_trying(url))
openlibrary/catalog/get_ia.py: tree = etree.parse(urlopen_keep_trying(url))
openlibrary/catalog/get_ia.py: f = urlopen_keep_trying(ureq)
openlibrary/catalog/get_ia.py: f = urlopen_keep_trying(url)
openlibrary/catalog/get_ia.py: f = urlopen_keep_trying(url)
openlibrary/catalog/importer/db_read.py: data = urlopen(url).read()
openlibrary/catalog/marc/download.py: f = urllib.request.urlopen("http://archive.org/download/bpl_marc/" + file)
openlibrary/catalog/marc/download.py: f = urllib.request.urlopen(ureq)
openlibrary/catalog/marc/marc_subject.py: f = urlopen_keep_trying(url)
openlibrary/catalog/marc/marc_subject.py: f = urlopen_keep_trying(url)
openlibrary/catalog/utils/query.py:def urlopen(url, data=None):
openlibrary/catalog/utils/query.py: return urllib.request.urlopen(req)
openlibrary/catalog/utils/query.py: return json.load(urlopen(url))
openlibrary/catalog/utils/query.py: return urlopen(url).read()
openlibrary/core/fulltext.py: json_data = urllib.request.urlopen(search_select, timeout=30).read()
openlibrary/core/loanstats.py: logger.info("urlopen %s", url)
openlibrary/core/loanstats.py: response = urllib.request.urlopen(url).read()
openlibrary/coverstore/utils.py: r = urlopen(req)
openlibrary/plugins/akismet/akismet.py: h = urllib.request.urlopen(req)
openlibrary/plugins/ol_infobase.py: response = urllib.request.urlopen(url, json).read()
openlibrary/plugins/openlibrary/tests/test_listapi.py: def urlopen(self, path, data=None, method=None, headers={}):
openlibrary/plugins/openlibrary/tests/test_listapi.py: self.urlopen("/account/login", data=urllib.parse.urlencode(data), method="POST")
openlibrary/plugins/openlibrary/tests/test_listapi.py: response = self.urlopen(
openlibrary/plugins/openlibrary/tests/test_listapi.py: data = self.urlopen("/people/" + self.username + "/lists.json").read()
openlibrary/plugins/openlibrary/tests/test_listapi.py: data = self.urlopen(key + ".json").read()
openlibrary/plugins/openlibrary/tests/test_listapi.py: data = self.urlopen(key + "/seeds.json").read()
openlibrary/plugins/openlibrary/tests/test_listapi.py: response = self.urlopen(key + "/seeds.json", json)
openlibrary/plugins/openlibrary/tests/test_ratingsapi.py: def urlopen(self, path, data=None, method=None, headers={}):
openlibrary/plugins/openlibrary/tests/test_ratingsapi.py: self.urlopen("/account/login", data=urllib.parse.urlencode(data), method="POST")
openlibrary/plugins/openlibrary/tests/test_ratingsapi.py: r = self.urlopen(
openlibrary/plugins/search/solr_client.py: ru = urlopen(query_url)
openlibrary/plugins/search/solr_client.py: ru = urlopen(query_url)
openlibrary/plugins/upstream/adapter.py: response = urllib.request.urlopen(req)
openlibrary/plugins/upstream/utils.py: tree = etree.parse(urllib.request.urlopen(url))
openlibrary/plugins/worksearch/code.py: solr_result = urllib.request.urlopen(url, timeout=10)
openlibrary/solr/update_work.py:def urlopen(url, data=None):
openlibrary/solr/update_work.py: return urllib.request.urlopen(req)
openlibrary/solr/update_work.py: result = json.load(urlopen(url))
openlibrary/solr/update_work.py: logger.info("urlopen %s", url)
openlibrary/solr/update_work.py: reply = json.load(urlopen(url))
openlibrary/solr/update_work.py: reply = json.load(urlopen(url))
openlibrary/tests/catalog/test_get_ia.py: monkeypatch.setattr(get_ia, 'urlopen_keep_trying', return_test_marc_xml)
openlibrary/tests/catalog/test_get_ia.py: monkeypatch.setattr(get_ia, 'urlopen_keep_trying', return_test_marc_bin)
openlibrary/tests/catalog/test_get_ia.py: monkeypatch.setattr(get_ia, 'urlopen_keep_trying', return_test_marc_bin)
openlibrary/tests/solr/test_update_work.py: monkeypatch.setattr(update_work, 'urlopen', lambda url: StringIO(solr_response))
openlibrary/utils/httpserver.py: >>> response = urllib.request.urlopen('http://0.0.0.0:8090/hello/world')
openlibrary/utils/httpserver.py: >>> urllib.request.urlopen('http://0.0.0.0:8090/hello/world')
openlibrary/views/showmarc.py: data = urllib.request.urlopen(url).read()
openlibrary/views/showmarc.py: data = urllib.request.urlopen(url).read()
openlibrary/views/showmarc.py: result = urllib.request.urlopen(ureq).read(100000)

Answer 7 · 2020-07-30T17:43:54.000Z

Thanks, @cclauss for guiding. I'm on it!

Answer 8 · 2020-08-15T07:50:01.000Z

Refactor to use requests json #3709 Approved by @cclauss
Requests refactor only query.py #3666
refactor update_work and test_update_work to use requests #3696 Approved by @cclauss
Refactor .read() to use requests #3710
converted only simple GETs to requests #3694
use requests instead of stdlib #3658
Fix portability and dependency issue #3559

Answer 9 · 2020-08-21T18:28:50.000Z

This is going to need SEVERAL PRs before it is finally closed.

Answer 10 · 2020-11-02T10:54:20.000Z

@hornc Would be willing to work with @aasifkhan7 to remove urllib from openlibrary/catalog/get_ia.py in favor of requests?

Answer 11 · 2021-02-01T14:43:22.000Z

grep urlopen **/*.py > files.txt

with open("files.txt") as in_file:
    files = {line.split(":")[0] for line in in_file}

print("\n".join(sorted(files)))

openlibrary/catalog/amazon/load_merge.py
openlibrary/catalog/importer/db_read.py #4539
--
openlibrary/catalog/marc/download.py
openlibrary/views/showmarc.py #4527
scripts/oclc_to_marc.py #4515
scripts/z3950_view.py 4a6be85
--
openlibrary/plugins/akismet/akismet.py See #4572
--
openlibrary/plugins/upstream/adapter.py
openlibrary/plugins/upstream/utils.py
--
openlibrary/plugins/search/solr_client.py ~~#4593~~ #4729
openlibrary/plugins/worksearch/code.py #4604
scripts/new-solr-updater.py #4605

Answer 12 · 2021-02-02T13:27:07.000Z

I don't think this script works any more scripts/z3950_view.py because PyZ3950 in not installed in the requirements any more. I can't find PyZ3950 on pypi.org. I cannot find reference to z3950_view in the codebase. Given this, should we remove z3950_view.py?

Answer 13 · 2021-02-02T13:48:31.000Z

Last released in 2004 https://pypi.org/project/PyZ3950 --> http://www.panix.com/~asl2/software/PyZ3950 --> https://github.com/asl2/PyZ3950 (see the PRs) which is far from Python 3 compatible.

Answer 14 · 2021-02-04T13:13:35.000Z

I don't think this file openlibrary/catalog/marc/download.py is used any more. Because it tries to use web.run which is no longer in the version of web.py installed. Also it tries to call http://wiki-beta.us.archive.org:9090 which is not able to be accessed.

Should openlibrary/catalog/marc/download.py be deleted?

Answer 15 · 2024-04-19T00:42:47.000Z

I went ahead and did a little grepping and such to see that these are all the places urllib is used (excluding infogami).

I'm not sure which of these functions we should still try to migrate away from.
At least some of the urlencodes could be changed to just passing the dict to requests.
@cclauss what do you think?

Filename	Count	16	5	4	3	3	3	2	2	2	2	2	1	1	1	1
./plugins/upstream/utils.py	8	urlencode	quote		parse_qs	urlparse		urlunparse	quote_plus
./plugins/upstream/borrow.py	8	urlencode	quote
./coverstore/utils.py	8	urlencode		urlsplit						unquote			urlunsplit	unquote_plus	parse_qsl
./utils/solr.py	6	urlencode		urlsplit
./plugins/openlibrary/code.py	8	urlencode			parse_qs	urlparse		urlunparse
./plugins/openlibrary/tests/test_listapi.py	8	urlencode					Request				build_opener	HTTPCookieProcessor
./plugins/openlibrary/tests/test_ratingsapi.py	7	urlencode					Request				build_opener	HTTPCookieProcessor
./plugins/upstream/adapter.py	4	urlencode					Request									urlopen
./core/fulltext.py	2	urlencode
./core/ia.py	2	urlencode
./core/models.py	2	urlencode
./core/sponsorships.py	2	urlencode
./coverstore/tests/test_webapp.py	2	urlencode
./plugins/admin/graphs.py	2	urlencode
./plugins/importapi/code.py	2	urlencode
./plugins/worksearch/code.py	3	urlencode
./plugins/upstream/addbook.py	4		quote			urlparse
./core/processors/readableurls.py	3		quote						quote_plus
./catalog/utils/query.py	3		quote
./plugins/books/code.py	3			urlsplit						unquote
./core/helpers.py	2			urlsplit
./plugins/openlibrary/lists.py	1				parse_qs