Improve printing when piping
Gallaecio opened this issue · 1 comments
Gallaecio commented
When you pipe with curl, their stderr
output changes and the stdout
output is ensured to come after any stderr
output.
Currently, we do not make those assurances, and our stderr
output can (always?) happen in the middle of the stdout
.
For example:
$ echo '{"url": "https://toscrape.com", "httpResponseBody": true}' > urls.jl
$ python -m zyte_api urls.jl --intype jl | jq -r .httpResponseBody | base64 --decode
INFO:zyte_api:Loaded 1 urls from urls.jl; shuffled: False
INFO:zyte_api:Running Zyte Data API (connections: 20)
/home/adrian/.local/share/venv/docs.zyte.com/lib/python3.10/site-packages/zyte_api/__main__.py:126: DeprecationWarning: There is no current event loop
loop = asyncio.get_event_loop()
0%| | 0/1 [00:00<?, ?url/s, conn:0.00s, resp:0.00s, throttle:0.0%, err:0+0(0.0%) | success:0/0(0.0%)]<!DOCTYPE html>
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Scraping Sandbox</title>
<link href="./css/bootstrap.min.css" rel="stylesheet">
<link href="./css/main.css" rel="stylesheet">
</head>
<body>
<div class="container">
<div class="row">
<div class="col-md-1"></div>
<div class="col-md-10 well">
<img class="logo" src="img/zyte.png" width="200px">
<h1 class="text-right">Web Scraping Sandbox</h1>
</div>
</div>
<div class="row">
<div class="col-md-1"></div>
<div class="col-md-10">
<h2>Books</h2>
<p>A <a href="http://books.toscrape.com">fictional bookstore</a> that desperately wants to be scraped. It's a safe place for beginners learning web scraping and for developers validating their scraping technologies as well. Available at: <a href="http://books.toscrape.com">books.toscrape.com</a></p>
<div class="col-md-6">
<a href="http://books.toscrape.com"><img src="./img/books.png" class="img-thumbnail"></a>
</div>
<div class="col-md-6">
<table class="table table-hover">
<tr><th colspan="2">Details</th></tr>
<tr><td>Amount of items </td><td>1000</td></tr>
<tr><td>Pagination </td><td>✔</td></tr>
<tr><td>Items per page </td><td>max 20</td></tr>
<tr><td>Requires JavaScript </td><td>✘</td></tr>
</table>
</div>
</div>
</div>
<div class="row">
<div class="col-md-1"></div>
<div class="col-md-10">
<h2>Quotes</h2>
<p><a href="http://quotes.toscrape.com/">A website</a> that lists quotes from famous people. It has many endpoints showing the quotes in many different ways, each of them including new scraping challenges for you, as described below.</p>
<div class="col-md-6">
<a href="http://quotes.toscrape.com"><img src="./img/quotes.png" class="img-thumbnail"></a>
</div>
<div class="col-md-6">
<table class="table table-hover">
<tr><th colspan="2">Endpoints</th></tr>
<tr><td><a href="http://quotes.toscrape.com/">Default</a></td><td>Microdata and pagination</td></tr>
<tr><td><a href="http://quotes.toscrape.com/scroll">Scroll</a> </td><td>infinite scrolling pagination</td></tr>
<tr><td><a href="http://quotes.toscrape.com/js">JavaScript</a> </td><td>JavaScript generated content</td></tr>
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00, 5.27s/url, conn:5.27s, resp:5.27s, throttle:0.0%, err:0+0(0.0%) | success:1/1(100.0%)]
INFO:zyte_api:
Summary
-------
Mean connection time: 5.27
Mean response time: 5.27
Throttle ratio: 0.0%
Attempts: 1
Errors: 0.0%, fatal: 0, non fatal: 0
Successful URLs: 1 of 1
Success ratio: 100.0%
INFO:zyte_api:
API error types:
[]
INFO:zyte_api:
Status codes:
[(200, 1)]
INFO:zyte_api:
Exception types:
[]
<tr><td><a href="http://quotes.toscrape.com/js-delayed">Delayed</a> </td><td>Same as JavaScript but with a delay (?delay=10000)</td></tr>
<tr><td><a href="http://quotes.toscrape.com/tableful">Tableful</a> </td><td>a table based messed-up layout</td></tr>
<tr><td><a href="http://quotes.toscrape.com/login">Login</a> </td><td>login with CSRF token (any user/passwd works)</td></tr>
<tr><td><a href="http://quotes.toscrape.com/search.aspx">ViewState</a> </td><td>an AJAX based filter form with ViewStates</td></tr>
<tr><td><a href="http://quotes.toscrape.com/random">Random</a> </td><td>a single random quote</td></tr>
</table>
</div>
</div>
</div>
</div>
</body>
</html>
Gallaecio commented
After some experiments and thinking, I believe the issue is not in python-zyte-api
, but in running jq
before the command finishes. If python-zyte-api
is executed on its own, this does not happen.