zytedata/python-zyte-api

Improve printing when piping

Gallaecio opened this issue · 1 comments

When you pipe with curl, their stderr output changes and the stdout output is ensured to come after any stderr output.

Currently, we do not make those assurances, and our stderr output can (always?) happen in the middle of the stdout.

For example:

$ echo '{"url": "https://toscrape.com", "httpResponseBody": true}' > urls.jl
$ python -m zyte_api urls.jl  --intype jl | jq -r .httpResponseBody | base64 --decode
INFO:zyte_api:Loaded 1 urls from urls.jl; shuffled: False
INFO:zyte_api:Running Zyte Data API (connections: 20)
/home/adrian/.local/share/venv/docs.zyte.com/lib/python3.10/site-packages/zyte_api/__main__.py:126: DeprecationWarning: There is no current event loop
  loop = asyncio.get_event_loop()
  0%|                                                                                                              | 0/1 [00:00<?, ?url/s, conn:0.00s, resp:0.00s, throttle:0.0%, err:0+0(0.0%) | success:0/0(0.0%)]<!DOCTYPE html>
<html lang="en">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        <title>Scraping Sandbox</title>
        <link href="./css/bootstrap.min.css" rel="stylesheet">
        <link href="./css/main.css" rel="stylesheet">
    </head>
    <body>
        <div class="container">
            <div class="row">
                <div class="col-md-1"></div>
                <div class="col-md-10 well">
                    <img class="logo" src="img/zyte.png" width="200px">
                    <h1 class="text-right">Web Scraping Sandbox</h1>
                </div>
            </div>

            <div class="row">
                <div class="col-md-1"></div>
                <div class="col-md-10">
                    <h2>Books</h2>
                    <p>A <a href="http://books.toscrape.com">fictional bookstore</a> that desperately wants to be scraped. It's a safe place for beginners learning web scraping and for developers validating their scraping technologies as well. Available at: <a href="http://books.toscrape.com">books.toscrape.com</a></p>
                    <div class="col-md-6">
                        <a href="http://books.toscrape.com"><img src="./img/books.png" class="img-thumbnail"></a>
                    </div>
                    <div class="col-md-6">
                        <table class="table table-hover">
                            <tr><th colspan="2">Details</th></tr>
                            <tr><td>Amount of items </td><td>1000</td></tr>
                            <tr><td>Pagination </td><td>&#10004;</td></tr>
                            <tr><td>Items per page </td><td>max 20</td></tr>
                            <tr><td>Requires JavaScript </td><td>&#10008;</td></tr>
                        </table>
                    </div>
                </div>
            </div>

            <div class="row">
                <div class="col-md-1"></div>
                <div class="col-md-10">
                    <h2>Quotes</h2>
                    <p><a href="http://quotes.toscrape.com/">A website</a> that lists quotes from famous people. It has many endpoints showing the quotes in many different ways, each of them including new scraping challenges for you, as described below.</p>
                    <div class="col-md-6">
                        <a href="http://quotes.toscrape.com"><img src="./img/quotes.png" class="img-thumbnail"></a>
                    </div>
                    <div class="col-md-6">
                        <table class="table table-hover">
                            <tr><th colspan="2">Endpoints</th></tr>
                            <tr><td><a href="http://quotes.toscrape.com/">Default</a></td><td>Microdata and pagination</td></tr>
                            <tr><td><a href="http://quotes.toscrape.com/scroll">Scroll</a> </td><td>infinite scrolling pagination</td></tr>
                            <tr><td><a href="http://quotes.toscrape.com/js">JavaScript</a> </td><td>JavaScript generated content</td></tr>
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00,  5.27s/url, conn:5.27s, resp:5.27s, throttle:0.0%, err:0+0(0.0%) | success:1/1(100.0%)]
INFO:zyte_api:
Summary
-------
Mean connection time:     5.27
Mean response time:       5.27
Throttle ratio:           0.0%
Attempts:                 1
Errors:                   0.0%, fatal: 0, non fatal: 0
Successful URLs:          1 of 1
Success ratio:            100.0%

INFO:zyte_api:
API error types:
[]
INFO:zyte_api:
Status codes:
[(200, 1)]
INFO:zyte_api:
Exception types:
[]
 <tr><td><a href="http://quotes.toscrape.com/js-delayed">Delayed</a> </td><td>Same as JavaScript but with a delay (?delay=10000)</td></tr>
                            <tr><td><a href="http://quotes.toscrape.com/tableful">Tableful</a> </td><td>a table based messed-up layout</td></tr>
                            <tr><td><a href="http://quotes.toscrape.com/login">Login</a> </td><td>login with CSRF token (any user/passwd works)</td></tr>
                            <tr><td><a href="http://quotes.toscrape.com/search.aspx">ViewState</a> </td><td>an AJAX based filter form with ViewStates</td></tr>
                            <tr><td><a href="http://quotes.toscrape.com/random">Random</a> </td><td>a single random quote</td></tr>
                        </table>
                    </div>
                </div>
            </div>
        </div>
    </body>
</html>

After some experiments and thinking, I believe the issue is not in python-zyte-api, but in running jq before the command finishes. If python-zyte-api is executed on its own, this does not happen.