TeamHG-Memex/aquarium

How to disable HAProxy authentication

onurakman opened this issue · 3 comments

How can I disable HAProxy authentication (by manully editing haproxy.cfg) because of HAProxy passes its authentication info to the site and the site returns 401

What I did is commented the following in the haproxy.cfg file:

...
# Splash Cluster configuration
frontend http-in
    bind *:8050
#
#    http basic auth
#    acl auth_ok http_auth
#    http-request auth realm Splash if !auth_ok
#    http-request allow if auth_ok
#    http-request deny
#
   # don't apply the same limits for non-render endpoints
    acl staticfiles path_beg /_harviewer/
    acl misc path / /info /_debug /debug

    use_backend splash-cluster # if auth_ok !staticfiles !misc
    use_backend splash-misc # if auth_ok staticfiles
    #  use_backend splash-misc # if auth_ok misc
...

This is incredibly annoying.. These credentials are meant to be private, so one could limit access to Splash instance facing the internet. Instead these credentials get forwarded to every website you crawl. Unless you additionally use a proxy, you're effectively letting everyone know where your Splash instance is and what its credentials are... Why is it even done this way? Clearly looks like a bug to me.

P.S. To test this, you can just crawl https://httpbin.org/headers and check response body (it simply mirrors the headers). You'll see your Splash credentials in Authorization header.

Is there a workaround to still use HTTP Basic Auth for Splash, but do not pass these credentials onto the website you crawl?

UPDATE Ok, I solved this by setting Authorization header in request.meta['splash']['splash_headers'], instead of directly in request headers as done by HttpAuthMiddleware. I believe that advice to use HttpAuthMiddleware is very dangerous and should be removed from documentation / README. The correct way is clearly to set these credentials via splash_headers.

@nirvana-msu setting those got me past the 401 code, THANK YOU. although, it looks like it still does not go through scraper api proxy :(