my8100/scrapydweb

Scrapyd running on a remote machine causes UI links to be broken.

Opened this issue · 13 comments

Describe the bug
If the ScrapyD Server(s) are running on a remote host (on same VPN) and the ScrapyDWeb is running on a seperate node then the links to Logs and Items become broken by design.

Example:
ScrapyD Server is ruinning at 189.09.09.90:6800 on a AWS EC2 01 and the ScrapyDWeb is running at 89.09.09.80:80 then in the config I will provide 189.09.09.90:6800 as the scrapyd server location and this will cause the links to be rendered as 189.09.09.90:6800/logs/proj/job/2019-09-30T11_01_23.log which becomes inaccessible for the browser however 189.09.09.90:6800 can be exposed via reverse proxy to abc.domain.com and then abc.domain.com/logs/proj/job/2019-09-30T11_01_23.log will be accessible.

Possible solution would be allow an alias for each server which can be used for generating the links.

What if adding abc.domain.com:80 as the Scrapyd server?

@my8100
sorry for the late reply

adding abc.domain.com:80 but each request now goes through web while it could have reached locally.

let me explain my setup

  1. We have a cluster of scrapyd.
  2. We have scrapydweb instance connected to all the servers locally.
  3. Everything is running locally and for security, only scrapydweb is exposed to the internet.

Ideally scrapydweb should fetch and display all the urls from scrapyd regardless of the type of data.

so instead of directly opening scrapyd-server-001.local:6800/logs/proj/job/2019-09-30T11_01_23.log it must open scrapydweb.domain.com/scrapyd-server-001.local/logs/proj/job/2019-09-30T11_01_23.log

basically it would internally redirect the requests to the scrapyd servers.

https://imgur.com/a/2i7Jf37

OK, it may be supported in a future release.

But for now, you can view the first 100 and last 100 lines of the log in the Stats page.

Hi, I would like to promote this feature as well. It will help to redirect links in the cloud-based setups as well.

@Rohithzr @sergiigladchuk
The requested feature is supported in PR #128
Please give it a try and offer your feedback, thanks.

  1. Stop Scrapydweb.
  2. Execute pip install --upgrade git+https://github.com/my8100/scrapydweb.git to get the latest code.
  3. Add the content below in the existed file scrapydweb_settings_v10.py.
    # The default is None, only set it up when you need to visit Scrapyd servers via reverse proxy.
    # Make sure that SCRAPYD_SERVERS_PUBLIC_URLS has same length with SCRAPYD_SERVERS above.
    # e.g.
    # SCRAPYD_SERVERS_PUBLIC_URLS = [
    # 'https://a.b.com', # visit the first Scrapyd server via reverse proxy.
    # '', # visit the second Scrapyd server without reverse proxy.
    # ]
    # See https://github.com/my8100/scrapydweb/issues/94 for more info.
    SCRAPYD_SERVERS_PUBLIC_URLS = None
  4. Update the option SCRAPYD_SERVERS_PUBLIC_URLS accordingly.
  5. Restart Scrapydweb.

For anyone trying to make this work with nginx, a "subfolder" config (mydomain.com/scrapy) didn't work for me for some reason.

I had success with a subdomain config like this (scrapy.mydomain.com):

server {
    listen 443 ssl http2;
    listen [::]:443 ssl http2;

    server_name scrapy.*;

    include /config/nginx/ssl.conf;

    client_max_body_size 0;

    location / {
        include /config/nginx/proxy.conf;
        include /config/nginx/resolver.conf;
        set $upstream_app scrapydweb;
        set $upstream_port 5000;
        set $upstream_proto http;
        proxy_pass $upstream_proto://$upstream_app:$upstream_port;

    }
}

Its not working for me, I do not have the custom domain, hence using IPv4 of the server.
I have deployed Scrapyd and Scrapydweb on the same server and running web with no reverse proxy, i.e. exposed port on which Scrapydweb is running.
While accessing the url (server_ipv4:port) from other computer, everything seems to work fine, however when I go to items section, I get Oops! Something went wrong. error. Also pip install logparser is also showing even after configuring it to run with Scrapydweb.
image
image

@ritikkumarsahu
Have you set up items_dir for Scrapyd first?
It is disabled by default.
You may need to check the Scrapyd web UI as it is returning the 404 status code for the items page.

https://scrapyd.readthedocs.io/en/stable/config.html#items-dir

I have used mongodb for the items pipeline. I am not sure how to set mongodb for ScrapydWeb

Only need to update the config of Scrapyd if you want to visit the items page.
Can you visit these url directly?
http://your-scrapyd-server:port/
http://your-scrapyd-server:port/items/

Maybe the items page is available only when you are using the simple json instead of database?

I can Visit http://your-scrapyd-server:port/ but not the http://your-scrapyd-server:port/items/ page
Can This be a feature request then?

Scrapydweb can show the job stats efficiently when working with logparser.
It makes more sense to check your data with a modern database client with GUI.
BTW, PR is welcome.

https://github.com/my8100/logparser?tab=readme-ov-file#to-work-with-scrapydweb-for-visualization