EU-EDPS/website-evidence-collector

Increasing ram usage and tool never finishes.

vincentcox opened this issue · 14 comments

Steps to reproduce:

Spawn a fresh Ubuntu 20.04 server (no GUI) VPS, install all the tools:

sudo apt update
sudo apt install nodejs -y 
sudo apt install npm -y
sudo apt install jq -y
sudo apt install chromium-browser -y
export PUPPETEER_EXECUTABLE_PATH="/usr/bin/chromium-browser" # Fix the "browser not installed" bug, "stolen" from the Dockerfile
npm install --global https://github.com/EU-EDPS/website-evidence-collector/tarball/master 
mkdir output_dir
website-evidence-collector --output output_dir/vincentcox.com --json --max 3 https://vincentcox.com --overwrite -- --no-sandbox # Fix the chrome sandbox issue, found somewhere in the issue tracker

It keeps running and it keeps eating resources:
rip memory
(rip memory)

Note that I am using the latest version from Github and that something might broke it in the Github version. But as explained in this issue (#41), I cannot access the official download link of the stable version.

Do you have the same behavior when using chromium bundled with the puppeteer node package?

How can I use the puppeteer node package? (sorry, I have little experience with nodeJs).

I installed the latest stable version (mentioned in your reply in my previous issue), it's the same issue.

I have the same in docker, which is using the puppeteer node package.

I removed the versions in the Dockerfile to get it working:

RUN apk add --no-cache \
      chromium \
      nss \
      freetype \
      freetype-dev \
      harfbuzz \
      ca-certificates \
      ttf-freefont \
      nodejs \
      yarn \

It works for you now? Could you prepare a pull request then to help others?

Could you find out which versions you are using instead? I think I decided to fix the version numbers to have a more reproducable setup which is important for auditing.

Sorry, it was not clear in my previous answer. I am trying things out, but they all break if I test them on my website (also for a client, but I don't want to share that one as my website is a good "test" example). So I said that tried docker (using a modified version to get it working), but got the same bug.

In your example you have incuded --max 3, hence you scan also some other random pages of the same website. Can you please check if with only one page you still have the same behaviour? I would then try to reproduce your problem.

It's unfortunately the same (when using the installed version in my initial post but with --max 1). I gave up on docker because I get this error:

error An unexpected error occurred: "EACCES: permission denied, scandir '/opt/website-evidence-collector/output/browser-profile'".

@rriemann-eu if you need more info to debug let me know!

So when I execute the following two commands, I do not get any error.

website-evidence-collector --output output_dir/vincentcox.com --json --max 1 https://vincentcox.com

website-evidence-collector --output output_dir/vincentcox.com2 --json --max 1 https://vincentcox.com -- --no-sandbox

I am using the latest version from master on opensuse. From the inspection.yml:

script:
  host: mars.fritz.box
  version:
    npm: 0.4.0
    commit: v0.4.0-70-ga956e2d
  cmd_args: '--output output_dir/vincentcox.com --json --max 1 https://vincentcox.com'
  environment: {}
  node_version: v10.22.1
browser:
  name: Chromium
  version: HeadlessChrome/80.0.3987.0
  user_agent: >-
    Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko)
    Chrome/72.0.3617.0 Safari/537.36
  platform:
    name: Linux
    version: 5.8.14-1-default
  extra_headers: {}
  preset_cookies: {}
start_time: 2020-11-24T11:30:47.957Z
end_time: 2020-11-24T11:31:00.650Z

Does your problem occurs with all websites?

Hmmm, might be something with my installation then. I'll go with docker then to avoid further mistakes and debugging time on your side. The dockerfile in the Repo doesn't work anymore.

If I want to build this I get this error:

root@client-testvm:~/test/website-evidence-collector#  docker build -t website-evidence-collector .
Sending build context to Docker daemon  3.995MB
Step 1/16 : FROM alpine:edge
 ---> 003bcf045729
Step 2/16 : LABEL maintainer="Robert Riemann <robert.riemann@edps.europa.eu>"
 ---> Using cache
 ---> f5d20c7a4860
Step 3/16 : LABEL org.label-schema.description="Website Evidence Collector running in a tiny Alpine Docker container"       org.label-schema.name="website-evidence-collector"       org.label-schema.usage="https://github.com/EU-EDPS/website-evidence-collector/blob/master/README.md"       org.label-schema.vcs-url="https://github.com/EU-EDPS/website-evidence-collector"       org.label-schema.vendor="European Data Protection Supervisor (EDPS)"       org.label-schema.license="EUPL-1.2"
 ---> Using cache
 ---> 16ece18d66c6
Step 4/16 : RUN apk add --no-cache       chromium~=80.0.3987       nss       freetype       freetype-dev       harfbuzz       ca-certificates       ttf-freefont       nodejs       yarn~=1.22.4       bash procps drill coreutils libidn curl       parallel jq grep aha
 ---> Running in 5ca2fe0d3cde
fetch https://dl-cdn.alpinelinux.org/alpine/edge/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/edge/community/x86_64/APKINDEX.tar.gz
ERROR: unsatisfiable constraints:
  chromium-86.0.4240.111-r0:
    breaks: world[chromium~80.0.3987]
  yarn-1.22.10-r0:
    breaks: world[yarn~1.22.4]
The command '/bin/sh -c apk add --no-cache       chromium~=80.0.3987       nss       freetype       freetype-dev       harfbuzz       ca-certificates       ttf-freefont       nodejs       yarn~=1.22.4       bash procps drill coreutils libidn curl       parallel jq grep aha' returned a non-zero code: 2

I think this error is caused by this https://superuser.com/a/1486407/1039133

Unfortunately, Alpine-Linux Package Management drops older packages when there are newer versions available. This makes it hard to use Alpine Linux with docker since you want a reproducible image with exact versions.

OK, so I will close this one until we know how to reproduce your problem on other systems. I will open a new issue on the docker problem, which deserves a solution.

Good idea, feel free to tag me in this!

I can confirm this on docker:

It takes a lot of time and keeps using more and more ram.

docker run --rm -it --cap-add=SYS_ADMIN -v $(pwd)/output:/output website-evidence-collector https://vincentcox.com --overwrite

top:

top - 14:06:18 up 24 days,  1:59,  2 users,  load average: 2.61, 1.76, 0.79
Tasks: 121 total,   1 running, 119 sleeping,   0 stopped,   1 zombie
%Cpu(s): 60.1 us, 30.6 sy,  0.0 ni,  7.7 id,  0.0 wa,  0.0 hi,  0.0 si,  1.7 st
MiB Mem :   1994.0 total,    109.3 free,   1602.3 used,    282.4 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.    169.5 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                        
 4006 ubuntu    20   0  445648  43844  28176 S  95.3   2.1   4:29.16 chrome                                                                                         
 4052 ubuntu    20   0 5504516 892576  50024 S  78.7  43.7   4:05.24 chrome                                                                                         
 4047 ubuntu    20   0  358264  52200  26516 S   8.3   2.6   0:27.17 chrome                                                                                         
  316 root      20   0   14804   4364   1408 S   0.3   0.2 114:20.20 docker-gen                                                                                     
  411 root      20   0   10988   3396   2880 R   0.3   0.2   0:00.37 top                                                                                            
    1 root      20   0  169324  10212   5544 S   0.0   0.5   1:44.17 systemd     

As I do not have this problem on my local computer without docker, I can imagine that it somehow depends on the Chromium version that is used. Maybe newer Chromium versions behave differently than the version HeadlessChrome/80.0.3987.0 I use on my local system.

Yeah the thing is: if it was just on my machine and not on docker it would be something on my side. But even if docker it's giving me the same issue.

With chromium 77.0.3865 (as used in this working dockerfile), it works for me.

Maybe this issue is not even in the scope of this project, but a chromium issue itself. For me it's okay if you guys close it, but keep in mind that other people might face the same issue (in docker or just using it installed on a system). Maybe my website is quite heavy to parse, but it's a standard Wordpress website so I think chances are high people will face the same situation.