simonw/shot-scraper

Ability to run `shot-scraper javascript` against several URLs at once

Opened this issue · 3 comments

I found myself wanting to use the Readability trick against multiple URLs, without having to pay the startup cost of launching a new Chromium instance for each one.

Idea: a way to run shot-scraper javascript against more than one URL, returning an array of results.

Challenge: the current UI for that command is:

shot-scraper javascript $URL $JAVASCRIPT

How would passing multiple URLs work? It would be easier if JavaScript came first as then you could tag on multiple URLs as positional options, but that doesn't feel right against the current design.

Some options:

  • A new command, javascript-multi - similar to how shot-scraper multi works in taking multiple screenshots at once
  • Add a -m multi-option to the javascript command and teach it to do those as well as the first one
    • Could have a special case here where shot-scraper javascript $JAVASCRIPT -m $URL1 -m $URL2 works - because it treats that first argument as the JavaScript in the case where there is only one positional argument and at least one -m option
  • shot-scraper javascript $JAVASCRIPT --urls $FILENAME which takes URLS from a file (or - for standard input) rather than expecting them to be passed as -m options

I built a prototype of that second option:

diff --git a/shot_scraper/cli.py b/shot_scraper/cli.py
index 3f1245e..86fc7b4 100644
--- a/shot_scraper/cli.py
+++ b/shot_scraper/cli.py
@@ -653,6 +653,13 @@ def accessibility(
     is_flag=True,
     help="Output JSON strings as raw text",
 )
+@click.option(
+    "multis",
+    "-m",
+    "--multi",
+    help="Run same JavaScript against multiple pages",
+    multiple=True,
+)
 @browser_option
 @browser_args_option
 @user_agent_option
@@ -668,6 +675,7 @@ def javascript(
     auth,
     output,
     raw,
+    multis,
     browser,
     browser_args,
     user_agent,
@@ -704,9 +712,26 @@ def javascript(
 
     If a JavaScript error occurs an exit code of 1 will be returned.
     """
+    # Special case for --multi - if multis are provided but JavaScript
+    # positional option was not set, assume the first argument is JS
+    if multis and not javascript:
+        javascript = url
+        url = None
+
+    # If they didn't provide JavaScript, assume it's being piped in
     if not javascript:
         javascript = input.read()
-    url = url_or_file_path(url, _check_and_absolutize)
+
+    to_process = []
+    if url:
+        to_process.append(url_or_file_path(url, _check_and_absolutize))
+    to_process.extend(url_or_file_path(multi, _check_and_absolutize) for multi in multis)
+
+    results = []
+
+    if len(to_process) > 1 and not raw:
+        output.write("[\n")
+
     with sync_playwright() as p:
         context, browser_obj = _browser_context(
             p,
@@ -719,18 +744,28 @@ def javascript(
             auth_username=auth_username,
             auth_password=auth_password,
         )
-        page = context.new_page()
-        if log_console:
-            page.on("console", console_log)
-        response = page.goto(url)
-        skip_or_fail(response, skip, fail)
-        result = _evaluate_js(page, javascript)
+        for i, url in enumerate(to_process):
+            is_last = i == len(to_process) - 1
+            page = context.new_page()
+            if log_console:
+                page.on("console", console_log)
+            response = page.goto(url)
+            skip_or_fail(response, skip, fail)
+            result = _evaluate_js(page, javascript)
+            if raw:
+                output.write(str(result) + "\n")
+            else:
+                output.write(
+                    json.dumps(result, indent=4, default=str) + ("\n" if is_last else ",\n")
+                )
+
         browser_obj.close()
-    if raw:
-        output.write(str(result))
-        return
-    output.write(json.dumps(result, indent=4, default=str))
-    output.write("\n")
+
+    if len(to_process) > 1 and not raw:
+        output.write("]\n")
+
+    if len(results) == 1:
+        results = results[0]
 
 
 @cli.command()

Then used like this:

shot-scraper javascript "
async () => {
  const readability = await import('https://cdn.skypack.dev/@mozilla/readability');
  return (new readability.Readability(document)).parse();
}" \
-m https://simonwillison.net/2024/Mar/30/ocr-pdfs-images/ \
-m https://simonwillison.net/2024/Mar/26/llm-cmd/ \
-m https://simonwillison.net/2024/Mar/23/building-c-extensions-for-sqlite-with-chatgpt-code-interpreter/ \
-m https://simonwillison.net/2024/Mar/22/claude-and-chatgpt-case-study/ \
-m https://simonwillison.net/2024/Mar/16/weeknotes-the-aftermath-of-nicar/ | tee /tmp/all.json

It worked, but I'm not sure if the design is right - in particular it feels inconsistent with how shot-scraper multi works.

Here are some idea's I have come across in other scraping tools:

url: https://example.com
urls: [https://example.com/page/{},1,243] # range through pages 1 to 243
urls:[...range(https://example.com/page/{},1,243)] # with an explicit range and some fuction needed
urls: ['https://example.com/', 'https://google.com', 'https://bing.com']

import urls from "./example_page_links.txt"
urls: urls.split("\n"),

Side Note: going through all the research stuff in issues: it's perhaps an idea to allow shot-scraper to use a config file. That way, all arguments you can pass in command line can be put neatly in a config file.