philippta/flyscrape

File downloads

Closed this issue · 3 comments

File downloaded should be supported as a built-in JavaScript function.

Proposed example:

import { download } from "flyscrape";

export default func({ doc }) {
    const url = doc.find(".download-link").attr("href");

    download(url, "./downloads/file.bin") 
    // or
    download(url, "./downloads/") // File name is inferred from URL or Content-Disposition header.
}

TBD:

  • How to specify the number of parallel downloads?
  • Should download be part of the http object from "flyscrape/http" instead?

Ref:

Being able to have logic to put different things in different paths based on file type, or generate unique names will be very nice.

As noted, parallel is also important. This is the Python pattern I use (extra janky because it is running in Jupyter) -

def initializer():
    import pathlib as plib
    import requests as reqs
    
    global pathlib
    pathlib = plib
    
    global requests
    requests = reqs
    
def fetch_url(target):
    url, path = target
    if not path.exists():
        resp = requests.get(url)
        if resp.status_code == 200:
            path.write_bytes(resp.content)
        else:
            return ('FAILED', path)
    else:
        return ('SKIPPED', path)
            
    return ('OK', path)

# targets = ...
with multiprocess.Pool(16, initializer) as p:
    out = p.map(fetch_url, targets)

It would be so convenient to do something like flyscrape --download url-list.txt --workers 16.

import { download } from "flyscrape";

export const config = {
    url: "https://news.ycombinator.com/",
    download-workers: 8,
    scrape-workers: 8,
}

export default func({ doc }) {
    const url = doc.find(".download-link").attr("href");

    download(url, "./downloads/file.bin") 
    // or
    download(url, "./downloads/") // File name is inferred from URL or Content-Disposition header.
}

It would be so convenient to do something like flyscrape --download url-list.txt --workers 16.

I'm not quite sure I understand what you are trying to accomplish.
Does the url-list.txt in you hypothetical example contain urls to files?

If so, I'm sure this could be a job for wget -i url-list.txt. Otherwise, do you mind elaborating on this?

I don't believe wget has parallelization built in. I download (and scrape) using 8 or 16 threads in parallel. There are 409 1GB files for LAION's (dataset) image embeddings. It is a pain to download those serially.