PeskyPotato/archive-chan

Keep checking threads untill they'e either archived or 404

cardoso-neto opened this issue ยท 10 comments

From what I could see, archive-chan currently only downloads snapshots of the threads instead of "watching" them for new posts until completion.
I'm thinking we could add --watch-threads flag or something like that.
I would gladly implement this. Your archiver is the most complete I've found so far.
I would just like to discuss this with you as I'm not sure how to do this yet.

The flag seems good. I was thinking about setting up a timer in feeder() to call archive() on the thread url repeatedly. We can then check to see if the image is written in the thread directory, skip if it is to save bandwidth. As for the text, I was thinking of the just overwriting the HTML file on each iteration.

Not sure if there's a better way of doing it, if you have any idea let me know.

Sounds pretty good.
Might I recommend checking if the media file is already downloaded with:

from pathlib import Path

media_file_path = Path(f"{reply['filename']}{reply['ext']}")
if params.preserve and not media_file_path.is_file():
    # download

As for the text, I was thinking of the just overwriting the HTML file on each iteration.

What if a mod deleted one of the posts? Would we now lose that post? That would be undesirable.

I thought a bit more about my suggested way of checking if a file has already been downloaded.
Since it only checks if the file path exists, we could end up with corrupt or incomplete files.
I'm thinking we need a way to check if the whole file has been downloaded.
Checking the file sizes is the first that comes to mind.

So here is what I did to the Extractor.download method:

def download(self, path, name, params, retries=0):
        """
        Donwload file from `path` to `name`.

        If it fails wait and retry, until total retries reached.
        """
        file_path = Path('{}/{}'.format(params.path_to_download, name))
        try:
            if file_path.is_file():
                response = requests.head(path, timeout=10)
                size_on_the_server = response.headers.get("content-length", 0)
                if file_path.stat().st_size == size_on_the_server:
                    return
            if params.verbose:
                print("Downloading image:", path, name)
            response = requests.get(path, timeout=240)
            with open(file_path, "wb") as output:
                output.write(response.content)
        except Exception as e:
            if params.total_retries > retries:
                print(e, file=sys.stderr)
                print(f"Retry #{retries}")
                retries += 1
                self.download(path, name, params, retries)

I'm still not sure if this is completely safe, though.

So, huge bug found, because we were saving the media files with their original file names, anytime a thread had more than one media file with the same name, only the most recent one would be saved.
This is easily solvable.

We could go for one of the following (or maybe make it configurable?):

  • a) use the 4chan ID as a filename (1605033830110.jpg)
  • b) prepend the 4chan ID to the original filename (1605033830110:Praise boobas.jpg)
  • c) content-addressing, i.e., use a hash of the image as filename (SHA256E--9da32ce23c2fce3f6b456359327e20ff94e41e89f40334ffd2132d0defc370ce.jpg)

What do you think?

I think the hash would probably be the best option since we get it from the 4chan API and it's included in the Reply model. That way we can preserve the file name to the original but just compare the hash returned by the API with the file that exists in the folder. What do you think @cardoso-neto?

I think the hash would probably be the best option

Here it looks like you want option c.

we can preserve the file name to the original

But here it looks like you want option b.

just compare the hash returned by the API with the file that exists in the folder

And here it looks like you want option c again.
(confused-emoji)

If there's any chance we can reproduce those md5 checksums, then option c is definitely the best because it'd solve the redownload problem perfectly. This is what they look like: MJoiDDK2ehvXP3fvM1wdAw== kinda looks like base64 encoding to me.
If we can't, then going with option b seems like a good enough choice, since it'd be the best of botth worlds (uniqueness as well as readability).
I'll experiment a bit with it and get back to you.
I already have a fork where I'm working on stuff, btw. cardoso-neto/archive-chan

So, I managed to reproduce 4chan's base64-encoded binary md5 hash with openssl md5 -binary $filename | openssl base64.
Next step is choosing how to deal with the filenames.
I'm thinking we could just use 4chan's standard post id and save the original .json file from 4chan's API, so we'd also have the original filenames and hashes.
Maybe something like this for the folder structure:

board/
..  thread_id/
..  ..  media/
..  ..  ..  post-id.png
..  ..  index.html
..  ..  thread.json

Somewhat related to this issue, I created two new command-line switches: --archived and --archived_only.
They're to be used when supplying a board letter (like /mlp/) so you can download the threads on the /mlp/archive/ as well.
I couldn't branch off of master because it didn't have my "retrying timeouting requests session", so I branched off of my own branch.
This is the commit cardoso-neto@ad94c40 if you feel like doing a code review.

Thank you @cardoso-neto , I will take a look at the commit this week. I appreciate the time you've put into this.

๐Ÿ™‚