Cameron-IPFSPodcasting/podcastnode-Python

New Feature - "Turbo" mode

Opened this issue · 9 comments

Allow the node to configure a "turbo mode" that will skip the standard 10 minute delay between tasks if the previous task was successful.

If there's no work to do, or there was an error downloading/pinning, the node will wait the standard 10 minutes before requesting again (even with turbo mode enabled).

i.e. as long as the node successfully processes a task, it will make another request immediately. If there is no work (or there's an error), it will wait 10 minutes before making another request.

Cons: A node who favorites several feeds (& thousands of episodes) could experience bandwidth issues while the node is downloading continuously.

Will there be server side support for this?
The description is clear, but can we receive a batch of N jobs from the server to process on the client, instead of the client pinging the server 10 times? It would reduce the amount of load on the server as well.

I need to add a response to tell the client if the hash was verified (against other nodes), or if it was a "failure". Currently the client doesn't know if a failure in consensus happens. Will try to work on sending a pass/fail state today, so the client knows to wait (on failure) or continue (on success).

I doubt I could batch jobs. Each request is based on the current "status" of the network/episode. Whether you download from the source, pin from other nodes, or have to delete a hash could change between requests. If I send you N tasks, something could be different by the time you get to the Nth task.

Just realized a turbo mode could be impossible in this python/cron version. 🤦 The python script only runs once. Cron determines that it run every 10 minutes. Umbrel/Start9 run a main loop that sleeps for 10 minutes.

Using cron was to have it "restart" if something crashes. I don't think turbo mode will work with cron.

Any Ideas?

I don't see why it needs to be managed externally.

My plan was to do a litlle more refactoring such that the script could run multiple loops until it fails / has no work and then let external job management systems run it. So the turbo mode could be handled within the client.

But my general opinion is that none of the time management needs to be done externally. It could all be done internally in python. I see pros and cons in both approaches. I don't see a need to change it and so I wasn't going to touch it.

Good idea (thanks)! Looping until there's a failure or no more to work should work fine.

I liked using cron since any crash would restart on the next task (10 minutes). And using flock in the cron should prevent multiple instances.

It seemed like the best way (at the time), was to make sure it would restart if something went wrong. I was worried about the script dying without restarting. The Umbrel/Start9 docker versions will restart if the main task exits/fails.

Updated the server to send a response to the response. i.e. when the client reports it's hash, there will be a "Success" or "Fail" status to indicate whether the hash was accepted/validated by the server.

I added a few lines at the end of the develop branch to demonstrate the response data (responsedata). responsedata['status'] will be "Success" or "Fail" from the server, or "Error" if the post command fails.

Use this to determine if you should continue in turbo mode, or exit.

No rush. Whenever you have time.

Thanks for the help.

Good to know.

I noticed an error in my logs and I wonder if you could comment on it. It's mildly related to turbo.
The gist is that in sending the response back, I got a timeout error. So the server never got notified of my work status.
Does that matter? I would have imagined that in that perticular scenario, we should retry sending the response. But I would imagine that if it did matter, then you would have already done it.

For me, my surprise is just that the next item of work was what seemed to be naturally next in the queue.

On the server, if I don't get a response, I send the same task on the next request. This was also a counter-measure against malicious nodes making multiple requests (without responding). I keep sending the same task until I get a valid response.

Unfortunately, nodes can get stuck in a loop when it's impossible to send a valid response. Mostly with DAI & 404 errors when I never get what I'm expecting. That's the main reason for re-writing a new server algo that gives out the tasks.

If you received a new task (next in the queue), you must have sent a valid response for the previous one. If it was a free/48-hour episode, other nodes may have satisfied the requirements (so I didn't need to re-send to your node).

I have the general structure done, but a few things came up.

I did some refactoring so that:

  1. It's easier to follow the different parts of the process
  2. Added type annotations so that I can more reliably depend on my IDE, but also ensure that I can have a place to look at what should I know.

Issues that I found is:

  1. Should the peer count be a string or an int? It's always been a string in the python client and I was wondering if I changed it to an int, if it would break the server side.
  2. Should the delete work request report an error if it fails to delete?

Peer count should be ok to be an int. I don't see how it would break the server. It's just a post variable on the server (which is stored as an int).

The delete work request was kind of informational. I had no way to know if it was successful, so just echoed back the hash. Plus delete was the only simultaneous action. I could tell the client to download/pin and delete something at the same time. The "error" payload was for download/pinning (99, 98, 97). I didn't care about Delete errors.

I haven't really thought about better error reporting, but with new features, we may need something better to handle simultaneous tasks. With future Garbage collection & Adding peers, it might help to report errors per task.