daviddavo/pocket2omnivore

publishedAt not working

Closed this issue · 6 comments

When I go to my inbox after running the script, all post appear as recently added. The original date from Pocket is not saved.

Tasks:

  • Use thread pooling instead of batching
  • Save the status in the DB -> We can assume that just after launching the status is PROCESSING. By default it should be Null.
  • Avoid repopulating articles that have been already populated, just launch the request for articles with null status (not processing or succeded) in the DB
  • Query the articleSavingRequeststatus to update the DB
  • Change the metadata for articles that succeded. Perhaps add another field in the DB to check if this step has been done.

omnivore-app/omnivore#2316

If you want to fix this, basically, upload everything. Wait for everything to become stable, then run the "update time" bit again.

Perhaps we should check if It's populated before continuing with the query articleSavingRequest. Instead of going by batches, we should use a heterogeneous queue of events. First, try and populate some articles and then do queries to check if they are populated. When they are populated then archive it and change the date.

In a kind of pseudo-code:

pq = processQueue()

fun saveArticle(url):
  r = gql(...)
  pq.push(checkArticle, r['id'])

fun checkArticle(id):
  r = gql(...)
  if is_populated(r):
    pq.push(archiveArticle, id)
  else:
    # Do some kind of sleep or wait
    pq.push(checkArticle, id)

fun archiveArticle(id)
  gql(...)

Yeah, that sounds like a good approach. The batching was to try to speed things up - but given the processing time it doesn’t seem to help all that much.

Before I realised what was happening with the processing I tried to batch 10, then update each. My only worry, and I haven’t checked the code for this, but I noticed some articles never leave the processing state - so we’d need some escape hatch if it gets fully stuck.

We can assume that they eventually leave, perhaps after a few days. We should use the DB to check if we already tried to archive an article and avoid re-populating (basically, saving the articleSavingRequeststatus in the DB)

Then, after every article has tried to be populated, make queries to check its status, and:

  • if the status is PROCESSED, launch the archive and update date mutations
  • otherwise, relaunch the query after a few seconds (use @backoff)

Making it in two stages is better because is "resumable", and you could always stop the notebook and run it again in a few days, as long as the database is the same it should avoid re-populating all articles again.

I tried using a threadpool, but the main problem is that I don't know how to "enqueue" the follow-up tasks of checking and updating the article.

Both ThreadPool and ThreadPoolExecutor are used to submit a lot of homogeneous tasks (with something like map), and then process those results, and then issue more homogeneous tasks. I don't think what I want is possible with this method.

All the example code looks like this:

futures = ...

for completed_future in wait(futures):
  ...

This is not suitable for our case, because we don't want to wait for ALL ARTICLES to be processed to start updating the info...

The only solution I can think of is a loop like this:

<save all articles and get rid>

remaining = [ ... ]
while remaining:
  < async check and update remaining list >

I'd like it to be able to intercalate the tasks, but I guess there's no other solution.

In the branch associated to this commit you can find a working version which checks when the article is processed and retries if data has changed (not resumable as I didn't do the DB connection yet)