zooniverse/panoptes-python-client

Connection issues when trying to upload subjects / looping over subject sets

Closed this issue · 2 comments

Hi @adammcmaster

A colleague of mine and I have recently been experiencing connection issues when uploading subjects using the panoptes-python-client. Uploading subjects fails regularly, i.e., seems to stall indefinitely at random times during the upload process. This has led to incomplete uploads of our subject-sets. We've re-factored our upload script such that we can "resume" uploading into the same set using these steps:
1. Iterating over a specified subject-set to find all already uploaded subjects
2. Start creating non-existent subjects in batches of 500 and then add to the set, repeat 2) until finished
3. Start over with 1) on connection failure (manually)

Recently, we've had difficulty to even get past stage 1) -- iterating over the subject set. Essentially what we are doing for 1) is this:

my_project = Project(args['project_id']) 
my_set = SubjectSet().find(args['subject_set_id']) 
for i, subject in enumerate(my_set.subjects): 
    .....

This is the error I just got from looping over the set:
PanoptesAPIException: Received HTTP status code 504 from API

We've been working on the "Cedar Creek" project (5880) which has attracted a lot of volunteers and is about to run out of data much faster than expected. When we uploaded the first batch of data before Christmas we've had the same issue for several days until it suddenly worked.

We've both been working from MSI (UMN super computing institute) but I've also experienced the issues from my home isp.

  • Do you see something we are doing wrong?
  • Is there a best practise to "resume" uploading into a subject-set? (avoiding duplicates)
  • Or is there a better way to handle this? Like, creating all subjects first by storing subject_ids of successfully created subjects to disk, repeat on connection failures by omitting successfully created subjects, and link all of them to the set in one go once all were created using the subject_ids saved to disk?
  • By the way: is there a way to access / find unlinked subjects? Are they garbage collected at some point?

We've been using Python 3.5 / 3.6, with panoptes-python-client 1.0.1 and 1.0.3.

(full code if you're interested: Link)

This is a dup of known issues that are fixed in master but not released, specifically #189 and #191. The solution is to use the latest code from github and not the released version, https://github.com/zooniverse/panoptes-python-client#installation
pip install -U git+git://github.com/zooniverse/panoptes-python-client.git

I'll let @adammcmaster speak more to best practice with 'resumption' and finding unlinked subjects. For ref I worked on something similar recently to resume uploads https://github.com/camallen/PRN-scripts/blob/f571de5a087320bde27047440765b74a7eb131f8/upload_manifest.py#L57

Thanks Cam! I've changed my "resuming" functionality according to your example, very neat. This will require less requests and thus mitigate connection issues. We've had also much less connection issues today (with the updated client) and were able to process a new chunk of data. Feel free to close the issue.