markovmodel/pyemma_tutorials

Regular problems with MDShare timeouts

Closed this issue · 19 comments

I was trying to run through the notebooks again this morning from a fresh repo clone. I am having problems with md-share timeouts -- I have tried repeatedly the last 30 mins. I mentioned a similar problem some weeks back to @cwehmeyer when I first started looking at the notebooks. Are we confident that this solution will be robust in the wild?

TimeoutError: [Errno 60] Operation timed out

It is as robust as our IT departments servers are. So this is at least questionable. We could host the data elsewhere? Then we should decide upon a provider. Github LFS seems costly. Probably Amazon S3 is suited best for this. But we can also be brave and believe that issues of this kind are only temporary and rare.

We can also try a ZeDaT-hosted userpage for this.

I bet this is the same webserver.

No it is not. I'm good with changing that (at the end of the day, when more important stuff has been done).

It's a bit less convenient since we don't have a group account...

I do not know a solution to this without some compromises. Moving the data to a proprietary platform fx. github or figshare. Maybe dryad is an option? https://datadryad.org/ (open, pay to play)

In my opinion we could for now just rely on our server infrastructure. Servers in general are never up 100% - never.

Still, we should make mdshare more stable. I know that no server is available all the time, but I experienced problems on no less than three days in this week alone.

This seriously needs to be resolved. The current solution is NOT OK in my opinion. There will countless emails coming our way if we do not do something about this now. FAO @franknoe

So is the issue the server availability or are there issues with mdshare itself?

How much data are we talking about? Any reason not to put this on a robust location like figshare and pulling things with mdshare from there? I believe the Pande people have also used figshare for some time.

Improving the IT situation is high up on my list, but this is not going to be a fast solution.

The fastest fix would be to

  1. hard-code some content, i.e., the tutorial file catalogue into mdshare. That way, we can bypass the need for an online (filename pattern) lookup after we already downloaded the data.
  2. mirror the files on another ZeDaT server under a private user account.

I suggest to have two locations, the default location (e.g. FU) and the fallfack location (e.g. figshare). But we should then check availability of the first choice fast (i.e. not a super long timeout).

If we need to buy space somewhere tell me the place and the cost. A few 100 MB can't be very expensive.

I think there are many options: I like the idea of Dryad, non-profit, research focused, citable DOI for the data. The only problem as I see it is that the data is static once uploaded, and additional data has to be separately submitted and paid for (ball park 100usd/submission as far as I can see). The most important aspect would be to have a platform which would work with the MDShare library as-is, so no major code changes need to be done. Does LiveCommsJ allow for supplementary information of this size? This could also be an option.

We can login with our zedat account.

Publishing data is allowed for FU members but it seems to come with a bit of overhead. I also have not yet explored how we can access parts of a dataset and there are no published datasets (only manuscripts) yet which I could use as an example.

What about a github repository for data only?