Create torrents for bulk data
Opened this issue · 4 comments
Right now, I'm using the fdsys
script to scrape all bill texts for every Congress session that has data. This takes a long, long time, so having the data hosted somewhere makes sense. After all, bills from previous congressional sessions aren't going to be modified. However, it is about a gigabyte of data per session, so no host would make sense - on the other hand, this is a great use case for torrents. The main issue is that you would most likely end up being stuck with all the formats possible in one torrent, but that's okay for me. Thoughts on this?
Sunlight used to host these on S3, but doesn't do that anymore.
It is a pretty decent use case for torrents, though I don't know if any of the organizers here have (or are familiar with) torrent management software, or want to take on the maintenance.
No, I don't remember anymore...not even the order of magnitude. If it was hugely expensive I'd probably remember, but we also didn't promote them very well -- they are just linked to on the wiki.
And actually, they still are:
https://github.com/unitedstates/congress/wiki
And the Sunlight downloads...still work. They're just not updated anymore. And are delivered over plain HTTP (gross).
I don't recall the S3 costs associated with these but I'd be shocked if they were significant. Speaking as a former crazed Bittorrent evangelist, I kind of doubt you'll wind up with enough use to keep a healthy swarm going. Still, if you want to go this route, S3 offers torrent capability. In practice that will probably wind up with AWS as the single seed and no real difference in costs (it actually might be a bit higher since I think you wind up paying for more API ops for individual chunks, even as the bandwidth costs are the same -- still, we're probably talking about pocket change).
What might make more sense is just configuring a requester-pays bucket. This will introduce some hassle for devs who aren't in the AWS ecosystem but is a pretty clean solution and protects against unexpected bills coming from devs who pull this data on an hourly cron. Unfortunately requester-pays buckets do not support Bittorrent.