are there any plans to update the archive?
Closed this issue · 1 comments
ckoshka commented
i'm planning on training a gpt-2 instance on the entire scp wiki & this is the only archive i've been able to find. are there any plans to update it?
sandsmark commented
yes, I updated https://github.com/sandsmark/wdotcrawl a bit and started running it again.
edit: it's probably going to take a long time (a couple of weeks). There's thousands of pages and pages have a ton of revisions, and I throttle heavily (wikidot didn't seem to mind 200ms delay between requests, but I started getting 500 errors so I delay at least 1s with exponential backoff on errors).