python/psf-salt

Generating a static archive of bugs.python.org and bugs.jython.org

ewdurbin opened this issue · 2 comments

Currently bugs.python.org and bugs.jython.org have both been deprecated in favor of Github Issues.

The messages/files/urls need to remain online in perpetuity as a reference (we don't want to break old links!).

We should investigate methods of creating a static archive of the sites for this purpose, to avoid the need to maintain the installations forever.

cc @rouilj

@JacobCoffee, @rouilj and I had a conversation on IRC a while back and John was tracking a request from another user regarding static archives of roundup instances.

@ewdurbin, I have not had any luck tracking down the user from the prior discussion.

However, some things to consider:

How much of the msgs/files/urls do you want to keep? Do user### urls matter? Do any other class urls matter?

As a first pass consider scraping all the issue### urls. The issue url's include the sender, date and
body of the msgs. Do you need to support the msg2345 url if issue123 displays msg2345?

If not this becomes easier. If you do need to keep the msgs (e.g. for recipient list, or exact
formatting, or because you expect links to msgs to be used), you could scrape all the msg####
url's as well.

For files, how much of the metadata do you need? Scraping the files#### url and then placing the actual
files in a subdirectory of files####/filename might work. I don't remember the exact structure of the
download links that we would have to replicate on disk.

One issue might be setting the mime type for the attached files. If you can live
with all files having the application/octet-stream mime type we could get away without having to
munge the download links on the generated HTML pages (files##, msgs##, issue##) to include a
type (.pdf, .jpg ...) extension.

To preserve internal links (e.g. issue123 references issue456) , we would need to make the url
b.p.o/issue23 resolve to issue23.html. This will make the web server serve the page up with the
correct mime type. I think using rewrite rules, either apache or nginx would be able to resolve the
right file on the back end.

If this isn't possible, we would need to automate munging the html in the scraped files changing
href="/issue23" to href="/issue23.html".

Also I don't see a reasonable way to generate an index page. How useful would a series of
/issue-1:1000.html pages listing issue numbered 1-1000 be for finding an issue? People could jump to a range easily enough by specifying issue-5001-6000. But would this be useful?

This ties in with searching the site. Roundup provides faceted searching (status, message text, title,
assignedto...). I am not sure if facted searching is needed when it is a static site.
If you expect this to be used by direct link (somebody on the internet references b.p.o/issue2346) and
a standard google index of the static site is sufficient, we can dispense with this issue. If you need to
retain the ability to find all issues where Ee is assigned and "assigned NEAR edurbin" (I think that still
works in google) isn't sufficient we may need something else. For example elastic search or something
based on sqlite fts5 search (with different facets in different columns).

That's a few things to consider off the top of my head.