Between about 2007 and 2015, a team of dedicated hobbyists did significant reverse engineering work on several early microprocessors, particularly the venerable 6502. Their work can be found at visual6502.org.
The team maintained a wiki that used MediaWiki technology. The wiki is no longer functional, as you can see. This repository represents my personal effort to restore the wiki's content to availability on the internet.
This site, however, is not a wiki; it is historical documentation. It is a static, markdown-base website in Github Pages. The process used to build the site is described below.
The original Wiki states:
Content is available under Attribution-NonCommercial-ShareAlike 3.0 Unported license.
To honor this license, I must list the authors.
This information is found in ./wiki/rdf
.
-
The original content was downloaded from the Wayback Machine. This copy is never modified. It is located in websites.
-
Additions are made and various cleanup changes are made. This copy is located in work.
-
A static markdown representation is built. This copy is located in wiki.
-
Github Pages uses Jekyll to render the markdown content as a static HTML site.
The rendered site is at https://gmofishsauce.github.io/v6502demo/wiki.
The Wayback Machine Downloader (WMD) was used to download the Wiki.
Additional processing was done a custom tool written in Golang, mkmd
.
The source code for this tool is found in the tools
subdirectory.
I am the sole author of the mkmd tool. It is GPLv3 licensed.
The WMD is a well-documented Ruby Gem. Because the Wiki contains many internal links and the Wayback Machine blindly chases all of them, the WMD by default downloads tens of thousands of pages. Specific exclusions reduced this to less than 1000 pages (including all images) for the entire V6502 Wiki:
wayback_machine_downloader http://visual6502.org --only wiki \
--exclude "/\&[A-Za-z]+|Special:/" > all_files.wmd 2>stderr.wmd
The command above can be checked without doing an actual download by adding the command line
option --list
to the wayback_machine_downloader command line. The list of files to be
downloads is written to the standard output (e.g. all_files.wmd
in the example above).
After my initial download, I found that I could never download the entire site again (which turned out to be required because I did not treat the download as pristine). It seemed like a throttling issues. The Wayback Machine Downloader does not have any options for adding delays between the individual file downloads.
In response I wrote a shell script, ./tools/dl.sh
. To use it, run the WMD command given
above with the --list
option. This produces a JSON file which I committed as ALL_FILES.json.
Make a copy of the file (e.g. lines.json
) and remove the first line '[' and the last line ']'
from the file; this leaves a file n "JSON lines" format (see jsonlines.org). Run
./tools/dl.sh lines.json
to download all the files. The script will incrementally remove lines
from the file as it successfully downloads each file; this allows the script to be killed and
restarted if issues occur. The script delays 5 to 15 seconds between each pair of file downloads.
I found this to be sufficient and did not investigate further.
The WMD downloads into websites/hostname
, websites/visual6502.org
in this case. But I
found that this second download pulled two variations on the name "visual6502.org", each
containing a couple of files. These can be seen in websites/
. These fake names complicate
processing. I decided to defer dealing with these. I created a directory work/
and copied
./websites/visual6502.org
recursively to work
. From here until the cleanup at the end
of this process, "the files" (or "all the files") refers to the content of ./work
only.
The websites
directory is kept pristine and the two side directories are addressed at the
end of the process.
All processing in this part of the "pipeline" must be done with extreme care because of the file naming. The downloaded files were named from their wiki page titles, which can include any character. Three of the pages in this wiki contain single quotes in their titles which become single quotes in their downloader filenames, and a few other files contain other shell metacharacters in their names. Fixing these is not trivial, because links to these files contain URL-encoded representations of these metacharacters, which must also be fixed. This is done later and details are given there.
Some non-image files in the wiki are gzipped. Image files are sensibly not gzipped as they
don't compress much. I wrote another shell script, ./tools/gunz.sh
, to find all the gzipped
files and unzip them. The script is idempotent (it can be safely rerun if gzipped files are
ever added to the repo). The script uses tricks like find -print0 ... | xargs -0 ...
because
of the file naming issue, which is resolved in a later step.
The license on the wiki requires listing the authors. I found that all authorship information
was present in the Wayback Machine as a set of XML (RDF) files, one per content file. The RDF
files are referenced by <link>
tags in each HTML document.
Code in the mkmd tool can be used to identify and download the most recent version of each RDF
file in the Wiki. These files were downloaded to ./wiki/rdf
. This is done a single
file at a time, using multiple invocations of the command
mkmd -r -o wiki/rdf html_file.ext
I wrote tools/dl_auth.sh
to download all authorship information into the work
directory. This script runs mkmd -r
for each file. Note that many of the HTML files have
file extensions .png
or .jpg
.
I build mkmd
in the ./tools
directory like this: go build -o mkmd
. The actual URL of
each of these .rdf
files within the Wayback Machine is peculiar; the Go code knows the rule
for constructing it.
The site will end up in v6502demo/wiki
, so the images need to be there. Rather than copy the
ones I downloaded which would double the size of the repo, I moved the downloadeded images/
directory from websites
to work
and then from work
to images
. There is shell script,
tools/bld_img.sh
, that does this. It's a one liner that runs rsync.
I renamed all files according to a rule. This is done using ./tools/mkmd -u
on each file.
The work directory contains the renamed files. This breaks all the links to the
renamed documents. All URLs are rewritten in later steps by applying the
same character remapping rule. Unlike the file names, the URLs that reference them are usually
URL encoded. The URLs must be URL decoded to restore the original illegal URL characters, then
fixed by applying the rule.
Interestingly, none of the image files seems to have any illegal URL characters in their names. This suggests that MediaWiki has similar set of remapping rules that it applies to uploaded images.
I enabled Github Pages for the entire repo. Much of the repo is not accessible through the Pages site, because Pages only renders markdown. I created some README.md files, which become index.html files in effect.
Several weeks were spent creating an HTML to markdown translator specific to the MediaWiki pages in the visual6502 wiki. The details are out of scope for this README; see the source (in ./tools) for details.