WaybackFullTextSearch++

Work in progress software for indexing entire archives of sites (or sections of sites) from web.archive.org using manticore search.

Usage

Very very quick overview, more details fully documenting will be added later

Step 1: Create sqlite DB file of url+timestamp combinations to scrape from archive.org

./WaybackGetUrls -o dbfilename.sqlite -d example.com

Usage:
  WaybackGetUrls [OPTION...]

  -o, --output_file arg  Sqlite3 DB name
  -d, --domain arg       Search query
      --tor              Route requests through TOR
      --tor-port arg     Port to run TOR proxy on (default: 9051)
  -h, --help             Print usage

Step 2: Scrape pages from archive.org

This requires you have an instance of manticore search running on your machine ./WaybackScrapePages -f dbfilename.sqlite -t manticoreTableName

Usage:
  WaybackScrapePages [OPTION...]

  -t, --table arg     Manticore table name
  -f, --db-file arg   File containing URLs to scrape
      --tor           Route requests through TOR
      --tor-port arg  Port to run TOR proxy on (default: 9051)
  -h, --help          Print usage

Step 3: Search scraped and parsed data

/WaybackSeachTable -t manticoreTableName -q searchQuery

Usage:
  WaybackSearchTable [OPTION...]

  -t, --table arg       Manticore table name
  -q, --query arg       Search query
  -n, --results arg     Number of results (default: 10)
  -p, --page arg        Page number, starts at 0 (default: 0)
  -s, --server-url arg  Manticore database URL (<ip>:<port>) (default:
                        127.0.0.1:9308)
  -h, --help            Print usage

TODO

  • Download available page URLs from wayback API and store in sqlite file
    • URL sqlite file handling class
    • cURL class
    • Parse downloaded API results into db file
  • Download pages from sqlite url DB into manticore instance
    • cURL class
    • Add functionality to sqlite class to get filtered results from db file
    • Manticore connection class
      • Add required functions to cURL helper class
    • Page parsing (get readable text for parsing)
    • Insert parsed page data into manticore DB
    • Batch operation page scraping
    • Multithreaded page scraping
      • Improve multithreading
    • Move away from using sqlite3 db during scraping, move data into manticore for more reliable usage
  • [~] Search functionality
    • Expand manticore class to have search features
    • Output results nicely
    • Web
      • Display results in web browser
      • Fully interactive search through web page
      • Perfect UI for ease/speed of use
      • Experiment with returning saved HTML when result link clicked instead of directing to archive.org
    • Advanced search features
    • Figure out how to combine proper pagination and grouping together multiple timestamps of the same page nicely
  • Option to route through TOR
  • Write good documentation
  • Fix all the function calls where returned errors arent handled
  • Detection +exclusion of binary files (sometimes wayback mimetype is wrong)
    • Or alternatively, handle various binary file types properly (parse them)
  • Better command line options/handling