s-rah/onionscan

Parsing onionscan's crawls dat_0 file

Closed this issue · 3 comments

Dev @s-rah was kind enough to direct me to the location of the crawl data, however I've spent most of the morning trying to work out how to parse it into something usable using Python. @s-rah suggested #129 I use Golang onionscan/crawldb package but I know very little about GoLang and was hoping to access the data directly and head back into familiar territory.

I've looked at the code to try and see how the files were produced, and saw onionscan uses Tiedot for its database structure, which describes itself as NOSQL JSON but this hasn't got me that far.

Does anyone know whether the dat_0 file is in a particular format, or is it just raw data. Opening it in macos Textedit I can see JSON structures, but Sublime Text parses it differently.

Textedit
dat_textedit

Sublime
dat_sublime

Ideally I want to parse the dat_0 into a JSON structure in Python. However I've tried json.dumps and ended up with a data rate error.

import json
f = open('crawls/dat_0','rb')
data = json.dumps(f.read())

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Currently as I'm unsure precisely what dat_0 is, I don't know whether json.dumps() is working correctly, but failing because the file is too large, or whether it is misinterpreting the file entirely, leading to the data rate error.

If anyone knows whether dat_0 is a standardized output, and for a bonus, if it can be parsed using a python library, it would be really helpful.

s-rah commented

The Tiedot API documentation specifies the format for data structures used in the database file https://github.com/HouzuoGuo/tiedot/wiki/Data-structures - you will need to write a python parser for these data structures in order to process the file in python.

Thanks very much. It's pointed me in the right direction which much more than i had beforehand.

For anyone interested in something similar, the following structure compiled using Kaitai Struct worked for parsing the main dat_0 file, and Kaitai Struct supports many languages, not just Python.

meta:
  id: onions
seq:
  - id: records
    type: record
    repeat: eos
types:
  record:
    seq:
    - id: validity
      type: s1
    - id: allocated
      type: s8le
    - id: document
      type: str
      encoding: utf-8
      terminator: 1
      eos-error: false