Parsing onionscan's crawls dat_0 file
Closed this issue · 3 comments
Dev @s-rah was kind enough to direct me to the location of the crawl data, however I've spent most of the morning trying to work out how to parse it into something usable using Python. @s-rah suggested #129 I use Golang onionscan/crawldb
package but I know very little about GoLang and was hoping to access the data directly and head back into familiar territory.
I've looked at the code to try and see how the files were produced, and saw onionscan uses Tiedot for its database structure, which describes itself as NOSQL JSON but this hasn't got me that far.
Does anyone know whether the dat_0 file is in a particular format, or is it just raw data. Opening it in macos Textedit I can see JSON structures, but Sublime Text parses it differently.
Ideally I want to parse the dat_0 into a JSON structure in Python. However I've tried json.dumps and ended up with a data rate error.
import json
f = open('crawls/dat_0','rb')
data = json.dumps(f.read())
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.
Currently as I'm unsure precisely what dat_0 is, I don't know whether json.dumps()
is working correctly, but failing because the file is too large, or whether it is misinterpreting the file entirely, leading to the data rate error.
If anyone knows whether dat_0 is a standardized output, and for a bonus, if it can be parsed using a python library, it would be really helpful.
The Tiedot API documentation specifies the format for data structures used in the database file https://github.com/HouzuoGuo/tiedot/wiki/Data-structures - you will need to write a python parser for these data structures in order to process the file in python.
Thanks very much. It's pointed me in the right direction which much more than i had beforehand.
For anyone interested in something similar, the following structure compiled using Kaitai Struct worked for parsing the main dat_0 file, and Kaitai Struct supports many languages, not just Python.
meta:
id: onions
seq:
- id: records
type: record
repeat: eos
types:
record:
seq:
- id: validity
type: s1
- id: allocated
type: s8le
- id: document
type: str
encoding: utf-8
terminator: 1
eos-error: false