Bitcoin blockchain parsing in Python 3.6 (and 2.7). Still in development so expect bugs.
Blockchain data is loaded from binary .dat
files downloaded by the Bitcoin Core wallet. These files contain out-of-order serialized blocks. Either retrieve the files downloaded by the wallet, or extract the sample .dat
file from the .rar
located in Blocks/.
The Examples/ directory contains the methods for importing the binary blocks into Python, and decoding the data. These examples form the basis of the classes contained in the py2 and py3 modules.
- Python 3.6 or 2.7
- base58
- tqdm (optional)
- Set up a Python 2 or 3 environment as desired and install base58 and tqdm:
pip install base58
pip install tqdm
- For now, either manually download and extract the repo, or if git is available:
git clone https://github.com/garethjns/PyBC.git
cd PyBc
- Some example
.dat
are included in Blocks/. Unzip these if required.
-
Set .../PyBC/ as the working directory.
-
Everything is designed to be run from the top level directory. To use modules, import with, eg.
from py[version].Block import Chain
Set the working directory to .../PyBc/
and import required classes from the submodueles.
See read_dat.py.
from py3.Block import Dat
# Specify .dat to load
path = 'Blocks/'
f = 'blk00003.dat'
dat = Dat(path, f,
verb=5)
# Create Dat object
dat = Dat(f,
verb=5)
# Read the first block
dat.read_next_block()
# Read the next block
dat.read_next_block()
# Print the first blocks details
dat.blocks[0]._print()
# Print the second block's first transaction
dat.blocks[1].trans[0]._print()
See read_chain.py.
from py3.Chain import Chain
# Create a Chain object
c = Chain(verb=4,
outputPath="ExportedBlocks/")
# Read the next .dat
c.read_next_Dat()
from py3.Chain import Chain
# Create chain object
# Specifying (or not) which .dat to start from, and
# how many to load
c = Chain(verb=5,
datStart=0,
datn=1,
outputPath="ExportedBlocks/")
# Read
c.read_all()
The examples directory contains a number of scripts outlining the steps required to extract and decode different bits (literally) of information from the serialized blockchain.
These examples include
- ReadBlocks
- Reading binary data from disk
- HashBlock
- Compile the relevant information in a block header
- Hash it to verify it's valid
- HashTransaction
- Compile the relevant transaction data
- Hash it to verify it's valid
- DecodeOutputScripts
- Process transaction output script to list of OP_CODES and data
- GetOuputAddress
- Convert data in output script to a bitcoin address
- BlockChainInfoAPI
- How to query Blockchain.info's api
- And use it to verify transactions and blocks
- Export
- Explorting blocks to other formats indlucing dicts and Pandas DataFrames.
See Examples/readme.md for more info.
Classes are split in to two modules py2
for Python 2 code, py3
for Python 3 code and pyx
for generic code.
The Python 3 code is developed first, and the Python 2 code converted later.
There are two main types of Class - "loaders" and "mappers". Loaders hold the binary data read from disk in ._[name]
attributes. This data can be accessed using a .[name]
property that handles converting the data to a more usable format. Holding data in the created objects is convenient, but obviously increases memory usage. Mappers work exactly the same as loaders but avoid holding data. Instead, the objects just hold the index to the location of the data in the .dat file. These have the attributes ._[name]_i
which hold the index, then two sets of properties: ._[name]
which get and return the data from disk, and .[name]
that return the convenient versions.
Chain, Dat, Block, Trans, TxIn, and TxOut classes deal with parsing the blockchain components have the same following effective hierarchy:
Chain. dats[x]. blocks[x]. trans[x]. txIn[x] and .txOut[x]
ie. Chains hold multiple Dats, Dats hold multiple Blocks, Blocks hold multiple transactions, Trans hold multiple TxIns and TxOuts.
The py3.Common class holds reading methods and cursor tracking which are used by most of the other classes.
Object and methods to handle whole chain. At the moment .dat
files are ordered, but blocks aren't re-ordered by timestamp. Order in .dat
depends on download order.
Each child object in chain is stored in appropriate field of parent object (.dats
, .blocks
, .trans
). These fields contain dictionaries keyed by {object number (int, counting from 0) : Object}:
- Chain.dats ->
- {Dat objects}.blocks ->
- {Block objects}.trans ->
- {Trans objects}.TxIn/TxOut ->
- {TxIn objects}
- {TxOut objects}
- {Trans objects}.TxIn/TxOut ->
- {Block objects}.trans ->
- {Dat objects}.blocks ->
Read .dat
files 000000 - 000002.dat
from Blocks import Chain
c = Chain(verb=5,
datStart=0,
datn=3)
c.read_all()
Print example first transaction in second block in first .dat
file imported.
c.dats[0].blocks[1].trans[0]._print()
verb
: Import verbosity (int)
- 0 = printing off
- 1 = Use a TQDM waitbar, if available
- 2 = print
.dat
filename on import - 3 = print block level information on import
- 4 = print transaction level information on import
- 5 = print TxIn and TxOut level info
- 6 = print above and API validation checks
datStart
: First .dat
file to load (int)
datn
: Number of .dat
files to load (int)
datPath
: Relative or absolute path to folder containing .dat
files
.readDat()
: Read specified file
.read_next_Dat()
: Read next file
.read_all()
: Read all .dat
files (within specified range)
Some batch export methods would be useful.
Object and methods to handle .dat files downloaded by Core wallet. Uses mmap to map .dat
file to memory and read byte by byte. Keeps track of how far through a file has been read (.cursor).
Load a single block
from Blocks import Dat
path = 'Blocks/'
f = 'blk00003.dat'
dat = Dat(path, f,
verb=5)
dat.read_next_block()
path : path to folder containg .dat
s
f : filename of .dat
file (string).
.cursor
: Current position in file (int).
.blocks
: Blocks extracted (dict).
.mmap
: Mutable string object to read binary data from .dat
file.
.reset()
: Reopen file, create new .mmap and return .cursor to 0.
.read_next_block()
: Read the next block and store in .blocks. Remember final .cursor position.
.read_all()
: Read all blocks in .dat
.
.to_dict()
: Return attributes in a dict
.blocks_to_pandas()
: All blocks as rows of pandas DataFrame. Doesn't include individual transaction information.
.trans_to_pandas()
: Return all transactions as rows of pandas data frame. Drops all but first input and output of each transaction.
.to_pic()
: Pickles the block to disk after removing all the mmap objects.
Object and methods to handle individual blocks.
General
.mmap
: Redundant mmap object from .dat (mmap) [remove?].
.start
: Starting cursor position in .dat (int).
.cursor
: Current cursor position (int).
.end
: End cursor position in .dat (int).
.trans
: Dict storing transactions in block (dict).
Header info (each has ._ property)
.magic
: Magic number (4 bytes)
.blockSize
: Block size (4 bytes)
.version
: Version (4 bytes)
.prevHash
: Previous hash (32 bytes)
.merkleRootHash
: Merkle root hash (32 bytes)
.timestamp
: Timestamp (4 bytes)
.nBits
: Block size (4 bytes)
.nonce
: Nonce (4 bytes)
.nTransactions
: Number of transactions in block (var bytes)
Useful properties
.time
: Human readable time (dt)
.read_header()
: Read the header from the binary file and convert to hex. Store in relevant attributes.
.read_trans()
: Loop over .nTransactions and read each transaction. Store in .trans.
.verify()
: Check block size matches cursor distance traveled.
._print()
: Print block header info.
.prep_header()
: Using the data stored in relevant header attributes, recombine and decode to binary ready for hashing.
.api_verify()
: Get the block information from the Blockchain.info API (using the hash). Verify it matches on a few fields.
.to_dict()
: Return attributes in a dict
.to_pandas()
: Return as a single, index DataFrame row.
.to_csv()
: Save DataFrame as .csv (not especially useful here - use export methods to Dat export with blocks-as-rows or transactions-as-rows).
- The
.read_var()
method fails for large values. In blocks where this occurs, it'll cause the cursor to increment too far and break everything.
Object to store transaction information.
.mmap
: Redundant mmap object from .dat (mmap) [remove?].
.start
: Starting cursor position in .dat (int).
.cursor
: Current cursor position (int).
.end
: End cursor position in .dat (int).
Transaction info (each has ._ property)
.version
: Version (4 bytes).
.nInputs
: Number of transaction inputs (variable bytes).
.txIn
: Holds TxIn object for each input.
.txOut
: Holds TxOut object for each output.
.lockTime
: Locktime (4 bytes).
Useful properties
.hash
: Return hash of transaction.
.get_transaction()
: Read the binary transaction data, including the input and output components.
.prep_header()
: Returned concatenated bytes from transaction header to use for hashing.
._print()
: Print transaction info.
.api_verify()
: Get the transaction information from the Blockchain.info API (using the hash). Verify it matches on a few fields.
.to_dict()
: Return attributes in a dict
.to_pandas()
: Return as a single, index DataFrame row.
.to_csv()
: Save DataFrame as .csv (not especially useful here - use export methods to Dat export with blocks-as-rows or transactions-as-rows).
.pre_header()
only gets the first input and first output the moment. This means the .hash
is calculated incorrectly for transactions with multiple inputs or outputs.
Holds inputs for transaction.
General
.cursor
: Current cursor position (int).
Transaction inputs
.prevOutput
: Previous output (32 bytes).
._prevIndex
: self.read_next(4).
.scriptLength
: Script length (variable bytes).
.scriptSig
: ScriptSig (variable bytes).
.sequence
: Sequence (4 bytes).
.read_in()
: Read TxIn bytes in order.
._print()
: Print TxIn info.
Holds outputs for transaction and methods to decode.
General
.cursor
: Current cursor position (int)
Transaction outputs
.output
: Transaction outputs (1 byte).
.value
: Value in Satoshis (8 bytes).
.pkScriptLen
= pkScriptLen (variable bytes).
.pkScript
: pkScript - contains output address (variable bytes).
Useful properties
.parsed_pkScript
: Return .pkScript as list of OP_CODES and data.
.outputAddr
: Return bitcoin address for this output.
read_out()
: Read TxOut bytes in order.
.split_script()
(static) : Split the output scrip (.pkScript) in to a list of OP_CODES and data to push to the stack.
.P2PKH()
(static) : Get the output address for this object.
.get_P2PKH()
: Convert (old?) public key to bitcoin address.
.PK2Addr()
(static) : Convert public key to bitcoin address.
.get_PK2Addr()
: Get the output address for this object.
._print()
: Print TxOut info.
Anything used in more than one class.
- Tracks cursor position in current file
- Reading and mapping methods.
- General export methods.
Handles API calls to blockchain.info's API.
Some unit tests are included for the Python 3 version in .../py3/
, and can be run from top level directory:
python -m py3.tests