/blockscrape

Blockscrape is a utility program that scrapes data from a blockchain and exports it to CSV format.

Primary LanguageJavaScriptMIT LicenseMIT

Blockscrape

Discontinued -- please contact me at che.fisher@gmail.com if you are interested in maintaining this project

Blockscrape is a utility program that scrapes a blockchain for required information and exports it to a CSV file.

license dependencies devDependencies

Why Blockscrape?

Whether you're a data scientist, quality assurance engineer, or simply find yourself repeatedly needing the same set of blocks or transactions and want to avoid requesting the same information over and over (thus reducing strain on your network and making it easier to share said information by saving it to disk), Blockscrape is a utility program for blockchain analysis bundled with some nifty features which make it:

  • Fast: uses all available CPU cores via Node workers to get the job done in parallel
  • Smart: uses a built in customizable LRU cache for fee calculation to avoid making the same request twice
  • Reliable: saves incomplete/failed blocks to disk and restarts dead workers just in case things do go wrong
  • Remote-able: connect to and scrape remote nodes not on your local network

Coming Soon

  • Extendable: allows for adding other blockchains with relative ease
  • Customizable: specify required attributes instead of the default height, amount, fee, time, and txid
  • Benchmarks: proof that it works fast
  • Tests: proof that it works well

Installation

Prerequisites

  • Requires Node Dubnium (v10). I'd recommend installing using Node Version Manager
  • An fully indexed locally running blockchain node such as Litecoin or Bitcoin Local chain optional but recommended, otherwise you'll need to scrape a remote blockchain

Instructions

  1. git clone the repository into wherever you keep these things
  2. cd into Blockscrape root directory
  3. npm install to get required packages
  4. npm link to get that fancy symlink (ooooh shiny!)

This will clone the repository, install required packages, and create a Blockscrape binary.

Now before I tell you the magic command you need to know a few things...

Environment Variables And A Few things

To take advantage of memoization the scraper goes in reverse. No matter what two blocks you pass Blockscrape will begin at the highest block and end at the lowest.

The scraper does have some persistence although it's pretty basic: Blockscrape saves the last written block to a file (last-written-block.save) and will begin from the next block down the chain, so you can safely restart it with, say, a cron job in case the master process dies.

The save files (you might also notice a failed-blocks.save appear in case a worker dies while scraping) are ignored by Git and thus shouldn't be checked into version control.

The data dumps are saved in the dumps folder and reference the first and final (last written) blocks in the data dump, for example blocks-109330-109300.csv.

  • BLOCKSCRAPECACHESIZE: maximum allowed number of transactions able to be stored in the LRU cache, defaults to 100000
  • BLOCKSCRAPECLI: the name of the CLI interface of your local blockchain, if undefined defaults to litecoin-cli
  • BLOCKSCRAPEFROM: the first block (inclusive) to scrape, if undefined attempt to read from last-written-block file
  • BLOCKSCRAPETO: the final block (inclusive) to scrape, if undefined defaults to 0
  • BLOCKSCRAPELIMIT: the maximum amount of blocks to write before shutting the process down, defaults to 10000

Running Blockscrape

Now that you know what the environment variables do you could, for example, scrape block 30000 to block 10 by doing:

  • BLOCKSCRAPECLI=litecoin-cli BLOCKSCRAPEFROM=30000 BLOCKSCRAPETO=10 blockscrape

Typing out those hefty environment variables every time would be tedious and I figure you probably don't want to sit around staring at your screen to ensure the Blockscrape is alive and well while scraping large amounts of data.

In that case consider starting (and potentially restarting) Blockscrape with a script like so:

# restartBlockscrape.sh

#!/bin/bash
source $HOME/.bashrc

NODE="$(which node)"
PROCESS="$NODE /home/grayedfox/github/blockscrape/main.js"
LOGFILE="/tmp/log.out"

export BLOCKSCRAPECLI="$(which litecoin-cli)"

if pgrep -f "$PROCESS" > /dev/null; then
  echo "Blockscrape is doing it's thing - moving on..." >> $LOGFILE
else
  echo "Blockscrape not running! Starting again..." >> $LOGFILE
  echo "Process: $PROCESS" >> $LOGFILE
  echo "Node: $NODE" >> $LOGFILE
  $PROCESS >> $LOGFILE
fi

Now to monitor progress you could tail -f /tmp/log.out if using the above example and watch the blocks roll by.

You could also turn this into a cron job using crontab -e (or your scheduler of choice) to execute that script every N minutes/hours/unicorns.

Supported Blockchains

  • Litecoin
  • Bitcoin (in theory)
  • Remote-able: scrape remote blockchains on Blockcypher

Contributing

Please follow the GitFlow branching model. Feature branches will require code reviews and branches merging into develop should be squashed. I have a linting style I like and I'd prefer you stick to it - Travis will fail pull requests that don't conform (sorry!). Captain's orders. All else is up for discussion!

Oh and feel free to report bugs, feedback, and the like - it's all much appreciated.