TapEth (Technologies for Advanced Programming - ETHereum) is a university project. The main goal of this project is to statistically analyze pending transaction on one Ethereum network (e.g. the main network) to predict the estimated waiting time for each transaction before being mined in a block.
N.B. Some large files, like Kafka or Spark packages, are hosted in GitHub using git-lfs.
The data pipeline is composed of the following steps:
Step | Technology used |
---|---|
Data ingestion | Apache Kafka Connect |
Data streaming | Apache Kafka / Apache Spark Streaming |
Data processing | Apache Spark / Apache Spark MLlib |
Data indexing | ElasticSearch |
Data visualization | Kibana |
The estimation is based on the gas price of the pending transaction. The gas price represent how much the user is disposed to spent for the transaction, so higher values of gas price are more catchy for miners that will mine the transaction sooner.
TapEth uses Infura that basically is service that provides a geth node (exposing the Ethereum JSON-RPC api). Through the pub/sub pattern, using the JSON-RPC api and websockets, it's possibile to subscribe to events like new blocks mined or incoming pending transactions; for more information, visit the pub/sub documentation of geth.
In brief:
- Pending transactions are acquired from Kafka Connector and written to a Kafka Topic. For more information, visit the Kafka section of this project.
- Spark Streaming reads from the Kafka Topic, then Spark MLlib processes the incoming data using machine learning and lastly sends the data to ElasticSearch. For more information, visit the Spark section of this project.
- Finally Kibana reads indexes from ElasticSearch and exposes a beautiful graphical interface.
You can start it easily with docker-compose using docker-compose up
from the root of this project folder.