/In-Network-Hadoop-NDN-CacheSimulator

Simulates the data transfer to explore caching potential in network nodes running Hadoop over NDN (Named Data Networking) rather than traditional TCP/IP.

Primary LanguageJava

This is a caching simulator developed to evaluate the performance of caching when using the Hadoop Distributed File System (HDFS) over the Named Data Networking (NDN) information-centric networking (ICN) architecture. It replays traces of read operations collected from HDFS clusters and evaluates them in a fat-tree network.

Our results were published in the paper "On the Power of In-Network Caching in the Hadoop Distributed File System" by E. Newberry and B. Zhang, which appeared in ACM ICN 2019.

Repository Structure

  • src contains the source code of the simulator.
  • topologies contains various topologies, although only the fat-tree (k=8 and 128 nodes) topology works currently.
  • traces contains various traces, although many of the ones in the "old" subdirectory may not work with the current simulator because they omit critical information (e.g., the starting offset of the read).

Requirements

This simulator should require no prerequisites apart from a modern version of Java and the libraries included in the standard distribution.

Compiling

The simulator can be compiled by running make. After modifications to the code, make clean should be run before running make again. Please note that this will erase the generated "build" directory and all of its contents.

Running

NOTE: Currently, this simulator is hardcoded to use 128 end hosts and a fat tree k factor of 8. However, these values should be trivial to modify in src/CacheSim.java.

The simulator must be run from within the "build" directory generated by running make. It can be invoked in the following way:

java CacheSim [topologyFile] [traceFile] [edgeCachePolicy] [aggrCachePolicy] [coreCachePolicy] [cacheBlockSize] [cacheMaxBlocks] (randomSeed) (nameNode)

The simulator will assign hosts to random positions in the cluster, based upon the provided seed (or, if none is provided, the seed provided by the Random() constructor). The NameNode argument was added as a workaround to fit a cluster with 128 DataNodes and 1 NameNode onto a fat-tree with 128 end hosts -- if specified, any hosts in the trace with this ID will be added as an additional node in the first pod of the fat-tree.

As explained in our paper, the simulator segments HDFS blocks into chunks ("cache blocks") of the size specified on the command line, starting from the beginning of the block. Read operations with the same source and destination host will be ignored and not included in the results, since they do not generate network traffic for the HDFS data they transfer. Moreover, the end location of the last cache block in an HDFS block will be rounded up, even if there is no data past that point, since there is no way to determine the actual size of the block from the input traces.

The simulator will output results for each router in the topology in the following format after it completes its evaluations:

cacheBlockAccesses,cacheBlockHits,cacheBlockHitRatio,byteAccesses,byteHits,byteHitRatio

However, if there was no network read activity for that router, the simulator will output "No Activity" instead of these counts. In the results, routers will be separated by layer of the fat-tree. Additionally, the total network traffic in bytes (with traffic counted once for each link it traverses) will be outputed at the end.

Trace Format

Input traces to this simulator have one operation per line, with the following information written for each operation (separated by spaces):

  • Timestamp - We used this value to sort the operations chronologically before using them as input to the simulator, as the simulator will process operations in the order they appear in the file. As such, this can be any value if the exact timestamps are unavailable.
  • Operation type (READ or WRITE) - This is used to separate out read and write operations, since we are only interested in read operations.
  • Block ID (integer) - Unique HDFS block ID
  • Size (integer) - Size of the read operation (in bytes)
  • Offset (integer) - Starting offset of the read operation within the HDFS block (in bytes)
  • Source (integer) - Unique identifier of the reading client
  • Destination (integer) - Unique identifier of the DataNode storing the read block

For example, a read operation for a 128KB read starting at a 256KB offset in block 987654321 being read by host 123 from the DataNode running on host 456 could have the following format:

2019-09-19-11-44-00-00 READ 987654321 131072 262144 123 456

Please note that our simulator does not take version ("generation stamp") into account. This is because none of our traces featured operations with multiple generation stamp values associated with a single block ID.

Topology Format

Originally, the topology file was used to allow the simulator to support multiple topologies. However, the latest version of the simulator has been hardcoded for the fat-tree topology. Instead, this file is used to validate the fat-tree generated at runtime to ensure that it is correct. The file we used to evaluate a fat-tree with k=8 can be found in topologies/fattree-128node-k8.txt.

Topology files have the following format:

  • Line 1: number of nodes (vertices) in the topology
  • Line 2: number of links (edges) in the topology
  • Lines 3+: source and destination pairs of nodes indexed from zero (separated by a space), one per line - These links are bidirectional, so only one does not need to specify both "4 5" and "5 4".