Implementing Driver

Question

Implementing Driver

niusj opened this issue 7 years ago · 3 comments

hi,

I have developed a graph-processing platform in C/C++ and want to benchmark it with graphalytics. I have some questions after reading the Manual: Implementing Driver.

After generating a directory named graphalytics-platforms-xgraph, shall I just edit the platform.properties, and run compile-benchmark.sh/load-graph.sh/execute-job.sh/unload-graph.sh/terminate-job.sh one by one ?
Do I need to convert my original input graph to the format like example-directed.v and example-directed.e ? If so, shall I implement a converter like "$rootdir/bin/exe/genCSR" in load-graph.sh ? And how to generate the corresponding validation graph ?
In my platform, graph loading, partitioning and computation are of a coherent process. I'm not sure how to measure only the upload time with graphalytics, like report after the stage or what ?

Answer 1 · 2017-09-25T13:44:33.000Z

Hi @niusj,

(1.1) The config-template/platform.properties file serves only as a template to define the properties a user can set (and their default values). You will likely not need to change this. You can create a distributable for your driver with mvn package; you will get a .tar.gz file including a.o. Graphalytics, your driver, and the config-template. You can unpack that archive wherever you want to run the benchmark, create a copy of config-template called config, and edit the platform.properties and other property files there to set properties specific to your environment (e.g., the location of your platform binaries, of the graph datasets, etc.).
(1.2) You should not manually run any scripts in bin/ other than run-benchmark.sh. The generated platform driver runs the other scripts at appropriate times in the execution of the benchmark process. You do need to edit execute-job.sh, load-graph.sh, terminate-job.sh and unload-graph.sh to perform their respective functions. If you have specific questions about any of these scripts, please let me know.
(2.1) If you want to add your own dataset to Graphalytics, you need to convert it to the generic vertex list + edge list format used by Graphalytics. During Graphalytics' execution, it calls the platform-specific load-graph.sh script for every graph. The vertex and edge file paths are passed as input and an output directory is specified where you can store the graph in a format specific to your platform. In your case, you would edit the load-graph.sh script to convert the vertex and edge files to a format that fits your platform.
(2.2) The validation datasets we provide were generated by simply running all algorithms on a graph with several platforms that we know to have correct implementations. I would suggest validating your implementation against the datasets we provide, and disabling validation if you want to use Graphalytics on graphs that you add yourself.
(3) Graphalytics currently only distinguishes between the makespan and processing time. The makespan, i.e., the total duration of running an algorithm including loading and storing results, is measured by the Graphalytics core. The processing time, as described in the Graphalytics specification, excludes the graph loading, partitioning, and result storing time. It must be reported by the platform. A straightforward solution is to print the time (with millisecond precision) at the start and end of your computation phase to standard output. Graphalytics stores the output of each job (run with the execute-job.sh script) in a log file and calls the XgraphCollector::collectProcessingTime method provided by your driver to extract the processing time. Please see the (commented) example implementation of that method and implement it to parse the output of your system and retrieve the processing time.

Hope this helps. Let me know if you have any other questions.

Answer 2 · 2017-10-31T09:57:27.000Z

Thank you so much!

The detailed tutorial about Implementing Driver above helps a lot, and I have implemented the graphalytics driver for the platform of myself successfully. Also, I have run the benchmark on Powergraph and GraphX successfully. But unfortunately, I have some problems with Giraph.

Here are some logs for running Graphalytics Test Benchmark on Giraph:

[driver.logs-graphalytics from Graphalytics]
2017-10-31 16:05:22,277 [runner-service-akka.actor.default-dispatcher-3] INFO [GiraphJob (run(270))] Waiting for resources... Job will start only when it gets all 2 mappers
2017-10-31 16:05:44,929 [runner-service-akka.actor.default-dispatcher-3] INFO [HaltApplicationUtils$DefaultHaltInstructionsWriter (writeHaltInstructions(79))] writeHaltInstructions: To halt after next superstep execute: 'bin/halt-application --zkServer cherry11:2181 --zkNode /_hadoopBsp/job_1509436473213_0001/_haltComputation'

[syslog from Hadoop]
Container 01:
2017-10-31 16:05:34,330 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1509436473213_0001_m_000000 Task Transitioned from SCHEDULED to RUNNING
2017-10-31 16:05:34,964 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() for application_1509436473213_0001: ask=1 release= 0 newContainers=0 finishedContainers=0 resourcelimit=<memory:7168, vCores:1> knownNMs=1
2017-10-31 16:05:36,623 INFO [Socket Reader #1 for port 49124] SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for job_1509436473213_0001 (auth:SIMPLE)
2017-10-31 16:05:36,659 INFO [IPC Server handler 0 on 49124] org.apache.hadoop.mapred.TaskAttemptListenerImpl: JVM with ID : jvm_1509436473213_0001_m_000002 asked for a task
2017-10-31 16:05:36,659 INFO [IPC Server handler 0 on 49124] org.apache.hadoop.mapred.TaskAttemptListenerImpl: JVM with ID: jvm_1509436473213_0001_m_000002 given task: attempt_1509436473213_0001_m_000000_02017-10-31 16:05:44,759 INFO [IPC Server handler 1 on 49124] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1509436473213_0001_m_000000_0 is : 1.0
2017-10-31 16:05:53,908 INFO [IPC Server handler 5 on 49124] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1509436473213_0001_m_000000_0 is : 1.0
2017-10-31 16:06:06,047 INFO [IPC Server handler 4 on 49124] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1509436473213_0001_m_000000_0 is : 1.0
2017-10-31 16:06:15,183 INFO [IPC Server handler 15 on 49124] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1509436473213_0001_m_000000_0 is : 1.0
2017-10-31 16:06:24,312 INFO [IPC Server handler 16 on 49124] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1509436473213_0001_m_000000_0 is : 1.0
2017-10-31 16:06:33,436 INFO [IPC Server handler 7 on 49124] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1509436473213_0001_m_000000_0 is : 1.0
2017-10-31 16:06:45,558 INFO [IPC Server handler 24 on 49124] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1509436473213_0001_m_000000_0 is : 1.0
2017-10-31 16:06:54,668 INFO [IPC Server handler 0 on 49124] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1509436473213_0001_m_000000_0 is : 1.0
2017-10-31 16:07:03,772 INFO [IPC Server handler 3 on 49124] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1509436473213_0001_m_000000_0 is : 1.0
...
Container 02:
INFO 2017-10-31 16:05:39,954 [org.apache.giraph.master.MasterThread] org.apache.giraph.comm.flow_control.StaticFlowControl - StaticFlowControl: Limit number of open requests to 100 and proceed when <= 80
INFO 2017-10-31 16:05:39,957 [org.apache.giraph.master.MasterThread] org.apache.giraph.comm.netty.NettyClient - NettyClient: Using execution handler with 8 threads after request-encoder.
INFO 2017-10-31 16:05:39,962 [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster - becomeMaster: I am now the master!
INFO 2017-10-31 16:05:39,993 [main-EventThread] org.apache.giraph.bsp.BspService - process: applicationAttemptChanged signaled
WARN 2017-10-31 16:05:40,022 [main-EventThread] org.apache.giraph.bsp.BspService - process: Unknown and unprocessed event (path=/_hadoopBsp/job_1509436473213_0001/_applicationAttemptsDir/0/_superstepDir, type=NodeChildrenChanged, state=SyncConnected)
INFO 2017-10-31 16:06:10,042 [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster - checkWorkers: Only found 0 responses of 1 needed to start superstep -1. Reporting every 30000 msecs, 569920 more msecs left before giving up.
INFO 2017-10-31 16:06:10,042 [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster - logMissingWorkersOnSuperstep: No response from partition 1 (could be master)
INFO 2017-10-31 16:06:40,073 [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster - checkWorkers: Only found 0 responses of 1 needed to start superstep -1. Reporting every 30000 msecs, 539889 more msecs left before giving up.
INFO 2017-10-31 16:06:40,073 [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster - logMissingWorkersOnSuperstep: No response from partition 1 (could be master)
INFO 2017-10-31 16:07:10,103 [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster - checkWorkers: Only found 0 responses of 1 needed to start superstep -1. Reporting every 30000 msecs, 509859 more msecs left before giving up.
INFO 2017-10-31 16:07:10,103 [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster - logMissingWorkersOnSuperstep: No response from partition 1 (could be master)
INFO 2017-10-31 16:07:40,140 [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster - checkWorkers: Only found 0 responses of 1 needed to start superstep -1. Reporting every 30000 msecs, 479822 more msecs left before giving up.
INFO 2017-10-31 16:07:40,140 [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster - logMissingWorkersOnSuperstep: No response from partition 1 (could be master)
INFO 2017-10-31 16:08:10,170 [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster - checkWorkers: Only found 0 responses of 1 needed to start superstep -1. Reporting every 30000 msecs, 449792 more msecs left before giving up.

[zookeeper.out from Zookeeper]
2017-10-31 16:04:35,915 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@173] - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir /tmp/zookeeper/version-2 snapdir /tmp/zookeeper/version-2
2017-10-31 16:04:35,916 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Leader@371] - LEADING - LEADER ELECTION TOOK - 228
2017-10-31 16:04:35,973 [myid:2] - INFO [LearnerHandler-/10.12.0.91:58531:LearnerHandler@346] - Follower sid: 1 : info : org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer@522359b7
2017-10-31 16:04:36,024 [myid:2] - INFO [LearnerHandler-/10.12.0.91:58531:LearnerHandler@401] - Synchronizing with Follower sid: 1 maxCommittedLog=0x0 minCommittedLog=0x0 peerLastZxid=0x0
2017-10-31 16:04:36,024 [myid:2] - INFO [LearnerHandler-/10.12.0.91:58531:LearnerHandler@410] - leader and follower are in sync, zxid=0x0
2017-10-31 16:04:36,024 [myid:2] - INFO [LearnerHandler-/10.12.0.91:58531:LearnerHandler@475] - Sending DIFF
2017-10-31 16:04:36,057 [myid:2] - INFO [LearnerHandler-/10.12.0.91:58531:LearnerHandler@535] - Received NEWLEADER-ACK message from 1
2017-10-31 16:04:36,057 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Leader@961] - Have quorum of supporters, sids: [ 1,2 ]; starting up and setting last processed zxid: 0x100000000
2017-10-31 16:04:54,800 [myid:2] - INFO [cherry12/10.12.0.92:3888:QuorumCnxManager$Listener@746] - Received connection request /10.12.0.93:47478
2017-10-31 16:04:54,801 [myid:2] - INFO [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message format version), 3 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 3 (n.sid), 0x0 (n.peerEpoch) LEADING (my state)
2017-10-31 16:04:54,819 [myid:2] - INFO [LearnerHandler-/10.12.0.93:56706:LearnerHandler@346] - Follower sid: 3 : info : org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer@784e2bf2
2017-10-31 16:04:54,846 [myid:2] - INFO [LearnerHandler-/10.12.0.93:56706:LearnerHandler@401] - Synchronizing with Follower sid: 3 maxCommittedLog=0x0 minCommittedLog=0x0 peerLastZxid=0x0
2017-10-31 16:04:54,846 [myid:2] - INFO [LearnerHandler-/10.12.0.93:56706:LearnerHandler@475] - Sending SNAP
2017-10-31 16:04:54,847 [myid:2] - INFO [LearnerHandler-/10.12.0.93:56706:LearnerHandler@499] - Sending snapshot last zxid of peer is 0x0 zxid of leader is 0x100000000sent zxid of db as 0x100000000
2017-10-31 16:04:54,872 [myid:2] - INFO [LearnerHandler-/10.12.0.93:56706:LearnerHandler@535] - Received NEWLEADER-ACK message from 3
2017-10-31 16:05:39,403 [myid:2] - INFO [SyncThread:2:FileTxnLog@203] - Creating new log file: log.100000001
2017-10-31 16:05:39,495 [myid:2] - INFO [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x15f7175f23a0000 type:create cxid:0x1 zxid:0x100000002 txntype:-1 reqpath:n/a Error Path:/_hadoopBsp/job_1509436473213_0001/_masterElectionDir Error:KeeperErrorCode = NoNode for /_hadoopBsp/job_1509436473213_0001/_masterElectionDir
2017-10-31 16:05:39,966 [myid:2] - INFO [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x15f7175f23a0000 type:create cxid:0xe zxid:0x100000009 txntype:-1 reqpath:n/a Error Path:/_hadoopBsp/job_1509436473213_0001/_applicationAttemptsDir/0 Error:KeeperErrorCode = NoNode for /_hadoopBsp/job_1509436473213_0001/_applicationAttemptsDir/0
2017-10-31 16:05:40,003 [myid:2] - INFO [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x15f7175f23a0000 type:create cxid:0x16 zxid:0x10000000c txntype:-1 reqpath:n/a Error Path:/_hadoopBsp/job_1509436473213_0001/_applicationAttemptsDir/0/_superstepDir/-1 Error:KeeperErrorCode = NoNode for /_hadoopBsp/job_1509436473213_0001/_applicationAttemptsDir/0/_superstepDir/-1
2017-10-31 16:06:10,046 [myid:2] - INFO [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x15f7175f23a0000 type:create cxid:0x22 zxid:0x100000010 txntype:-1 reqpath:n/a Error Path:/_hadoopBsp/job_1509436473213_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerHealthyDir Error:KeeperErrorCode = NodeExists for /_hadoopBsp/job_1509436473213_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerHealthyDir
2017-10-31 16:06:10,065 [myid:2] - INFO [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x15f7175f23a0000 type:create cxid:0x23 zxid:0x100000011 txntype:-1 reqpath:n/a Error Path:/_hadoopBsp/job_1509436473213_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerUnhealthyDir Error:KeeperErrorCode = NodeExists for /_hadoopBsp/job_1509436473213_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerUnhealthyDir
2017-10-31 16:06:40,077 [myid:2] - INFO [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x15f7175f23a0000 type:create cxid:0x26 zxid:0x100000012 txntype:-1 reqpath:n/a Error Path:/_hadoopBsp/job_1509436473213_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerHealthyDir Error:KeeperErrorCode = NodeExists for /_hadoopBsp/job_1509436473213_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerHealthyDir
2017-10-31 16:06:40,094 [myid:2] - INFO [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x15f7175f23a0000 type:create cxid:0x27 zxid:0x100000013 txntype:-1 reqpath:n/a Error Path:/_hadoopBsp/job_1509436473213_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerUnhealthyDir Error:KeeperErrorCode = NodeExists for /_hadoopBsp/job_1509436473213_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerUnhealthyDir
2017-10-31 16:07:10,106 [myid:2] - INFO [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x15f7175f23a0000 type:create cxid:0x2a zxid:0x100000014 txntype:-1 reqpath:n/a Error Path:/_hadoopBsp/job_1509436473213_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerHealthyDir Error:KeeperErrorCode = NodeExists for /_hadoopBsp/job_1509436473213_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerHealthyDir
...

I'm wondering whether the error comes from zookeeper configuration. I can run Giraph successfully, without configuring zookeeper by myself. To run the Graphalytics benchmark on Giraph, I have configured zookeeper and set platform.giraph.zoo-keeper-address in platform.properties. I have tried Standalone and Replicated ZooKeeper(3 replicas on 3 machines with the same zoo.cfg) configuration, and zookeeper can be started successfully in both situations. When ./bin/sh/run-benchmark.sh in graphalytics driver for Giraph, it gets stuck and shows timeout error later. Is there anything wrong with my zookeeper configuration or other aspects?

Looking forward to your reply.

Answer 3 · 2019-03-14T14:03:33.000Z

Can run the benchmark on Giraph successfully.