uclasystem/dorylus

Execution is stuck at Epoch 1

CodingYuanLiu opened this issue · 8 comments

Dear maintainers,
I'm running dorylus on lambda right now. My graph server master is stuck at Epoch 1, and then it returns:
"[ Node 0 ] [ FUNC ERROR ] Unhandled, 2021-07-30T03:50:31.864Z 7caf9b09-14b0-4cc9-9ca9-edd041fea72a Task timed out after 600.02 seconds".
My lambda prints no log so I have no idea what's going on. Could you help me solve the problem?
Thank you very much!

More details are presented below.

My Cluster

  • graphserver
    2 EC2 c5n.2xlarge us-east-1f

  • weightserver
    1 EC2 t2.medium us-east-1c

  • lambda
    Named "gcn". Memory 192MB, timeout 10min.

    It is configured to be in the same VPC of the graphservers and weightserver (but GSs and WS are not in the same subnet, as is shown above).

My Input Dataset and Configs

Actually I don't have much knowledge about machine learning, so I use the example simple graph presented in the tutorial as the input dataset.

  • small.graph

    0 1
    0 2
    1 3
    2 4
    3 5
    
  • feature

    used by ./prepare is 4

    0.3, 0.2, 0.5, 0.7
    0.7, 0.5, 0.5, 0.3
    0.1, 0.2, 0.8, 0.7
    0.3, 0.4, 0.5, 0.1
    0.3, 0.4, 0.2, 0.1
    0.3, 0.6, 0.5, 0.8
    
  • labels

    I set the argument of ./prepare to 1

    0
    3
    2
    1
    2
    3
    
  • layerconfig (renamed to simplegraph.config later)

    4
    8
    10
    
  • Other configs
    All other configs (e.g., gserverport, nodeport) use the default value.

Commands and Full Logs

  • Command:

    • graphserver
      ./run/run-onnode graph simplegraph --l 5 --e 5
    • weightserver
      ./run/run-onnode weight simplegraph
  • Full Logs

    • graphserver

      [ Node 404 ]  Engine starts initialization...
      [ Node 404 ]  Parsed configuration: dThreads = 2, cThreads = 8, datasetDir = /filepool/simplegraph/parts_2/, featuresFile = /filepool/simplegraph/features.bsnap, dshMachinesFile = /home/ubuntu/dshmachines, myPrIpFile = /home/ubuntu/myprip, undirected = false, data port set -> 5000, control port set -> 7000, node port set -> 6000
      [ Node 404 ]  NodeManager starts initialization...
      [ Node   1 ]  Private IP: 172.31.73.108
      [ Node 404 ]  Engine starts initialization...
      [ Node 404 ]  Parsed configuration: dThreads = 2, cThreads = 8, datasetDir = /filepool/simplegraph/parts_2/, featuresFile = /filepool/simplegraph/features.bsnap, dshMachinesFile = /home/ubuntu/dshmachines, myPrIpFile = /home/ubuntu/myprip, undirected = false, data port set -> 5000, control port set -> 7000, node port set -> 6000
      [ Node 404 ]  NodeManager starts initialization...
      [ Node   0 ]  Private IP: 172.31.79.216
      [ Node   0 ]  NodeManager initialization complete.
      [ Node   0 ]  CommManager starts initialization...
      [ Node   1 ]  NodeManager initialization complete.
      [ Node   0 ]  CommManager starts initialization...
      [ Node   0 ]  CommManager initialization complete.
      [ Node   1 ]  CommManager initialization complete.
      [ Node   0 ]  Preprocessing... Output to /filepool/simplegraph/parts_2/graph.0.bin
      [ Node   1 ]  Preprocessing... Output to /filepool/simplegraph/parts_2/graph.1.bin
      Cannot open output file:/filepool/simplegraph/parts_2/graph.0.bin, [Reason: Permission denied]
      Cannot open input file: /filepool/simplegraph/parts_2/graph.0.bin, [Reason: No such file or directory]
      [ Node   0 ]  Finish preprocessing!
      [ Node   0 ]  <GM>: 0 global vertices, 0 global edges,
                      0 local vertices, 0 local in edges, 0 local out edges
                      0 out ghost vertices, 0 in ghost vertices
      [ Node   0 ]  No feature cache, loading raw data...
      [ Node   0 ]  Cannot open output cache file: /filepool/simplegraph/parts_2/feats4.0.bin [Reason: Permission denied]
      Cannot open output file:/filepool/simplegraph/parts_2/graph.1.bin, [Reason: Permission denied]
      Cannot open input file: /filepool/simplegraph/parts_2/graph.1.bin, [Reason: No such file or directory]
      [ Node   1 ]  Finish preprocessing!
      [ Node   1 ]  <GM>: 0 global vertices, 0 global edges,
                      0 local vertices, 0 local in edges, 0 local out edges
                      0 out ghost vertices, 0 in ghost vertices
                      [ Node   1 ]  No feature cache, loading raw data...
      [ Node   1 ]  Cannot open output cache file: /filepool/simplegraph/parts_2/feats4.1.bin [Reason: Permission denied]
      [ Node   1 ]  Engine initialization complete.
      [ Node   0 ]  Engine initialization complete.
      [ Node   0 ]  Number of epochs: 5, validation frequency: 1
      [ Node   0 ]  Sync Epoch 1 starts...
      [ Node   1 ]  Sync Epoch 1 starts...
      [ Node   0 ]  ^[[1;31m[ FUNC ERROR ]^[[0m Unhandled, 2021-07-30T03:50:31.864Z 7caf9b09-14b0-4cc9-9ca9-edd041fea72a Task timed out after 600.02 seconds
      [ Node   1 ]  ^[[1;31m[ FUNC ERROR ]^[[0m Unhandled, 2021-07-30T03:50:31.938Z 548bbee5-c454-479f-956c-baeb0e8e239b Task timed out after 600.02 seconds
      [ Node   0 ]  ^[[1;31m[ FUNC ERROR ]^[[0m Unhandled, 2021-07-30T03:50:31.993Z 5d6f2b65-2386-4169-9868-e43f0c5a0361 Task timed out after 600.10 seconds
      [ Node   1 ]  ^[[1;31m[ FUNC ERROR ]^[[0m Unhandled, 2021-07-30T03:50:32.001Z 402b364b-4579-40f4-94a5-03f5efde62c0 Task timed out after 600.10 seconds
      [ Node   1 ]  ^[[1;31m[ FUNC ERROR ]^[[0m Unhandled, 2021-07-30T03:50:32.004Z 285502c8-06e2-4585-8a81-983110962a8e Task timed out after 600.10 seconds
      [ Node   0 ]  ^[[1;31m[ FUNC ERROR ]^[[0m Unhandled, 2021-07-30T03:50:32.020Z a93af084-45b9-4a86-b0a9-6b3bcc37cb06 Task timed out after 600.10 seconds
      [ Node   0 ]  ^[[1;31m[ FUNC ERROR ]^[[0m Unhandled, 2021-07-30T03:50:32.032Z fe90565a-f474-4c29-84c9-1cae963f3d63 Task timed out after 600.09 seconds
      [ Node   1 ]  ^[[1;31m[ FUNC ERROR ]^[[0m Unhandled, 2021-07-30T03:50:32.053Z e40b2790-0c6a-4c28-9bec-dc2075b0a5ce Task timed out after 600.02 seconds
      [ Node   1 ]  ^[[1;31m[ FUNC ERROR ]^[[0m Unhandled, 2021-07-30T03:50:32.018Z f85d186c-56f3-4cc8-87bf-36aa28034f23 Task timed out after 600.10 seconds
      [ Node   0 ]  ^[[1;31m[ FUNC ERROR ]^[[0m Unhandled, 2021-07-30T03:50:32.075Z 627871c8-fa47-4812-af2e-11b9cc7da5fd Task timed out after 600.02 seconds
      
    • weightserver

      Killing existing 'weightserver' processes... 
      weightserver: no process found                                                                             [24/1922]Running WEIGHT servers with: [ MARK # 12 ]...                                                                       ./build/weightserver /home/ubuntu/dshmachines /home/ubuntu/myprip /home/ubuntu/gserverip 5000 65433 55431 /home/ubuntu/simplegraph.config /home/ubuntu/tmpfiles 1 1 0 GCN 0.01 0.02                                                     Binding weight server to tcp://*:65433...                                                                           [ WS   0 ] Initializing nodes...
      [ WS   0 ] All weight servers connected.
      [ WS   0 ] Layer 0 - Weights: (4, 8)
      [ WS   0 ] Layer 1 - Weights: (8, 10)
      [ WS   0 ] All nodes up to date.
      [  INFO  ] Number of lambdas set to 10.
      
    • Lambda

      No log is presented in CloudWatch.

Hi,

I think there are two problems with the execution:

  1. For some of our experiments we have hard-coded the region into the Lambda client which you can see here. Change this to the region you are using otherwise it will try to launch lambdas in the wrong region.

  2. It also seems that there is some problem with the graph server finding your graph files as you can see from the output. Make sure that the directory structure is exactly as described in the wiki in section 3 as the path is currently hard-coded. Also make sure that permissions on the directory are set correctly. Hopefully in a future update we can make the path less strict.

Hope this helps, and good luck!

Thank you very much for your reply! But there still exists some problems.

  1. I've changed the hard-coded lambda region before. So probably this is not the cause.
  2. I change the hard coded graph file path in run/run-onnode, and now it does not report "Can not open file..." errors.
    However, it now stops quickly after log "Sync Epoch 1 starts...", reporting "Execution fails (at least on this node), check the log file".
    The log file does not record any errors. The output file is empty (although the file is created at weight server).
    Lambda still does not have logs in CloudWatch.

Full logs

  • graphserver
Running GRAPH servers with: [ MARK # 35 ]...                                                                        ./build/graphserver --datasetdir /home/ubuntu/filepool/simplegraph/parts_2/ --featuresfile /home/ubuntu/filepool/simplegraph/features.bsnap --labelsfile /home/ubuntu/filepool/simplegraph/labels.bsnap --dshmachinesfile /home/ubuntu/$shmachines --layerfile /home/ubuntu/simplegraph.config --pripfile /home/ubuntu/myprip --dataserverport 55431 --weig$tserverport 65433 --wserveripfile /home/ubuntu/wserverip --undirected 0 --tmpdir=/home/ubuntu/tmpfiles --cthreads 8
--dthreads 2 --dataport 5000 --ctrlport 7000 --nodeport 6000 --numlambdas 1 --numEpoch 10 --validationFrequency 1 -$MODE 2 --pipeline 0 --staleness 4294967295 --gnn GCN --preprocess 0 --timeout_ratio 5
[ Node 404 ]  Engine starts initialization...
[ Node 404 ]  Parsed configuration: dThreads = 2, cThreads = 8, datasetDir = /home/ubuntu/filepool/simplegraph/part$_2/, featuresFile = /home/ubuntu/filepool/simplegraph/features.bsnap, dshMachinesFile = /home/ubuntu/dshmachines, m$PrIpFile = /home/ubuntu/myprip, undirected = false, data port set -> 5000, control port set -> 7000, node port set $> 6000
[ Node 404 ]  NodeManager starts initialization...
[ Node   1 ]  Private IP: 172.31.73.108
[ Node   0 ]  NodeManager initialization complete.                                                                  
[ Node   0 ]  CommManager starts initialization...
[ Node   1 ]  NodeManager initialization complete.
[ Node   0 ]  CommManager starts initialization...
[ Node   0 ]  CommManager initialization complete.
[ Node   1 ]  CommManager initialization complete.
[ Node   0 ]  <GM>: 6 global vertices, 5 global edges,
                3 local vertices, 2 local in edges, 3 local out edges
                0 out ghost vertices, 1 in ghost vertices
[ Node   0 ]  Loading feature cache from /home/ubuntu/filepool/simplegraph/parts_2/feats4.0.bin...
[ Node   1 ]  <GM>: 6 global vertices, 5 global edges,
                3 local vertices, 3 local in edges, 2 local out edges
                1 out ghost vertices, 0 in ghost vertices
[ Node   1 ]  Loading feature cache from /home/ubuntu/filepool/simplegraph/parts_2/feats4.1.bin...
[ Node   1 ]  Engine initialization complete.
[ Node   0 ]  Engine initialization complete.                                                                       
[ Node   0 ]  Number of epochs: 10, validation frequency: 1
[ Node   0 ]  Sync Epoch 1 starts...
[ Node   1 ]  Sync Epoch 1 starts...
scp: /home/ubuntu/tmpfiles/output_*: No such file or directory
scp: /home/ubuntu/tmpfiles/output_*: No such file or directory
Check the output results in "~/outfiles/" folder.
Check the running log under "~/logfiles/" folder.
Execution fails (at least on this node), check the log file!
Killing existing 'graphserver' processes...
graphserver: no process found
graphserver: no process found
Killing existing 'weightserver' processes...
  • weightserver
Running WEIGHT servers with: [ MARK # 17 ]...
./build/weightserver /home/ubuntu/dshmachines /home/ubuntu/myprip /home/ubuntu/gserverip 5000 65433 55431 /home/ubuntu/simplegraph.config /home/ubuntu/tmpfiles 1 1 0 GCN 0.01 0.02
[ WS   0 ] Initializing nodes...
[ WS   0 ] All weight servers connected.
Binding weight server to tcp://*:65433...
[ WS   0 ] Layer 0 - Weights: (4, 8)
[ WS   0 ] Layer 1 - Weights: (8, 10)
[ WS   0 ] All nodes up to date.
[  INFO  ] Number of lambdas set to 10.
Check the output results in "~/outfiles/" folder.

By the way, if I use ./run/run-onnode graph simplegraph cpu command, the output log is just the same. It returns the same error.

Fixing stuck at epoch 1

Hi thanks for the detailed information.

So first a question about running the CPU version. Just making sure, but when you built Dorylus did you make sure to build with the CPU option? Because it will compile the specific ResComm that you specify during compilation and unfortunately cannot use any others unless you recompile it with different options first. If you have done this then my other thought is that the graph might be too small. We use that graph as an example for data formatting, but never actually ran on a graph of that size so it could be causing weird memory issues.

I recommend trying the reddit-small dataset as it is widely used and we have already used it with our system.You can find the dataset here.

If this doesn't work let us know. We can try recreating some of the issues your are experiencing to better assist you.

Other things

There are just a few other things I want to point out in case they are problems:

  1. Make sure the instances you use have a security group that allows them to communicate with all other instances in the VPC. By default all ports are blocked.
  2. To get CloudWatch logs from Lambda you might have to attach an arn to the lambdas that has permissions to write to CloudWatch, which will them generate a CloudWatch log group under /aws/lambda/<function-name>

Thank you very much for your reply! I downloaded the reddit dataset here and found it consists of several files. But these files' formats do not conform to the required format described in the dorylus wiki 3. Prepare Input Dataset
. And I can not separate reddit-large and reddit-small datasets in these files. Could you tell me how I can generate required graph, feature and label files from the dataset? Thank you.

More information

  1. After recompile graphserver with CPU enabled, the execution returns succeed, but the output file is empty. batch Acc and Loss in the log is -nan, so there must exists some problems in the execution. Maybe this can be solved later by using the Reddit-small dataset?
    Part of the logs:
[ Node   0 ]  Sync Epoch 5 starts...
[ Node   1 ]  Time for epoch 4: 10.88ms
[ Node   1 ]  Sync Epoch 5 starts...
[ Node   1 ]  batch Acc: -nan, Loss: -nan
[ Node   0 ]  batch Acc: -nan, Loss: -nan
[ Node   1 ]  Time for epoch 5: 10.85ms
[ Node   1 ]  Sync Epoch 6 starts...
[ Node   0 ]  Time for epoch 5: 10.90ms
[ Node   0 ]  Sync Epoch 6 starts...
[ Node   0 ]  batch Acc: -nan, Loss: -nan
[ Node   1 ]  batch Acc: -nan, Loss: -nan
  1. I have set the security groups of the EC2 instances to open all the ports, so this is not the problem.
  2. I re-run the lambda version graphserver today and its error message remains the same. However, this time, the gcn lambda is not even invoked according to its log. I manually added a printLog(xxx) at the head of the function LambdaComm::invokeLambda in commmanager/lambda_comm.cpp, but it prints nothing.

Hi,

Glad to hear that you have set up the CPU version successfully. Regarding your questions,

0.1. You can refer to the DGL example (https://docs.dgl.ai/en/0.6.x/_modules/dgl/data/reddit.html) for the GNN background and how to prepare the dataset. I would suggest to start with their preprocessed dataset which shares a similar format (though still different) to ours.
0.2. Unfortunately, the original reddit dataset has different format from dorylus'. You have to write some scripts to convert the data into our format. Basically, the graph file is simply the edge list of the graph, where each line represents a directed edge. And the feature file consists of features of vertices, where each line is the feature of a vertex. The label file specifies which class the vertex belongs to, where each line represents the class of the corresponding vertex. Our scripts should handle the remaining tasks and convert these text files into binary files.
0.3. The dataset here is for reddit-small in the paper only. Reddit-large is constructed by ourselves from the raw reddit dataset.

1.1. It's ok if your output file is empty. You should be able to find all essential information in logs printed in the terminal.
1.2. The accuracy and loss are not reliable due to the synthetic dataset you are using. The simple graph in the doc is merely an example to show the format of the dataset files instead of a real-world graph dataset. According to your log, you have run the CPU verion successfully. So yes, reddit-small will give you reasonable accuracy numbers.

  1. There must be something wrong if your added debug messages didn't show up. I would suggest to delete everything on all servers (e.g., by running ./gnnman/send-source --force) and re-build dorylus when you switch bewteen different backends. Given that you had run dorylus lambda version and invoked lambdas before, I think this can help you invoke your lambdas at least.

Nits:

  1. I noticed in your first post, you ran dorylus graph server with ./run/run-onnode graph simplegraph --l 5 --e 5. We typically passed the arguments with = in our experiments, like ./run/run-onnode graph simplegraph --l=5 --e=5. I am not sure if this leads to any problem. You can give it a try.
  2. With regard to the cloudwatch issue, I cannot think of any specific reason for now. Have you tried to run a simple hello world lambda function and let it print something to cloudwatch to check if all the setup is good?
  3. For your layerconfig, I would suggest to set the last layer dimension the same as the number of your lable kinds. For example, 4 for your simplegraph dataset.

Thank you very much for your help!

I use DGL's RedditDataset (i.e., reddit-small dataset) as you recommended, and it can now run both in CPU mode and Lambda mode successfully (according to the master graphserver's log).

However, the outputs seem a little weird. Could you help me check out if the execution & outputs are correct?

Points that make me feel strange:

  1. Loss values in the log are all "-nan"
  2. Output files in the weightserver are empty
  3. Output files in the graphservers show that Time per stage are all 0.000ms.
  4. The Acc values in the logs remain unchanged after epoch 2. Does it converge so fast? Or it caches the previous feature results?
  5. By the way, what are the expected outputs of Dorylus when the input is reddit-small dataset?

Details

CPU version Dorylus

Firstly, I prepare the reddit dataset and run dorylus in CPU mode. Here are the useful logs:

  • CPU version - log on graph servers
 [ Node   0 ]  Preprocessing... Output to /filepool/reddit/parts_2/graph.0.bin
[ Node   1 ]  CommManager initialization complete.
[ Node   1 ]  Preprocessing... Output to /filepool/reddit/parts_2/graph.1.bin
[ Node   1 ]  Finish preprocessing!
[ Node   1 ]  <GM>: 232965 global vertices, 114615892 global edges,
                113084 local vertices, 46862283 local in edges, 46862283 local out edges
                66256 out ghost vertices, 66256 in ghost vertices
[ Node   1 ]  No feature cache, loading raw data...
[ Node   1 ]  Engine initialization complete.
[ Node   0 ]  Finish preprocessing!
[ Node   0 ]  <GM>: 232965 global vertices, 114615892 global edges,
                119881 local vertices, 67753609 local in edges, 67753609 local out edges
                76601 out ghost vertices, 76601 in ghost vertices
[ Node   0 ]  No feature cache, loading raw data...
[ Node   0 ]  Engine initialization complete.
[ Node   0 ]  Number of epochs: 10, validation frequency: 1
[ Node   0 ]  Sync Epoch 1 starts...
[ Node   1 ]  Sync Epoch 1 starts...
[ Node   1 ]  batch Acc: 0.096304, Loss: -nan
[ Node   0 ]  batch Acc: 0.002169, Loss: nan
[ Node   0 ]  Time for epoch 1: 9040.82ms
[ Node   0 ]  Sync Epoch 2 starts...
[ Node   1 ]  batch Acc: 0.096480, Loss: -nan
[ Node   0 ]  batch Acc: 0.000584, Loss: nan
[ Node   0 ]  Time for epoch 2: 9126.88ms
[ Node   0 ]  Sync Epoch 3 starts...
[ Node   1 ]  Time for epoch 2: 9126.90ms
[ Node   1 ]  Sync Epoch 3 starts...
[ Node   1 ]  batch Acc: 0.096480, Loss: -nan
[ Node   0 ]  batch Acc: 0.000584, Loss: -nan
[ Node   0 ]  Time for epoch 3: 8823.63ms
[ Node   0 ]  Sync Epoch 4 starts...
[ Node   1 ]  Time for epoch 3: 8823.63ms
[ Node   1 ]  Sync Epoch 4 starts...
[ Node   1 ]  batch Acc: 0.096480, Loss: -nan
[ Node   0 ]  batch Acc: 0.000584, Loss: -nan
[ Node   0 ]  Time for epoch 4: 8680.90ms
[ Node   0 ]  Sync Epoch 5 starts...
[ Node   1 ]  Time for epoch 4: 8680.90ms
[ Node   1 ]  Sync Epoch 5 starts...
[ Node   1 ]  batch Acc: 0.096480, Loss: -nan
[ Node   0 ]  batch Acc: 0.000584, Loss: -nan
[ Node   0 ]  Time for epoch 5: 8451.58ms
[ Node   0 ]  Sync Epoch 6 starts...
[ Node   1 ]  Time for epoch 5: 8451.56ms
[ Node   1 ]  Sync Epoch 6 starts...
[ Node   1 ]  batch Acc: 0.096480, Loss: -nan
[ Node   0 ]  batch Acc: 0.000584, Loss: -nan
[ Node   0 ]  Time for epoch 6: 8459.82ms
[ Node   0 ]  Sync Epoch 7 starts...
[ Node   1 ]  Time for epoch 6: 8459.83ms
[ Node   1 ]  Sync Epoch 7 starts...
[ Node   1 ]  batch Acc: 0.096480, Loss: -nan
[ Node   0 ]  batch Acc: 0.000584, Loss: -nan
[ Node   0 ]  Time for epoch 7: 8452.07ms
[ Node   0 ]  Sync Epoch 8 starts...
[ Node   1 ]  Time for epoch 7: 8452.08ms
[ Node   1 ]  Sync Epoch 8 starts...
[ Node   1 ]  batch Acc: 0.096480, Loss: -nan
[ Node   0 ]  batch Acc: 0.000584, Loss: -nan
[ Node   0 ]  Time for epoch 8: 8504.51ms
[ Node   0 ]  Sync Epoch 9 starts...
[ Node   1 ]  Time for epoch 8: 8504.50ms
[ Node   1 ]  Sync Epoch 9 starts...
[ Node   1 ]  batch Acc: 0.096480, Loss: -nan
[ Node   0 ]  batch Acc: 0.000584, Loss: -nan
[ Node   0 ]  Time for epoch 9: 8764.25ms
[ Node   0 ]  Sync Epoch 10 starts...
[ Node   1 ]  Time for epoch 9: 8764.25ms
[ Node   1 ]  Sync Epoch 10 starts...
[ Node   1 ]  batch Acc: 0.096480, Loss: -nan
[ Node   0 ]  batch Acc: 0.000584, Loss: -nan
[ Node   1 ]  Time for epoch 10: 7847.84ms
[ Node   0 ]  Time for epoch 10: 8753.58ms
[ Node   0 ]  <EM>: Run start time: Tue Aug 10 12:22:58 2021
[ Node   0 ]  <EM>: Run end time: Tue Aug 10 12:24:26 2021
[ Node   0 ]  <EM>: Backend CPU
[ Node   0 ]  <EM>: Dataset: /filepool/reddit/parts_2/ (602, 128, 41)
[ Node   0 ]  <EM>: staleness: 4294967295
[ Node   0 ]  <EM>: 10 sync epochs and 0 async epochs
[ Node   0 ]  <EM>: Using 1 lambdas
[ Node   0 ]  <EM>: Initialization takes 134452.368 ms
[ Node   0 ]  Relaunched Lambda Cnt: 0
[ Node   0 ]  <EM>: Average  sync epoch time 8705.804 ms
[ Node   1 ]  <EM>: Average  sync epoch time 8615.241 ms
[ Node   1 ]  <EM>: Average async epoch time 0.000 ms
  • CPU version - outputs on graphservers
 <EM>: Run start time: Tue Aug 10 12:22:58 2021
<EM>: Run end time: Tue Aug 10 12:24:26 2021
<EM>: Using 1 lambdas
<EM>: Initialization takes 134452.368 ms
<EM>: Forward:  Time per stage:
<EM>    Aggregation    0  0.000 ms
<EM>    ApplyVertex    0  0.000 ms
<EM>    Scatter        0  0.000 ms
<EM>    ApplyEdge      0  0.000 ms
<EM>    Aggregation    1  0.000 ms
<EM>    ApplyVertex    1  0.000 ms
<EM>    Scatter        1  0.000 ms
<EM>    ApplyEdge      1  0.000 ms
<EM>: Total forward-prop time 0.000 ms
<EM>: Backward: Time per stage:
<EM>    Aggregation    2  0.000 ms
<EM>    ApplyVertex    2  0.000 ms
<EM>    Scatter        2  0.000 ms
<EM>    ApplyEdge      2  0.000 ms
<EM>    Aggregation    3  0.000 ms
<EM>    ApplyVertex    3  0.000 ms
<EM>    Scatter        3  0.000 ms
<EM>    ApplyEdge      3  0.000 ms
<EM>: Total backward-prop time 0.000 ms
<EM>: Average epoch time 8705.804 ms
<EM>: Final accuracy 0.000
Relaunched Lambda Cnt: 0
  • Outputs on the weight server are empty.

Lambda version Dorylus

After running CPU version Dorylus, I recompile the graphservers with Lambda backend, restart the weightserver, and run ./run/run-onnode graph reddit --l=5 --e=5 to run Lambda version Dorylus

  • Lambda version - logs on graph server:
    Here are the end of the logs:
...
[ Node   1 ]  [ FUNC ERROR ] Unhandled, RequestId: 530d9c9b-4527-4bc6-805d-d3ac25e08c72 Error: Runtime exited with error: signal: killed
[ Node   0 ]  curr 19631, timed 19235, chunk 5:B:0:3/3: vtx 1
[ Node   0 ]  Relaunch lambda 5:B:0:3/3: vtx 1 for layer 0...
[ Node   1 ]  curr 20335, timed 20045, chunk 5:B:0:3/8: vtx 1
[ Node   1 ]  Relaunch lambda 5:B:0:3/8: vtx 1 for layer 0...
[ Node   1 ]  curr 20349, timed 20045, chunk 5:B:0:4/9: vtx 1
[ Node   1 ]  Relaunch lambda 5:B:0:4/9: vtx 1 for layer 0...
[ Node   1 ]  Time for epoch 5: 40582.43ms
[ Node   0 ]  Time for epoch 5: 40789.55ms
[ Node   0 ]  <EM>: Run start time: Tue Aug 10 12:38:16 2021
[ Node   0 ]  <EM>: Run end time: Tue Aug 10 12:42:37 2021
[ Node   0 ]  <EM>: Backend LAMBDA:gcn
[ Node   0 ]  <EM>: Dataset: /filepool/reddit/parts_2/ (602, 128, 41)
[ Node   0 ]  <EM>: staleness: 4294967295
[ Node   0 ]  <EM>: 5 sync epochs and 0 async epochs
[ Node   0 ]  <EM>: Using 5 lambdas
[ Node   0 ]  <EM>: Initialization takes 2939.644 ms
[ Node   0 ]  Relaunched Lambda Cnt: 8
[ Node   0 ]  <EM>: Average  sync epoch time 52276.335 ms
[ Node   1 ]  <EM>: Average  sync epoch time 52234.906 ms
[ Node   0 ]  <EM>: Average async epoch time 0.000 ms
[ Node   1 ]  <EM>: Average async epoch time 0.000 ms
[ Node   1 ]  ^[[1;31m[ FUNC ERROR ]^[[0m Unhandled, RequestId: 530d9c9b-4527-4bc6-805d-d3ac25e08c72 Error: Runtime exited with error: signal: killed
[ Node   0 ]  curr 19631, timed 19235, chunk 5:B:0:3/3: vtx 1
[ Node   0 ]  Relaunch lambda 5:B:0:3/3: vtx 1 for layer 0...
[ Node   1 ]  curr 20335, timed 20045, chunk 5:B:0:3/8: vtx 1
[ Node   1 ]  Relaunch lambda 5:B:0:3/8: vtx 1 for layer 0...
[ Node   1 ]  curr 20349, timed 20045, chunk 5:B:0:4/9: vtx 1
[ Node   1 ]  Relaunch lambda 5:B:0:4/9: vtx 1 for layer 0...
[ Node   1 ]  Time for epoch 5: 40582.43ms
[ Node   0 ]  Time for epoch 5: 40789.55ms
[ Node   0 ]  <EM>: Run start time: Tue Aug 10 12:38:16 2021
[ Node   0 ]  <EM>: Run end time: Tue Aug 10 12:42:37 2021
[ Node   0 ]  <EM>: Backend LAMBDA:gcn
[ Node   0 ]  <EM>: Dataset: /filepool/reddit/parts_2/ (602, 128, 41)
[ Node   0 ]  <EM>: staleness: 4294967295
[ Node   0 ]  <EM>: 5 sync epochs and 0 async epochs
[ Node   0 ]  <EM>: Using 5 lambdas
[ Node   0 ]  <EM>: Initialization takes 2939.644 ms
[ Node   0 ]  Relaunched Lambda Cnt: 8
[ Node   0 ]  <EM>: Average  sync epoch time 52276.335 ms
[ Node   1 ]  <EM>: Average  sync epoch time 52234.906 ms
[ Node   0 ]  <EM>: Average async epoch time 0.000 ms
[ Node   1 ]  <EM>: Average async epoch time 0.000 ms
  • Lambda version - logs on weight servers
./build/weightserver /home/ubuntu/dshmachines /home/ubuntu/myprip /home/ubuntu/gserverip 5000 65433 55431 /home/ubuntu/reddit.config /home/ubuntu/tmpfiles 1 1 0 GCN 0.01 0.02
[ WS   0 ] Initializing nodes...
[ WS   0 ] All weight servers connected.
[ WS   0 ] Layer 0 - Weights: (602, 128)
[ WS   0 ] Layer 1 - Weights: (128, 41)
[ WS   0 ] All nodes up to date.
Binding weight server to tcp://*:65433...
[  INFO  ] Number of lambdas set to 10.
[ WS   0 ] Epoch 1, acc: 0.0530, loss: -nan
[ WS   0 ] w Local Layer 1 Current version: 1, active weight version cnt: 1 0:10 0:0 -> (1:0, )
[ WS   0 ] w Local Layer 0 Current version: 1, active weight version cnt: 1 0:10 0:0 -> (1:0, )
[ WS   0 ] Epoch 2, acc: 0.0548, loss: -nan
[ WS   0 ] w Local Layer 1 Current version: 2, active weight version cnt: 1 0:10 0:0 -> (2:0, )
[ WS   0 ] w Local Layer 0 Current version: 2, active weight version cnt: 1 0:10 0:0 -> (2:0, )
[ WS   0 ] Epoch 3, acc: 0.0548, loss: -nan
[ WS   0 ] w Local Layer 1 Current version: 3, active weight version cnt: 1 0:10 0:0 -> (3:0, )
[ WS   0 ] w Local Layer 0 Current version: 3, active weight version cnt: 1 0:10 0:0 -> (3:0, )
[ WS   0 ] Epoch 4, acc: 0.0548, loss: -nan
[ WS   0 ] w Local Layer 1 Current version: 4, active weight version cnt: 1 0:10 0:0 -> (4:0, )
[ WS   0 ] w Local Layer 0 Current version: 4, active weight version cnt: 1 0:10 0:0 -> (4:0, )
[ WS   0 ] Epoch 5, acc: 0.0548, loss: -nan
[ WS   0 ] w Local Layer 1 Current version: 5, active weight version cnt: 1 0:10 0:0 -> (5:0, )
[SHUTDOWN] Deleting workers
[SHUTDOWN] Server shutting down...
We are terminating the weight server
[SHUTDOWN] Closing ZMQ
Assertion failed: pfd.revents & POLLIN (src/signaler.cpp:261)
Check the output results in "~/outfiles/" folder.
  • Lambda version - outputs on graph servers
<EM>: Run start time: Tue Aug 10 12:38:16 2021
<EM>: Run end time: Tue Aug 10 12:42:37 2021
<EM>: Using 5 lambdas
<EM>: Initialization takes 2939.644 ms
<EM>: Forward:  Time per stage:
<EM>    Aggregation    0  0.000 ms
<EM>    ApplyVertex    0  0.000 ms
<EM>    Scatter        0  0.000 ms
<EM>    ApplyEdge      0  0.000 ms
<EM>    Aggregation    1  0.000 ms
<EM>    ApplyVertex    1  0.000 ms
<EM>    Scatter        1  0.000 ms
<EM>    ApplyEdge      1  0.000 ms
<EM>: Total forward-prop time 0.000 ms
<EM>: Backward: Time per stage:
<EM>    Aggregation    2  0.000 ms
<EM>    ApplyVertex    2  0.000 ms
<EM>    Scatter        2  0.000 ms
<EM>    ApplyEdge      2  0.000 ms
<EM>    Aggregation    3  0.000 ms
<EM>    ApplyVertex    3  0.000 ms
<EM>    Scatter        3  0.000 ms
<EM>    ApplyEdge      3  0.000 ms
<EM>: Total backward-prop time 0.000 ms
<EM>: Average epoch time 52276.335 ms
<EM>: Final accuracy 0.000
Relaunched Lambda Cnt: 8

Great, the system is now working. But you are right, it is not learning according to the posted output.

Questions 2,3, & 5: sorry about the confusion. the output files generated are useless after the code refacted for the artifact. You should only pay attention to the logs printed out in the terminal:

  • graphserver: you should only need the last few lines starting with "[Node 0] <EM>". Mainly the information here indicates the training time.
  • weightserver: the master weight server will print out accuracy and loss per epoch like "[ WS 0 ] Epoch <n>, acc: <accuracy>, loss: <loss>". Those data should be enough to re-generate Figure 5 and 6 in the paper. Refer this wiki page for details.

Question 1 & 4: no, the loss shouldn't be nan and the accuracy should increase per epoch if it learns correctly. Like what Figure 5 in the paper shows, the accuracy should reach 0.90 quickly (within 20 epochs), while the loss decreases correspondingly.

My few suggestions:

  1. Use more lambdas for reddit-small (e.g., 80 lambdas per graph server). 5 lambdas might be too few to hold the data chunk and they kept timing out according to your logs.
  2. Double check the dataset is correct. Like the feature file, label file, and the graph file have their vertex ID matched, features are loaded correctly, etc.
  3. DGL also provides other small datasets like cora and pubmed, which might be good for debugging and checking the configuration.
  4. Refer the scripts in benchmark/ to set the staleness and learning rate correctly.

Thanks for your help! I try two datasets from DGL again: KarateClubDataset and cora.
KarateClub can only successfully run on CPU version. Loss values are still nan if I use it on lambda version.
But cora can now both on CPU and Lambda version now, and the Acc and Loss values are not nan. It is awesome!
Thank you for your help again! I can start my journey with Dorylus now.

Other information

I can provide some observations that I found when I was debugging (although you might have found them already):
The nan number comes from the softmax(z) function. For example, in CPU version, it is in src/graph-server/commmanager/CPU_comm.cpp. Some vecSrc values are nan in the 6th line of the function.
nan vecSrc values themselves come from the dot operation, i.e., Matrix z = feats.dot(weight);, in the 3rd line of CPUComm::vtxNNForwardGCN. I have no idea why this dot operation will generate nan values.
As there is an available dataset, I don't need to dig into why reddit and karateclub dataset causes nan "Loss" and "Acc" values anymore. But I'm worried that this problem will trouble me later.

Dataset open source

Could you tell me why Dorylus does not open source an available dataset? Maybe it's because this job is easy for most GNN developers? If needed, I'm glad to offer my scripts to generate an available dataset. i.e., cora.