/sheep

A scalable distributed graph partitioner. Ongoing research.

Primary LanguageC++OtherNOASSERTION

THIS IS STUDENT-QUALITY CODE, SO SERIOUSLY, READ THE MANUAL.

0. DEPENDENCIES
  cd ..
  git clone https://github.com/goatdb/llama.git
  # You also need some form of MPI library. I use openmpi.
  # This code depends on MMAP and so is non-portable;
  # we have a plan to change this if there is interest (contact dmargo).

1. QUICK START
  make -j4
  ./scripts/dist-partition.sh

2. USAGE: ./scripts/dist-partition.sh [options... -o $OUTPUT_FILE] $GRAPH $NUM_PARTITIONS
  $GRAPH may be a .net (SNAP) or a .dat (XSS/Graph500 binary) file.
  There is a snap2xss conversion utility in llama/utils
  By default, $GRAPH = test/hep-th.dat and $NUM_PARTITIONS = 2
  If $NUM_PARTITIONS = 0, then we skip the partitioning phase.

3. OPTIONS
  -h (home)     sets the sheep directory. By default, this is your current working directory.
  -w (workers)  sets the number of distributed workers. By default, this is 2.
  -t (trials)   sets the number of trials to run. By default, this is 1.
  -s (sequence) processes the graph in the given vertex order. By default, this is degree sort order.
  -o (output)   output the partitioned edges into appropriately-numbered files.
  -k (keep)     retains the partitioner's intermediate data. This is useful for debugging.
  -v (verbose)  gives verbose messages.

  -l (sLurm) sets up an appropriate environment for the Harvard Odyssey SLURM controller.
  -i uses MPI to sort the graph. This is faster, but will not work with out-of-memory graphs or -cs.
  -r uses MPI to reduce the graph. This is faster, but will not work with OOM graphs or -c.
  -c restricts the number of cores. This is useful for OOM graphs, but will not work with -ir.
  -a uses an affinity-based work assignment scheme. This will not work with -irc.
    This option is deprecated, but not yet retired.

4. EXAMPLES:
  To partition an in-memory graph using 6 cores:
    ./scripts/dist-partition.sh -w 6 -ir $GRAPH $NUM_PARTITIONS

  To partition an OOM graph in 10 parts using 2 cores:
    ./scripts/dist-partition.sh -w 10 -c 2 $GRAPH $NUM_PARTITIONS

  To partition a large graph using 16 cores on one Harvard Odyssey cluster machine:
    sbatch -p general -t 60 -N 1 -n 16 -c 1 --mem-per-cpu 15000 \
      ./scripts/dist-partition.sh -w 16 -sir $GRAPH $NUM_PARTITIONS

  To partition a large graph using 4x16 cores on several cluster machines:
    sbatch -p general -t 60 -N 4 -n 16 -c 1 --mem-per-cpu 15000 \
      ./scripts/dist-partition.sh -w 64 -sir $GRAPH $NUM_PARTITIONS

5. READING THE OUTPUT:
  If you write out partitions, you will not get an evaluation.
  This is because it is MUCH more expensive to evaluate the partitions than it is just to write them out.
  In most real applications you will not want to bother with the evaluator unless your graph is small.

  By default, the evaluator reports ~EVERY POSSIBLE EVALUATION METRIC.~
  This includes metrics that are not relevant to your particular partitioning problem.
  If you are running scripts/dist-partition.sh, then the metrics you care about are ECV(down) and its balance.

  The evaluator reports vertex-balanced partitioning metrics, edge-balanced partitioning metrics via hashing,
  and edge-balanced partitioning metrics via downward elimination tree assignment.
  However, if the partitioner was not trying to solve e.g. the vertex-balanced partitioning ~PROBLEM~,
  then of course the evaluator will report a bad vertex-balanced partitioning result.
  By default the partitioner solves the downward assignment problem (as described in our paper)
  because this consistently produces the best partition quality.
  If you want to solve a different partitioning problem, then you're going to need to pass the correct
  flags to partition_tree, which means you won't be able to use scripts/dist-partition.sh

  If there's interest then I'll add more flags to let you do this via dist-partitions.sh

6. DETAILS: BASIC BINARY USAGE AND MPI
  For normal users these details should be unnecessary. Just use scripts/dist-partition.sh

  scripts/dist-partition.sh merely strings together a few binaries in the manner expected by MPI, SLURM, etc.
  The important binaries are graph2tree and partition_tree,
  and to a lesser extent degree_sequence and merge_trees.
  These binaries merely string together a few Sheep library calls; all of the real code is in lib/.
  However, the easiest way to understand lib/ is to see how the code is used by these binaries.

  usage: graph2tree $GRAPH [-s $SEQUENCE] [-o $OUTPUT_FILE] [-p $NUM_PARTITIONS] [options...]
    graph2tree converts a $GRAPH into a tree in order $SEQUENCE.
    By default, $SEQUENCE is the degree sequence of $GRAPH.
    If given, then $SEQUENCE must be an ascii file with one vertex identity per line.
    A useful option is -f: this will tell you many parameters of the resulting tree.

    graph2tree may optionally write the tree to $OUTPUT_FILE
    If $NUM_PARTITIONS is given, then graph2tree will write the partitions to $OUTPUT_FILE instead.
    `mpirun -n $WORKERS ./graph2tree -ri $GRAPH -p $NUM_PARTITIONS -o $OUTPUT_FILE` is by far
    the fastest way to generate partitions and write them out for sufficiently large graphs.
    if you are looking to use Sheep for a high-performance application then this is what you need.

    By default graph2tree is serial, but it supports MPI through the -ir flags.
    Use -r for an MPI reduce. This is your bread-and-butter parallelism.
    ***You must give an input sequence, or else the result of the distributed reduce will be incoherent garbage.***
    ***You can use the -i option or the degree_sequence binary to obtain a whole graph degree sequence.***
    Example: mpiexec -np 6 ./graph2tree $GRAPH -s $INPUT_SEQUENCE -o $OUTPUT_TREE -r

    Use -i for an MPI degree sort. If $SEQUENCE is also given, then $SEQUENCE becomes an output file.
    This gives you a fast, distributed method to compute the graph's degree sequence.
    Example: mpiexec -np 6 ./graph2tree $GRAPH -s $OUTPUT_SEQUENCE -o $OUTPUT_TREE -ir

  usage: partition_tree [options...] [-g $GRAPH] $SEQUENCE $TREE $NUM_PARTITIONS
    partition_tree partitions $TREE into $NUM_PARTITIONS parts.
    If $GRAPH is given then these partitions are then evaluated on $GRAPH.
    ***THIS TAKES TIME; THE EVALUATION CODE IS INTENDED TO BE EXHAUSTIVE, NOT EFFICIENT.***

    partition_tree requires the $SEQUENCE of the $TREE.
    If $GRAPH is given and $SEQUENCE is "-" then $SEQUENCE will be the degree sequence of $GRAPH.
    If you used a different $SEQUENCE to create the tree, then make sure to give the same sequence here.
    This is why ./graph2tree -i writes an output sequence; it's so you can reuse the same sequence here.

6.1. DETAILS: OUT-OF-MEMORY GRAPHS
  graph2tree also supports partial graph loading. This is used to process out-of-memory graphs.
    -l n/k will load the n-th partial of k total partials (n is 1-indexed).
    ***You must give an input sequence, or else if you later merge_trees it will be incoherent garbage.
    ***You can use the degree_sequence binary to obtain a whole graph degree sequence.***
    Example:
      for PARTIAL in $(seq 6); do
        ./graph2tree $GRAPH -s $SEQUENCE -o "${OUTPUT_TREE}.${PARTIAL}" -l "${PARTIAL}/6"
      done
    See degree_sequence and merge_trees for more details.

  usage: degree_sequence [options..] $INPUT_GRAPH $OUTPUT_SEQUENCE
    This produces a degree sequence for $INPUT_GRAPH and writes it to $OUTPUT_SEQUENCE.
    This is generally unnecessary; for in-memory graphs either let graph2tree compute its own sequence
    or use the -i option, described above. This is for out-of-memory graphs.

  usage: merge_trees [options...] [-o $OUTPUT_TREE] $FIRST_TREE $SECOND_TREE
    Merges $FIRST_TREE and $SECOND_TREE, and optionally writes the result to $OUTPUT_TREE
    $FIRST_TREE and $SECOND_TREE must be produces of the same sequence, or the result is garbage.
    If you want to merge more than 2 trees you will need to call ./merge_trees repeatedly.
    Obviously this is annoying, which is why scripts/dist-partition.sh exists to do it for you.