Implementation of the HULA algorithm.
bvm2
mininet
- p4 16 compiler
- PI
- Generate topology using
topology-generation/fattree.py
script or use the default one. - Run
make
and wait for mininet to start up. - Run
./controller.py
to configure the data plane.
The switches are now configured, however, at this point, the hosts don't have
routes to hosts not on their rack. To establish a route to a host hn
:
- Run
xterm hn
from mininet CLI. - In the terminal, run
./test-scripts/probe.py
. This script sends probes from the current host every1s
.
This will send a probe from hn
to every other node and establish a path to
hn
from every other node. It will also establish a path to every other host
connected the same ToR as hn
. Repeat this for all node to establish a
connection to them.
Use the ./test-scripts/receieve
and ./test-script/send
script to view
incoming packets and send packets to other hosts respectively.
Since Fat tree topologies are commonly used in datacenters, the paper uses them for evaluation. However, HULA doesn't require any specific topology to work correctly. For this project, any topology with a notion of upstream and downstream links will be sufficient.
The project assumes the following naming scheme to distinguish upstream and
downstream links: A switch in the n
th tier has the form sm
where 100 * n < m < 100 * (n + 1)
.
The paper implements a smart form of multicast for probe replication that uses a notion of upstream and downstream links. The smart multicast implementation uses the aforementioned naming convention.
At the p4 level, the implementation uses simple_switch_CLI
to program each
switch by generating configuration commands for each switch in config/
. The
pseudocode for the smart multicast is:
if packet from downstream link:
replicate on all links except ingress
else:
replicate only on downstream links
which is computed into simple multicast groups. The controller creates a
mapping from ingress_port
to multicast group where each multicast group
does the right thing.
Note: The implementation of multicast groups in here does not leverage the probe optimization from the paper (Section 4.5).
Note: There is also code in util
to send the correct messages to the
software switch, however, bmv doesn't completely implement message to the
Packet Replication Engine (PRE) (see this).
Probe packets are the main component of the HULA protocol. A probe packet originates from a src ToR switch and gets replicated throughout the network. The path utilization and the hop information from the probe is used to configure the best hop to get to the ToR where the probe originated.
The probe packet contains two headers: dst_tor
and path_util
. The parsing
and deparsing parts of the code are trivial for the new headers.
The forwarding tables need to know the ToR ID associated with each switch in order to implement forwarding. Namely, it makes two decisions based on this infrormation:
- If a data packet has reached its required ToR, forward to the rack switch.
- If not, use the destination ToR to find the next best hop.
In the implementation, this infromation is stored in two metadata
fields:
self_id
and dst_tor
. The control plane configures get_dst_tor
table
to correctly extract this information for all switches.
Hula requires a notions of link utilization associated with each incoming port
on a switch. The link utilization is computed by the update_ingress_statistics
action using the formula u = pk_l + u * (1 - dt/t)
where pk_l
is the size
of the packet, dt
is the time since last update, and t
is a constant.
We'll be working with h1
, h2
, and the switches that connect. If you'd like to
use a different pair of hosts, take a look at ./topology-generation/foo.png
to
figure out which switches connect them.
- Run
make
and wait for mininet to start up. - Run
./controller.py
to configure the multicast groups and tables. - Run
link s104 s211 down
andlink s104 c210 down
in mininet CLI. This removes all paths fromh1
toh10
. - Run
xterm h1 h10 h10
in mininet CLI to create one terminal forh1
and two forh10
. - In the first
h10
run./test-scripts/probe.py
to start sending probes fromh10
. - In the second
h10
instance, run./test-scripts/receive.py
to display incoming packets. - In the
h1
terminal, run the following to continously send packets toh10
. With the current setup, none of the packets will reachh10
.
./test-scripts/send.py 10.0.104.10 "h1" -1
- Now run
link s104 s211 up
in the mininet CLI to bring up the link. There is now a path fromh1
toh10
. - After a short delay, note that packet from
h1
have started arriving toh10
. The HULA probes have detected and coped with link failures!
To evaluate the port utilization of each switch under load, I wrote the
benchmark.py
script which snapshots the state of various registers on each
switch at runtime.
For this evaluation I ended up creating a synthetic load to test the port utilization. To reproduce, follow these steps:
- Run
make
followed by./controller.py
to setup the network. - On hosts h1, h3, h5, h7, h9, h11, h13, h15, run
./test-scripts/probe.py &
to start sending probes. - Run
./test-scripts/send.py 10.0.104.9 'message'
on hosts h1, h3, h5, h7, and, h11.
A more realistic load balancing test could be created using one of the datacenter TCP dumps. However, due to time constraints, I was not able to look into this.
For benchmarking, I collected 60 snapshots of the network state at an interval
of 0.5 seconds. The collected data is in data/data.json
. Each snapshot contains
the current state of the best_hops
register and the port_util
register.
The HULA paper generated multiple graphs to compare the load balancing characteristics of the alogrithm with ECMP and CONGA scheme. For my evaluation, we I generated a visualization of the best path from two hosts (h1 and h5) to h9. As the visualization shows, the load balancing scheme switches to paths with lower link utilization under load.
-
The semantics of what variables are shared between the controls is not always clear from the program. For example, if I define an array in the ingress control, will it carry state across different invocations of the ingress? What makes registers unique to allow them to carry the state across invocations? This is what Andy Fingerhut from the p4 slack had to say:
P4_16 header stack syntax looks like an array, and in many ways is, but only for arrays of values of some header type, and they do not preserve their value after finishing processing one packet. Like other header types, their value disappears after a packet is finished processing
-
No conditionals allowed in table actions. Needed to manually turn the code straightline. Painful.
- Built upon the skeleton code from homework03 for CS6114 fall 18, by Nate Foster, Cornell University.
- Fat tree topology generation code taken from the Frenetic repository.