anet: A Python repository from iweinbau

ANet
A platform to distribute commands over the computers in the KUL department of computer science (building A) through SSH (using public key authentication without passphrases).

Made by Wouter Baert
with some ssh/rsync/nc-commands from Maarten Baert.

========================================

INSTALLATION

I did not design ANet to be installed by others and did most of the installation process almost a year before writing this. Therefore this information may be incomplete.

1) Requirements: Linux for the epoll interface, Python 3, ssh, rsync, I think that's about it.
2) Since I felt uncomfortable security-wise to handle passwords, I decided to authenticate using public-key authentication. This means you need to add your own public SSH key (not necessarily the one from your department account) to the authorized_keys file in the ssh-folder on both the middle machine (at st.cs.kuleuven.be) and the end machines (the one you use at the department, one is enough since they share the same file system).
3) Add the following lines to your ssh config file (simply an alias used in several commands):
Host cs
HostName st.cs.kuleuven.be
User r0597343
4) Change the student_id constant at the top of both anet.py and anet_dependencies/anet_dispatcher.py to your student ID.
5) Add all the end machines to known_hosts. I don't have a good script for this, maybe you can run "anet --killall ueieubfuiwb" to just connect to every machine without actually killing anything. You will then probably be prompted to accept the fingerprint of each machine, which means you still have to accept manually but you all the rest is automated.
6) Add the path /home/rXXXXXXX/anet/src/ to the end machine, the next step will copy files into this folder.
7) Update the anet_dependencies on the end machines. The easiest way is to add the following script to your bin folder (I called it anet-update) and run it:
#!/bin/sh
rsync -e "ssh -o 'ProxyCommand ssh -q cs nc aalst.cs.kotnet.kuleuven.be 22' -l rXXXXXXX" --update --delete --archive /local_path_to_anet_dependencies/ cs-aalst:/home/rXXXXXXX/anet/src
This installs the dispatcher on the end machine.
8) Add the path /home/rXXXXXXX/anet/dependencies/ to the end machine, this is where your own dependencies will be copied (for example the program that needs to be executed).
9) Optional: you can add the following script to your bin folder to allow easy access to anet from anywhere on your machine:
#!/bin/sh
python3 path_to_anet.py "$@"

========================================

USAGE

anet <batch-file> [dependencies]
batch-file: Path to the text file containing the commands to be distributed. The batch file consists of one command per line.
dependencies: Optional. Path containing all dependencies necessary for execution. If this is a folder, should end with /. If this argument is present, the dependencies on the end machines will be updated, although this causes a bit of overhead, so it's advised to leave this out unless necessary.
The output is returned through stdout by returning a response every time a job successfully finishes. Each response is formatted as follows:
1) int: request_id
2) int: length of response
3) bytes: response
Both ints are 32-bit little-endian integers. Check example.py and example.java to see some example implementations of client processes.

anet --find <match-string>
Goes through all end machines and prints any processes it finds matching the match-string. Not tested yet.

anet --killall <match-string>
Goes through all end machines and prints and kills any processes it finds matching the match-string. Limited testing, but at least not too much can go wrong due to permissions I suppose. :)

========================================

FEATURES

I wrote ANet with a few things in mind:
- All remote communication is done over SSH, so no one can change the commands that will be executed or change/read their output. Other people with access to the end machines can however see what commands are being executed, so don't put sensitive information in there.
- There is only one SSH connection and dispatcher process per end machine, not per end core.
- Fault tolerance:
	- If a job crashes, it will be marked as unfinished and instead will be restarted (possibly somewhere else).
	- If a job runs really slowly, when most other jobs are done other cores will start working on the same job. Whichever core finishes first will decide the output, after which all other processes handling the same job are killed.
	- If a dispatcher crashes, it is restarted and its jobs are marked as unfinished. (I think).
	- If an SSH connection fails, it will be restarted with a bit of delay. There seems to be some DDOS protection or something so sending too many SSH requests in a short time gives bad results.
- Both the master process and the dispatcher processes each run in a single thread.
- Both the master process and the dispatcher processes don't use busy waits, thanks to the epoll interface. (However, in hindsight this point and the one above seems to inevitably lead to a theoretical deadlock, so possible improvements for more details).

========================================

POSSIBLE IMPROVEMENTS

In case anyone wants to improve the project, here are some ideas:
- Made the installation process more user-friendly.
- Create a built-in way to pause and resume batches. Currently the master process needs to be running continuously with a decent Internet connection (so no IP changes) while the batch is being processed.
- Create a detached mode where there isn't a master process anymore. Instead, after submitting the batch the end machines process jobs by themselves. This could be difficult due to fault tolerance, however maybe using the file system as a central mediator for storing job information can help (since it already has some fault tolerance built-in, e.g. when your process crashes data on the file system persists).
- For better fault tolerance: what if a dispatcher/SSH connection crashes while sending its output? The master process will end up in a part of the code where it expects a full response.
- Avoid deadlock, which can now technically occur if the master process wants to write something to a dispatcher while it's input buffer is full and the dispatcher wants to write something to the master process while its input is full. Luckily this is very unlikely at the moment since the dispatcher input buffer is mostly filled with commands (which don't take much space) so these will normally be processed well in time.
iweinbau/anet