cs5700-project5: A Python repository from jaisontj

# Running Instructions

Running `make` inside the directory will build executables named `dnsserver` 
and`httpserver` which can then be run as follows:

`
./dnsserver -p port -n name
./httpserver -p port -o origin
`

where `port` is the port the dns server should bind to and `name` is the domain
it should resolve; `origin` is the address to the origin server from which the 
`httpserver` should get the data.

# Briefly describe the design decisions you are making in your DSN server and 
HTTP server.

## DNS Server

The DNS server for the scope of the project is very simple; in that it only 
resolves one domain and returns a fixed IP. The server is written in Python 3 
and runs on the specified port provided.

The DNS server expects to resolve only the domain provided to it via the `name`
parameter. Apart from the above, the server responds with an empty answer for 
domains that it cannot resolve. Moreover, it only supports A type questions of 
class IN.

## HTTP Server

The HTTP server is written in python3 as well. 

It only supports queries for `/test.html` and for all other queries it returns 
a 404 not found. The caching strategy for the `httpserver` is quite 
straightforward in that it caches the file in its first request and every 
request henceforth is responded with the cached file.

# How would you change your DNS server to dynamically return an IP, instead of 
a hard coded IP?

I will have another "measurement" server running along with the DNS server
locally; the DNS server will send it the IP from the request, the measurement 
server will respond with an IP that it determines to be the closest to the 
client. Since the measurement server will be running on the same host as the 
DNS server, latency between the DNS and the measurement server will be 
negligible. Ideally, the resolved response should have more than one IP; but 
towards the scope of the project it is safe to assume that the closest server 
will always be available. 

# Explain how you would implement the mapping of incoming client requests to 
CDN servers, if you had several servers to pick from. Notice the CDN servers 
are geographically distributed, and so are clients. Be specific about what 
kind of measurement system you would implement, where exactly the data would be
collected, and how you would then decide which server is the best option for a 
particular client. 

As mentioned in the previous answer, there will be a measurement server running
on the same host as the DNS and HTTP server. The measurement server is 
basically a peer-to-peer distributed system (shared state); it knows of the 
existence and locations of the other httpservers and measurement servers. It 
will also split the other CDN servers into zones based on their geolocation.

The measurement server will perform a geolocation lookup on the given client IP
(client IP here is the requesting DNS server IP; however the project mentions
that the client and the name lookup will be on the same host) using one of the 
many free services available online (Depending on the space available on the 
machine, it can also maintain a GEO-IP map in a local DB). The server will then
match the client to the closest CDN server based on its zone; if there are 
multiple CDN servers in the same (or closest) zone to the client then a random
server will be picked. This resolved response will have a short TTL.

Subsequent requests from the same client (identified by IP) will be optimized 
as all the CDN servers will perform active measurements by calculating
RTTs to the client from themselves (using ping).This can be done at a certain 
rate through the day to calculate an exponential moving average of the RTTs.
This report can be tracked by the measurement servers to then route the same
client to the server with the lowest RTT on subsequent requests.  

Finally, if time permitted, an additional metric that can be used is server 
load and try to load balance to servers with lesser load and comparable RTTs.
The measurement server can analyse the traffic received by its respective 
httpserver and share that state with its peers to determine load.

# Explain how you would implement caching for your HTTP server, if we were to 
send a range of requests to your HTTP server. What cache replacement strategy 
would you implement if content popularity followed a Zipfian distribution. 
How would your HTTP server respond to a request for a content that is not 
currently in the cache?

Since the content popularity follows a Zipfian distribution, an LFU cache 
replacement strategy would be used; in that the least frequently used content
would be evicted from cache first.

An optimization to this (dependent on the amount of space available on the 
host and pending some experimentation from my end) is to use LFRU (Least 
Frequently Recently Used) where the cache will be split into two categories;
privileged and unprivileged. Most frequently used content will be in the
privileged category and content from the privileged category will be pushed to 
the unprivileged category based on LRU; content from the unprivileged category 
will be evicted out of the cache based on LFU. This way the most frequently and
recently accessed content will be in the privileged category.

For content not in cache, the HTTP server will first fetch the content from 
the origin; cache it based on the strategy mentioned above and then serve the 
content to the client. 

Again, an optimization to this is to have the measurement servers also track
the content cache on each of the servers and in case another peer is closer 
than the origin then fetch the content from the peer instead. 
One other strategy I wanted to explore was optimistic caching; where the 
measurement servers can sync up periodically to share data on the cache on each
of the hosts and ensure that all the hosts contain the most frequently accessed
data of the entire group.
jaisontj/cs5700-project5