################ Directory Layout ################ Problem A Solution .─────────────────────────────────────────────── ├── Apptio Infra.pdf # Coding problem handout ├── Dockerfile # Container installation for problem A ├── README ├── extra_thoughts.txt # Ideas/thoughts that didn't fit in README │ ├── configs # Go source files for configs pkg │ ├── configs.go # Parses json conf files │ └── configs_test.go │ ├── install # Scripts for installing if Docker is unavailable │ ├── configure.sh # Sets up environment on host machine │ ├── install.sh # Installs logserver on all host machines │ └── host_info.json # Details of all host machines to install logserver on │ ├── logserver # Go source files for main pkg │ ├── logserver.go # main() location │ ├── logserver_test.go │ └── conf.json # An example conf file used by main() │ ├── test # Automated testing scripts │ ├── small_test.sh # Runs 'go test' in each project Go pkg │ ├── medium_test.sh # Compiles & execs logserver, checking log retrieval │ └── test_conf.json # Conf file given to logserver on exec in medium_test └──────────────────────────────────────────────── ####################### DESIGN & IMPLEMENTATION ####################### For this assignment, I completed "Problem A - Self Service Data Access". There were a lot of vectors available for solving this problem, and after some brainstorming I decided to write an HTTP server in GoLang. This "logserver" is designed to execute on the main app's host machine, where it listens on a port for GET requests from the developers. In addition to the logserver, I wrote GoLang unit tests for the server, a bash script that checks the logserver's basic operability, some bash scripts for installation and configuration, and a Dockerfile. The logserver parses the main app log file and prints the contents in plain text for the developer. While another design could simply proxy the log file without any parsing, my design makes future additions to the logserver easier to implement. For example, limits can be placed on the number of log entries to return at any one time, preventing the logserver from hogging the network and resources of the host machine from clients of the main app. Also, caching of log entries can be easily added, again to prevent clients from seeing slowdowns due to the logserver hogging resources. For modularity, I designed the logserver to take a conf.json file that holds settings for the logserver that can be changed by the admin of the infrastructure. Because the main app is in production, the development time of the logserver must be short and the API used must be simple. Development time of the logserver was quick, thanks to GoLang's great HTTP library. We also don't need to worry about spending time teaching devs how to access the logserver, because any dev can use a regular web browser to read the log file. Security of the logserver depends greatly on the infrastructure of the organization. The key concern is that clients of the main app should be prevented from accessing the logserver. In a scenario where the devs and the main app host machine are all on the same internal network, the logserver can listen on a socket with a private IP4 address and a port number. WAN access to this socket can be prevented with port forwarding/NAT or an ACL residing on either on the gateway router or on the host machine itself. If the devs are not sharing a network with the machine, however, we would need authentication and authorization with either a PKI or Kerberos. The programe kubectl has the same security issue as my logserver - insider access for devs, but none for the clientele. Thus, if the main app in the problem is cloud-based, I'd have to choose one of these solutions: a) install keys for each dev b) write a client-side proxy server that operates in the organization's internal network, and install a key for that server in the worker nodes for the main app. c) have some kind of ticket authentication/authorization service like kerberos configured for the log server. Whatever the choice made, it is imperative that my logserver should only responds to GET requests originating from inside the organization. There were a few key problems with my logserver that arose in development, namely the potential size of the log file and the datetime format in the main app log file. ################### The Datetime format ################### I don't know how the <datetime logtime> is formatted. Go provides a time.Parse function that turns strings into Time values, but it needs either an example string or one of the constants in the time library. I thought about having a control flow switch if no example string is provided, meaning it would keep the datetime entry as a string, but since the other case would return a value of Time, I'd have to have two different structs for log entries. Go doesn't have unions/one-of types, so I could do something clever with interfaces here to have the code work in both cases. My current solution splits the log entries up by "," and assumes that the first comma is the first byte after the datetime string. If the datetime format used in the main app log file has commas in it, this will break the output format and any searchability. In the real world, this problem wouldn't be an issue because the format would be known ahead of runtime, and the logserver could refuse to run without a proper datetime format specified for parsing the log file. I can't allow a default example string format, because it's an assumption that could cause a lot problems if it were accidentally used. ################ Filesystem Reads ################ Presently, every request to the logserver reads the entire log file in from disk (or, more likely, from the kernel's file cache buffer). This will not scale if the log file to read gets to be terrabytes in size. While I didn't come up with a solution for this in time for submitting this project, I do have a few ideas for future work. #### Log file fits in memory We could load & parse the entire file at runtime & save it as a global array of log entries. At the same time, save the byte offset where we last encountered an EOF when parsing the log file (or the size of the file itself). Then whenever a request comes in to read the log file, we'd only need to check if the log file is bigger now than it was during the last parse, and if it was, lock the global array of log entries and update it. With this approach, a search by date of log entries wouldn't need to check for updates if the range of dates being searched is contained within the array of logentries. However, if searching for a word in a log message, the log file would still need to be checked for updates. - This introduces a few more problems, however. 1. If an attacker can't change the values in the log file itself, he/she could change the values in the global array of log entries in the logserver to cover his/her tracks. 2. a lock on the global array of logentries creates a new host of issues. If there are a ton of updates since the last read to parse and the CPU resources given to the logserver cgroup is low, devs accessing the logserver could spend a while waiting for a response. #### Log file too big for memory For every request, the logserver would have to read in chunks at a time of the log file, parse it, send it, delte from memory, and repeat. This would eliminate the possibility of concurrent requests, since a single client would be using all the memory available to the cgroup. In this situation, having a client-side proxy server would be almost mandatory. The proxy would have more RAM and CPU power available, giving it a better caching power and thus a better ability to serve concurrent requests. ##################### Further ideas/issues: ##################### - have the server take a conf file so the admin doesn't need a long string of flags for each startup of the program - big read requests of the log file could slow down the main app Possible solutions: - add a limit to how much a user can access at one time - only run the search when the CPU isn't under heavy load - before any requests, cache as much of the log file into memory as possible - cache results (both client side and server side) - depending on the OS, change the 'nice' level of the log server process - the server side app will need read access on the log file and any archived log files. - network traffic of sending log search results could impact performance for main app - upgrades to the server side program API shouldn't break clients that assume an older API - if using a proxy, could batch requests & send as one to the log server - filtering the log file could be done with functions passed in by devs - instead of passing a conf file location to main in argv[], could write a few CLI prompts that, when the server is first executed, ask the user for either a config file location AND/OR values to use instead. - add a way to restart the logserver if it gets a sigsegfault or sigkill or any other signal that terminates the process. Back in C you could use setjmp and sigsetjmp to save the process's environment early on in main, and jump back to that spot with a sig handler function that called siglongjmp. I'm curious if there's a way to do that in Go, or if such a solution is even allowed in a container. Definitely a question I'll be bringing up during my on-site interview. ############ ALTERNATIVES ############ - log.Fatal(http.ListenAndServe(":8080", http.FileServer(http.Dir("/usr/share/doc")))) If the log file were used in that http.Dir call, the logserver program would be ~10 lines of code. -problems: pretty brittle. If the log file gets large or archived, updates to the logserver would be needed to accomodate changes. However, it's the simplest & quickest solution to the problem we're trying to solve. - Give every dev root access to the main app host machine - Not an option - too many security issues, and could lead to devs installing unneeded tools/programs on the machine. - Modify the kernel by writing a new syscall that takes an inode and tells the kernel to never evict that inode's disk blocks from the kernel's internal page cache. However, it's doubtful that the log file would ever be evicted anyway, since the main app will never close the file descriptor pointing to that inode. - problems: - Downtime; such a patch would likely require a restart - If the kernel is open-source, non-proprietary, we'd have to fork from the mainline and have continuous maintenance on our new kernel, like how Google maintains Android. However, we could try to merge our code into the mainline if there - It's possible this is already an option, so we'd need to do some research to make sure we aren't reinventing the wheel - Complexity, time, resources required, and adding kernel code can introduce a great many bugs. - Write a caching client proxy server that forwards dev requests. If all devs go through the proxy, there will be a large reduction in CPU, RAM, disk I/O, and network resources used by the log server on the main app host machine. - problem: It's a little beyond the scope of the problem we're trying to solve, and introduces a security vulnerability. Without more information, it's impossible to know if caching log files external to the server is allowed. Also, a malicious actor could cover his/her tracks by changing values saved in the cache. - Add rsyslog to the main app host machine & copy the log files to an external database -problem: Not enough information is known about the situation. Rsyslog is more resource intensive than my small log server implementation - maybe smaller is better. - Add Kerberos to the network & check tickets whenever log files are requested. Adding authentication/authorization to the log server would allow for a much more dynamic API. Server admin specific commands could be added, e.g. change configs without having to restart the server. -problems: - increases complexity of the system as a whole - Even though main app customers wouldn't need to go through kerb, they might still see slowdowns. For instance, a dev might write a script that grabs log files every 10 seconds, and overhead of decrypting/encrypting those messages by the server app on the host machine could bog down the system. - This is beyond the scope of the main problem - getting the devs access to the log files. - Chmod the log files to a group on the host OS, give that group read privileges, give devs an account on the server they can SSH into, and add that account to the group. a cron job could be added to the host to add log archive files to that group with read privileges. -problems: - assumes the security policy is ok with devs having ssh access to the machine. - would be hard to track which dev is using the account and at what time - Add in a sandbox program to let devs pass in their own functions to filter the log -problem: complexity. Something like this would take a long time to build, test, and deploy. ############### DEPLOYMENT PLAN ############### Deployment of the logserver is depends upon what software is on the main app host machine. Because the main app is deployed and active, the main goal of deployment is keeping the app online for clients. Thus, a deployment plan relying on Docker images would not work if Docker wasn't already installed, since installing docker while the main app is deployed could introduce new bugs, security vulnerabilities, and resource utilization that wasn't planned for. However, Docker is quite popular, so I wrote a simple Dockerfile that can create an image for deployment. I also wrote bash scripts for automating testing and installation, in case Docker was unavailable. The deployment also depends upon the infrastructure of the system. For example, if the system is fault-tolerant with more than one machine running the main app, deployment needs to be run on each machine. Thus, my installation script is designed to read a host_info.json file containing the specifications of the targets. Without Docker, a deployment plan would need to take these steps: 0. Run unit tests to make sure everything still works 1. Compile a logserver binary that is arch/OS compatible with the main app host machine - save a hash value of the binary so we can periodically check the binary in deployment to see if it's been modified 2. scp the binary & other relevant files into the same namespace as the log files 3. use ssh [command] to run a script on the main host machine completing these steps: a) chmod/chown the binary - make sure the server binary isn't root - create a group that only the binary & the log files are part of - give the group read only privileges to the log files b) write either a cron job or a systemd service that adds any new or archived log files to the group c) have cron or systemd start/restart the server when needed d) create a chroot environment for the logserver binary inside the main app log directory e) add the logserver to a cgroup & limit it's CPU/RAM resource availability f) On the gateway machine/router connected to the WAN, modify NAT/port forwarding to prevent external logserver access, and/or add a rule to the ACL/iptables preventing access from WAN IP's to the port used by the server. - note: this would assume devs access the logserver from internal IPs, or there is a VPN tunnel setup inside the organization to grant devs internal IP addresses i) Make sure current user doesn't have root privileges, and execute the server binary with "chroot . /logserver" With Docker, deployment could consist of building an image with a Dockerfile, pushing that image to dockerhub, and pulling/running it on the main app host machine. If there are multiple machines, a script could be written to automate deployment. ########################### MONITORING/REPORTING/ALERTS ########################### Things to log: a) the config file used at runtime - the entire file itself, or just a few parameters if too long b) request IP origins for each interaction, possibly what they request as well c) any errors, of course d) could log user agents of the connecting devices, to focus future development on a specific hardware/software. Random, but got the idea from a CMU lecture. Alerts: - Send an alert if one IP address is making a ton of requests - having the main app be DDOSed by a log server wouldn't be fun to explain to clients of the main app. Monitoring: - use heartbeats from an external source to make sure the server is still running. In cloud architectures, I think that's what "circuit-breakers" are for, but as of 4/12/18 I still don't have a full view. In an old-school deployment, I could set up a cron job that uses netcat or some other net suite program to send a HEAD request to the log server. If the server didn't respond for a few requests, sound the alarms! Could send a text to a sysadmin. Go's logging design makes it easy to have a program's log file be located on a seperate machine. Output can be written to a socket, since log.SetOutput takes an io.Writer interface as the parameter. To ensure fault tolerance, we could do a dup system call on the logging file descriptor, causing writes to occur to both a local file and to the socket file. ############# CONFIGURATION ############# One area of parameters to set are cgroup settings: maximum memory allowed, max cpu resources, stuff like that, which would be set in the Dockerfile. There's also monitoring alert settings - where to send alerts, where and how often to send out heartbeat messages, and trigger levels for auto banning an IP to prevent DOS attacks. More program-specific configurations: a) port to bind on b) if caching is enabled, maximum cache size to be used c) location of the main app log file in file system for the log server d) location of the logserver config file e) which RPC methods to enable/disable f) Datetime format in the main app log: <datetime logtime> - dd/mm/yy, mm/dd/yy, etc. - the time pkg in Go has a time.Parse() that needs either the RFC number specifying string format, or an example string format. Either of these could be specified in configuration. Since there's no authentication/authorization of the incoming requests, every configuration value must be set on execution. If anyone could change log server settings via RPC without a/a, the settings would be impossible to keep stable. ####### TESTING ####### SMALL TESTS: For unit testing the Go code, I followed testing conventions reccomended by "The Go P.L.". Table-driven testing makes it easy to check corner cases and prevents the need to duplicate assertion logic [pg 306]. Running Go unit tests is also very easy - the "go test" command is much simpler than maintaining a Makefile for testing C code, and whenever "go test" has a failed test, the nonzero exit code will halt any bash script. Code coverage and benchmark testing are also simple to do with Go tools. MEDIUM TESTS: I wrote a bash script that checks the behavior of the logserver during runtime. The script: 1. compiles and runs the logserver 2. uses curl to make a call to /read on the logserver 3. displays the log file the server was supposed to read, and the results of the call to read. Because my logserver parses the log file, I couldn't use the diff tool to compare the webpage and the To make sure the API worked properly, I ran the logserver on localhost and in a local Docker container, and used a web browser to access the proper address:port/uri. LARGE TESTS: Prior to deployment, we need to run the logserver alongside the main app. We must: 1) Confirm that the two apps do not conflict and the logserver has proper access to the main app log files 2) Ensure the logserver does not slow down the main app when under heavy load, and tweak the cgroup settings as needed 3) Quantify the logserver's impact on network throughput - I've been reading that limiting network bandwidth of a cgroup/container is a tricky issue that hasn't yet been solved. We should still test the impact to get an idea of what might happen if we deployed the logserver. FURTHER TESTS: - When deployed, use noisy nmap scans from outside the organization to make sure the log server port is inaccessible from the WAN. - If the logserver takes queries, could use a fuzzer to try and find query strings that cause the program to crash ######### RESOURCES ######### For this assignment, I mostly stuck to examples in the Go pkg documentation at golang.org/pkg/ and examples in "The Go Programming Language". Staying away from code frameworks that I don't understand yet like Gorrilla Mux helped minimize my tech debt. Code sources: - Donovan, Alan A. A., and Brian W. Kernighan. "The Go Programming Language". New York: Addison-Wesley, 2016. "The Go Programming Language" - Example code from various packages at golang.org/pkg/* Scripting help: - https://stackoverflow.com/questions/878600/how-to-create-a-cron-job-using-bash-automatically-without-the-interactive-editor? - Barrett, Daniel J. "Linux Pocket Guide". O'Reilly, 2016. - Frisch, Aileen. "Essential System Administration". O'Reilly, 2002. - https://github.com/bbright024/proxy-web-server/blob/master/driver.sh Simple script for testing web servers, written by Dave O'Halloran of CMU Idea sources: - MIT 6.858 Security - Lecture 4: Privelege Seperation in Web Apps https://www.youtube.com/watch?v=XnBJc3-N2BU - "How to handle configuration in Go" https://stackoverflow.com/questions/16465705/how-to-handle-configuration-in-go? - Kubernetes Devs AMA https://www.reddit.com/r/kubernetes/comments/8b7f0x/we_are_kubernetes_developers_ask_us_anything/ - CMU's 15-418 Parallel Computer Architecture & Programming 2016 - Lecture 14 - Scaling a Website https://scs.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=3bb2f332-fbdb-4434-9f3b-0c2b3f9668c8