/apptio_challenge

Infrastructure Engineer Coding Problem solution

Primary LanguageGo

################
Directory Layout
################

Problem A Solution
.───────────────────────────────────────────────
├── Apptio Infra.pdf         # Coding problem handout
├── Dockerfile                 # Container installation for problem A
├── README                         
├── extra_thoughts.txt       # Ideas/thoughts that didn't fit in README
│
├── configs                  # Go source files for configs pkg
│   ├── configs.go              # Parses json conf files     
│   └── configs_test.go           
│
├── install                  # Scripts for installing if Docker is unavailable
│   ├── configure.sh            # Sets up environment on host machine
│   ├── install.sh              # Installs logserver on all host machines
│   └── host_info.json          # Details of all host machines to install logserver on
│
├── logserver                  # Go source files for main pkg
│   ├── logserver.go            # main() location
│   ├── logserver_test.go         
│   └── conf.json               # An example conf file used by main()
│  
├── test                     # Automated testing scripts     
│   ├── small_test.sh           # Runs 'go test' in each project Go pkg
│   ├── medium_test.sh          # Compiles & execs logserver, checking log retrieval
│   └── test_conf.json          # Conf file given to logserver on exec in medium_test
└────────────────────────────────────────────────

#######################
DESIGN & IMPLEMENTATION 
#######################

    For this assignment, I completed "Problem A - Self Service Data Access".
There were a lot of vectors available for solving this problem, and after some
brainstorming I decided to write an HTTP server in GoLang.  This "logserver" is
designed to execute on the main app's host machine, where it listens on a port
for GET requests from the developers.  In addition to the logserver, I wrote
GoLang unit tests for the server, a bash script that checks the logserver's
basic operability, some bash scripts for installation and configuration, and a
Dockerfile.    

    The logserver parses the main app log file and prints the contents in plain
text for the developer.  While another design could simply proxy the log file
without any parsing, my design makes future additions to the logserver easier
to implement.  For example, limits can be placed on the number of log entries
to return at any one time, preventing the logserver from hogging the network
and resources of the host machine from clients of the main app.  Also, caching
of log entries can be easily added, again to prevent clients from seeing
slowdowns due to the logserver hogging resources.  For modularity, I designed
the logserver to take a conf.json file that holds settings for the logserver
that can be changed by the admin of the infrastructure.

     Because the main app is in production, the development time of the
logserver must be short and the API used must be simple.  Development time of
the logserver was quick, thanks to GoLang's great HTTP library.  We also don't
need to worry about spending time teaching devs how to access the logserver,
because any dev can use a regular web browser to read the log file.

    Security of the logserver depends greatly on the infrastructure of the
organization.  The key concern is that clients of the main app should be
prevented from accessing the logserver. In a scenario where the devs and the
main app host machine are all on the same internal network, the logserver can
listen on a socket with a private IP4 address and a port number.  WAN access to
this socket can be prevented with port forwarding/NAT or an ACL residing on
either on the gateway router or on the host machine itself.  If the devs are
not sharing a network with the machine, however, we would need authentication
and authorization with either a PKI or Kerberos.  

    The programe kubectl has the same security issue as my logserver - insider
access for devs, but none for the clientele.  Thus, if the main app in the
problem is cloud-based, I'd have to choose one of these solutions:
    a) install keys for each dev
    b) write a client-side proxy server that operates in the organization's
       internal network, and install a key for that server in the worker nodes
       for the main app.
    c) have some kind of ticket authentication/authorization service like kerberos
       configured for the log server.

    Whatever the choice made, it is imperative that my logserver should only
responds to GET requests originating from inside the organization.

    There were a few key problems with my logserver that arose in development,
namely the potential size of the log file and the datetime format in the main app
log file.

###################
The Datetime format
###################
       I don't know how the <datetime logtime> is formatted.  Go provides a
   time.Parse function that turns strings into Time values, but it needs either
   an example string or one of the constants in the time library.  I thought
   about having a control flow switch if no example string is provided, meaning
   it would keep the datetime entry as a string, but since the other case would
   return a value of Time, I'd have to have two different structs for log
   entries.  Go doesn't have unions/one-of types, so I could do something clever
   with interfaces here to have the code work in both cases.
       My current solution splits the log entries up by "," and assumes that the
   first comma is the first byte after the datetime string.  If the datetime format
   used in the main app log file has commas in it, this will break the output
   format and any searchability.
       In the real world, this problem wouldn't be an issue because the format
   would be known ahead of runtime, and the logserver could refuse to run without
   a proper datetime format specified for parsing the log file.  I can't allow a
   default example string format, because it's an assumption that could cause a lot
   problems if it were accidentally used. 

################
Filesystem Reads
################

    Presently, every request to the logserver reads the entire log file in from
disk (or, more likely, from the kernel's file cache buffer).  This will not
scale if the log file to read gets to be terrabytes in size.  While I didn't
come up with a solution for this in time for submitting this project, I do have a 
few ideas for future work.  

    #### Log file fits in memory

       We could load & parse the entire file at runtime & save it as a global
    array of log entries.  At the same time, save the byte offset where we last
    encountered an EOF when parsing the log file (or the size of the file
    itself).  Then whenever a request comes in to read the log file, we'd only
    need to check if the log file is bigger now than it was during the last
    parse, and if it was, lock the global array of log entries and update it.

         With this approach, a search by date of log entries wouldn't need to
    check for updates if the range of dates being searched is contained within
    the array of logentries.  However, if searching for a word in a log
    message, the log file would still need to be checked for updates.
      - This introduces a few more problems, however.
	   1. If an attacker can't change the values in the log file itself,
              he/she could change the values in the global array of log entries
	      in the logserver to cover his/her tracks.
	   2. a lock on the global array of logentries creates a new host of
	      issues.  If there are a ton of updates since the last read to
	      parse and the CPU resources given to the logserver cgroup is low,
	      devs accessing the logserver could spend a while waiting for a
	      response.  

    #### Log file too big for memory 

            For every request, the logserver would have to read in chunks at a
        time of the log file, parse it, send it, delte from memory, and repeat.
        This would eliminate the possibility of concurrent requests, since a
        single client would be using all the memory available to the cgroup.
        In this situation, having a client-side proxy server would be almost
        mandatory. The proxy would have more RAM and CPU power available,
        giving it a better caching power and thus a better ability to serve
        concurrent requests.

#####################
Further ideas/issues:
#####################

     - have the server take a conf file so the admin doesn't need a long string of flags
        for each startup of the program
     - big read requests of the log file could slow down the main app
         Possible solutions:
           - add a limit to how much a user can access at one time
	   - only run the search when the CPU isn't under heavy load
	   - before any requests, cache as much of the log file into memory as possible
	   - cache results (both client side and server side)
	   - depending on the OS, change the 'nice' level of the log server process
     - the server side app will need read access on the log file and any
        archived log files.
     - network traffic of sending log search results could impact performance for main app
     - upgrades to the server side program API shouldn't break clients that assume
         an older API
     - if using a proxy, could batch requests & send as one to the log server
     - filtering the log file could be done with functions passed in by devs
     - instead of passing a conf file location to main in argv[], could write a few CLI prompts
       that, when the server is first executed, ask the user for either a config file location
       AND/OR values to use instead.
     - add a way to restart the logserver if it gets a sigsegfault or sigkill or any other
       signal that terminates the process.  Back in C you could use setjmp and sigsetjmp
       to save the process's environment early on in main, and jump back to that spot with
       a sig handler function that called siglongjmp.  I'm curious if there's a way to do
       that in Go, or if such a solution is even allowed in a container.  Definitely a
       question I'll be bringing up during my on-site interview. 
       

############
ALTERNATIVES
############

 - log.Fatal(http.ListenAndServe(":8080", http.FileServer(http.Dir("/usr/share/doc"))))
     If the log file were used in that http.Dir call, the logserver program would be
     ~10 lines of code.
        -problems:
           pretty brittle.  If the log file gets large or archived, updates to 
           the logserver would be needed to accomodate changes. However, it's the
           simplest & quickest solution to the problem we're trying to solve.
               
- Give every dev root access to the main app host machine
    - Not an option - too many security issues, and could lead to devs installing
      unneeded tools/programs on the machine.

- Modify the kernel by writing a new syscall that takes an inode and tells the kernel
    to never evict that inode's disk blocks from the kernel's internal page cache.
    However, it's doubtful that the log file would ever be evicted anyway, since the
    main app will never close the file descriptor pointing to that inode.
      - problems:
         - Downtime; such a patch would likely require a restart 
         - If the kernel is open-source, non-proprietary, we'd have to fork from
           the mainline and have continuous maintenance on our new kernel, like
           how Google maintains Android.  However, we could try to merge our code
           into the mainline if there
         - It's possible this is already an option, so we'd need to do some research
           to make sure we aren't reinventing the wheel
         - Complexity, time, resources required, and adding kernel code can introduce
           a great many bugs. 

- Write a caching client proxy server that forwards dev requests.  If all devs go through
   the proxy, there will be a large reduction in CPU, RAM, disk I/O, and network resources
   used by the log server on the main app host machine. 
   - problem:
        It's a little beyond the scope of the problem we're trying to solve,
        and introduces a security vulnerability. Without more information, it's
        impossible to know if caching log files external to the server is
        allowed. Also, a malicious actor could cover his/her tracks by changing
        values saved in the cache.
      
- Add rsyslog to the main app host machine & copy the log files to an external database
     -problem: 
        Not enough information is known about the situation.  Rsyslog is more resource
        intensive than my small log server implementation - maybe smaller is better. 

- Add Kerberos to the network & check tickets whenever log files are requested.
   Adding authentication/authorization to the log server would allow for a much
   more dynamic API.  Server admin specific commands could be added,
   e.g. change configs without having to restart the server.
     -problems:
        - increases complexity of the system as a whole
        - Even though main app customers wouldn't need to go through kerb, they
          might still see slowdowns.  For instance, a dev might write a script
          that grabs log files every 10 seconds, and overhead of
          decrypting/encrypting those messages by the server app on the host
          machine could bog down the system.

        - This is beyond the scope of the main problem - getting the devs
          access to the log files.
    
- Chmod the log files to a group on the host OS, give that group read
   privileges, give devs an account on the server they can SSH into, and add
   that account to the group. a cron job could be added to the host to add log
   archive files to that group with read privileges.
    -problems:
     - assumes the security policy is ok with devs having ssh access to the machine.
     - would be hard to track which dev is using the account and at what time

- Add in a sandbox program to let devs pass in their own functions to filter the log
     -problem:
        complexity.  Something like this would take a long time to build, test, and deploy.
        

###############
DEPLOYMENT PLAN
###############


     Deployment of the logserver is depends upon what software is on the main
app host machine.  Because the main app is deployed and active, the main goal
of deployment is keeping the app online for clients.  Thus, a deployment plan
relying on Docker images would not work if Docker wasn't already installed,
since installing docker while the main app is deployed could introduce new
bugs, security vulnerabilities, and resource utilization that wasn't planned
for.  However, Docker is quite popular, so I wrote a simple Dockerfile that
can create an image for deployment.  I also wrote bash scripts for automating
testing and installation, in case Docker was unavailable.  

     The deployment also depends upon the infrastructure of the system.  For
example, if the system is fault-tolerant with more than one machine running the
main app, deployment needs to be run on each machine.  Thus, my installation
script is designed to read a host_info.json file containing the specifications
of the targets.


Without Docker, a deployment plan would need to take these steps:

0. Run unit tests to make sure everything still works
1. Compile a logserver binary that is arch/OS compatible with the main app host machine
   - save a hash value of the binary so we can periodically check the binary in deployment
      to see if it's been modified
2. scp the binary & other relevant files into the same namespace as the log files
3. use ssh [command] to run a script on the main host machine completing these steps:
  a) chmod/chown the binary
    - make sure the server binary isn't root
    - create a group that only the binary & the log files are part of
    - give the group read only privileges to the log files
  b) write either a cron job or a systemd service that adds any new or archived
     log files to the group
  c) have cron or systemd start/restart the server when needed
  d) create a chroot environment for the logserver binary inside the main app log
     directory
  e) add the logserver to a cgroup & limit it's CPU/RAM resource availability
  f) On the gateway machine/router connected to the WAN, modify NAT/port
      forwarding to prevent external logserver access, and/or add a rule to the
      ACL/iptables preventing access from WAN IP's to the port used by the server.
        - note: this would assume devs access the logserver from internal IPs, or
          there is a VPN tunnel setup inside the organization to grant devs internal
          IP addresses
  i) Make sure current user doesn't have root privileges, and execute the server binary
     with "chroot . /logserver"


With Docker, deployment could consist of building an image with a Dockerfile,
pushing that image to dockerhub, and pulling/running it on the main app host
machine.  If there are multiple machines, a script could be written to automate
deployment.


###########################
MONITORING/REPORTING/ALERTS
###########################

Things to log:
  a) the config file used at runtime
       - the entire file itself, or just a few parameters if too long
  b) request IP origins for each interaction, possibly what they request as well
  c) any errors, of course
  d) could log user agents of the connecting devices, to focus future development
     on a specific hardware/software.  Random, but got the idea from a CMU lecture.
     
Alerts:
  - Send an alert if one IP address is making a ton of requests -
     having the main app be DDOSed by a log server wouldn't be fun to explain to
     clients of the main app.

Monitoring:
  - use heartbeats from an external source to make sure the server is still running.
      In cloud architectures, I think that's what "circuit-breakers" are for, but
      as of 4/12/18 I still don't have a full view. 
      In an old-school deployment, I could set up a cron job that uses netcat or
      some other net suite program to send a HEAD request to the log server.
      If the server didn't respond for a few requests, sound the alarms!  Could
      send a text to a sysadmin.

    Go's logging design makes it easy to have a program's log file be located
on a seperate machine.  Output can be written to a socket, since log.SetOutput
takes an io.Writer interface as the parameter.  To ensure fault tolerance, we
could do a dup system call on the logging file descriptor, causing writes to
occur to both a local file and to the socket file.


#############
CONFIGURATION
#############

One area of parameters to set are cgroup settings:
     maximum memory
allowed, max cpu resources, stuff like that, which would be set in the
Dockerfile.

    There's also monitoring alert settings - where to send alerts, where and how often
to send out heartbeat messages, and trigger levels for auto banning an IP to prevent
DOS attacks.

More program-specific configurations:
 a) port to bind on
 b) if caching is enabled, maximum cache size to be used
 c) location of the main app log file in file system for the log server
 d) location of the logserver config file 
 e) which RPC methods to enable/disable
 f) Datetime format in the main app log: <datetime logtime> - dd/mm/yy, mm/dd/yy, etc.
    - the time pkg in Go has a time.Parse() that needs either the
      RFC number specifying string format, or an example string format.
      Either of these could be specified in configuration.

    Since there's no authentication/authorization of the incoming requests, every
configuration value must be set on execution.  If anyone could change log
server settings via RPC without a/a, the settings would be impossible to keep stable.

#######
TESTING
#######

SMALL TESTS:
    For unit testing the Go code, I followed testing conventions reccomended by
"The Go P.L.".  Table-driven testing makes it easy to check corner cases and
prevents the need to duplicate assertion logic [pg 306].  Running Go unit tests
is also very easy - the "go test" command is much simpler than maintaining a
Makefile for testing C code, and whenever "go test" has a failed test, the
nonzero exit code will halt any bash script.  Code coverage and benchmark
testing are also simple to do with Go tools.

MEDIUM TESTS:
I wrote a bash script that checks the behavior of the logserver during runtime.
The script: 
  1. compiles and runs the logserver
  2. uses curl to make a call to /read on the logserver
  3. displays the log file the server was supposed to read,
      and the results of the call to read.
      
Because my logserver parses the log file, I couldn't use the diff tool to
compare the webpage and the 


To make sure the API worked properly, I ran the logserver on localhost
and in a local Docker container, and used a web browser to access the proper
address:port/uri.  

LARGE TESTS:
Prior to deployment, we need to run the logserver alongside the main app.
We must:
   1)  Confirm that the two apps do not conflict and the logserver has proper
       access to the main app log files
   2)  Ensure the logserver does not slow down the main app when under heavy
       load, and tweak the cgroup settings as needed
   3)  Quantify the logserver's impact on network throughput
         - I've been reading that limiting network bandwidth of a
           cgroup/container is a tricky issue that hasn't yet been solved.
           We should still test the impact to get an idea of what might happen
           if we deployed the logserver.
           

FURTHER TESTS:
 -  When deployed, use noisy nmap scans from outside the organization to make
      sure the log server port is inaccessible from the WAN.
 -  If the logserver takes queries, could use a fuzzer to try and find query
      strings that cause the program to crash

#########
RESOURCES
#########

For this assignment, I mostly stuck to examples in the Go pkg
documentation at golang.org/pkg/ and examples in "The Go Programming
Language".  Staying away from code frameworks that I don't understand
yet like Gorrilla Mux helped minimize my tech debt.  

Code sources:

 -  Donovan, Alan A. A., and Brian W. Kernighan. "The Go Programming Language".
    New York: Addison-Wesley, 2016.  "The Go Programming Language"

 -  Example code from various packages at golang.org/pkg/*

Scripting help:

 -  https://stackoverflow.com/questions/878600/how-to-create-a-cron-job-using-bash-automatically-without-the-interactive-editor?

 -  Barrett, Daniel J. "Linux Pocket Guide". O'Reilly, 2016.

 -  Frisch, Aileen. "Essential System Administration". O'Reilly, 2002. 

 -  https://github.com/bbright024/proxy-web-server/blob/master/driver.sh
     Simple script for testing web servers, written by Dave O'Halloran of CMU

Idea sources:

 -  MIT 6.858 Security - Lecture 4: Privelege Seperation in Web Apps
     https://www.youtube.com/watch?v=XnBJc3-N2BU

 -  "How to handle configuration in Go"
     https://stackoverflow.com/questions/16465705/how-to-handle-configuration-in-go?

 -  Kubernetes Devs AMA
     https://www.reddit.com/r/kubernetes/comments/8b7f0x/we_are_kubernetes_developers_ask_us_anything/

 -  CMU's 15-418 Parallel Computer Architecture & Programming 2016 - Lecture 14 - Scaling a Website
     https://scs.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=3bb2f332-fbdb-4434-9f3b-0c2b3f9668c8