tattle-made/docs

Standardize the docker pipeline

Closed this issue · 25 comments

Lets finalize a few things

  • the norms each service needs to follow to ensure it can be swiftly deployed to our infrastructure
  • the cloud platform we are deploying to (elastic beanstalk, ec2, fargate etc)

Open to suggestions on what all we can add to this.

Framework evaluation - Kubernetes

  • Guarantee availability of specified no. of "replica" pods
  • Streamline rolling upgrades into production without downtime
  • Horizontal auto-scaling based on resource consumption
  • Web UI dashboard to manage deployment, manage resources, & perform troubleshooting
  • Somewhat steep learning curve, but can be abstracted for most devs/users

Sprint 1 Objectives

  • Primary: Deploy Kubernetes/FarGate for Services Dev/Live server

Framework evaluation - FarGate

  • Manages EC2 instances so no manual provisioning required
  • EC2 instances run in dedicated VPC, not visible to users
  • Integrates with both ECS and EKS through a common CLI
  • Container lifecycle management completely under Fargate
  • Requires ECS Task Definitions to be stateless

Decision Points

  • Kubernetes: Kubernetes is the most scalable framework for container orchestration and is platform-agnostic, but has a conceptual and procedural learning curve. Medium-term plan includes scaling use cases, so going ahead with Kubernetes. Verdict - Greenlight
  • Fargate: Fargate is the most hands-off framework that integrates all AWS scaling options into a one-stop serverless compute platform, but EC2 visibility is completely removed. EC2 literacy is not a problem, so planning not to go ahead with Fargate for now. Verdict - Shelved

Sprint 1 Objectives

  • Primary: Deploy Kubernetes for Services Dev/Live server
  • Secondary: Evaluate options for cron job management and deploy sharechat scraper if possible (#2)

Deployed Tattle services containers using minikube on the k8s-dev instance for the k8s POC, testing of services health is in progress

Important Evaluations for k8s:

  1. Handling of environment variables
  2. Production mgmt tool (currently front-runner is kops)
  3. API for EC2 auto-scaling
  4. Access restriction for service endpoints - "internal" vs external (k8s Ingress?)
  5. Single redis queue to be accessible from multiple instances of archive
  6. Fine-grained resource monitoring for each container when deployed in k8s

Kubernetes Deployment Status:
FCS and TGM services deployed and containers running. Currently implementing k8s Services to enable external access to Tattle service endpoints.

CICD Improvement - Redeployment Auto-trigger
As discussed with @dennyabrain, looking at increased automation of the current workflow, where new changes in services auto-trigger redeployment. Current workflow given below:

  • Service is pushed to Docker repo with appropriate tag
  • For new services, image name is updated in Dockerrun.aws.json
  • Github issue comments used to trigger auto-deployment via EB (doesn't work across repos)

Redeployment Auto-trigger Evaluations:

  • Check GitHub Actions and alternatives
  • Test Kubernetes rolling upgrade and rollback capabilities

Miscellaneous Evaluations:

  • Running multiple application components (API server, Web UI, etc.) each with its own Nginx server instance, and considerations when scaling to 10/20 components
  • Frameworks for breaking down data pipelines to standardized building blocks (for example, https://github.com/pditommaso/awesome-pipeline)

Sprint 2 Objectives:

  • Primary: Enabling external access to deployments via k8s Services
  • Secondary: Deploy Khoj services on k8s
  • Secondary: Evaluate redeployment auto-trigger options
    (additional Sprint 2 Objectives in #2)

Miscellaneous Evaluations:

  • Enabling check-in of environmental variables and auto-apply options for environments

Sprint Objectives:

  • Primary: Deploy one service (currently SCS) as-is on k8s cluster on AWS using kops
  • Secondary: Determine ways for check-in of env. variables to get applied automatically during deployment
  • Secondary: Connect k8s to automated CICD pipeline
  • Secondary: Expose containers using services
  • Tertiary: Update labels and annotations in k8s deployment
  • Tertiary: Deploy remaining of FCS, TGM & SCS on k8s cluster

k8s Deployment Status Update:

  • Sharechat scraper is now deployed as-is on AWS k8s cluster using kops

Sprint Objectives:

  • Primary: Expose containers using services
  • Secondary: Test rolling updates for pods
  • Secondary: Connect k8s to automated CICD pipeline
  • Secondary: Determine ways for check-in of env. variables to get applied automatically during deployment
  • Tertiary: Update labels and annotations in k8s deployment
  • Tertiary: Deploy remaining of FCS, TGM & WAS on k8s cluster

k8s Deployment Status Update:

  • AWS k8s cluster was brought down and re-created using instances within free-tier
  • k8s Service for exposing container, and Ingress controller was created
  • Currently facing an issue with port-mapping while attempting to access

Sprint Objectives:

  • Primary: Expose containers using services
  • Secondary: Connect k8s to automated CICD pipeline
  • Tertiary: Perform rolling update of pods
  • Tertiary: Deploy one of KHJ/WAS/FCS/TGM on k8s cluster
  • Tertiary: Determine ways for check-in of env. variables to get applied automatically during deployment
  • Tertiary: Update labels and annotations in k8s deployment

I was taking stock of our current progress and deadlines. I want to propose some hard deadlines for this week (Jul 12) and check with you if you think its practical. So the main requirement is the ability to auto deploy khoj api and sharechat scraper onto our infrastructure.
I guess the following will be essential requirements for that

  • Expose containers using services
  • Connect k8s to automated CICD pipeline
  • Determine ways for check-in of env. variables to get applied automatically during deployment

how do you think we are doing on this?

@dennyabrain plan for the week looks good (slight modification in the 3rd point, where we maintain the environment variables in deployment-specific YAML files on the deployment server, rather than check-in to Github)

k8s Deployment Status:

  • Was able to expose containers using services, currently working for SCS

Week Objectives:

  • Primary: Connect k8s to automated CICD pipeline (including specification of YAML files with env. vars.)
  • Primary: Auto-deploy KHJ API server onto k8s, in addition to SCS

Current Evaluations:

  • Does k8s work with a static string image-name:version for rolling deployments, or does it require unique strings?
  • Github action satisfying CD requirement (preferably k8s/SSH-based, or worst case custom)
  • How much would be the downtime if deleting and re-applying the deployment was the easiest for now?

k8s Deployment Status (20200713):

  • k8s was configured to trigger from Github actions
  • SCS now gets built, uploaded, and deployed to 2 k8s replica pods on commit
  • Basic PoC of end-to-to CICD is in place

Next Steps:

  • Standardize the existing k8s pipeline to deploy multiple services (YAML structure, naming conventions)
  • Implement k8s CICD for next set of services - WAS, KAS
  • Explore multiple nodes, different size instances, etc. as ways of scaling specific services only
  • Explore options for health monitoring of servers, starting with k8s Web UI Dashboard (also part of #3)
  • Explore logs monitoring options (also part of #3)
  • Upgrade CICD to handle cron job deployment (will be done as part of #2)

Additional Considerations:

  • Specifying EBS provisioning during deployment
  • k8s container disk usage analysis and improvement
  • Mapping k8s services to Tattle URLs as a REST API
  • Enabling HTTPS on k8s-based Tattle URLs
  • k8s Labels and Annotations standardization
  • k8s deployments with persistent volumes (if reqd)
  • Check-in of environment variables into Github (if reqd)

* Contd. from the previous comment

Next Steps for k8s Deployment:

  • Standardize the existing k8s pipeline to deploy multiple services (YAML structure, naming conventions)
  • Deployment of SCS Cron Job
  • Deployment of Khoj
  • Deployment of Archive Server + Redis Queue

Future Considerations:

  • Specifying EBS provisioning during deployment
  • Multiple nodes, different size instances, etc. as ways of scaling specific services only
  • Check auto-scaling nodes
  • k8s container disk usage analysis and improvement
  • Mapping k8s services to Tattle URLs as a REST API
  • Enabling HTTPS on k8s-based Tattle URLs
  • k8s Labels and Annotations standardization
  • k8s deployments with persistent volumes (if reqd)
  • Check-in of environment variables into Github (if reqd)

k8s Deployment Status:

  • SCS cron job is deployed and tested (awaiting testing for re-deployment with static Docker image tag)
  • SCS REST server CICD on k8s is implemented and tested
  • Khoj API is deployed and CICD on k8s is implemented and tested

Next Steps:

  • Deployment of Archive server on k8s, and CICD integration, including evaluation of redis queue deployment
  • Deployment of SCS Luigi and CICD integration

Future Considerations (contd. from previous comment):

  • Creating separate cluster for Production deployments as using current as Dev deployment
  • Streamlining k8s and CICD pipelines for multiple version (dev, prod) of multiple services
  • Streamlined solution for user access control on any UI screens to be exposed publicly

k8s cluster node failed on Friday due to insufficient CPU. Cluster was taken down, however re-creation failed due to kops-kubectl version mismatch and master node incompatibility with AZ. Cluster was later recreated successfully and CICD pipelines were re-linked with the new naming conventions.

k8s Deployment Status:

  • Cluster has been recreated with better load handling
  • CICD on k8s for SCS cron job, SCS REST server, and Khoj API is implemented and tested

Next Steps:

  • Deployment of Archive server on k8s, and CICD integration, including evaluation of redis queue deployment
  • Deployment of SCS Luigi and CICD integration

Future Considerations:

  • Specifying EBS provisioning during deployment
  • Multiple nodes, different size instances, etc. as ways of scaling specific services only
  • Check auto-scaling nodes
  • k8s container disk usage analysis and improvement
  • Mapping k8s services to Tattle URLs as a REST API
  • Enabling HTTPS on k8s-based Tattle URLs
  • k8s Labels and Annotations standardization
  • k8s deployments with persistent volumes (if reqd)
  • Check-in of environment variables into Github (if reqd)
  • Creating separate cluster for Production deployments as using current as Dev deployment
  • Streamlining k8s and CICD pipelines for multiple version (dev, prod) of multiple services
  • Message/job queue native k8s solution
  • Streamlined solution for user access control on any UI screens to be exposed publicly

k8s Deployment Status:

  • Archive server ReplicaSet has been deployed successfully on the k8s dev cluster, with a single redis pod
  • With this, all the primary PoCs for Kubernetes are completed, and the basic streamlined deployment pipeline is in place

Next Steps:

  • Cost analysis of cloud deployments (includes resource utilization analysis, if required)
  • Mapping k8s services to Tattle URLs as a REST API (HTTPS-enabled)
  • Create Production cluster after mid-August, based on resource utilization and costing

Future Considerations - if reqd. based on Costing Analysis:

  • k8s container disk usage analysis and improvement
  • Specifying EBS provisioning during deployment

Future Considerations - if reqd. based on Resource Utilization:

  • Check auto-scaling nodes
  • Multiple nodes, different size instances, etc. as ways of scaling specific services only

Future Considerations - Medium Priority:

  • Message/job queue native k8s solution (KubeMQ, RabbitMQ)
  • k8s deployments with persistent volumes
  • Streamlining of environment variables specification for deployments (only if quick solution exists)
  • Streamlined solution for user access control on any UI screens to be exposed publicly (if required)
  • k8s Labels, Annotations, and selector-based deployments (only for tightly-coupled services?)

k8s Deployment Status:

  • Archive server started throwing a lot of errors (given below), most likely due to multiple instances talking to the same redis pod
  • Archive server was redeployed as a single instance to an empty node, and the redis server was redeployed to the same node

Archive Server Error Log:

BRPOPLPUSH { ReplyError: READONLY You can't write against a read only replica.  
    at parseError (/home/node/app/node_modules/redis-parser/lib/parser.js:179:12)
    at parseType (/home/node/app/node_modules/redis-parser/lib/parser.js:302:14)
  command:
   { name: 'brpoplpush',
     args:
      [ 'bull:Whatsapp Post Index Queue:wait',
        'bull:Whatsapp Post Index Queue:active',
        '5' ] } }

Cluster Creation Overview:

  • Development - This would be a minimal cluster of 2-3 nodes, with all the services that are in active development. Ideally, there would be just 1 pod per service here, since availability and/downtime are not problems. In case there are generally a lot of commits made to the Github repo in such a phase, a better CICD pipeline might be to delete the corresponding k8s Deployment and re-trigger it; this would ensure that only 1 tag has to be configured. This pipeline would need to be tested though.
  • Production - All services that are being used by external stakeholders, and all services that are part of official internal pipelines. Here, services with zero-downtime requirements can have 2 pods, while internal might be able to make do with just 1.
  • Zero-downtime Services - Archive Server (multiple services), Factcheck Scraper (Khoj App), Khoj API Server (Khoj App)

General Best Practices Analysis (Before Cluster Re-creation):

  • Check vertical and horizontal scaling of both Pods and Nodes; if Nodes can be added later, we can start with lean clusters
  • Check auto-scaling and load-balancing in response to high loads
  • Check reducing default EBS provisioning during deployment
  • Check Node/Pod Affinity/Anti-Affinity rules for more fine-grained control on pod scheduling (replacement for nodeSelector)
  • Configure k8s to use Application/Network Load Balancer instead of Classic

Timelines:

  • Resolve archive server error given in previous comment by around Aug 10
  • Finish important/critical best practices analysis by Aug 16
  • Deploy Dev and Prod servers in the week of Aug 17

k8s Deployment Status (20200812):

  • Archive server issue was resolved add redis as a 2nd container to the archive-server pod itself, and updating the REDIS_HOST accordingly
  • Testing was done to add and remove nodes from the cluster, and it was found to be largely working as expected
  • PROD and DEV cluster configuration creation and costing estimation for months of Aug and Sep was done

Next Steps (in order of priority):

  • Check reducing default EBS provisioning during deployment
  • Check option of k8s to use Application/Network Load Balancer instead of Classic
  • Check auto-scaling and load-balancing in response to high loads
  • k8s container disk usage analysis and improvement

k8s Deployment Status (20200817):

  • k8s was tested and found to be working with Network Load Balancer (Application LB not possible)
  • Auto-scaling of pods is somewhat straightforward, and for nodes is slightly less so, but both of these seem possible
  • Reducing EBS provisioning seems like its not straightforward or not possible
  • k8s disk usage analysis also seems like its not straightforward, and some resources suggest high disk usage might be related to Docker rather than k8s

Next Steps:

  • Check mapping k8s services to Tattle URLs as a REST API with HTTPS-enabled
  • Start PROD cluster deployment

Evaluation Status:

  • HTTPS-enabled Tattle URLs for k8s Services was tested successfully
  • Pod Affinity framework in k8s was tested succesfully
  • k8s Pod and Node auto-scaling frameworks were identified (though not tested)

k8s Deployment Status (PROD Cluster):

  • New PROD cluster was created using kops from newly provisioned deployment machine
  • All containers and cronjobs have been deployed to the cluster (Pod Affinity was used for load distribution)
  • All auto-deploy scripts have been created and pushed to the server
  • All Services have been configured as Tattle URLs with HTTPS enabled
  • New Sematext apps and dashboards were created for PROD Logs and Infra monitoring
  • Sharechat Scraper REST server deployment artifacts were created and pushed to the server

Next Steps:

  • Start DEV cluster deployment

k8s Deployment Status:

  • DEV cluster was created using kops from newly provisioned deployment machine
  • Khoj has been deployed to the DEV cluster
  • Setting of Labels for Pods created by Cronjobs was figured
  • All Cronjobs of PROD cluster have been updated with Labels and Affinity (to be monitored over the next two days)
  • Consolidated list of commands for cluster creation and container monitoring was compiled

Next Steps:

  • Deprecate EB archive server and services deployments and redirect to k8s deployments

k8s Deployment Status:

  • Cronjobs of PROD cluster working as intended after Labels and Affinity update (parallel jobs scheduled on different nodes)
  • Factcheck Service was modified to be available only within cluster and Route53 records were deleted
  • Main archive server and services URLs were updated to point to k8s PROD endpoints in Route53
  • Testing was done for Archive server, Redis queue, Factcheck Service, and Telegram bot; all were found to be working as intended with k8s endpoints
  • EB deployments of archive server and services deployment are scheduled to be taken down after a week or so once k8s deployment has been validated with real-time load usage
  • DEV cluster was deleted and recreated with larger machines to test new Elasticsearch deployment, and
  • ECK and Elasticsearch cluster were deployed on DEV cluster

All the main objectives viz. creation of CICD pipeline and building confidence in the infra have been achieved, hence closing this issue.