Standardize the docker pipeline

Question

Standardize the docker pipeline

Closed this issue 4 years ago · 25 comments

dennyabrain commented 4 years ago

Lets finalize a few things

the norms each service needs to follow to ensure it can be swiftly deployed to our infrastructure
the cloud platform we are deploying to (elastic beanstalk, ec2, fargate etc)

Open to suggestions on what all we can add to this.

Answer 1 · 2020-06-03T13:24:24.000Z

Framework evaluation - Kubernetes

Guarantee availability of specified no. of "replica" pods
Streamline rolling upgrades into production without downtime
Horizontal auto-scaling based on resource consumption
Web UI dashboard to manage deployment, manage resources, & perform troubleshooting
Somewhat steep learning curve, but can be abstracted for most devs/users

Sprint 1 Objectives

Primary: Deploy Kubernetes/FarGate for Services Dev/Live server

Answer 2 · 2020-06-06T10:04:48.000Z

Framework evaluation - FarGate

Manages EC2 instances so no manual provisioning required
EC2 instances run in dedicated VPC, not visible to users
Integrates with both ECS and EKS through a common CLI
Container lifecycle management completely under Fargate
Requires ECS Task Definitions to be stateless

Decision Points

Kubernetes: Kubernetes is the most scalable framework for container orchestration and is platform-agnostic, but has a conceptual and procedural learning curve. Medium-term plan includes scaling use cases, so going ahead with Kubernetes. Verdict - Greenlight
Fargate: Fargate is the most hands-off framework that integrates all AWS scaling options into a one-stop serverless compute platform, but EC2 visibility is completely removed. EC2 literacy is not a problem, so planning not to go ahead with Fargate for now. Verdict - Shelved

Sprint 1 Objectives

Primary: Deploy Kubernetes for Services Dev/Live server
Secondary: Evaluate options for cron job management and deploy sharechat scraper if possible (#2)

Answer 3 · 2020-06-11T07:04:54.000Z

Deployed Tattle services containers using minikube on the k8s-dev instance for the k8s POC, testing of services health is in progress

Important Evaluations for k8s:

Handling of environment variables
Production mgmt tool (currently front-runner is kops)
API for EC2 auto-scaling
Access restriction for service endpoints - "internal" vs external (k8s Ingress?)
Single redis queue to be accessible from multiple instances of archive
Fine-grained resource monitoring for each container when deployed in k8s

Answer 4 · 2020-06-17T04:26:10.000Z

Kubernetes Deployment Status:
FCS and TGM services deployed and containers running. Currently implementing k8s Services to enable external access to Tattle service endpoints.

CICD Improvement - Redeployment Auto-trigger
As discussed with @dennyabrain, looking at increased automation of the current workflow, where new changes in services auto-trigger redeployment. Current workflow given below:

Service is pushed to Docker repo with appropriate tag
For new services, image name is updated in Dockerrun.aws.json
Github issue comments used to trigger auto-deployment via EB (doesn't work across repos)

Answer 5 · 2020-06-17T17:10:46.000Z

Redeployment Auto-trigger Evaluations:

Check GitHub Actions and alternatives
Test Kubernetes rolling upgrade and rollback capabilities

Miscellaneous Evaluations:

Running multiple application components (API server, Web UI, etc.) each with its own Nginx server instance, and considerations when scaling to 10/20 components
Frameworks for breaking down data pipelines to standardized building blocks (for example, https://github.com/pditommaso/awesome-pipeline)

Sprint 2 Objectives:

Primary: Enabling external access to deployments via k8s Services
Secondary: Deploy Khoj services on k8s
Secondary: Evaluate redeployment auto-trigger options
(additional Sprint 2 Objectives in #2)

Answer 6 · 2020-06-22T18:11:57.000Z

Miscellaneous Evaluations:

Enabling check-in of environmental variables and auto-apply options for environments

Answer 7 · 2020-06-30T16:48:13.000Z

Sprint Objectives:

Primary: Deploy one service (currently SCS) as-is on k8s cluster on AWS using kops
Secondary: Determine ways for check-in of env. variables to get applied automatically during deployment
Secondary: Connect k8s to automated CICD pipeline
Secondary: Expose containers using services
Tertiary: Update labels and annotations in k8s deployment
Tertiary: Deploy remaining of FCS, TGM & SCS on k8s cluster

Answer 8 · 2020-07-01T09:15:58.000Z

k8s Deployment Status Update:

Sharechat scraper is now deployed as-is on AWS k8s cluster using kops

Sprint Objectives:

Primary: Expose containers using services
Secondary: Test rolling updates for pods
Secondary: Connect k8s to automated CICD pipeline
Secondary: Determine ways for check-in of env. variables to get applied automatically during deployment
Tertiary: Update labels and annotations in k8s deployment
Tertiary: Deploy remaining of FCS, TGM & WAS on k8s cluster

Answer 9 · 2020-07-03T11:20:37.000Z

k8s Deployment Status Update:

AWS k8s cluster was brought down and re-created using instances within free-tier
k8s Service for exposing container, and Ingress controller was created
Currently facing an issue with port-mapping while attempting to access

Sprint Objectives:

Primary: Expose containers using services
Secondary: Connect k8s to automated CICD pipeline
Tertiary: Perform rolling update of pods
Tertiary: Deploy one of KHJ/WAS/FCS/TGM on k8s cluster
Tertiary: Determine ways for check-in of env. variables to get applied automatically during deployment
Tertiary: Update labels and annotations in k8s deployment

Answer 10 · 2020-07-05T18:08:04.000Z

I was taking stock of our current progress and deadlines. I want to propose some hard deadlines for this week (Jul 12) and check with you if you think its practical. So the main requirement is the ability to auto deploy khoj api and sharechat scraper onto our infrastructure.
I guess the following will be essential requirements for that

Expose containers using services
Connect k8s to automated CICD pipeline
Determine ways for check-in of env. variables to get applied automatically during deployment

how do you think we are doing on this?

Answer 11 · 2020-07-07T03:07:15.000Z

@dennyabrain plan for the week looks good (slight modification in the 3rd point, where we maintain the environment variables in deployment-specific YAML files on the deployment server, rather than check-in to Github)

k8s Deployment Status:

Was able to expose containers using services, currently working for SCS

Week Objectives:

Primary: Connect k8s to automated CICD pipeline (including specification of YAML files with env. vars.)
Primary: Auto-deploy KHJ API server onto k8s, in addition to SCS

Answer 12 · 2020-07-10T05:49:25.000Z

Current Evaluations:

Does k8s work with a static string image-name:version for rolling deployments, or does it require unique strings?
Github action satisfying CD requirement (preferably k8s/SSH-based, or worst case custom)
How much would be the downtime if deleting and re-applying the deployment was the easiest for now?

Answer 13 · 2020-07-15T04:49:18.000Z

k8s Deployment Status (20200713):

k8s was configured to trigger from Github actions
SCS now gets built, uploaded, and deployed to 2 k8s replica pods on commit
Basic PoC of end-to-to CICD is in place

Next Steps:

Standardize the existing k8s pipeline to deploy multiple services (YAML structure, naming conventions)
Implement k8s CICD for next set of services - WAS, KAS
Explore multiple nodes, different size instances, etc. as ways of scaling specific services only
Explore options for health monitoring of servers, starting with k8s Web UI Dashboard (also part of #3)
Explore logs monitoring options (also part of #3)
Upgrade CICD to handle cron job deployment (will be done as part of #2)

Answer 14 · 2020-07-21T04:17:34.000Z

Additional Considerations:

Specifying EBS provisioning during deployment
k8s container disk usage analysis and improvement
Mapping k8s services to Tattle URLs as a REST API
Enabling HTTPS on k8s-based Tattle URLs
k8s Labels and Annotations standardization
k8s deployments with persistent volumes (if reqd)
Check-in of environment variables into Github (if reqd)

* Contd. from the previous comment

Answer 15 · 2020-07-22T05:03:00.000Z

Next Steps for k8s Deployment:

Standardize the existing k8s pipeline to deploy multiple services (YAML structure, naming conventions)
Deployment of SCS Cron Job
Deployment of Khoj
Deployment of Archive Server + Redis Queue

Future Considerations:

Specifying EBS provisioning during deployment
Multiple nodes, different size instances, etc. as ways of scaling specific services only
Check auto-scaling nodes
k8s container disk usage analysis and improvement
Mapping k8s services to Tattle URLs as a REST API
Enabling HTTPS on k8s-based Tattle URLs
k8s Labels and Annotations standardization
k8s deployments with persistent volumes (if reqd)
Check-in of environment variables into Github (if reqd)

Answer 16 · 2020-07-23T14:25:41.000Z

k8s Deployment Status:

SCS cron job is deployed and tested (awaiting testing for re-deployment with static Docker image tag)
SCS REST server CICD on k8s is implemented and tested
Khoj API is deployed and CICD on k8s is implemented and tested

Next Steps:

Deployment of Archive server on k8s, and CICD integration, including evaluation of redis queue deployment
Deployment of SCS Luigi and CICD integration

Future Considerations (contd. from previous comment):

Creating separate cluster for Production deployments as using current as Dev deployment
Streamlining k8s and CICD pipelines for multiple version (dev, prod) of multiple services
Streamlined solution for user access control on any UI screens to be exposed publicly

Answer 17 · 2020-07-29T06:09:43.000Z

k8s cluster node failed on Friday due to insufficient CPU. Cluster was taken down, however re-creation failed due to kops-kubectl version mismatch and master node incompatibility with AZ. Cluster was later recreated successfully and CICD pipelines were re-linked with the new naming conventions.

k8s Deployment Status:

Cluster has been recreated with better load handling
CICD on k8s for SCS cron job, SCS REST server, and Khoj API is implemented and tested

Next Steps:

Deployment of Archive server on k8s, and CICD integration, including evaluation of redis queue deployment
Deployment of SCS Luigi and CICD integration

Future Considerations:

Specifying EBS provisioning during deployment
Multiple nodes, different size instances, etc. as ways of scaling specific services only
Check auto-scaling nodes
k8s container disk usage analysis and improvement
Mapping k8s services to Tattle URLs as a REST API
Enabling HTTPS on k8s-based Tattle URLs
k8s Labels and Annotations standardization
k8s deployments with persistent volumes (if reqd)
Check-in of environment variables into Github (if reqd)
Creating separate cluster for Production deployments as using current as Dev deployment
Streamlining k8s and CICD pipelines for multiple version (dev, prod) of multiple services
Message/job queue native k8s solution
Streamlined solution for user access control on any UI screens to be exposed publicly

Answer 18 · 2020-08-01T13:23:30.000Z

k8s Deployment Status:

Archive server ReplicaSet has been deployed successfully on the k8s dev cluster, with a single redis pod
With this, all the primary PoCs for Kubernetes are completed, and the basic streamlined deployment pipeline is in place

Next Steps:

Cost analysis of cloud deployments (includes resource utilization analysis, if required)
Mapping k8s services to Tattle URLs as a REST API (HTTPS-enabled)
Create Production cluster after mid-August, based on resource utilization and costing

Future Considerations - if reqd. based on Costing Analysis:

k8s container disk usage analysis and improvement
Specifying EBS provisioning during deployment

Future Considerations - if reqd. based on Resource Utilization:

Check auto-scaling nodes
Multiple nodes, different size instances, etc. as ways of scaling specific services only

Future Considerations - Medium Priority:

Message/job queue native k8s solution (KubeMQ, RabbitMQ)
k8s deployments with persistent volumes
Streamlining of environment variables specification for deployments (only if quick solution exists)
Streamlined solution for user access control on any UI screens to be exposed publicly (if required)
k8s Labels, Annotations, and selector-based deployments (only for tightly-coupled services?)

Answer 19 · 2020-08-03T14:27:18.000Z

k8s Deployment Status:

Archive server started throwing a lot of errors (given below), most likely due to multiple instances talking to the same redis pod
Archive server was redeployed as a single instance to an empty node, and the redis server was redeployed to the same node

Archive Server Error Log:

BRPOPLPUSH { ReplyError: READONLY You can't write against a read only replica.  
    at parseError (/home/node/app/node_modules/redis-parser/lib/parser.js:179:12)
    at parseType (/home/node/app/node_modules/redis-parser/lib/parser.js:302:14)
  command:
   { name: 'brpoplpush',
     args:
      [ 'bull:Whatsapp Post Index Queue:wait',
        'bull:Whatsapp Post Index Queue:active',
        '5' ] } }

Answer 20 · 2020-08-06T05:18:06.000Z

Cluster Creation Overview:

Development - This would be a minimal cluster of 2-3 nodes, with all the services that are in active development. Ideally, there would be just 1 pod per service here, since availability and/downtime are not problems. In case there are generally a lot of commits made to the Github repo in such a phase, a better CICD pipeline might be to delete the corresponding k8s Deployment and re-trigger it; this would ensure that only 1 tag has to be configured. This pipeline would need to be tested though.
Production - All services that are being used by external stakeholders, and all services that are part of official internal pipelines. Here, services with zero-downtime requirements can have 2 pods, while internal might be able to make do with just 1.
Zero-downtime Services - Archive Server (multiple services), Factcheck Scraper (Khoj App), Khoj API Server (Khoj App)

General Best Practices Analysis (Before Cluster Re-creation):

Check vertical and horizontal scaling of both Pods and Nodes; if Nodes can be added later, we can start with lean clusters
Check auto-scaling and load-balancing in response to high loads
Check reducing default EBS provisioning during deployment
Check Node/Pod Affinity/Anti-Affinity rules for more fine-grained control on pod scheduling (replacement for nodeSelector)
Configure k8s to use Application/Network Load Balancer instead of Classic

Timelines:

Resolve archive server error given in previous comment by around Aug 10
Finish important/critical best practices analysis by Aug 16
Deploy Dev and Prod servers in the week of Aug 17

Answer 21 · 2020-08-13T07:33:24.000Z

k8s Deployment Status (20200812):

Archive server issue was resolved add redis as a 2nd container to the archive-server pod itself, and updating the REDIS_HOST accordingly
Testing was done to add and remove nodes from the cluster, and it was found to be largely working as expected
PROD and DEV cluster configuration creation and costing estimation for months of Aug and Sep was done

Next Steps (in order of priority):

Check reducing default EBS provisioning during deployment
Check option of k8s to use Application/Network Load Balancer instead of Classic
Check auto-scaling and load-balancing in response to high loads
k8s container disk usage analysis and improvement

Answer 22 · 2020-08-19T03:33:05.000Z

k8s Deployment Status (20200817):

k8s was tested and found to be working with Network Load Balancer (Application LB not possible)
Auto-scaling of pods is somewhat straightforward, and for nodes is slightly less so, but both of these seem possible
Reducing EBS provisioning seems like its not straightforward or not possible
k8s disk usage analysis also seems like its not straightforward, and some resources suggest high disk usage might be related to Docker rather than k8s

Next Steps:

Check mapping k8s services to Tattle URLs as a REST API with HTTPS-enabled
Start PROD cluster deployment

Answer 23 · 2020-08-22T15:14:26.000Z

Evaluation Status:

HTTPS-enabled Tattle URLs for k8s Services was tested successfully
Pod Affinity framework in k8s was tested succesfully
k8s Pod and Node auto-scaling frameworks were identified (though not tested)

k8s Deployment Status (PROD Cluster):

New PROD cluster was created using kops from newly provisioned deployment machine
All containers and cronjobs have been deployed to the cluster (Pod Affinity was used for load distribution)
All auto-deploy scripts have been created and pushed to the server
All Services have been configured as Tattle URLs with HTTPS enabled
New Sematext apps and dashboards were created for PROD Logs and Infra monitoring
Sharechat Scraper REST server deployment artifacts were created and pushed to the server

Next Steps:

Start DEV cluster deployment

Answer 24 · 2020-08-29T13:28:57.000Z

k8s Deployment Status:

DEV cluster was created using kops from newly provisioned deployment machine
Khoj has been deployed to the DEV cluster
Setting of Labels for Pods created by Cronjobs was figured
All Cronjobs of PROD cluster have been updated with Labels and Affinity (to be monitored over the next two days)
Consolidated list of commands for cluster creation and container monitoring was compiled

Next Steps:

Deprecate EB archive server and services deployments and redirect to k8s deployments

Answer 25 · 2020-08-31T12:06:06.000Z

k8s Deployment Status:

Cronjobs of PROD cluster working as intended after Labels and Affinity update (parallel jobs scheduled on different nodes)
Factcheck Service was modified to be available only within cluster and Route53 records were deleted
Main archive server and services URLs were updated to point to k8s PROD endpoints in Route53
Testing was done for Archive server, Redis queue, Factcheck Service, and Telegram bot; all were found to be working as intended with k8s endpoints
EB deployments of archive server and services deployment are scheduled to be taken down after a week or so once k8s deployment has been validated with real-time load usage
DEV cluster was deleted and recreated with larger machines to test new Elasticsearch deployment, and
ECK and Elasticsearch cluster were deployed on DEV cluster

All the main objectives viz. creation of CICD pipeline and building confidence in the infra have been achieved, hence closing this issue.