cunningham-lab/neurocaas

Integrate all neurocaas usage to tag based workflow.

Closed this issue · 1 comments

Now, developer usage is switched to a tag based workflow. Here is the current layout:
"Soft cap" protections:

  • test-ec2-killer
    • Kills all ec2 instances that are not exempt after 180 minutes of activity.
  • ec2-rogue-killer
    • Kills all ec2 instances that are not on ssm, or explicitly provided with a timeout after 15 minutes of activity.

“Hard cap functions” on total usage.

  • neurocaas-guardduty-develop
    • Stops all ec2 instances that have the developer security group after 2880 minutes of activity (2 days)
  • neurocaas-guardduty-deploy
    • Stops all ec2 instances that have the deploy security group after 120 minutes of activity.

These functions provide a nice layer of security against unexpected usage in all cases except a ssm job that continues unnecessarily.

Here are the next steps:

  • Test these permissions with John Luoyu, then move over all dev permissions to this model and announce.
  • Build tags into the deployment lambda pipeline.
    • Use the timeout tag in ssm timeouts as well, for redundancy
    • Build the corresponding monitoring lambda function guard and add as an additional soft cap to those above.
    • Reference ownership tags in the lambda startup script to find currently active instances when calculating budget.
    • Include the current request load of instances with given timeout when calculating budget.
    • Generate messages back to the user if we have to kill a deployment instance with a clear description of what went wrong.
    • Use pr workflow to vet instances and bring them into tag based workflow.
  • Better messaging: separate out topic arns for different use cases.

In the linked pull request here, we have changed stack lambdas to launch instances with a timeout tag. If given, this tag will be read from the neurocaas config file. If not given, will be 20 minutes by default.

Additionally, I will lower the neurocaas-guardduty-deploy hard cap time limit to 20 minutes for jobs that don't come with a time limit naturally. This has to be done after all stacks are migrated with the PR workflow.

We still have to take care of the items included in the main issue text above.