concourse/hush-house

Task inheriting parent cluster GCP account info

Opened this issue · 6 comments

The results of our task can be seen here: https://hush-house.pivotal.io/teams/PE/pipelines/kibosh/jobs/delete-gke-cluster-and-registry-images/builds/1. Our pipeline creates and deletes a GKE cluster using our service account key provided. On our delete step, we forgot to add the --project parameter. The result was trying to delete a GKE cluster in the cf-concourse-production project.

We have fixed our pipeline to alway reference the GCP project but, wanted to let the group know still.

Thanks for letting us know @jkjell !

Assuming that this is coming through the metadata server (I might be wrong): as we're able to configure garden properties by making use of environment variables, we're able to configure google's internal metadata server to be blocked from the containers that run as
steps/checks in our machines:

worker:
  replicas: 1
  env:
    - name: CONCOURSE_GARDEN_DENY_NETWORK
      value: "169.254.169.254/32"

Regarding permissions granted to those VMs, we're not really able fully remove all of the current permissions as we need those in order to have the provisioning of new disks and
other administrative functions - we could perhaps reduce them, but not make them a full "no permissions granted".

I'll try to validate that soon and follow up with a PR 👍

thx!

Update: we added the deny rule to the workers - we'd need now to ensure that it's indeed blocking and it does what we expect (see

- name: CONCOURSE_GARDEN_DENY_NETWORK
value: "169.254.169.254/32"
).

We confirmed that the deny rule blocks the containers from being able to access the metadata server that the underlying host can access. That rule is applied now to all shared GCP workers on Hush House so this shouldn't happen anymore!

We provisioned a new worker without the rule turned on, and when we ran gcloud info we could reproduce this behaviour, which explains why without the --project flag, the gcloud CLI was falling through to the service account credentials used to provision the GKE cluster:

Account: [secret-account-id@developer.gserviceaccount.com]
Project: [cf-concourse-production]

Current Properties:
  [core]
    project: [cf-concourse-production]
    account: [secret-account-id@developer.gserviceaccount.com]
    disable_usage_reporting: [True]

After applying the rule, the gcloud info output is much more locked down:

Account: [None]
Project: [None]

Current Properties:
  [core]
    disable_usage_reporting: [True]

As of now, running fly execute with

---
platform: linux

image_resource:
  type: registry-image
  source:
    repository: platforminsightsteam/base-ci-image

run:
  path: gcloud
  args: ["info"]

shows Project: [cf-concourse-production]. We intercepted a similar container and were able to curl 169.254.169.254 with no errors. We inspected /proc/$(pgrep gdn)/cmdline on multiple worker pods, and saw that --deny-network169.254.169.254/32 does indeed appear. Something fishy is going on here.

We have confirmed, as @cirocosta suspected, that concourse/concourse#5159 is the culprit here. We used the following docker-compose.yml:

version: '3'

services:
  concourse-db:
    image: postgres
    environment:
      POSTGRES_DB: concourse
      POSTGRES_PASSWORD: concourse_pass
      POSTGRES_USER: concourse_user
      PGDATA: /database

  concourse:
    # digest:
    # before PR sha256:488638b0651e1e6cc884876499970a181ef63f1b2b02b6b9718ca1383c51a0b4
    # (https://ci.concourse-ci.org/teams/main/pipelines/concourse/jobs/build-rc-image/builds/86)
    # after PR sha256:49837094a16050e64a02e8f100a1992084f89505fdddd98e48aae8aa5355b4b4
    # (https://ci.concourse-ci.org/teams/main/pipelines/concourse/jobs/build-rc-image/builds/87)
    image: concourse/concourse-rc@<digest>
    command: quickstart
    privileged: true
    depends_on: [concourse-db]
    ports: ["8080:8080"]
    environment:
      CONCOURSE_POSTGRES_HOST: concourse-db
      CONCOURSE_POSTGRES_USER: concourse_user
      CONCOURSE_POSTGRES_PASSWORD: concourse_pass
      CONCOURSE_POSTGRES_DATABASE: concourse
      CONCOURSE_EXTERNAL_URL: http://localhost:8080
      CONCOURSE_ADD_LOCAL_USER: test:test
      CONCOURSE_MAIN_TEAM_LOCAL_USER: test
      CONCOURSE_WORKER_BAGGAGECLAIM_DRIVER: overlay
      CONCOURSE_GARDEN_DENY_NETWORK: 172.217.1.174/32 # google.com

and ran fly execute against both versions using this task.yml:

---
platform: linux

image_resource:
  type: registry-image
  source:
    repository: appropriate/curl

run:
  path: curl
  args: ["google.com"]

Before the PR, we got curl: (7) Failed to connect to google.com port 80: Connection refused, but after the PR we got

<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.
</BODY></HTML>

Given that the change in concourse/concourse#5159 causes Garden (via kawasaki) to prepend iptables rules, and setting --deny-network appends them, we feel a bit discouraged when deciding how to address this "leaking GCP metadata" use case. Admittedly, neither @pivotal-jamie-klassen or I know a whole lot about iptables, so maybe there is some place we can configure garden to definitely block traffic to GCP's metadata server while still allowing outbound traffic from containers running in greenhouse (windows containers).