kubernetes/git-sync

Sync AWS CodeCommit + SparkApplication

lucasmsmedeiros opened this issue · 4 comments

Hello everyone!

I'm running a spark-operator on k8s and I need to synchronize my AWS CodeCommit repository directly so I can import my python modules and not have to build the images with them encapsulated in it.
I've already used sync with GitHub and deploying SSH to the namespace. However, I am trying to sync with AWS credentials according to the yaml below:

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: teste-sync-{{ macros.datetime.now().strftime("%Y-%m-%d-%H-%M-%S") }}
  namespace: processing
spec:
  volumes:
    - name: ivy
      emptyDir: {}
  sparkConf:
    extraJavaOptions: -Dcom.amazonaws.services.s3.enableV4=true
    spark.jars.packages: "org.apache.hadoop:hadoop-aws:3.2.0,org.apache.spark:spark-avro_2.12:3.0.1"
    spark.driver.extraJavaOptions: "-Divy.cache.dir=/tmp -Divy.home=/tmp"
    spark.kubernetes.allocation.batch.size: "10"
    spark.sql.debug.maxToStringFields: "2000"
  hadoopConf:
    "fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"
    "fs.s3a.path.style.access": "True"
    "fs.s3a.connection.ssl.enabled": "True"
  type: Python
  pythonVersion: "3"
  mode: cluster
  image: url_spark_image
  imagePullPolicy: Always
  mainApplicationFile: teste-sync.py
  sparkVersion: "3.1.2"
  restartPolicy:
    type: Never
  volumes:
    - name: ivy
      emptyDir: {}
    - name: scripts
      emptyDir: {}
  driver:
    volumeMounts:
      - name: scripts
        mountPath: /git-sync
    initContainers:
      - name: git-sync
        image: "k8s.gcr.io/git-sync/git-sync:v3.6.1"
        imagePullPolicy: IfNotPresent
        volumeMounts:
          - name: scripts
            mountPath: /scripts
        env:
          - name: GIT_SYNC_REPO
            value: "https://git-codecommit.MY_REGION.amazonaws.com/v1/repos/MY_REPO"
          - name: GIT_SYNC_BRANCH
            value: "master"   
          - name: GIT_SYNC_ROOT
            value: /dags
          - name: GIT_SYNC_DEST
            value: "main"
          - name: GIT_SYNC_ONE_TIME
            value: "true"
          - name: GIT_SYNC_SSH
            value: "false"
          - name: GIT_SYNC_AUTH
            value: "basic"   
          - name: AWS_ACCESS_KEY_ID
            valueFrom:
              secretKeyRef:
                name: aws-credentials
                key: aws_access_key_id
          - name: AWS_SECRET_ACCESS_KEY
            valueFrom:
              secretKeyRef:
                name: aws-credentials
                key: aws_secret_access_key           
    env:
      - name: PYTHONPATH
        value: "$PYTHONPATH:/git-sync/main/scripts"              
    envSecretKeyRefs:
      AWS_ACCESS_KEY_ID:
        name: aws-credentials
        key: aws_access_key_id
      AWS_SECRET_ACCESS_KEY:
        name: aws-credentials
        key: aws_secret_access_key
    cores: 1
    coreLimit: "1200m"
    memory: "2g"
    labels:
      version: 3.1.2
    serviceAccount: spark
    volumeMounts:
      - name: ivy
        mountPath: /tmp
  executor:
    envSecretKeyRefs:
      AWS_ACCESS_KEY_ID:
        name: aws-credentials
        key: aws_access_key_id
      AWS_SECRET_ACCESS_KEY:
        name: aws-credentials
        key: aws_secret_access_key
    cores: 1
    instances: 2
    memory: "3g"
    labels:
      version: 3.1.2
    volumeMounts:
      - name: ivy
        mountPath: /tmp

From the tests I did it's not working. Can anyone help me? Is there a problem with yaml or will this type of authentication not work and will I have to deploy SSH?

A few things:

  1. GIT_SYNC_AUTH is not a thing

  2. since your REPO is "https://" I assume you want to pass GIT_SYNC_USERNAME and either GIT_SYNC_PASSWORD or GIT_SYNC_PASSWORD_FILE. It looks like you have those but in variable names that git-sync would have no way to know about.

  3. If you look at logs I bet you will see something indicating an auth failure.

I'm going to close this for now. Let me know if you can't make it work still. The logs will show you the flags it used - if the username and password are not know, it can't pass them to basicauth.

Hi, @thockin!

Did the changes you propose and I still can't make it work...

My yaml now:

apiVersion: v1
kind: Pod
metadata:
  name: "{{APP_NAME}}"
  namespace: orchestrator
spec:
  containers:
    - name: python-container
      image: "{{PYTHON_IMAGE}}"
      imagePullPolicy: IfNotPresent
      securityContext:
        allowPrivilegeEscalation: false
        runAsUser: 0
      command:
        - "python"
        - "/opt/app/{{API_FILE_PATH}}"
      volumeMounts:
        - name: dags
          mountPath: /git-sync        
  initContainers:
  - name: git-sync
    image: "k8s.gcr.io/git-sync/git-sync:v3.6.1"
    imagePullPolicy: IfNotPresent
    volumeMounts:
      - name: dags
        mountPath: /dags
    env:
      - name: GIT_SYNC_REPO
        value: "https://git-codecommit.<my_region>.amazonaws.com/v1/repos/<my_repo>"
      - name: GIT_SYNC_BRANCH
        value: "master"   
      - name: GIT_SYNC_ROOT
        value: /dags
      - name: GIT_SYNC_DEST
        value: "master"
      - name: GIT_SYNC_ONE_TIME
        value: "true"
      - name: GIT_SYNC_USERNAME
        valueFrom:
          secretKeyRef:
            name: aws-credentials
            key: aws_access_key_id
      - name: GIT_SYNC_PASSWORD
        valueFrom:
          secretKeyRef:
            name: aws-credentials
            key: aws_secret_access_key
  volumes:
    - name: dags
      emptyDir: {}

The error:

INFO: detected pid 1, running init handler
I0922 14:20:38.033790 11 main.go:389] "level"=0 "msg"="starting up" "pid"=11 "args"=["/git-sync"]
I0922 14:20:38.044278 11 main.go:934] "level"=0 "msg"="cloning repo" "origin"="https://git-codecommit.<my_region>.amazonaws.com/v1/repos/<my_repo>" "path"="/dags"
E0922 14:20:38.136591 11 main.go:535] "msg"="too many failures, aborting" "error"="Run(git clone -v --no-checkout -b master https://git-codecommit.<my_region>.amazonaws.com/v1/repos/<my_repo> /dags): exit status 128: { stdout: "", stderr: "Cloning into '/dags'...\nfatal: unable to access 'https://git-codecommit.<my_region>.amazonaws.com/v1/repos/<my_repo>/': The requested URL returned error: 403" }" "failCount"=

The Permissions policies of the user:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "SecretsManagerFullAccess",
            "Effect": "Allow",
            "Action": "secretsmanager:*",
            "Resource": "*"
        },
        {
            "Sid": "ECRAccess",
            "Effect": "Allow",
            "Action": [
                "ecr:GetAuthorizationToken",
                "ecr:BatchCheckLayerAvailability",
                "ecr:GetDownloadUrlForLayer",
                "ecr:BatchGetImage",
                "ecr:DescribeRepositories",
                "ecr:ListImages",
                "ecr:DescribeImages",
                "ecr:GetRepositoryPolicy",
                "ecr:ListTagsForResource",
                "ecr:DescribeImageScanFindings"
            ],
            "Resource": "*"
        },
        {
            "Sid": "CodeCommitFullAccess",
            "Effect": "Allow",
            "Action": "codecommit:*",
            "Resource": "*"
        }
    ]
}

A few things to do:

  1. Can you manually prove that the username and password are correct (no trailing newline or anything) by doing git clone https://user:pass@server... ?
  2. Run git-sync with -v 6 and see exactly which git commands it is running.
  3. If that looks right, consider trying git-sync v4.0.0 and -v 9 which will log more useful info about flags and the md5sums of credentials.