Sync AWS CodeCommit + SparkApplication
lucasmsmedeiros opened this issue · 4 comments
Hello everyone!
I'm running a spark-operator on k8s and I need to synchronize my AWS CodeCommit repository directly so I can import my python modules and not have to build the images with them encapsulated in it.
I've already used sync with GitHub and deploying SSH to the namespace. However, I am trying to sync with AWS credentials according to the yaml below:
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: teste-sync-{{ macros.datetime.now().strftime("%Y-%m-%d-%H-%M-%S") }}
namespace: processing
spec:
volumes:
- name: ivy
emptyDir: {}
sparkConf:
extraJavaOptions: -Dcom.amazonaws.services.s3.enableV4=true
spark.jars.packages: "org.apache.hadoop:hadoop-aws:3.2.0,org.apache.spark:spark-avro_2.12:3.0.1"
spark.driver.extraJavaOptions: "-Divy.cache.dir=/tmp -Divy.home=/tmp"
spark.kubernetes.allocation.batch.size: "10"
spark.sql.debug.maxToStringFields: "2000"
hadoopConf:
"fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"
"fs.s3a.path.style.access": "True"
"fs.s3a.connection.ssl.enabled": "True"
type: Python
pythonVersion: "3"
mode: cluster
image: url_spark_image
imagePullPolicy: Always
mainApplicationFile: teste-sync.py
sparkVersion: "3.1.2"
restartPolicy:
type: Never
volumes:
- name: ivy
emptyDir: {}
- name: scripts
emptyDir: {}
driver:
volumeMounts:
- name: scripts
mountPath: /git-sync
initContainers:
- name: git-sync
image: "k8s.gcr.io/git-sync/git-sync:v3.6.1"
imagePullPolicy: IfNotPresent
volumeMounts:
- name: scripts
mountPath: /scripts
env:
- name: GIT_SYNC_REPO
value: "https://git-codecommit.MY_REGION.amazonaws.com/v1/repos/MY_REPO"
- name: GIT_SYNC_BRANCH
value: "master"
- name: GIT_SYNC_ROOT
value: /dags
- name: GIT_SYNC_DEST
value: "main"
- name: GIT_SYNC_ONE_TIME
value: "true"
- name: GIT_SYNC_SSH
value: "false"
- name: GIT_SYNC_AUTH
value: "basic"
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-credentials
key: aws_access_key_id
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-credentials
key: aws_secret_access_key
env:
- name: PYTHONPATH
value: "$PYTHONPATH:/git-sync/main/scripts"
envSecretKeyRefs:
AWS_ACCESS_KEY_ID:
name: aws-credentials
key: aws_access_key_id
AWS_SECRET_ACCESS_KEY:
name: aws-credentials
key: aws_secret_access_key
cores: 1
coreLimit: "1200m"
memory: "2g"
labels:
version: 3.1.2
serviceAccount: spark
volumeMounts:
- name: ivy
mountPath: /tmp
executor:
envSecretKeyRefs:
AWS_ACCESS_KEY_ID:
name: aws-credentials
key: aws_access_key_id
AWS_SECRET_ACCESS_KEY:
name: aws-credentials
key: aws_secret_access_key
cores: 1
instances: 2
memory: "3g"
labels:
version: 3.1.2
volumeMounts:
- name: ivy
mountPath: /tmp
From the tests I did it's not working. Can anyone help me? Is there a problem with yaml or will this type of authentication not work and will I have to deploy SSH?
A few things:
-
GIT_SYNC_AUTH is not a thing
-
since your REPO is "https://" I assume you want to pass GIT_SYNC_USERNAME and either GIT_SYNC_PASSWORD or GIT_SYNC_PASSWORD_FILE. It looks like you have those but in variable names that git-sync would have no way to know about.
-
If you look at logs I bet you will see something indicating an auth failure.
I'm going to close this for now. Let me know if you can't make it work still. The logs will show you the flags it used - if the username and password are not know, it can't pass them to basicauth.
Hi, @thockin!
Did the changes you propose and I still can't make it work...
My yaml now:
apiVersion: v1
kind: Pod
metadata:
name: "{{APP_NAME}}"
namespace: orchestrator
spec:
containers:
- name: python-container
image: "{{PYTHON_IMAGE}}"
imagePullPolicy: IfNotPresent
securityContext:
allowPrivilegeEscalation: false
runAsUser: 0
command:
- "python"
- "/opt/app/{{API_FILE_PATH}}"
volumeMounts:
- name: dags
mountPath: /git-sync
initContainers:
- name: git-sync
image: "k8s.gcr.io/git-sync/git-sync:v3.6.1"
imagePullPolicy: IfNotPresent
volumeMounts:
- name: dags
mountPath: /dags
env:
- name: GIT_SYNC_REPO
value: "https://git-codecommit.<my_region>.amazonaws.com/v1/repos/<my_repo>"
- name: GIT_SYNC_BRANCH
value: "master"
- name: GIT_SYNC_ROOT
value: /dags
- name: GIT_SYNC_DEST
value: "master"
- name: GIT_SYNC_ONE_TIME
value: "true"
- name: GIT_SYNC_USERNAME
valueFrom:
secretKeyRef:
name: aws-credentials
key: aws_access_key_id
- name: GIT_SYNC_PASSWORD
valueFrom:
secretKeyRef:
name: aws-credentials
key: aws_secret_access_key
volumes:
- name: dags
emptyDir: {}
The error:
INFO: detected pid 1, running init handler
I0922 14:20:38.033790 11 main.go:389] "level"=0 "msg"="starting up" "pid"=11 "args"=["/git-sync"]
I0922 14:20:38.044278 11 main.go:934] "level"=0 "msg"="cloning repo" "origin"="https://git-codecommit.<my_region>.amazonaws.com/v1/repos/<my_repo>" "path"="/dags"
E0922 14:20:38.136591 11 main.go:535] "msg"="too many failures, aborting" "error"="Run(git clone -v --no-checkout -b master https://git-codecommit.<my_region>.amazonaws.com/v1/repos/<my_repo> /dags): exit status 128: { stdout: "", stderr: "Cloning into '/dags'...\nfatal: unable to access 'https://git-codecommit.<my_region>.amazonaws.com/v1/repos/<my_repo>/': The requested URL returned error: 403" }" "failCount"=
The Permissions policies of the user:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "SecretsManagerFullAccess",
"Effect": "Allow",
"Action": "secretsmanager:*",
"Resource": "*"
},
{
"Sid": "ECRAccess",
"Effect": "Allow",
"Action": [
"ecr:GetAuthorizationToken",
"ecr:BatchCheckLayerAvailability",
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage",
"ecr:DescribeRepositories",
"ecr:ListImages",
"ecr:DescribeImages",
"ecr:GetRepositoryPolicy",
"ecr:ListTagsForResource",
"ecr:DescribeImageScanFindings"
],
"Resource": "*"
},
{
"Sid": "CodeCommitFullAccess",
"Effect": "Allow",
"Action": "codecommit:*",
"Resource": "*"
}
]
}
A few things to do:
- Can you manually prove that the username and password are correct (no trailing newline or anything) by doing
git clone https://user:pass@server...
? - Run git-sync with
-v 6
and see exactly which git commands it is running. - If that looks right, consider trying git-sync v4.0.0 and
-v 9
which will log more useful info about flags and the md5sums of credentials.