awsdocs/aws-glue-developer-guide

Starting job with trigger causes error

tardif54 opened this issue · 5 comments

I created a Job that uses 2 --extra-py-files, one of them is a library archived in a zip file, following AWS guidelines.
When the job is started through the AWS Glue console, everything works fine. Whenever I use a trigger or a command line (start-job-run() ) to start the exact same job, I get the following error :

Resource Setup Error: Exception in thread "main" org.apache.spark.SparkException: Cannot load main class from JAR s3://bucket/path/to/my/zip/file.zip with URI s3. Please specify a class through --class.

I have tried using non-overridable parameters, specifying the extra-py-files in my command lines, nothing seems to work.

from the error it looks like your files are being supplied in --extra-jars, from command line you would need to make use of --extra-py-files like below.

aws glue start-job-run --job-name "mysql-rds-parallel-read" --arguments='--scriptLocation="s3://my_glue/libraries/test_lib.py",--extra-py-files="hive_metastore_migration.py"'

it will return you jobrun id if submitted successfully like below,
{
"JobRunId": "jr_8313019d2c9e3db824d4681d6f1e43a2e54ed35707fca5bf2e6c8c764719448b"
}

from the error it looks like your files are being supplied in --extra-jars, from command line you would need to make use of --extra-py-files like below.

aws glue start-job-run --job-name "mysql-rds-parallel-read" --arguments='--scriptLocation="s3://my_glue/libraries/test_lib.py",--extra-py-files="hive_metastore_migration.py"'

it will return you jobrun id if submitted successfully like below,
{
"JobRunId": "jr_8313019d2c9e3db824d4681d6f1e43a2e54ed35707fca5bf2e6c8c764719448b"
}

I have tried explicitly adding --extra py files to my AWS Cli command. Here's a part of the log for a failed job run, you can see that both files are there in --extra-py-files. I dont understand what is the difference between starting from the console and starting with a trigger or command line

--extra-py-files s3://bucket/path/to/connection.py, s3://bucket/path/to/optical_services_results/osr_transformations.zip --JOB_ID j_12cf0bc0f9428c8c6a83ed8830575cdf3dc47da498a22a35a3ba1b822b27ff6d --JOB_RUN_ID jr_6ab99c591ef78af230da38aea7f2274805fcba58ce0b3f9cd2d5f25535aa3786 --enable-glue-datacatalog --job-bookmark-option job-bookmark-disable --scriptLocation s3://bucket/path/to/optical_services_results/optical_services_results.py --job-language python --TempDir s3://bucket/path/to/temp/ --JOB_NAME optical-services-results-job

while researching i found discussion on below aws developer forums
https://forums.aws.amazon.com/thread.jspa?threadID=308042

while researching i found discussion on below aws developer forums
https://forums.aws.amazon.com/thread.jspa?threadID=308042

I have tried explicitly supplying job arguments.... doesn't work

Here's the solution to my problem, I used a Boto3 script :

import boto3

client = boto3.client('glue')


def add_trigger():
    client.create_trigger(
        Name='test2schedule',
        Type='SCHEDULED',
        Schedule='cron(07 19 * * ? *)',
            Actions=[
            {
                'JobName': 'optical-services-results-job',
                'Arguments': {
                    '--scriptLocation': 's3://my/bucket/script.py',
                    '--extra-py-files': 's3://my/bucket/connection.py,s3://my/bucket/pythonlibrary.zip'
                },
            },
        ],
        StartOnCreation=True
    )


def main():
    add_trigger()


if __name__ == "__main__":
    main()