aws-samples/aws-etl-orchestrator

Error in Join Marketing and Sales Data

johnsontroye1 opened this issue · 7 comments

I have pulled down this repo and have it working until the last step (Join Marketing and Sales Data). I have tried to get past this unsuccessfully. Here's the error logged in Gluerunner CloudWatch logs:

[ERROR] 2018-07-18T15:17:26.792Z 88fb4fc4-8a9d-11e8-bec7-f7119107e998 Glue job "JoinMarketingAndSalesData" run with Run Id "jr_bebcc..." failed. Last state: FAILED. Error message: AnalysisException: u'Path does not exist: hdfs://ip-172-31-74-135.ec2.internal:8020/user/root/aa.etl-output-path/tmp/sales;'

Yes, i essentially cleaned out everything several times and reran to the same point of error. The only difference i see in the logs is different run id and ip address to the ec2. Can you please tell me where I go to open a support case for this? Thank you.

There were 5 .json files in the repo that needed config changes.

  • cloudformation/gluerunner-lambda-params.json
  • lambda/s3-deployment-descriptor.json
  • cloudformation/glue-resources-params.json
  • lambda/gluerunner/gluerunner-config.json
  • cloudformation/step-functions-resources-params.json

Would you mind sending me your .json files so i can compare against what i have. Maybe i did mess up a configuration.

troy.johnson@changepoint.com

Thank you very much,

Troy

It could be the reason that a wrong parameter set in glue-resources-params.json:

{
    "ParameterKey": "S3ETLOutputPath",
    "ParameterValue": "<NO-DEFAULT>"
}

Please make sure ParameterValue is indeed set to a S3 path, like:

s3://<bucket_name>/output

Not simply:

output

Because the later will actually write the result to HDFS local system! That's why the Join Marketing and Sales Data couldn't find the file.

Config parameters and docs were updated to simplify the configuration process and make it less error prone.