aws-samples/aws-etl-orchestrator

Join Marketing And Sales Data report Unable to infer schema for Parquet. It must be specified manually.;

liangruibupt opened this issue · 2 comments

Join Marketing And Sales Data glue job report below error

Traceback (most recent call last):
File "script_2019-12-26-08-44-52.py", line 42, in
.load(s3_marketing_data_path, format="parquet")
File "/mnt/yarn/usercache/root/appcache/application_1577349699059_0001/container_1577349699059_0001_01_000001/pyspark.zip/pyspark/sql/readwriter.py", line 159, in load
File "/mnt/yarn/usercache/root/appcache/application_1577349699059_0001/container_1577349699059_0001_01_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in call
File "/mnt/yarn/usercache/root/appcache/application_1577349699059_0001/container_1577349699059_0001_01_000001/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'
End of LogType:stdout

I found below guide:

https://aws.amazon.com/premiumsupport/knowledge-center/glue-unable-to-infer-schema/
How do I resolve the "Unable to infer schema" exception in AWS Glue?
Last updated: 2019-06-12

My AWS Glue job fails with one of the following exceptions:

"AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'"
"AnalysisException: u'Unable to infer schema for ORC. It must be specified manually.;'"

But after double check the ProcessMarketingData jobs, I found the some useful tips

ProcessMarketingData can not found the correct source data to convert to Parquet mode
19/12/26 09:00:39 INFO GlueContext: getCatalogSource: catalogId: null, nameSpace: marketingandsales_qs, tableName: marketing_qs, isRegisteredWithLF: false
19/12/26 09:00:39 INFO GlueContext: classification csv
19/12/26 09:00:39 INFO GlueContext: location s3://aws-etl-orchestrator-demo-raw-data/marketing/
19/12/26 09:00:42 INFO HadoopDataSource: nonSplittable: false, disableSplitting: false, catalogCompressionNotSplittable: false, groupFilesTapeOption: none, format: csv
19/12/26 09:00:42 WARN HadoopDataSource: Skipping Partition
{}
as no new files detected @ s3://aws-etl-orchestrator-demo-raw-data/marketing/ / or path does not exist
19/12/26 09:00:42 INFO SparkContext: Starting job: count at DynamicFrame.scala:1144

So similar like issue #4
You should upload the sales sample data to
aws-etl-orchestrator-demo-raw-data/sales and marketing sample data to
aws-etl-orchestrator-demo-raw-data/marketing

For example:
aws s3 ls s3://aws-etl-orchestrator-demo-raw-data --region ap-northeast-1 --profile us-east-1 --recursive
2019-12-26 17:39:42 0 marketing/
2019-12-26 17:43:36 151746 marketing/MarketingData_QuickSightSample.csv
2019-12-26 17:42:55 0 sales/
2019-12-26 17:43:51 2002910 sales/SalesPipeline_QuickSightSample.csv

Thank you @liangruibupt. In the latest commit, I added instructions in the "Putting it all together" to make the upload process clear and simple.