awsdocs/aws-glue-developer-guide

Add .md files for python shell jobs

BharatKJain opened this issue · 6 comments

There's no documentation for:

  1. adding database connection in python shell jobs?
  2. How to use job parameters in python shell job?
  3. Can we create glue job context in python shell job?
  4. Using Crawlers in python shell jobs?

Hi @BharatKJain . My apologies for the lateness of this response.

  1. Is this what you are looking for? https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect-samples.html If you want to create a connection in the Glue data catalog, would this be of help? https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-connections.html

  2. This section (https://docs.aws.amazon.com/glue/latest/dg/add-job-python.html#create-job-python-properties) should be redone to refer to Python, not the console, I think. I will create a ticket for that. But, this might be of help: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-calling.html#aws-glue-programming-python-calling-parameters

  3. Not sure what you mean by "glue job context". Can you provide more information?

  4. The doc has a reference for the crawler API calls here: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-crawling.html Is that what you are looking for?

If your request is to have this content available in .md format, we are working on refreshing the Developer Guide on GitHub.

  1. I was talking about python shell jobs , but the link you mentioned was for spark jobs i.e. spark-sample , also when we add a database connection to a python shell job, we are then no more able to access internet due to which currently i am dividing jobs into 2 parts i.e. (1)fetch data from API and store it in s3 & (2) copy data from S3 to Redshift, i think there are some issues internally in python shell glue job which limits the internet connection.

  2. Glue job parameters can be fetched in python shell jobs using aws.utils, but it took a while to figure out because of lack of documentation, so yeah i am hoping for it to get updated.

  3. Spark jobs use glue context by which we fetched the job parameters, anyways that's resolved in (2.) point.

  4. Ok thanks, i will look into it

Major/Main issue:

Please check why does database connection limits the internet connectivity from your end.

I will ask around about the internet connectivity issue.

One question first: Are you using a VPC? Does this section help explain what is going on? https://docs.aws.amazon.com/en_us/glue/latest/dg/infrastructure-security.html

Closing this issue or pull request in advance of archiving this repo. For more information about the decision to archive this repo (and others in the "awsdocs" org), see the announcement on the AWS News Blog.