Running python jobs on dataflow console
1byxero opened this issue · 7 comments
Not sure if this is right place to ask this question, but
How to run python dataflow jobs written using apache beam using Google Cloud Dataflow console? or console only supports java for now?
HI @1byxero, I am assuming that you are asking about jobs running using Dataflow templates from the UI (https://cloud.google.com/dataflow/docs/templates/executing-templates). If that is the case, you can run both Python and Java jobs there. Does this answer your question?
cc: @bjchambers
I made a python script using Apache beam.
The script loads few entities from Google data store makes some transformation and then pushes into different kind on datastore.
The script runs pretty well when I run it on a compute engine instance.
But when I try to run it using the dataflow dashboard, it fails.
Steps I execute while running on data flow dashboard are as follows
- Store the script as a file on Google storage bucket
- Select custom template to run as a job
- Give path of the file saved on storage bucket.
- Execute it.
After executing these steps, the dashboard shows error showing 'unable to parse the file's
But when I run the example template of word count.
It runs well and returns correct output.
But the execution information shows that, the template that ran was a Java template.
I hope I made my question clear.
Please tell where I went wrong?
What did I miss?
Is your script a proper Dataflow pipeline? Dataflow has its own APIs for jobs that can run on its service.
You can see [1] for creating templates using Dataflow Python SDK, and quick start guide [2] for a general idea using Dataflow Python SDK.
[1] https://cloud.google.com/dataflow/docs/templates/creating-templates
[2] https://beam.apache.org/get-started/quickstart-py/
I followed example at link [1] and that example runs well when I run it on compute engine.
Then I made few changes and tried running it on compute engine and it worked well. But when I tried it through the dashboard it didn't run.
Yes, the pipeline that I have written is correct as it's giving desired results when run on a compute engine
Basically when I use a compute engine as a runner it works but when I try to use dataflow service as a runner it fails to even parse the code.
Can you try the instructions from here: https://cloud.google.com/dataflow/docs/templates/creating-templates#creating-and-staging-templates
There will be a template file created at the location you specified by --template_location
flag. Use that file as your template.
I followed those instructions for using the web interface and still the same issue occurred.
The issue is as follows
Error Running Dataflow Job
Unable to parse template file 'gs://bucketname/scriptname.py.
I was missing out on the --runner parameter.
Its working fine now.
Thank you!