GoogleCloudPlatform/DataflowPythonSDK

Jobs not starting on Google dataflow service

1byxero opened this issue · 5 comments

I have dataflow pipelines which work perfectly fine on LocalRunner, but they do not start on dataflow service.

To elaborate on that,
The first phase of the pipe reads data from google datastore, and beam divides that phase into 6 different phases automatically namely UserQuery, SplitQuery, GroupByKey, Values, Flatten, Read.

So the dataflow service ui shows UserQuery, SplitQuery these phases are running, and GroupByKey is Partially running.

I have tried running these pipes before and they used to take about 5 minutes on the dataflow service. The amount of the data now these pipes will process is same so i approx 5-6 minutes should be taken, but the pipes dont end at all, they are just stuck at those 3 phases in the ReadFromDatastore function

What could possibly the reasons for this, so i might be able to debug

I tried it on 2.0.0 as well as 2.1.0 SDK

Hi @1byxero, could you please email dataflow-feedback@google.com and provide your job ID? Thanks.

Hello, I have figured out the issue. It has been addressed by this question on stackoverflow.

Thank you for the help, will close this issue now!

Thank you. Please also note that version 2.1.1 of the SDK fixes the underlying issue.

Actually, my pipes have an additional requirement of python package google-cloud-datastore which install google-cloud-core. The version it installs is 0.25/0.26 which actually has this issue hence i would rather prefer to stay on the current working environment and would check up again once next stable big release of apache-beam.