GoogleCloudPlatform/DataflowPythonSDK

Install package for workers

dovy opened this issue · 4 comments

dovy commented

I'm running everything locally as I want with the dataflow runner (using the new Apache Beam SDK). I'm using sqlalchemy, a popular ORM.

As I deploy to Dataflow using the DataflowPipelineRunner, I get the following:

No module named sqlalchemy.sql

I've tried running dataflow in and out of a pyenv. I've also set a requirements.txt. Nothing else is broken except for this. Meaning all other packages seem to work as expected.

Any ideas?

Full error:
An exception was raised when trying to execute the work item <BatchWorkItem s01 steps=Writing to RDS/WriteImpl/initialize_write-in0/Read+Writing to RDS/WriteImpl/initialize_write+Writing to RDS/WriteImpl/initialize_write-out0/Write 166444193180234443> : Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 474, in do_work work_executor.execute() File "dataflow_worker/executor.py", line 909, in dataflow_worker.executor.MapTaskExecutor.execute (dataflow_worker/executor.c:24411) op.start() File "dataflow_worker/executor.py", line 473, in dataflow_worker.executor.DoOperation.start (dataflow_worker/executor.c:14273) def start(self): File "dataflow_worker/executor.py", line 478, in dataflow_worker.executor.DoOperation.start (dataflow_worker/executor.c:13554) pickler.loads(self.spec.serialized_fn)) File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py", line 198, in loads return dill.loads(base64.b64decode(s)) File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 260, in loads return load(file) File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 250, in load obj = pik.load() File "/usr/lib/python2.7/pickle.py", line 858, in load dispatch[key](self) File "/usr/lib/python2.7/pickle.py", line 1133, in load_reduce value = func(*args) File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 726, in _import_module return getattr(__import__(module, None, None, [obj]), obj) ImportError: No module named sqlalchemy.sql

Could you look at the worker logs? Do you see your package being installed as expected?

dovy commented

@aaltay I sent you an email. It seems like now that I have everything properly structured we have a crash on our pip install command. I would so prefer to run pip install -r requirements.txt than specify manually in REQUIRED_PACKAGES within my setup.py.

However, if this is not possible, how do I specify the versions within REQUIRED_PACKAGES? == doesn't seem to be working.

Oh how I wish the local runner would perform the same. Waiting for a spinup of dataflow slows down this debug process something fierce.

@dovy was your idea to use sqlalchemy together with Cloud SQL? If so, did you get it up and running?

I remember @dovy's issue was resolved. Closing this. Please re-open if there are additional questions on the original issue.