Install package for workers
dovy opened this issue · 4 comments
I'm running everything locally as I want with the dataflow runner (using the new Apache Beam SDK). I'm using sqlalchemy, a popular ORM.
As I deploy to Dataflow using the DataflowPipelineRunner
, I get the following:
No module named sqlalchemy.sql
I've tried running dataflow in and out of a pyenv. I've also set a requirements.txt. Nothing else is broken except for this. Meaning all other packages seem to work as expected.
Any ideas?
Full error:
An exception was raised when trying to execute the work item <BatchWorkItem s01 steps=Writing to RDS/WriteImpl/initialize_write-in0/Read+Writing to RDS/WriteImpl/initialize_write+Writing to RDS/WriteImpl/initialize_write-out0/Write 166444193180234443> : Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 474, in do_work work_executor.execute() File "dataflow_worker/executor.py", line 909, in dataflow_worker.executor.MapTaskExecutor.execute (dataflow_worker/executor.c:24411) op.start() File "dataflow_worker/executor.py", line 473, in dataflow_worker.executor.DoOperation.start (dataflow_worker/executor.c:14273) def start(self): File "dataflow_worker/executor.py", line 478, in dataflow_worker.executor.DoOperation.start (dataflow_worker/executor.c:13554) pickler.loads(self.spec.serialized_fn)) File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py", line 198, in loads return dill.loads(base64.b64decode(s)) File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 260, in loads return load(file) File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 250, in load obj = pik.load() File "/usr/lib/python2.7/pickle.py", line 858, in load dispatch[key](self) File "/usr/lib/python2.7/pickle.py", line 1133, in load_reduce value = func(*args) File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 726, in _import_module return getattr(__import__(module, None, None, [obj]), obj) ImportError: No module named sqlalchemy.sql
Could you look at the worker logs? Do you see your package being installed as expected?
@aaltay I sent you an email. It seems like now that I have everything properly structured we have a crash on our pip install command. I would so prefer to run pip install -r requirements.txt than specify manually in REQUIRED_PACKAGES within my setup.py.
However, if this is not possible, how do I specify the versions within REQUIRED_PACKAGES? ==
doesn't seem to be working.
Oh how I wish the local runner would perform the same. Waiting for a spinup of dataflow slows down this debug process something fierce.