data-apis/python-record-api

Failure on pandas commands

datapythonista opened this issue · 7 comments

I've got this script:

import pandas

df = pandas.DataFrame({'col': ['foo bar']})
df['col'].map(lambda x: len(x.split(' ')))

When I run it with the Python interpreter, it works without problems.

But when I run it with PYTHON_RECORD_API_TO_MODULES="pandas" python -m record_api, I get the following error:

Traceback (most recent call last):
  File "/home/mgarcia/miniconda3/envs/pydata/lib/python3.7/site-packages/record_api/__main__.py", line 12, in <module>
    tracer.calls_from_modules[0], run_name="__main__", alter_sys=True
  File "/home/mgarcia/miniconda3/envs/pydata/lib/python3.7/runpy.py", line 205, in run_module
    return _run_module_code(code, init_globals, run_name, mod_spec)
  File "/home/mgarcia/miniconda3/envs/pydata/lib/python3.7/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/home/mgarcia/miniconda3/envs/pydata/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/mgarcia/quansight/dataframe_tools/kaggle/mutable/scripts/9996822.py", line 4, in <module>
    df['col'].map(lambda x: len(x.split(' ')))
  File "/home/mgarcia/quansight/dataframe_tools/kaggle/mutable/scripts/9996822.py", line 4, in <module>
    df['col'].map(lambda x: len(x.split(' ')))
  File "/home/mgarcia/miniconda3/envs/pydata/lib/python3.7/site-packages/record_api/core.py", line 564, in __call__
    Stack(self, frame)()
  File "/home/mgarcia/miniconda3/envs/pydata/lib/python3.7/site-packages/record_api/core.py", line 372, in __call__
    getattr(self, method_name)()
  File "/home/mgarcia/miniconda3/envs/pydata/lib/python3.7/site-packages/record_api/core.py", line 477, in op_CALL_METHOD
    self.process((function,), function, args)
  File "/home/mgarcia/miniconda3/envs/pydata/lib/python3.7/site-packages/record_api/core.py", line 354, in process
    log_call(f"{filename}:{line}", fn, *args, **kwargs)
  File "/home/mgarcia/miniconda3/envs/pydata/lib/python3.7/site-packages/record_api/core.py", line 262, in log_call
    bound = Bound.create(fn, args, kwargs)
  File "/home/mgarcia/miniconda3/envs/pydata/lib/python3.7/site-packages/record_api/core.py", line 239, in create
    sig = signature(fn)
  File "/home/mgarcia/miniconda3/envs/pydata/lib/python3.7/site-packages/pandas/core/generic.py", line 1799, in __hash__
    f"{repr(type(self).__name__)} objects are mutable, "
TypeError: 'Series' objects are mutable, thus they cannot be hashed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/mgarcia/miniconda3/envs/pydata/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/mgarcia/miniconda3/envs/pydata/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/mgarcia/miniconda3/envs/pydata/lib/python3.7/site-packages/record_api/__main__.py", line 15, in <module>
    raise Exception(f"Error running {tracer.calls_from_modules}")
Exception: Error running ['9996822']

Not sure what's the exact pattern, but I'd say I get an error like this in almost every script that uses pandas. Let me know if you need more information, I can find other examples, but I guess it should be obvious for you what's wrong.

Can you try with the new version? This works for me locally:

$ cat tmp.py
import pandas

df = pandas.DataFrame({'col': ['foo bar']})
df['col'].map(lambda x: len(x.split(' ')))
$ PYTHON_RECORD_API_TO_MODULES="pandas" PYTHON_RECORD_API_OUTPUT_FILE=tmp.jsonl PYTHON_RECORD_API_FROM_MODULES=tmp python -m record_api
$ cat tmp.jsonl
{"location":"/Users/saul/p/python-record-api/tmp.py:3","function":{"t":"type","v":{"module":"pandas.core.frame","name":"DataFrame"}},"bound_params":{"pos_or_kw":[["data",{"t":"dict","v":[["col",["foo bar"]]]}]]}}
{"location":"/Users/saul/p/python-record-api/tmp.py:4","function":{"t":"builtin_function_or_method","v":{"module":"_operator","name":"getitem"}},"bound_params":{"pos_only":[["a",{"t":{"module":"pandas.core.frame","name":"DataFrame"}}],["b","col"]]}}

That's weird, I'm in the latest version... Those are the rest of the relevant versions:

#!/bin/sh
python --version
python -c "import pandas; print('pandas', pandas.__version__)"
python -c "import record_api; print('record_api', record_api.__version__)"
export PYTHON_RECORD_API_TO_MODULES="pandas"
export PYTHON_RECORD_API_FROM_MODULES=mutable_error
export PYTHON_RECORD_API_OUTPUT_FILE=output.jsonl
python -m record_api
Python 3.7.6
pandas 1.0.2
record_api 1.1.0
Traceback (most recent call last):
  File "/home/mgarcia/miniconda3/envs/pydata/lib/python3.7/site-packages/record_api/__main__.py", line 12, in <module>
    tracer.calls_from_modules[0], run_name="__main__", alter_sys=True
  File "/home/mgarcia/miniconda3/envs/pydata/lib/python3.7/runpy.py", line 205, in run_module
    return _run_module_code(code, init_globals, run_name, mod_spec)
  File "/home/mgarcia/miniconda3/envs/pydata/lib/python3.7/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/home/mgarcia/miniconda3/envs/pydata/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/mgarcia/quansight/dataframe_tools/kaggle/mutable/scripts/mutable_error.py", line 4, in <module>
    df['col'].map(lambda x: len(x.split(' ')))
  File "/home/mgarcia/quansight/dataframe_tools/kaggle/mutable/scripts/mutable_error.py", line 4, in <module>
    df['col'].map(lambda x: len(x.split(' ')))
  File "/home/mgarcia/miniconda3/envs/pydata/lib/python3.7/site-packages/record_api/core.py", line 564, in __call__
    Stack(self, frame)()
  File "/home/mgarcia/miniconda3/envs/pydata/lib/python3.7/site-packages/record_api/core.py", line 372, in __call__
    getattr(self, method_name)()
  File "/home/mgarcia/miniconda3/envs/pydata/lib/python3.7/site-packages/record_api/core.py", line 477, in op_CALL_METHOD
    self.process((function,), function, args)
  File "/home/mgarcia/miniconda3/envs/pydata/lib/python3.7/site-packages/record_api/core.py", line 354, in process
    log_call(f"{filename}:{line}", fn, *args, **kwargs)
  File "/home/mgarcia/miniconda3/envs/pydata/lib/python3.7/site-packages/record_api/core.py", line 262, in log_call
    bound = Bound.create(fn, args, kwargs)
  File "/home/mgarcia/miniconda3/envs/pydata/lib/python3.7/site-packages/record_api/core.py", line 239, in create
    sig = signature(fn)
  File "/home/mgarcia/miniconda3/envs/pydata/lib/python3.7/site-packages/pandas/core/generic.py", line 1799, in __hash__
    f"{repr(type(self).__name__)} objects are mutable, "
TypeError: 'Series' objects are mutable, thus they cannot be hashed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/mgarcia/miniconda3/envs/pydata/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/mgarcia/miniconda3/envs/pydata/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/mgarcia/miniconda3/envs/pydata/lib/python3.7/site-packages/record_api/__main__.py", line 15, in <module>
    raise Exception(f"Error running {tracer.calls_from_modules}")
Exception: Error running ['mutable_error']

It works for me on a clean environment, but with newer versions of the other deps:

$ conda create -n tmp -c conda-forge python=3.8 pandas
$ conda activate tmp
$ pip install python-record-api
$ python --version
Python 3.8.2
$ python -c "import pandas; print('pandas', pandas.__version__)"
pandas 1.1.0.dev0+1519.gd09f20e29
$ python -c "import record_api; print('record_api', record_api.__version__)"
record_api 1.1.0

I will try with your versions.

I see the issue though, I am caching calls to signature to speed up the time, and I guess the Series class cannot be hashed...

I will add a fix in case it cannot hash something to just get the signature without a cache.

I just published 1.1.1, could you try that?

I just published 1.1.1, could you try that?

Yes, all good now, thanks!