BUG: Assignment of Timestamp Scalar uses micrsosecond precision, Series uses nano
Closed this issue · 6 comments
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
ts = pd.Timestamp.now()
df = pd.DataFrame({"a": [1]})
df["direct_assignment"] = ts
df["series_assignment"] = pd.Series(ts)
df.dtypes
yields
```python
a int64
direct_assignment datetime64[us]
series_assignment datetime64[ns]
dtype: object
yields
Issue Description
I was surprised to see the dtype mismatch here
Expected Behavior
At least for backwards compatability we might want to still make the scalar assignment still yield nanosecond resolution
Installed Versions
INSTALLED VERSIONS
commit : c2cd90a
python : 3.10.12.final.0
python-bits : 64
OS : Linux
OS-release : 6.2.0-33-generic
Version : #33-Ubuntu SMP PREEMPT_DYNAMIC Tue Sep 5 14:49:19 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.2.0dev0+341.gc2cd90ac54
numpy : 1.24.4
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.0.0
pip : 23.2.1
Cython : 0.29.33
pytest : 7.4.2
hypothesis : 6.87.1
sphinx : 7.2.6
blosc : None
feather : None
xlsxwriter : 3.1.6
lxml.etree : 4.9.3
html5lib : 1.1
pymysql : 1.4.6
psycopg2 : 2.9.7
jinja2 : 3.1.2
IPython : 8.16.1
pandas_datareader : None
bs4 : 4.12.2
bottleneck : 1.3.7
dataframe-api-compat: None
fastparquet : 2023.8.0
fsspec : 2023.9.2
gcsfs : 2023.9.2
matplotlib : 3.7.3
numba : 0.57.1
numexpr : 2.8.7
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 13.0.0
pyreadstat : 1.2.3
python-calamine : None
pyxlsb : 1.0.10
s3fs : 2023.9.2
scipy : 1.11.3
sqlalchemy : 2.0.21
tables : 3.8.0
tabulate : 0.9.0
xarray : 2023.9.0
xlrd : 2.0.1
zstandard : 0.21.0
tzdata : 2023.3
qtpy : None
pyqt5 : None
I just ran into a similar issue:
df1 = pd.DataFrame(data={"x": [1], "d": [pd.Timestamp("2020-01-01")]})
df2 = pd.DataFrame(data={"x": [1], "d": pd.Timestamp("2020-01-01")})
Question: which resulting dtypes do df1 and df2 have?
Answer:
>>> df1.dtypes
x int64
d datetime64[ns]
dtype: object
>>> df2.dtypes
x int64
d datetime64[s]
dtype: object
...and the resulting DFs are incompatible, cannot be concatenated because of incompatible dtypes!
I think it boils down to pd.Timestamp("2020-01-01")
deciding on an internal granularity automatically. The "unit" argument does nothing (it is used for interpreting the input value, not the resulting internal dtype). There seems to be no parameter to switch off the automatic. So I think Timestamp should always be "ns" unless you specify something like Timestamp(..., resolution="s")
explicitly. Otherwise we get different incompatible dtypes depending on the input string (which might come from external sources). The only current solution seems to be to use Timestamp("2020-01-01").as_unit("ns")
. Then my example from above works.
Hello @WillAyd,
I would love to work on this.
I have found that the issue is that the direct assignment passes through the infer_dtype_from_scalar()
function in the cast.py
file. Inside this function, the following cast is what gives the us precision : val = val.to_datetime64()
.
To recap, the direct assignment follows this call trace to the problem : Dataframe.__setitem__()
-> Dataframe._set_item()
->Dataframe._sanitize_column()
-> construction.sanitize_array()
-> dtypes.cast.construct_1d_arraylike_from_scalar()
-> dtypes.cast.infer_dtype_from_scalar()
. Inside this method, these lines of code are the source of our problem:
elif isinstance(val, (np.datetime64, dt.datetime)): ... if val is NaT or val.tz is None: val = val.to_datetime64() dtype = val.dtype
I am just beginning in this kind of open source work, so please do not hesitate to give me any kind of guidance. Also, I would be very happy to work on the problem if you would have any specific guidelines.
Thank you
Dupe of ⬇️ ?
@davetapley How is pd.Timestamp.now()
a Python datetime?
@ValueRaider I'm not sure I follow?
It is literally a datetime
in the sense that:
pandas/pandas/_libs/tslibs/timestamps.pyi
Line 34 in dc37a6d
i.e.:
>>> import pandas as pd
>>> from datetime import datetime
>>> isinstance(pd.Timestamp.now(), datetime)
True
R.e. my specific linking of #55014 as a possible dupe,
then the issues are linked because they both have the same symptom,
as identified in #55014 (comment):
- for scalars, the resolution is preserved (so for stdlib datetime, it becomes 'us', because that's the resolution of the python stdlib)
- for a list, the resolution is 'ns' by default
@davetapley It does appear similar, but my concern is that thread is handling bug as a low-priority edge case: I think a conversation is needed regarding the expected behaviour in Pandas 2 when instantiating a DataFrame with columns of type dt.datetime
That this happens using pure Pandas API should raise the urgency.