pandas-dev/pandas

Forthcoming dateutil code reorganization

pganssle opened this issue · 28 comments

I know that pandas makes use of some of the private interfaces in dateutil (and I'm hoping to start adding features so that this is no longer necessary), so I thought I'd give you a heads up about a proposed changed to the organization of the parser module that will change the location of some of the private interfaces.

Basically parser.py will move to parser/_parser.py, and only the public parts will be imported into parser/__init__.py, as part of the introduction of a dedicated ISO-8601 parser. See PR dateutil/dateutil#489 for the current discussion.

Thanks for the heads up! We test against dateutil master in one of our jobs (here), so we should be able to catch these and patch them in pandas quickly.

Do you have a rough plan for the next dateutil release?

@TomAugspurger I'm trying to finalize some things, I want to get isoparse in there and possibly a few improvements to the time zones. I am hoping a new feature release before the end of the year. I usually finalize a release on master and let it sit for a week or so, I can ping you when that happens so you'll have some notice.

@pganssle I'm -1 on rushing the isoparse. The best-case scenario is that the dateutil implementation can drop-in replace the _string_to_dts pandas currently uses for parsing these formats. Since the current pandas implementation isn't fully spec compliant (as you mentioned in an issue somewhere), it isn't obvious how easy (or feasible) this will be. So yah, advocating patience and thorough in-house testing/profiling.

As for pandas accessing dateutil internals, there are a couple of workarounds pandas uses that we can probably make unnecessary: https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/tslibs/parsing.pyx#L207, https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/tslibs/parsing.pyx#L315

Note the tail end of tslibs.parsing.dateutil_parse is almost identical to _build_tzinfo from dateutil#493.

I'm not really planning on "rushing" the isoparse implementation, but at the end of the day because it's spec-driven, a lot of the behavior questions are actually made ahead of time (assuming we stick to the spec). The main question then comes down to the public API, and decisions about that can always be pushed off to later by releasing a highly constrained version (e.g. the only supported interface is isoparse(dt_str) - which we know we want to support anyway).

In any case, I'd like some form of it to be in dateutil 2.7.0.

@jbrockmendel I would also guess that a lot of this stuff will have to be in conditional import logic or something, unfortunately. I suspect it could cause a decent number of problems if pandas bumped their dependency to the latest version of dateutil, since that could cause inconsistencies in pinned versions.

@pganssle if you are moving things to private methods please deprecate first; simply moving will break lots of outstanding installs.

@jreback I'm not moving things to private methods, I'm moving private functions/classes (i.e. in the physical layout of the code). I've pulled out the reorganization portion of the isoparser PR into dateutil/dateutil#501 so it's easier to see what's happening.

Basically, if you're calling dateutil.parser._tzparser or something, it will have moved, but even stuff that is not in dateutil.parser.__all__ will all still be importable from dateutil.parser if it doesn't start with _

what i am saying is if you are changing your publicly exposed functions you should deprecate

@jreback pandas should not be affected by the changes pganssle is discussing, with the exception of the private import of _timelex in tslibs.parsing. I'll make sure to patch that try/except import when appropriate.

I'm not really planning on "rushing" the isoparse implementation

@pganssle it probably only seems rushed to me because I haven't made time to look it over closely. Will do so this evening.

@pganssle we should patch this on the pandas side
but the issue is that the upstream needs to deprecate or completely break existing installations which is not nice

I'm sure some people will break if they are relying on the private interface, that's a consequence of Python's philosophy of "nothing is really private", but I think there's a better than average chance that anyone who notices / pays attention to DeprecationWarning is also likely to not use private methods in the first place, and if they do use private methods, they're also likely to test against the master of their dependencies (as you guys are doing, presumably at least partially because you've been burned before).

Not saying things won't break out there, just saying that deprecation warnings (which aren't even on by default) probably won't mitigate this by much. In the short term people can pin their dependencies until they fix their fragile code. Worst case scenario, the library is BSD licensed, so there are no viral concerns from vendoring it.

That said, for some of the stuff where I'm actually planning on exposing it publicly anyway (_tzparser, _timelex), it might not be the worst thing in the world to go through a deprecation cycle. Still, the message is weird: "This private method has moved, you shouldn't be using it in the first place, but if you must, pick it up from somewhere else now."

@pganssle that is exactly what we did for lots of things in pandas over the last year
the entire code base was public not its much more private
my point is that you can do what you want but i would test against already released versions to see if there is breakage (then u can decide what to do)

expecting users to upgrade another package just to fix a misuse is not great

note that something is wrong in dateutil master already
see https://travis-ci.org/pandas-dev/pandas/jobs/298549012

we are probably using a private module and have been for quite some time

@jreback Well, it's not great in general that people count on private interfaces. I understand why it's done in this case (and in fact I'm working to remedy that by exposing some of these in a way that I'm comfortable supporting), but at the end of the day any such private interface calls will necessarily be fragile. They're private because they're implementation details, not to be relied upon.

In any case, my point was mostly that the situation is what it is. DeprecationWarning won't prevent you from needing to update pandas if you want the latest dateutil, and most people won't see them anyway, so it seems like a lot of work do deprecate things that people shouldn't be using anyway (for this very reason).

That said, I'll try to go the deprecation route with some of the bigger things.

@pganssle it’s not that i care about deprecation per se
more that it will just work for existing installations
iow what happens is people have a version of pandas say 0.21.0 and then they upgrade dateutil (or vice versa); would like for it just to work in a back compact way (iow we can easily try/except to achieve this); but if u can preserve the interface (private or not; it has been this way for a long time) for a version or 2 would be really helpful

further i would suggest bumping the major version number in any event

@pganssle

your current setup on master is not installing dateutil.parser, e.g.

#!/usr/bin/python
from os.path import isfile
import os

from setuptools import setup

from dateutil._version import VERSION

if isfile("MANIFEST"):
    os.unlink("MANIFEST")

setup(name="python-dateutil",
      version=VERSION,
      description="Extensions to the standard Python datetime module",
      author="Paul Ganssle",
      author_email="dateutil@python.org",
      url="https://dateutil.readthedocs.io",
      license="BSD 3-Clause",
      long_description="""
The dateutil module provides powerful extensions to the
datetime module available in the Python standard library.
""",
      packages=["dateutil", "dateutil.zoneinfo", "dateutil.tz"],
      package_data={"dateutil.zoneinfo": ["dateutil-zoneinfo.tar.gz"]},
      zip_safe=True,
      requires=["six"],
      install_requires=["six >=1.5"],  # XXX fix when packaging is sane again
      classifiers=[
          'Development Status :: 5 - Production/Stable',
          'Intended Audience :: Developers',
          'License :: OSI Approved :: BSD License',
          'Programming Language :: Python',
          'Programming Language :: Python :: 2',
          'Programming Language :: Python :: 2.7',
          'Programming Language :: Python :: 3',
          'Programming Language :: Python :: 3.2',
          'Programming Language :: Python :: 3.3',
          'Programming Language :: Python :: 3.4',
          'Programming Language :: Python :: 3.5',
          'Programming Language :: Python :: 3.6',
          'Topic :: Software Development :: Libraries',
      ],
      test_suite="dateutil.test"
      )
(

needs to be added here (or can use find_packages)

      packages=["dateutil", "dateutil.zoneinfo", "dateutil.tz"],

Thanks, good catch. I should probably set up the tests so that they install and run the installed version rather than running straight from the repo, to avoid this sort of problem.

so I just merged #18172 which restores our build where we use dateutil from master (fixed by @pganssle in dateutil/dateutil#507)

We have a couple of failures. These must be some changes in dateutil as we pass vs 2.6.1. These generally look ok, and we should simply update our tests conditional on the dateutil version.

https://travis-ci.org/pandas-dev/pandas/jobs/299303619

_________ TestGuessDatetimeFormat.test_guess_datetime_format_for_array _________
[gw0] linux -- Python 3.6.3 /home/travis/miniconda3/envs/pandas/bin/python
self = <pandas.tests.indexes.datetimes.test_tools.TestGuessDatetimeFormat object at 0x7f81e01c6240>
    def test_guess_datetime_format_for_array(self):
        tm._skip_if_not_us_locale()
        expected_format = '%Y-%m-%d %H:%M:%S.%f'
        dt_string = datetime(2011, 12, 30, 0, 0, 0).strftime(expected_format)
    
        test_arrays = [
            np.array([dt_string, dt_string, dt_string], dtype='O'),
            np.array([np.nan, np.nan, dt_string], dtype='O'),
            np.array([dt_string, 'random_string'], dtype='O'),
        ]
    
        for test_array in test_arrays:
>           assert tools._guess_datetime_format_for_array(
                test_array) == expected_format
E           AssertionError: assert None == '%Y-%m-%d %H:%M:%S.%f'
E            +  where None = <function _guess_datetime_format_for_array at 0x7f81919e3ea0>(array(['2011-12-30 00:00:00.000000', '2011-12-30 00:00:00.000000',\n       '2011-12-30 00:00:00.000000'], dtype=object))
E            +    where <function _guess_datetime_format_for_array at 0x7f81919e3ea0> = tools._guess_datetime_format_for_array
pandas/tests/indexes/datetimes/test_tools.py:912: AssertionError
__ TestGuessDatetimeFormat.test_guess_datetime_format_with_parseable_formats ___
[gw0] linux -- Python 3.6.3 /home/travis/miniconda3/envs/pandas/bin/python
self = <pandas.tests.scalar.test_parsing.TestGuessDatetimeFormat object at 0x7f81895c9470>
    def test_guess_datetime_format_with_parseable_formats(self):
        tm._skip_if_not_us_locale()
        dt_string_to_format = (('20111230', '%Y%m%d'),
                               ('2011-12-30', '%Y-%m-%d'),
                               ('30-12-2011', '%d-%m-%Y'),
                               ('2011-12-30 00:00:00', '%Y-%m-%d %H:%M:%S'),
                               ('2011-12-30T00:00:00', '%Y-%m-%dT%H:%M:%S'),
                               ('2011-12-30 00:00:00.000000',
                                '%Y-%m-%d %H:%M:%S.%f'), )
    
        for dt_string, dt_format in dt_string_to_format:
>           assert parsing._guess_datetime_format(dt_string) == dt_format
E           AssertionError: assert None == '%Y%m%d'
E            +  where None = <built-in function _guess_datetime_format>('20111230')
E            +    where <built-in function _guess_datetime_format> = parsing._guess_datetime_format
pandas/tests/scalar/test_parsing.py:81: AssertionError
_______ TestGuessDatetimeFormat.test_guess_datetime_format_with_dayfirst _______
[gw0] linux -- Python 3.6.3 /home/travis/miniconda3/envs/pandas/bin/python
self = <pandas.tests.scalar.test_parsing.TestGuessDatetimeFormat object at 0x7f8188b1a940>
    def test_guess_datetime_format_with_dayfirst(self):
        ambiguous_string = '01/01/2011'
>       assert parsing._guess_datetime_format(
            ambiguous_string, dayfirst=True) == '%d/%m/%Y'
E       AssertionError: assert None == '%d/%m/%Y'
E        +  where None = <built-in function _guess_datetime_format>('01/01/2011', dayfirst=True)
E        +    where <built-in function _guess_datetime_format> = parsing._guess_datetime_format
pandas/tests/scalar/test_parsing.py:85: AssertionError
_________ TestGuessDatetimeFormat.test_guess_datetime_format_nopadding _________
[gw0] linux -- Python 3.6.3 /home/travis/miniconda3/envs/pandas/bin/python
self = <pandas.tests.scalar.test_parsing.TestGuessDatetimeFormat object at 0x7f81dae48828>
    def test_guess_datetime_format_nopadding(self):
        # GH 11142
        dt_string_to_format = (('2011-1-1', '%Y-%m-%d'),
                               ('30-1-2011', '%d-%m-%Y'),
                               ('1/1/2011', '%m/%d/%Y'),
                               ('2011-1-1 00:00:00', '%Y-%m-%d %H:%M:%S'),
                               ('2011-1-1 0:0:0', '%Y-%m-%d %H:%M:%S'),
                               ('2011-1-3T00:00:0', '%Y-%m-%dT%H:%M:%S'))
    
        for dt_string, dt_format in dt_string_to_format:
>           assert parsing._guess_datetime_format(dt_string) == dt_format
E           AssertionError: assert None == '%Y-%m-%d'
E            +  where None = <built-in function _guess_datetime_format>('2011-1-1')
E            +    where <built-in function _guess_datetime_format> = parsing._guess_datetime_format

https://travis-ci.org/pandas-dev/pandas/jobs/302975078

looks like more changes in master on dateutil, including deprecation of _timelex.

@pganssle is there a direct replacement for _timelex, IOW taking a string and giving me back the tokens (rather than a datetime)? trying to move off this deprecated function (and its ok if its not back-compat).

@jreback It's not that it's deprecated so much as it was always private. It is an implementation detail of the public interface, which is the auto-magical parse function, so there is no supported replacement for it.

Before the end of the year, I'd like to make a pretty significant refactor to parser.parse (end of the year being the arbitrary time at which I'm going to switch to an LTS branch and a 3.x-only branch). I and @jbrockmendel both are very keen on exposing some of the actual details of parse's auto-magical action for people not content to just consume anything that looks like a date, and part of that effort will probably be a public lexer/tokenizer.

Until then, I think the best thing to do would be to just vendor _timelex somewhere in your source tree. Presumably if you put it in a .pyx file, that would also make the lexing part a bit faster. Once the parser refactor is done you can start using the public interface for that if it suits your needs better - I'll ping pandas' issue tracker when that happens (though I'm sure @jbrockmendel will be on it since he's very keen on all this parser stuff and also on pandas).

@pganssle is there still upcominig reorganization we need to watch out for here, or can this be closed?

@jbrockmendel I think this happened some time ago already (the parser reorg), plus pandas tests against dateutil master now, so as long as pandas has done everything it needs to do to adapt to the new code organization, feel free to close this.