apache/datasketches-cpp

[python packaging] Top-level name problems

thatch opened this issue · 4 comments

Hi, looking at the 3.4.0 wheel for datasketches as released, the *.dist-info/top_level.txt file contains:

datasketches
src
tests

The top-level dir "src" should not be included at all; the one file it contains is not useful.
The top-level dir "tests" should be called something more unique (like "datasketches_tests") if you intend to package it in the wheel. Alternatively just keep this in the sdist?

Both of these are likely to cause conflicts with other projects.

Thanks for this, @thatch. We're not really python people so such suggestions are quite helpful.

We do expect to have some native python included at some point (trying to allow the use of generic python objects with container sketches, like kll or sampling, and that'll require user-specified classes for serialization or even comparators). I assume we'd want need to ensure they get included under datasketches?

I assume we'd want need to ensure they get included under datasketches?

Yep, the normal thing would be have this structure:

datasketches/__init__.py
datasketches/_datasketches.so
datasketches/kll.py

Right now you have datasketches.so and it is quite difficult to have both python and native parts of the same name. You can have the __init__.py do from ._datasketches import * so all the names are available as a way of combining them.

I'd experimented with the library part a bit a few months ago but we didn't have anything concrete to do with it then.

Right now we're trying to figure out why the Windows CI job started failing. Once that's done we'll be able to merge the linked PR (or maybe before if we determine it's sufficiently unrelated -- which really should be the case).

This is now merged, along with a few other simplifications when revisiting the files. Thanks for pointing it out!