karlicoss/HPI

merge rexport, pushshift and gdpr reddit data

purarue opened this issue ยท 25 comments

Was able to get pushshift as mentioned in the README working to export old comments. Thought I'd mention it here.

It's possible to use pushshift to get data further back, but I'm not sure if it should be part of this project, since some of the older comments don't have the same JSON structure, I can only assume pushshift is getting data from multiple sources the further it goes back. It requires some normalization, like here and here.

The only data I was missing due to the 1000 limit on queries from using rexport were comments. It exported the last 1000 but I have about 5000 on reddit in total.

Regarding HPI:

Wrote a simple package to request/save that data, with a dal (whose PComment NamedTuple has similar @property attributes to rexports DAL), and a merge function, and now:

In [15]: from my.reddit import comments, _dal, pushshift_comments

In [16]: len(list(_dal().comments())) # from dal.py in rexport
Out[16]: 999

In [17]: len(list(pushshift_comments())) # from pushshift
Out[17]: 4891

In [18]: len(list(comments())) # merged data, using utc_time to remove duplicates
Out[18]: 4893

In [19]: comments
Out[19]: <function my.reddit.comments() -> Iterator[Union[rexport.dal.Comment, pushshift_comment_export.dal.PComment]]>

Its possible that one could write enough @property wrappers to handle the differences in the JSON representations of old pushshift data, unsure if thats something you want to pursue here.

Nice job! Just checked it out too (although I don't have as many comments to need it), and it works!

Wondering, how do you approach Python entry points/executable scripts while developing?

E.g. one could either:

  • make a symlink __main__.py -> __init__.py, then you can run via python3 -m pushshift_comment_export

  • make __init__.py executable and run pushshift_comment_export/__init__.py. Looks kinda odd, but nothing wrong with it in theory, although I tend not to name the script __init__.py in such cases

These have a benefit of not messsing with installed packages at all, which I feel is nice, but maybe not that big of a deal?
There is a third option, which I believe you're using:

  • install in 'editable' mode (pip3 install --user .). The only annoying thing, imo, is that it messes with ~/.local/bin.

    Ultimately though, I guess installing (whether editable or normal) is preferable. I kind of regret messing with dynamic imports like this, it only adds complexity, so in the future I'll probably get rid of them in favor of the users installing packages properly. Perhaps in a dedicated HPI virtualenv to avoid cluttering user packages.

Regarding merge function: looks good!
It would be cool to merge into upstream too, but my only concern with the current way (keeping everything in reddit.py) would be rexport and pushshift_comment_export imports. If someone only uses one of these, the whole data provider would fail.

I guess ideally I see it organized similarly to other multisource data providers, e.g. something like:

  • my.reddit.all -- top level comments, submission, the stuff that would be actually exposed to the data consumers
    Maybe this actually can live in my/reddit/__init__.py? But need to be careful about circular dependencies.
  • my.reddit.common -- merge functions.
    I guess for multisource data providers it's also inevitable to have some code duplication and add schemas here? It's not necessary in runtime, but Union[rexport.dal.Comment, pushshift_comment_export.dal.PComment] won't typecheck if someone only has rexport or pushift_comment_export installed. But can probably be worked around and not that big of a deal for most users.
  • my.reddit.rexport -- basically this stuff
  • my.reddit.pushshift -- what you added
  • my.reddit.gdpr -- potentially could be added in the future if/when Reddit automates GDPR takeouts properly

The main goal in this structure IMO is to keep my.reddit.all as small as possible, so it can be easily overridden by other people.
If you have any interesting ideas about it, feel free to open a separate issue in HPI so we can discuss there!

Wondering, how do you approach Python entry points/executable scripts while developing?

Recently new thought process, but; If I'm developing something that I'd run once and it exports to JSON/some nice format, and I don't have to do any sort of normalization when loading in, (for example my recent forum_parser and steam_scraper projects), then I just use individual python scripts with no packaging, which require you to clone and run directly.

If its something that requires a DAL, or it would be nice to have library usage to be able to merge exports together (so, like ffexport), I prefer to just use setuptools to install the package globally. That way its relatively easy to use as a library. I do get the appeal of this (dal_helper), but its not something I wanted to wrap my head around.

I would agree with what you say for pushshift_comment_export, making it __main__.py makes more sense here. If I expected more wider/periodic usage, and there was no dal.py, you could just have the one script, I would just use the py_modules setuptools flag. Being able to do python3 -m pushshift_comment_export is nice as well. I will probably edit that to be part of __main__.py instead.

If something is installed with setuptools:

  • I have two cookiecutter templates I use, one is relatively old and uses click/has a __main__.py wrapper, the other is a somewhat minimal setuptools module. The latter is what I used for pushshift_comments_export

  • I will typically install it as an editable package with developing, even though this causes annoyances. It just gives me a faster feedback loop than uninstalling/reinstalling. That puts a line in ~/.local/lib/python/*/easy-install.pth that doesn't get removed when you uninstall the editable package, so after Im done with development I run:

    • pip uninstall -y package_name
    • rm -i ~/.local/lib/python*/site-packages/package_name*
    • rm -v $(which executable_name)
    • pip-editable-installs: aliased to e "$(find ~/.local/lib/python* -name easy-install.pth)"

It is quite annoying that this isn't a supported use case for python development. Once I'm done, I'll reinstall using pip3 install --user .

  • I typically run an entr one-liner in another terminal, like one of these:
  99990  find harden_frontmatter | entr -c sh -c "pandoc ./post/server_setup/README.md -t json | ./harden_frontmatter"
  102146  fd -E .git | entr -c ./build
  77806  while true; do git ls-files | entr -cdp sh -c 'make uninstall && python3 setup.py install && find -type f -name *.py -exec pyflakes {} ..
> 120547  while true; do; find ./parse.py | entr python3 ./parse.py --from-file ./data.json --to-file ./parsed.json; sleep 1; done
  124024  find facebook.py | entr python3 -c 'from my.facebook import events; print(list(events()))'

That re-runs a script/file I currently have installed/runs the python3 setup.py and then a script, whenever I make any changes to the file I'm currently writing, so I have a faster feedback loop.

so in the future I'll probably get rid of them in favor of the users installing packages properly. Perhaps in a dedicated HPI virtualenv to avoid cluttering user packages

I don't really think of it as cluttering user packages. I agree the way pip handles installing into a bin directory isn't ideal. For newer users they might not want to mess with the $PATH, so making scripts python -m compliant is a good idea.

But, as long as the package name isn't parser, or something that could cause name conflicts, no ones creating something called pushshift_comment_export or ffexport. setuptools already exists, it allows users to install packages I dont push to pypi using the git+https pattern, and it allows you to import the package from any directory to use a library. Writing a setup.py (by which I mean using my cookiecutter-lib template) has too many benefits if I'm going to use something more than once.

If cluttering the namespace is that much of an issue, a virtualenv could potentially solve that, but I think virtualenvs themselves are another layer of complexity that people might be turned off from.

Deciding whether or not to push a setuptools package to pypi isn't really that much of a concern of mine, I do it if I feel the project might have breaking changes/I want it to be more stable.

Regarding dynamic imports, I think specifying a directory for the dal in HPI for individual projects isn't a great workflow, paths may change and it feels prone to errors. However, for something like configuration; my.config, which more conforms to some sort of xdg standard and it makes sense to split the configuration as not part of the project, dynamic imports seem pretty nice there. A couple other projects, like ipython and thefuck do things similarly.

If someone only uses one of these, the whole data provider would fail.

Right. It would probably be a breaking change anyways, to convert rexport to a setuptools package. One could potentially keep it like:

.
|-rexport
  |- __init__.py
  |-- dal.py
  |-- export.py
dal.py
setup.py

The upper dal.py references the rexport package in the same directory, so you're still able to import by doing from rexport.dal import DAL (I believe?), so this pattern could be kept backwards compatible.

Converting to a package has other benefits, like someone can also do something like:

from rexport.export import Exporter
json_values = Exporter.export(**some_params)

I think allowing people to import from rexport using setuptools in cases like this is pretty nice, because it means one can do pip install git+https... and its possibly easier to subclass/monkey-patch things in your own scripts. Otherwise you have to git clone and edit the files yourself.

If the namespace package names are unique enough, it could potentially let you handle multiple providers in a nicer way, like:

def events() -> Iterable[Type1, Type2]:
    yield from _merge_events(*provider_one(), *provider_two())

and then for each provider, you could have:

try:
	from rexport.dal import events as provider_one
except ImportError:
    # warn the user?
    provider_one = lambda: ()  # returns nothing, so this provider doesnt work
.... (another one)

_merge_events would have to know how to handle the different formats, but that complexity has to live somewhere.

If you have any interesting ideas about it, feel free to open a separate issue in HPI so we can discuss there!

Sounds good. I'm still currently trying to figure out my HPI process, been doing this for less than 2 weeks. Once I'm a bit more comfortable with this and I have things I'd like to merge back in, I'll patch changes into this fork and make some PRs. I think my own fork has diverged so much that its not possible to do so anymore.

then I just use individual python scripts with no packaging, which require you to clone and run directly.

Yep, agree, that's mostly what I've been doing so far.

I typically run an entr one-liner in another terminal, like one of these

Oh yeah, it's really nice! I should remember to use entr more often too.

But, as long as the package name isn't parser, or something that could cause name conflicts

Yeah agree. I had wtfs maybe a couple of times when something was used as an editable, and not cleaned up properly (I think?), but apart from that no real issues.

but I think virtualenvs themselves are another layer of complexity that people might be turned off from.

Yep! It would certainly be optional, just thinking that potentially might be useful if the number of data providers gets out of hand. I guess another similarity to Emacs packages, having them all in a single directory (even potentially under a massive git repository) helps with quick experiments and rollback.

Regarding dynamic imports, I think specifying a directory for the dal in HPI for individual projects isn't a great workflow, paths may change and it feels prone to errors
Converting to a package has other benefits, like someone can also do something like:

Yep, that's the plan. At first I had this urge to impose as minimal structure as possible, but it seems that conforming to setuptools is worth it. The only price to pay is adding a minimal setup.py and a subdirectory for the code (but perhaps this is a good habit anyway).

because it means one can do pip install git+https

Yep, this as well, certainly easier than carefully cloning at the correct location or copy pasting the path into the config.

and then for each provider, you could have:

That way could work too, yeah. I guess one downside is that it feels too defensive, e.g. if something is misconfigured, you might fallback onto empty data provider without realizing it. But on the other hand breaking things is more annoying.

Maybe it could benefit from different 'strategies' for handing missing data providers, e.g.

  • by default, warn in logs, and also instead of fallback onto an empty list, return a suitable Result value, e.g. Exception("Couldn't use rexport data because of ImportError ...."). Ideally the downstream tools would handle and propagate error values anyway so the user would notice (while also having the rest of the data sources work as intended)

  • allow 'disabling' data providers if they don't want to use it.
    If it's disabled, we can simply fallback onto empty data and only keep a debug/info log message.

    Actually, for disabling, it might as well reuse the same mechanism we discussed here!

  • and there could be a global policy for the default behaviour -- e.g. during active development I would prefer for things to crash immediately instead of returning error values, so would be nice to have a toggle for that. (related to #17)

I think my own fork has diverged so much that its not possible to do so anymore.

Yeah no problem, having fun with the code is more important :) Always can just copy-paste the code onto a fresh fork later.

I've added the link to readme and moving the issue to HPI (most of the discussion is relevant to it), hope you don't mind!

total misclick there, sounds good.

Once I've merged changes from rexport and #83 into my fork...

Would be possible to merge some of this (pushishift). Would probably be optional, like:

@dataclass
class reddit(uconfig)
	...
    # path[s]/glob to one or more comment exports
    # https://github.com/seanbreckenridge/pushshift_comment_export
    pushshift_export_path: Optional[Paths]

	
....
# if user has set pushshift comment export path...
	from pushshift_comment_export.dal import PComment, read_file

I can probably work on a PR at some point if thats something you're interested in.

Just merged #83 ! Let me know if you have any issues :)

Absolutely, would be great to merge!

  1. Configs: IMO it's best to separate the configs, e.g. class reddit_pushshift, class reddit_rexport. Or even better?:

    class reddit:
      class pushshift:
          export_path: Paths
    
      class rexport:
          export_path: Paths
    

    That way:

    • the naming for export_path is consistent
    • if there are some extra config attributes, they wouldn't mix up, and less potential for clashes
  2. Optional[Paths]: I thought that Optional[Paths] would look awkward (suggests it's sort of expected to be None). Considering that setting it to None would be fairly rare (you usually either use the module and set it to something reasonable, or don't have the config at all), I though that it would make sense to reserve empty string for that purpose:

    assert get_files('') == ()

  3. Code structure: I think ideally it's best to separate the different data providers into different files, extract the schemas/common stuff to something like common.py and have a minimal all.py file that merges them together, e.g.:

    The idea is that if the user adds their own data source, they can benefit from common.py, and supply their own all.py (that's why it should be as minimal as possible). I think it will also make backwards compatibility easier. What do you think?

P.S. Just recalled, I also had some experiments/notes about possible ways of overriding/extending modules, will dig them out and post somewhere for the discussion.

any issues

all good on my end, been merging commits as you've been merging them into master; Though, I'm only using ghexport and rexport currently.

nested classes for config

Would that work with my.core.cfg.make_config? If so looks good. Would need to add some checks for backward compatibility. The avoiding clashes point is good as well, for the possibility of sharing secret files for data sources.

I though that it would make sense to reserve empty string for that purpose

Feels sort of wrong from a typing perspective, but simple enough. Though, if we're doing nested classes, just don't include the pushshift class if its not used.

all.py/common.py

Yeah, this is better organized, I'll follow how its done in github. I suppose following that pattern, in common.py, Instead of doing a try/catch to stop an ImportError for pushshift_comment_export.dal in the merge function, would be easiest to catch that in the data-source specific (my.reddit.pushshift.py) file, and return an empty list to the merge function if the user didn't provide config/export path.

Would that work with my.core.cfg.make_config?

I think so? As far as I understand, nested classes in Python are merely a syntactic thing (e.g. even if you want to access an attribute of the enclosing class, you'd have to specify a fully qualified name)

Would need to add some checks for backward compatibility

Yeah. Regarding backward compatibility, also thought that my/reddit/__init__.py would be handy then, so old imports of my.reddit continue working. However then my.reddit wouldn't be a namespace package anymore, and that means that overrides won't work. I guess if the individual modules were properly versioned, that would be easier to handle, could be a major version change, or something similar. I'll think more about it...

would be easiest to catch that in the data-source specific (my.reddit.pushshift.py) file, and return an empty list to the merge function if the user didn't provide config/export path.

Ah yeah. Ideally, if the config has empty path, the list of inputs is empty, and the module just emits nothing and everything just works. The issue now is that the modules (e.g. import ghexport) are usually imported on the top level, both because it's more convenient, and for type annotation purposes. That means that if the user never wanted to use it in the first place, it still crashes on import, so they either have to install the dependencies, or override all.py and comment out the modules they aren't using.

Something I can think on spot: reorganize all.py to something like:

from my.core import X
from .common import merge_events, Results

def gdpr():
    from . import gdpr as source
    return source

def ghexport():
    from . import ghexport as source
    return source

def events() -> Results:
    yield from merge_events(
        X(gdpr()    .events(), []),
        X(ghexport().events(), []),
    )

X is a function which adds some defensive behaviour, something like:

def X(factory, default):
    try:
        res = factory()
        return res
    except ImportError: # presumably means that the user hasn't installed the module. other exceptions would be legitimate errors?
        warn(f"Module {factory.__name__} is not configured blah blah blah")
        return default

, it could perhaps be somehow configurable, for example, respect disabled_modules and in that case it won't emit the warning?

  • a bit boilerplaty. Really wish python allowed lambdas with multiple statements... maybe alternatively could use importlib
  • this is not mypy transparent (although all.py is meant to be super short so maybe it's not a biggie)

I'll think about it more.. maybe I'll come up with a way of lazily importing module dependencies so it's mypy friendly.

X is a function which adds some defensive behavior, something like:

Recently created an all.py for my.location, leaving it here as reference

To keep this more mypy-friendly, I ended up creating a top level 'Location' NamedTuple, even though each data source has its on 'Location' NamedTuple, though one could probably get around that by standardizing the interface/duck typing. Each source looks like:

def _google_locations() -> Iterator[Location]:
    yield from map(
        lambda gl: Location(lng=gl.lng, lat=gl.lat, dt=gl.at),
        filter(
            lambda g: isinstance(g, GoogleLocation),
            google_events(),
        ),
    )

... filtering, and then mapping it onto the common NamedTuple representation.

(imports are at the top since I don't expect anyone else to be using this, but they could be put in each function)

Otherwise, I could've defined some type like:

Location = Union[my.google.models.Location, my.apple.Location, my.location.ip.Location]

...but I was already writing @property wrappers on the underlying NamedTuple to standardize an interface, so they could all be used in my.time.tz.via_location in _iter_local_dates. Using the Union is nice because it means additional fields on the NamedTuple from particular provider are still there. In this case, there wasn't much extra data for each Location.


If someone was using this for the first time, I think the 'X' function which adds the fallback/warning message benefits outweigh the downsides of it being biolerplatey; If a user wanted to customize all.py to include their own sources, they could choose to exclude the check.

I can't think of any reason the 'X' function couldn't be abstracted away into core.

Really wish python allowed lambdas with multiple statements

yeah.... I've done some weird things before to try and get around this. elixir has spoiled me there

https://github.com/seanbreckenridge/mint/blob/master/budget/budget/cleandata/transactions/__init__.py#L8-L16
https://github.com/seanbreckenridge/mint/blob/master/budget/budget/cleandata/transactions/transform.py#L28-L59
https://github.com/seanbreckenridge/mint/blob/master/budget/budget/cleandata/transactions/transform.py#L227-L242

though one could probably get around that by standardizing the interface/duck typing

Yep, or Protocol

If someone was using this for the first time, I think the 'X' function which adds the fallback/warning message benefits outweigh the downsides of it being biolerplatey

Good point! I guess having a helper in the core would also allow making it configurable, e.g. defensive + warning by default and error/throw in 'developer mode.

yeah.... I've done some weird things before to try and get around this
Huh, clever!

Just as an update, Will probably attempt to merge a PR using an all.py file to merge ghexport and pushshift comments w/ some core.warnings if either arent configured unless you've decided against another style here.

Yeah, I guess it makes sense -- people are more likely to start with just one data source, so would be nicer to make this easier. And then I or generally more paranoid people could have an explicit policy to always throw instead of handling defensively.

Also there is @warn_if_empty annotation which can be used to make it easier to catch if it's too defensive (guess later it can also respect some error handling policy to be stricter, e.g. throw).

Hmm, regarding

class reddit:
  class pushshift:
      export_path: Paths

  class rexport:
      export_path: Paths

make_config doesnt work well with nested classes, am sort of stuck here

Was trying to localize the rexport_config to the rexport.py file, instead of having all the configuration in the common.py, because that means its less composable -- if a user wanted to add an additional datasource, mypy may complain because the types on the dataclass config object in common doesnt match their computed config (maybe thats fine, and they just #type: ignore[attr-defined] the import?

However, it being more composable means its more magical, and mypy might not be able to type it statically?

This code doesnt work, its just me trying to get something to work, trying subclassing/default_factory on the dataclass methods

diff --git a/my/core/cfg.py b/my/core/cfg.py
index b23fa86..7e6190c 100644
--- a/my/core/cfg.py
+++ b/my/core/cfg.py
@@ -1,25 +1,41 @@
-from typing import TypeVar, Type, Callable, Dict, Any
+from typing import TypeVar, Type, Callable, Dict, Any, Set
+
 
 Attrs = Dict[str, Any]
 
 C = TypeVar('C')
 
+# extract the names of nested classes from a dataclass object
+# (see reddit config for example)
+def _nested_classes(cls: Type[C]) -> Set[str]:
+    class_names: Set[str] = set()
+    import inspect
+    for name, _ in inspect.getmembers(cls, inspect.isclass):
+        if name == "__class__":  # ignore own class type
+            continue
+        class_names.add(name)
+    return class_names
+
+
 # todo not sure about it, could be overthinking...
 # but short enough to change later
 # TODO document why it's necessary?
 def make_config(cls: Type[C], migration: Callable[[Attrs], Attrs]=lambda x: x) -> C:
     user_config = cls.__base__
     old_props = {
-        # NOTE: deliberately use gettatr to 'force' lcass properties here
+        # NOTE: deliberately use gettatr to 'force' class properties here
         k: getattr(user_config, k) for k in vars(user_config)
     }
     new_props = migration(old_props)
     from dataclasses import fields
+    allowed_keys = {f.name for f in fields(cls)}.union(_nested_classes(cls))
     params = {
         k: v
         for k, v in new_props.items()
-        if k in {f.name for f in fields(cls)}
+        if k in allowed_keys
     }
+    print(params)
+    import pdb; pdb.set_trace()
     # todo maybe return type here?
     return cls(**params) # type: ignore[call-arg]
 
diff --git a/my/reddit/__init__.py b/my/reddit/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/my/reddit/all.py b/my/reddit/all.py
new file mode 100644
index 0000000..e69de29
diff --git a/my/reddit/common.py b/my/reddit/common.py
new file mode 100644
index 0000000..9071b59
--- /dev/null
+++ b/my/reddit/common.py
@@ -0,0 +1,59 @@
+from abc import ABCMeta
+from dataclasses import dataclass, field
+
+from my.core.common import Paths
+from my.config import reddit as uconfig
+
+
+# TODO: move to core? create a shared 'datasource' superclass?
+# need to think more about how to handle nested classes with make_config
+
+@dataclass
+class datasource(metaclass=ABCMeta):
+    export_path: Paths
+
+
+# empty strings for Paths objects resolve to empty data
+@dataclass
+class reddit(uconfig):
+    rexport: datasource = field(default_factory=lambda: datasource(''))
+    pushshift: datasource = field(default_factory=lambda: datasource(''))
+
+
+from my.core.cfg import make_config, Attrs
+# hmm, also nice thing about this is that migration is possible to test without the rest of the config?
+def migration(attrs: Attrs) -> Attrs:
+    # check for legacy path names
+    for deprecated_attr in ['export_path', 'export_dir']:
+        if deprecated_attr in attrs:
+            # assume the user is trying to use this attribute to set rexport Paths
+            @dataclass
+            class rexport:
+                export_path: Paths = attrs[deprecated_attr]
+
+            @dataclass
+            class pushshift:
+                export_path: Paths = ''
+
+            attrs['rexport'] = rexport
+            attrs['pushshift'] = pushshift
+
+            from my.core.warnings import high
+            high(f'''"{deprecated_attr}" is deprecated! Please update your reddit configuration to look like:
+
+class reddit:
+
+    class rexport:
+        # path[s]/glob[s] to data
+        export_path: Paths = {repr(attrs[deprecated_attr])}
+
+    class pushshift:
+        # path[s]/glob[s] to data
+        export_path: Paths
+
+''')
+            break
+    return attrs
+
+config = make_config(reddit, migration=migration)
+
diff --git a/my/reddit/pushshift.py b/my/reddit/pushshift.py
new file mode 100644
index 0000000..84d1af9
--- /dev/null
+++ b/my/reddit/pushshift.py
@@ -0,0 +1,9 @@
+
+@dataclass
+class pushshift_config(datasource):
+    '''
+    Uses [[https://github.com/seanbreckenridge/pushshift_comment_export][pushshift_comment_export]] output.
+    '''
+    # path[s]/glob to the exported JSON data
+    export_path: Paths
+
diff --git a/my/reddit.py b/my/reddit/rexport.py
old mode 100755
new mode 100644
similarity index 89%
rename from my/reddit.py
rename to my/reddit/rexport.py
index bbafe92..de28944
--- a/my/reddit.py
+++ b/my/reddit/rexport.py
@@ -1,26 +1,37 @@
 """
 Reddit data: saved items/comments/upvotes/etc.
 """
+
 REQUIRES = [
     'git+https://github.com/karlicoss/rexport',
 ]
 
-from .core.common import Paths
 
-from my.config import reddit as uconfig
+
+from .common import datasource
 from dataclasses import dataclass
 
+from my.core.common import Paths
+
+rexport_config: datasource
+try:
+    from my.config import reddit as user_config
+    rexport_config = user_config.rexport  # type: ignore[attr-defined]
+except (ImportError, AttributeError):
+    from my.config import reddit as user_config
+    rexport_config = user_config
+
+# localize possible migrations for rexport to this module
 @dataclass
-class reddit(uconfig):
+class rexport(metaclass=rexport_config):
     '''
     Uses [[https://github.com/karlicoss/rexport][rexport]] output.
     '''
-
     # path[s]/glob to the exported JSON data
     export_path: Paths
 
 
-from .core.cfg import make_config, Attrs
+from my.core.cfg import make_config, Attrs
 # hmm, also nice thing about this is that migration is possible to test without the rest of the config?
 def migration(attrs: Attrs) -> Attrs:
     export_dir = 'export_dir'
@@ -29,34 +40,31 @@ def migration(attrs: Attrs) -> Attrs:
         from .core.warnings import high
         high(f'"{export_dir}" is deprecated! Please use "export_path" instead."')
     return attrs
-config = make_config(reddit, migration=migration)
 
-###
-# TODO not sure about the laziness...
+config = make_config(rexport, migration=migration)
+
 
 try:
     from rexport import dal
 except ModuleNotFoundError as e:
-    from .core.compat import pre_pip_dal_handler
+    from my.core.compat import pre_pip_dal_handler
     dal = pre_pip_dal_handler('rexport', e, config, requires=REQUIRES)
 # TODO ugh. this would import too early
 # but on the other hand we do want to bring the objects into the scope for easier imports, etc. ugh!
 # ok, fair enough I suppose. It makes sense to configure something before using it. can always figure it out later..
 # maybe, the config could dynamically detect change and reimport itself? dunno.
-###
 
-############################
 
+from pathlib import Path
 from typing import List, Sequence, Mapping, Iterator
-from .core.common import mcachew, get_files, LazyLogger, make_dict
+from my.core.common import mcachew, get_files, LazyLogger, make_dict
 
 
 logger = LazyLogger(__name__, level='debug')
 
 
-from pathlib import Path
 def inputs() -> Sequence[Path]:
-    return get_files(config.export_path)
+    return get_files(config.rexport.export_path)
 
 
 Sid        = dal.Sid
@@ -223,9 +231,6 @@ def stats():
     }
 
 
-##
-
-
 def main() -> None:
     for e in events(parallel=False):
         print(e)

Will try again this weekend, just posting incase you had any ideas

similar problem described in the location issue I raised a while back -- essentially 'where does the configuration for each module live?'

I may be repeating points already said above or in other discussions, just wanted to get this down somewhere.

Ideally common.py shouldn't include imports to other relative modules -- it should just describe the common schema for some datasource, and have a function to merge data from each of them (and perhaps other shared code, but definitely shouldn't include provider (rexport, pushshift) specific configuration or code)

I'm leaning towards the following, as it means there is less problems overlaying/adding additional modules, and its slightly more mypy friendly -- as were not defining some expected nested configuration object (which the user may change if they add an additional source), with the two data sources here -- the only interaction all.py has with each module is trying to import the events/comments functions.

all.py
common.py
rexport.py
pushshift.py

common.py just includes the merge and shared NamedTuple model

rexport.py manages config like:

try:
    from my.config import reddit_rexport as user_config
except:
    from my.config import reddit as user_config
    # send deprecation warning
 

class rexport(user_config):
    export_path: Path

# handle migration/make_config and possible deprecation for the rexport configuration block

... so that any deprecation warnings/configuration issues specific to my.reddit.rexport are handled in that file, instead of all.py (since we'd want to keep all.py as minimal as possible, so there aren't issues with overlaying)

This does mean the configuration is a bit more verbose (reddit_rexport) as opposed to something which may inspect my.config.core.{enabled,disabled}_modules or do importlib magic to expose config/event functions, I think the tradeoff is worth it

agh. seems like pushshift is going through a bit of a restructure to improve their API/has been down a bit recently (i.e. now when I'm testing)

going to remind myself to go look back at here incase its still having issues in a while; may need to update pushshift_comment_export etc. etc. bit up in the air right now

Hmm. Yeah -- agree it would be nice to split up. The only thing to keep in mind is that backwards compatbility (with import my.reddit) would need to be implemented on call sites, because my.reddit can't have __init__.py? Unless you have some ideas how to work around it?

re: dataclasses -- right, yeah nestedness seems like it would be hard to handle.
Although I had a different structure in mind, which shouldn't lead to such issues?

  • user config
    Kind of like I suggested here: #89 (comment)

    class reddit: 
        class rexport:
            export_path = ...
        class pushshift:
            export_path = ...
        class gdpr:  # no GDPR atm, but just for the sake of demo
            export_path = ...
    
  • my.reddit.rexport: will have most of the code from current reddit.py? but instead of uconfig, inherit uconfig.reddit (with some compat code for older config format)

  • my.reddit.pushshift: inherit uconfig.pushshift

  • my.reddit.common

    Ideally common.py shouldn't include imports to other relative modules -- it should just describe the common schema for some datasource, and have a function to merge data from each of them (and perhaps other shared code, but definitely shouldn't include provider (rexport, pushshift) specific configuration or code)

    yep!

  • my.reddit.all

    Kind of what I described here #89 (comment)

    from . import rexport
    from . import pushshift
    from . import gdpr
    
    from .common import merge_events
    
    comments = merge_events(rexport.comments, pushshift.comments, gdpr.comments)
    upvotes  = merge_events(rexport.upvots  , pushshift.upvotes , gdpr.upvotes )
    ...
    

    The exact details are debatable -- e.g. as we discussed about making it more defensive

    So the user would override all.py if they want to mixin a new data source. A bit of copy pasting, but works?

my.reddit.rexport: will have most of the code from current reddit.py? but instead of uconfig, inherit uconfig.reddit (with some compat code for older config format)

ahh... for some reason I had it in my head that we needed to have the nested config as a dataclass somewhere in the module code and call make_config on it to handle migrations, but that isnt really necessary - that can just be handled in rexport.py. Even beter if we don't define some 'combined' configuration in the my.reddit module, as that leaves it more extendible for the user

my.reddit can't have init.py? Unless you have some ideas how to work around it?

yeah, don't think there is a workaround that allows overlaying code over my.reddit and something that is fully backwards compatible. Perhaps best way to handle this is to add an __init__.py file with a deprecation warning to update the import to my.reddit.all... and then actually deprecate it in a few months to convert it to a namespace package. If someone is very bothered by that they can always do the manual workaround of installing this repo as editable and removing the __init__.py file themselves in the mean time (though, if they'd know that was the error, I dunno)

override all.py to mixin a new data souce

yeah, that sounds good

I think all the patterns here work for the my.location issue I opened a few days ago as well

Cool! Guess would be nice to compile a proper doc eventually so we remember our own past discussions xD

Btw, even if either of class reddit.rexport and class reddit_rexport work now -- do you think there any clear potential pros and cons? The former is a bit nicer in terms of matching the hierarchy... although the latter is not that much more confusing. Not sure.

The nested config reminds me a bit of a nested JSON/YAML like object, which is typically used for configuration anyways, so it works well in that sense.

Also, another random thought I had recently -- if the user is using the deprecated version of the config, we could probably use the repr and print something by doing:

for deprecated_attr in ["export_path", "export_dir"]...
...
high(f"""DEPRECATED! Please modify your config to look like:

class reddit:
    class rexport:
        export_path: Paths = {repr(attrs[deprecated_attr])}
""")

would probably be a good idea to update the docs some, yeah... we've had some long discussions in PRs before. could try doing some PRs, would give me practice in org mode

am currently writing up a readme that expands a bit on the problems with editable/namespace packages (and their usage in HPI as a plugin system, by extension) as part of a naive script to re-order my easy-install.pth, can probably link to that in SETUP.org when discussing namespace packages

Also worth mentioning that this seems to exist now:

https://www.reddit.com/settings/data-request

seems pretty complete, may be able to use that instead of pushshift, unsure on the amount of data in each, will try and parse the GDPR export and see.

Would probably be combined like the github module, into reddit.gdpr?

Been trying to come up with common protocol across rexport and the GDPR export, but the data is quite different:

For example, saves look something like this in a CSV file in the GDPR export:

id,permalink
54hyj2,https://www.reddit.com/r/QuantifiedSelf/comments/54hyj2/quantified_self_personal_informatics_course/

No title, datetime, or text, unlike rexport, so is a bit difficult to merge the data, unlike when trying to do pushshift and rexport (those have enough raw metadata that it makes sense)

I don't really want to remove a bunch of attributes on the shared classes just so that its consistent across rexport/gdpr, but I also don't really want to make the attributes nullable, as that propogates a bunch of breaking changes across promnesia and anyone else who is using this

Currently looks a bit like:

"""
This defines Protocol classes, which make sure that each different
type of Comment/Save have a standard interface
"""

from typing import Dict, Any, TYPE_CHECKING
from datetime import datetime

Json = Dict[str, Any]

if TYPE_CHECKING:
    try:
        from typing import Protocol
    except ImportError:
        # hmm -- does this need to be installed on 3.6 or is it already here?
        from typing_extensions import Protocol  # type: ignore[misc]

    class Save(Protocol):
        created: datetime  # gdpr exports dont have
        title: str  # gdpr exports dont have
        raw: Json  # dont have

        @property
        def sid(self) -> str: ...
        @property
        def url(self) -> str: ...
        @property
        def text(self) -> str: ...  # Note: gdpr exports don't have text here
        @property
        def subreddit(self) -> str: ...

but since half of those attributes aren't present in the GDPR exports, Im almost leaning towards not including Saves from the GDPR in the combined my.reddit.all Save results, and letting the user use rexport an the main source in rexport.all, and if they really are missing data/want something else, they can import from my.reddit.gdpr

To sort of illustrate, what I mean is, when the fields don't match, the data in each my.reddit.rexport, my.reddit.gdpr and my.reddit.pushshift would be usable as their own modules, like usual, but my.reddit.all, would combine like:

  • comments: rexport, pushishift and gdpr (gdpr export actually does have enough info for comments)
  • posts/submissions: rexport, gdpr (has enough info here to merge)
  • upvote: rexport (no gdpr, since that has only id/permalink/direction)
  • saves: rexport (no gdpr)

If you wanted upvote/save information from the gdpr export, you'd have to import those functions from my.reddit.gdpr, it wouldn't be included in the merged functions

Then, lots of additional functions that gdpr provides, which has no additional complications;

$ /bin/ls -1 | head
account_gender.csv
approved_submitter_subreddits.csv
chat_history.csv
checkfile.csv
comment_headers.csv
comments.csv
comment_votes.csv
drafts.csv
friends.csv
gilded_comments.csv

Let me know if you can think of something better here

I've totally missed your message from April about official reddit GDPR, thanks that's exciting! Requested mine as well.

Yeah I see the problem with missing attributes.. I guess makes sense to discuss here, since it's kind of a generic discussion about combining data sources.

I see also options

  • throw on trying to access missing attributes (e.g. a Save from gdpr source would throw on trying to access title
    This is type safe but might be unexpected and annoying to handle downstream. E.g. hpi query/hpi stat might start failing while trying to order by timestamp
  • return default value for missing attributes
    • we could still return None (even despite mypy annotation), but I guess this could also be annoying to handle downstream
    • for many fields could return some type-safe default value... but it might end up pretty confusing

Considering neither of these options is particularly great, it could be configurable (fits nicely into allowing different 'defensiveness policies' for different modules). But I guess it makes it pretty complicated too, and unclear if there are any real benefits... So what you suggested with combining different sources for different data makes sense for now, and we can change it later if we come up to something better.

Will leave this open for now, since my.reddit.gdpr is yet to be written, but a good chunk of this has been added in #179