rasbt/watermark

Allow storing watermark info in metadata?

bollwyvl opened this issue · 7 comments

Watermark looks great for reproducibility.

It would be nice to have an option to (also) store this data in the notebook metadata:

{
"metadata": {
  "watermark": {
    "date": "2015-17-06T15:04:35",
    "CPython": "3.4.3",
    "IPython": "3.1.0",
    "compiler": "GCC 4.2.1 (Apple Inc. build 5577)",
    "system" : "Darwin",
    "release" : "14.3.0",
    "machine": "x86_64",
    "processor" : "i386",
    "CPU cores": "4",
    "interpreter": "64bit"
  }
}
}
}

Maybe some more hierarchy in there as well...

Since the kernel doesn't have any idea what's going on w/r/t notebooks, it would probably have to be done with a display.Javascript:

if as_metadata:
    display(
        Javascript("""IPython.notebook.metadata.watermark = {}""")
            .format(json.dumps(the_data)
    )

Happy to help with a PR, if you would think there is a place for this!

rasbt commented

Thanks for the suggestion, this sounds interesting. API-wise, I would think of an additional (optional) flag that would maybe write the produced output into the meta-tag.

Just wondering, what application and use-case would you have in mind? Right now, for example, I'd use this plugin to conveniently show the time-stamp of the last update to users. Or to show Python versions and packages that were used to create those results. I am just wondering how the "meta" tag could be additionally used to improve reproducibility.

Thanks for the response. Yeah, -m is already taken, but something to that
effect.

I think the big win is that metadata in standard formats (iso, etc) is more
unambiguously parseable by downstream consumers and UI than inline text.
Instead of writing some regular expressions, one can
json.load()[metadata][watermark] For example, on nbviewer, we show the
kernel that was used to create the notebook.

So if one has a big stack of documentation notebooks in a repo, one can
check for when they were actually executed, not when they were checked out,
etc.

When we get better search, either in Jupyter hub or in custom deployments,
metadata fields will just be ready to go as facets. An organization that
has watermark as part of their "standard distribution" could gain a lot of
insight, about a snapshot or over time.

On 23:34, Tue, Sep 1, 2015 Sebastian Raschka notifications@github.com
wrote:

Thanks for the suggestion, this sounds interesting. API-wise, I would
think of an additional (optional) flag that would maybe write the produced
output into the meta-tag.

Just wondering, what application and use-case would you have in mind?
Right now, for example, I'd use this plugin to conveniently show the
time-stamp of the last update to users. Or to show Python versions and
packages that were used to create those results. I am just wondering how
the "meta" tag could be additionally used to improve reproducibility.


Reply to this email directly or view it on GitHub
#4 (comment).

rasbt commented

metadata in standard formats (iso, etc) is more
unambiguously parseable by downstream consumers and UI than inline text.

Good point, I agree. In this context, I could also imagine an optional little add-on to write all current package specifications of the Python env into the metadata as in pip freeze > requirements.txt

Btw. something like

-s      --save_meta
-g      --generate_meta

seems to be okay! However, I would suggest to not use the 1-letter short form here and go with --generate_meta to make it clear to a "user" of this notebook that the current watermark would change the notebook's meta-data in some way upon re-execution.

Would you be interested in implementing such a feature?

Sorry I didn't get back to you sooner: traveling!

I'd love to take a whack at this. Hopefully I can get a PoC up quickly.

Addons are great, but likely outside the scope of this particular request!

But, since we're off topic... I highly recommend building thementry_points vs namespace tomfoolery or magic module/function names.

In addition to pip, i'd consider being able to serialize the state of:

  • python
    • conda
  • "native" managers:
    • apt
    • dnf / yum
    • brew
  • other vcs
    • hg
rasbt commented

No need to apologize, and I am sorry, too. It was a pretty hectic week. I am currently in final stage of finishing up my new book that is coming out in 1-2 weeks and there is a lot of stuff to be done :).

So, I think writing to the meta-tags as an option would be great. And I will open separate issues for the other suggestions. I like the idea of considering other "managers"/"environments"

Cheers,
Sebastian

Worth reheating this discussion? I think it would be cool to have the information inside the metadata of the notebook. Then follow up with a PR for conda-tools/conda-execute#3 which might make the notebook a "shareable unit". Right now for sharing notebooks you need to make repository with a requirements.txt or some such.

rasbt commented

I don't really know much about the formatting recommendation/guidelines in/for Jupyter notebooks, and if there's a difference between Jupyter Notebook and Jupyter Lab in terms on what gets written to .ipynb files. However, I noticed that in the Jupyter Lab UI, there's a metadata field, which would probably be equivalent to what @bollwyvl mentioned with

{
"metadata": {
  "watermark": {
    "date": "2015-17-06T15:04:35",
    "CPython": "3.4.3",
    "IPython": "3.1.0",
    "compiler": "GCC 4.2.1 (Apple Inc. build 5577)",
    "system" : "Darwin",
    "release" : "14.3.0",
    "machine": "x86_64",
    "processor" : "i386",
    "CPU cores": "4",
    "interpreter": "64bit"
  }
}

screen shot 2018-09-24 at 10 47 59 am

In any case, if you or @bollwyvl or someone else would like to implement this (a way to optionally write metadata), I'd be very open to this and be happy to merge it (there was good work in progress over at #7 ).

This could be either via a

  • magic command
  • decorator, or
  • --metadata flag.