An argument that Jupyter Notebooks are conceptually flawed just as spreadsheets are - the (python) world needs a successor.
To try and convince you of that I've built a quick demo of on alternative.
But before I show you that, I need to convince you notebooks have fundamental problems...
True, that's very annoying. Both .xlsx
and .ipynb
show as blobs, it's impossible to clearly see what's changed
in either between versions.
But it's not the real problem, try again.
Another good try. It's virtually impossible to have clearly visible logic in either a notebook or an excel sheet which can also be used in production. You end up copying the logic out, rewriting it and adding it to your production stack.
Problems come when you want to modify the logic and share it with someone again, now you have two (somewhat different) implementations to keep track of and keep identical.
But still not the fundamental common problem, try again...
Remember ctrl + alt + shift + F9
in Excel? Go for coffee and wait for it to update, and hope nothing crashes,
search through all you sheets to see if anything has gone wrong.
Notebooks are no better - sections don't automatically
update when an earlier section is modified, so you end up running "Run All Cells" the whole time.
But even that's not the whole problem, because notebooks reuse a single python process, you can have more subtle
bugs: declare a function in one cell, then use it in the next - all works well, now delete the function from the
first cell, but the function object still exists in globals, so the notebook continues to work.
Now you send that sheet to someone else and of course everything fails!
But that's not all: both excel and notebooks don't make it obvious when an error has happened, you could have an exception in a cell that's offscreen (or a sheet you're not looking at) and you wouldn't be aware of it.
That's a big problem, but really it's just an implementation mistake, not the root problem, keep trying...
True, just as you can't reuse logic in a production environment, you can't easily access logic outside the main document to write unit tests.
Mad, but still not THE problem, keep trying...
So annoying, anyone who writes code for more than a few hours a month becomes very at-home in their editor of choice. Having to leave it to use either excel or notebooks is really painful, plus both lack many of the advanced features of a modern IDE.
Still not the answer I'm looking for, have another try...
Excel is awful at displaying complex results, particularly with a narrative. Notebooks are better at describing a narrative, but they're still ugly - you have to show lots of code that the casual reader doesn't care about (like imports and utility functions), there's also no really pretty way of displaying a notebook that I know of.
Still not the right answer, have one more guess...
YES! That the fundamental problem, and has led to many very serious (heisen)bugs. I personally have fallen into this bear pit more than once.
Three conceptually very different things:
- input data (sometimes, with both it's theoretically possible to read input data from external sources)
- logic to calculate results (and parameters for those calculations)
- the results
Are stored in the same file.
Imagine if python automatically appended the output of a script to that script every time you ran it!
Of course this didn't seem like a big problems when spreadsheets were conceived in the 60s (the word "spreadsheet" or "spread-sheet" comes from two facing pieces of paper used as a leger). They were designed as just a clever table, just as you had the inputs and output on the same piece of paper when you did manual accounts on lined paper; it seemed sensible to keep everything in one file when building the computerised equivalent. But why on earth did anyone think this was still a good idea when inventing notebooks?
You would never think of storing your customer data in the same file as the logic to generate their invoices (unless you're still using excel) - so why would you store the results of your machine learning model along side its definition?
This is the fundamental problem with both excel and notebooks, it's the root cause of many of the issues described above and the reason most experience develops eschew both.
Here my MVP is called "notbook", but if anyone actually wants to use it, it should be renamed to avoid confusion with Jupyter Notebooks.
A program that executes a python script, and renders an HTML document with:
- sections of the code, smartly rendered
- the output of print commands shown beneath the associated print statement
- comments rendered as markdown for the body of the document, again inserted in the right place in the document
- rich objects e.g. plots and table rendered properly in the document
- (not yet built) allow HTML inputs (text inputs, number inputs, range slides etc.) in the HTML to change the inputs to calculations
- (not yet built) link to imported python modules which can be rendered recursively as more hybrid documents or just as highlighted python
That document can be built either using:
notbook build my-logic.py
- where the HTML document is built once and the process exists, if execution raises an exception, no document is built and the processes exits with code1
.- or,
notbook watch my-logic.py
- where the file is watched and a web-server is started showing the document, when the file changes the HTML document is updated and the page automatically updates giving instant feedback
The python script(s) containing all logic:
- only contain valid python syntax
- be executable via the standard
python
CLI - be importable into other code and be able to import other code
This might not sound like much (it's basically just another static site generator which works on python files, not markdown etc.), but I think it could dramatically improve the workflow for data scientists and anyone python-literate currently using notebooks or excel.
- It fixes all the issues described in the "quiz" above
- it's simpler - because the process here is dramatically simpler than notebooks, the source code is much smaller. Therefore extending it, fixing bugs and customising usage should be much easier.
- perhaps in some scenarios slightly slower than running code in an existing python process
There's much more this could do:
- the two things marked as "not yet built" above
- rendering tables from pandas and similar
- currently there's basic support for bokeh plots but other plotting libraries should be supported
- stage caching so slow steps in calculations could be cached between executions
- richer printing: currently
devtools.debug
is used to print complex objects (e.g. notstr
,int
,float
), this should be replaced with an interactive tree-view like chrome - server based mode - instead of running the python script locally and showing it on localhost, the python source is posted to a server which runs the script and renders the results at some URL for the develop (or anyone else with permissions) to view
- a "zip mode" and renderer of "zip mode": zips all the assets need to render a document, you can then send that binary file to someone to view, they can display the zipped document without having to mess about with the command line to get the document extracted and ready to view