dataproofer/Dataproofer

Make it easier to get datasets into dataproofer

Opened this issue · 9 comments

ejfox commented

This is a sort of meta-issue that's more conceptual than technical.

There will probably be additional issues with more specific tasks, but I want to have the overall discussion here.

With Dataproofer out for the past few months, I've gotten a better chance to use it naturally and see other people use it in the wild. After this, my number one piece of insight is that it needs to be much easier to get data into Dataproofer.

Using Dataproofer should not be any additional overhead on a user's workflow. Unfortunately the current process looks like this:

  1. Be working on a dataset, be curious about the makeup of that dataset
  2. Open up dataproofer
  3. Click button to load data into dataproofer
  4. Navigate to location of data you were just looking at
  5. Click "load" and load dataset into dataproofer
  6. Get insights back on your dataset

I'd really love it to look more like this:

  1. Be working on a dataset, get curious about the makeup of that dataset
  2. Press a button or type a command to load dataset in dataproofer
  3. Get insights back on your dataset

There are a couple of ways to approach this, I think:

Chrome extension / bookmarklet

Create a Chrome extension or bookmarklet that makes it easy (one click) to get from Google Sheets, or a CSV in the browser to getting a report on that dataset in Dataproofer

Polish command-line usage

Make it extremely easy to launch the dataproofer GUI from the command-line (a la Sublime) as well as making command-line reporting mirror all the utility of the GUI app

Better workflow documentation

I think we've done a pretty good job of documenting how to use Dataproofer itself. We haven't yet given examples of how Dataproofer fits into an overall workflow with other applications. I think giving people a better idea of where in their process Dataproofer is most useful is important.

Automatically show data in your Google Drive in start-up screen

So if you're working on a Google Sheet, you switch to Dataproofer and your most recent spreadsheets are shown in a list, and you simply click the one you're interested in to run dataproofer.

I'd love to discuss this for a bit and try to come to a conclusion that we can move forward on by October 1st. I'd like to make a "Fall Dataproofer Update" milestone that includes this. If we find it's a particularly large endeavor, this goal may be the entirety of the Fall 2016 Update.

cc @geraldarthur @enjalot @floodfish

ejfox commented

@geraldarthur @enjalot @markhamnolan

For me personally, that six-step list is how I generally expect to use apps and I'm fine with that. But I certainly don't object to other methods as long as we feel it's realistic to maintain them, and that there are no potential security risks.

ejfox commented

@floodfish appreciate your input here. When I think of dataproofer through the "spellcheck for data" lens, I think that if you had to copy your message into a spellcheck application to see which words were spelled wrong, only the die-hard copy editor nerds would use it.

I think our tool works well for die-hard data nerds right now, but I want it to be more ubiquitous and more useful than that.

I'm personally not seeing as much pickup of the tool right now as I expected, even in my own life. That could be a few things, but I think one glaring reason is that it's hard to see how it can easily integrate into your existing workflow and give you benefit without any extra effort (and ideally without any effort at all).

That's just my perspective, though, and would love to discuss more here with @geraldarthur and @enjalot.

Also, tangentially, I've been exploring Electron for a personal project and noticed this little tidbit:

screenshot 2016-09-23 11 03 11

Apparently one of the new things they've added is the ability to manage windows. This in turn lets you create invisible windows to do some hacked threading. @enjalot probably knows best if this would help solve our problem, but I thought it was interesting.

They've also added the ability to get system information like available memory so instead of limiting data to a magic number, we intelligently use the memory that's available.

@enjalot @geraldarthur @markhamnolan @floodfish - If possible, I'd love to get input from you guys on what you think the best path forward is by the end of the month.

  • Improving the command line?
  • Improving large dataset handling?
  • Attempting both at once?
  • Something else?

From there I'd like to clean up and organize our issues a bit so that next steps are clear both to us and potential contributors. Part of this is selfish- I want to be able to have bite-size tasks I can jump on to continue to improve Dataproofer over the fall. But we're also looking into working with new collaborators soon, so I'd like to have the people who have put a ton of work in so far help decide the best way for new folks to continue.

Re. wider pickup of the tool as a factor of workflow integration: My gut says that building integrations/extensions would help with user retention more than acquisition. I see installing an integration/extension as a bigger ask/commitment than using standalone app. Or maybe it can happen all at once: if a whole newsroom installs the extension and does a training together, then they'll use it more than if they had a standalone tool to forget about. (That aside, that fact that @ejfox wants and would use it is a pretty compelling reason to do it! Always good to make the tools you most want to use.)

Re. path forward, command line vs. large datasets, etc.: Is there a userbase that we can interview/poll? That could help a lot with prioritization.

Hey guys, this is a really great tool! How about trying to provide an integration to KNIME or other data analysis frameworks? This would help people checking their datasets while they're analyzing it anyways.

ejfox commented

@imagejan Appreciate the input! I'm not familiar with KNIME and looking into it now.

I think discussing other integrations is probably useful. I've been using csvkit a lot in the past 2 weeks and I think that if we could work ourself into the workflow of that suite of tools, that would be some really good progress.

ejfox commented

The more I think about it the more I think it would be great to basically be a super-powered version of csvkit's csvstat

large datasets

I think improving large dataset handling is something fundamental that should be addressed.
We set it up to be convenient to use one way but it imposed a greater cost than I foresaw so we need to reevaluate the tradeoff.

I think one key way to address that is to redesign the tests to be row based, with special hooks to run code before and after the rows have been processed. This means a more complicated interface for writing tests, but it would enable streaming processing (and perhaps even parallel processing). Streaming means we can show progress on large files, not crash if the files are way too big and also potentially run them in webworkers.

bringing the tool to you

I think this could be key, go to the places where people live instead of asking them to leave the house.

What about hooking into context menus, at least in OS X I think this isn't too hard to do. I'm thinking you're in excel and you can click the apple -> services -> send sheet to dataproofer kind of workflow.
Similarly I like the idea of a chrome extension, but perhaps a google sheets app of some sort would be better? Not sure how many other places besides sheets you'd want to proof (files on github? links to csv/excel files?) I suppose with a chrome extension you could add "check with dataproofer" in the right-click context menu.

Picking things up here re large datasets since that's the biggest blocker preventing CLI and the desktop app from being closer to 💯 in my opinion. It's the biggest recurring issue, but there's a recurring, easier solution suggested in #127 and by @riordan—sampling.

We take a sample of the rows, run tests, and prompt the user to run it over another sample of different rows. We can max out the sample size given the size of the dataset. This would require better workflow documentation, but it's definitely something achievable before Mozfest. Plus, we can incorporate any UI design into @amccartney's styles