Y-Label

Let anyone search and label text with active learning and no servers

Watch The Intro Video

Problem it solves

Typical usage scenario is company X has some domain experts (Doctor, trader, lawyer) and lots of text data and they want the expert to label the data. Often their is a lot of data and lots of it isn't relevant so it's an inefficient process that doesn't pan out. You can solve this with either guided learning or active learning, but both of these typically need some engineering work (e.g. client server +ML/search) and that's a lot of overhead for an adhoc project.

Solution

Y-Label tried to solve the above by

Being completely in browser, so there is no engineering work
Providing full text search / trigram indexing
Implementing a deep active learning module that helps select the most relevant data
Use IndexDB to give the user persistence without relying on a server

Goals

Support any language without any assumptions about vocabulary or tokenization
Work in the Browser with no communication to any server
Be Dead easy for non-technical people to use
Make exploring and labeling text datasets fast and easy.

Status

What this does now

Upload data as a JSON array of objects and choose one field to label
Stores data in IndexDB
Creates a Trigram index on the data + Uses the index for regular search
Regex search (but with no indexing)
Add classes and classify each document

What it will do soon (By Mid July 2019)

Export the labels as JSON
Allow CSV upload
Active Learning for document classification on index DB

Roadmap / Plan

We are now at V0.0.0 Until say V0.0.X (X<5), we're doing 3 things

Adding all the cool features we can think about
Add support for span annotations
Getting a feel for usability and settle on a look and feel

Between V0.0.X and V1 we'll be

Figuring out stable API
Rewriting this with clean code, tests and conformity to the APIs above.

Backends

The crucial thing for the second phase of V0 is figuring out APIs that will support swappable backends, so that users can for instance

Add there own storage and data sources
Implement their own search logic
Bring in their own active learning models

The idea being that what the user can express should be confined, but what happens to those expressions should be configurable. However The goal for NOW is to figure out what this is, so we are writing dirty non-extensible code so that we can move fast.

Theoretical Inspiration

YLabel has a few inspiring papers and ideas that are worth looking at

[Why Label when you can Search? Alternatives to Active Learning for Applying Human Resources to Build Classification Models Under Extreme Class Imbalance] (http://pages.stern.nyu.edu/~fprovost/Papers/guidedlearning-kdd2010.pdf). This paper talks about the value of searching when we label data.
Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances talks about searching and "feature" labeling along side active learning, with an old implementation in Java
Active Learning Literature Survey The defintive guide on Active Learning
Deep Active Learning for Named Entity Recognition A paper that shows that Deep Active Learning on text (For NER) is feasible and effective
Neural Machine Translation in Linear Time A deep learning model for text that works at the charachter level and is (maybe) fast enough to work in the browser.

Who is Behind this.

This is Tal Perry's "side-project". By day, I'm the founder and CEO of LightTag - The Text Annotation Tool for Teams, and this is where I get to try out new ideas, because they won't let me touch the code anymore.

LightTag/ylabel