This is the Github repository for the ODSC-APAC 2021 tutorial session " How to do NLP When You Don’t Have a Labeled Dataset?"
An overview on this topic can be found in the ODSC blog, from August 2021.
Abstract:
Lack of a readily available dataset is a commonly seen scenario in industry projects involving NLP. It is also a situation researchers venturing into new problems or new languages often encounter. However, both traditional textbooks, as well as tutorials and workshops primarily focus on modeling and deploying models. In this workshop, I will introduce some strategies to create labeled datasets for a new task and build your first models with that data. At the end of this session, the participants are expected to get some ideas for solving the data bottleneck in their organization. The target audience are data scientists as well as those involved in requirements gathering for a given NLP problem
What is what in this repo:
- code/ has all the notebooks and python files required.
- resources/ contains the word lists used in developing labeling functions
- slides/ contains the markdown and Rpres file for the slides, and they are published on Rpubs at: (https://rpubs.com/vbsowmya/odsc2021)[https://rpubs.com/vbsowmya/odsc2021]
- files/ contains the input/output files used in the code files.
LICENSE: CC0-1.0 License README.md: This file.
requirements.txt file - generated using pipreqs.