This is the code repository for the LREC 2022 paper Development of a Benchmark Corpus to Support Entity Recognition in Job Descriptions. The authors are Thomas AF Green, Diana Maynard, and Chenghua Lin.
A public, human-labelled dataset of salient entities in job descriptions.
Salient entities include:
- Skill
- Qualification
- Experience
- Occupation
- Domain
Full details regarding annotation schema can be found in schema/Combined_Annotation_Instructions.pdf
.
Labels were collected from Amazon Mechanical Turk. Workers were required to achieve >70% accuracy in a qualification task before contributing to this dataset.
Data formatting follows 2003 CONLL NER dataset conventions. Individual Worker responses, Worker ID and associated accuracies on the qualification task have been retained.
Original job description data can be found at Kaggle. Credit to user airiddha for the development of the original dataset.
Folder | Filename | Description |
---|---|---|
raw_data | answers.txt | Worker responses from the annotation task on AMT. |
raw_data | testset.txt | Gold standard dataset for model evaluation. |
raw_data | worker_accuracies.csv | Dataframe of the accuracies of each Worker (identified by their index in answers.txt) on the evaluation task (>70% required for participation). |
preprocessing | data_aggregation.py | Python script for aggregating worker answers into single label pandas dataframe. |
preprocessing | extra_preprocessing.py | Python script for simple data preprocessing. Used in baseline model. |
baseline | train_baseline_crf.py | Python script that trains, saves, and evaluates the baseline CRF model for entity recognition. |
baseline | load_baseline_crf.py | Python script that loads the saved baseline model and offers an interactive prompt for classification. train_baseline_crf must be run first. |
schema | Combined_Annotation_Instructions.pdf | Schema document used during data collection on Amazon Mechanical Turk. |
@inproceedings{Green2022,
author = {Thomas AF Green and Diana Maynard and Chenghua Lin},
journal = {Proceedings of the 13th Conference on Language Resources and Evaluation},
keywords = {corpus development,entity recognition,job descriptions,natural language processing},
pages = {1201-1208},
title = {Development of a Benchmark Corpus to Support Entity Recognition in Job Descriptions},
url = {http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.128.pdf},
year = {2022}
}
If you have any questions regarding this dataset, please contact the lead author at tafgreen1@sheffield.ac.uk.
Creative Commons Attribution 4.0 International