/skill-extraction-dataset

Public dataset for the extraction of salient entities from job descriptions.

Primary LanguagePythonCreative Commons Attribution 4.0 InternationalCC-BY-4.0

Benchmark Corpus to Support Entity Recognition in Job Descriptions

This is the code repository for the LREC 2022 paper Development of a Benchmark Corpus to Support Entity Recognition in Job Descriptions. The authors are Thomas AF Green, Diana Maynard, and Chenghua Lin.

Overview

A public, human-labelled dataset of salient entities in job descriptions.

Salient entities include:

  • Skill
  • Qualification
  • Experience
  • Occupation
  • Domain

Full details regarding annotation schema can be found in schema/Combined_Annotation_Instructions.pdf.

Labels were collected from Amazon Mechanical Turk. Workers were required to achieve >70% accuracy in a qualification task before contributing to this dataset.

Data formatting follows 2003 CONLL NER dataset conventions. Individual Worker responses, Worker ID and associated accuracies on the qualification task have been retained.

Original job description data can be found at Kaggle. Credit to user airiddha for the development of the original dataset.

File Descriptions

Folder Filename Description
raw_data answers.txt Worker responses from the annotation task on AMT.
raw_data testset.txt Gold standard dataset for model evaluation.
raw_data worker_accuracies.csv Dataframe of the accuracies of each Worker (identified by their index in answers.txt) on the evaluation task (>70% required for participation).
preprocessing data_aggregation.py Python script for aggregating worker answers into single label pandas dataframe.
preprocessing extra_preprocessing.py Python script for simple data preprocessing. Used in baseline model.
baseline train_baseline_crf.py Python script that trains, saves, and evaluates the baseline CRF model for entity recognition.
baseline load_baseline_crf.py Python script that loads the saved baseline model and offers an interactive prompt for classification. train_baseline_crf must be run first.
schema Combined_Annotation_Instructions.pdf Schema document used during data collection on Amazon Mechanical Turk.

Citation

@inproceedings{Green2022,
   author = {Thomas AF Green and Diana Maynard and Chenghua Lin},
   journal = {Proceedings of the 13th Conference on Language Resources and Evaluation},
   keywords = {corpus development,entity recognition,job descriptions,natural language processing},
   pages = {1201-1208},
   title = {Development of a Benchmark Corpus to Support Entity Recognition in Job Descriptions},
   url = {http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.128.pdf},
   year = {2022}
}

Contact

If you have any questions regarding this dataset, please contact the lead author at tafgreen1@sheffield.ac.uk.

License

Creative Commons Attribution 4.0 International