The script is to extract a list of junior researchers in NLP based on ACL anthology. The repo is curated and maintained by Zhijing Jin (an enthusiastic PhD student in NLP).
We use a tentative filter to extract authors who
- have not published *ACL papers as first authors,
- BUT have published at non-*ACL venues but also recorded on ACL anthology (e.g., workshops, LREC, etc), OR as non-first authors on *ACL papers,
- AND have at least 1 ACL anthology entry within 3 years,
- AND earliest publication date is within 3 years, AND total number of papers <= 3.
Feel free to make pull requests if you have suggestions to improve the code.
git clone https://github.com/zhijing-jin/acl_mentorship.git
cd acl_mentorship
git clone https://github.com/acl-org/acl-anthology
pip -r install acl-anthology/bin/requirements.txt
mv extract_junior_authors.py acl-anthology/bin/extract_junior_authors.py
python acl-anthology/bin/extract_junior_authors.py
# output files: junior_authors.txt, junior_authors_n_papers.csv
We manually inspected the quality of a random sample. Among our extracted researcher names:
- Total Number (by Mar 2021): 11,021 authors
- 45%: Students / Recent graduates
- 30%: Non-academia scientists beginning to publish
- 25%: Interdisciplinary/Other senior researchers beginning to publish at NLP venues
python extract_email_from_paper_pdf.py
# input file: junior_authors_n_papers.csv (generated by Step 1's `extract_junior_authors.py`)
# output file: junior_authors_n_email.csv
There can be about 5,946 valid emails. The rest of the emails are too difficult to be parsed from PDF or need some complicated rules.
We welcome pull requests to improve this function.