acl_mentorship: A Python repository from zhijing-jin

The script is to extract a list of junior researchers in NLP based on ACL anthology. The repo is curated and maintained by Zhijing Jin (an enthusiastic PhD student in NLP).

We use a tentative filter to extract authors who

have not published *ACL papers as first authors,
BUT have published at non-*ACL venues but also recorded on ACL anthology (e.g., workshops, LREC, etc), OR as non-first authors on *ACL papers,
AND have at least 1 ACL anthology entry within 3 years,
AND earliest publication date is within 3 years, AND total number of papers <= 3.

Feel free to make pull requests if you have suggestions to improve the code.

Configure the Environment

git clone https://github.com/zhijing-jin/acl_mentorship.git
cd acl_mentorship

git clone https://github.com/acl-org/acl-anthology
pip -r install acl-anthology/bin/requirements.txt
mv extract_junior_authors.py acl-anthology/bin/extract_junior_authors.py

Step 1: Extract junior authors' names and papers

How to Run

python acl-anthology/bin/extract_junior_authors.py
# output files: junior_authors.txt, junior_authors_n_papers.csv

Quality Check

We manually inspected the quality of a random sample. Among our extracted researcher names:

Total Number (by Mar 2021): 11,021 authors
45%: Students / Recent graduates
30%: Non-academia scientists beginning to publish
25%: Interdisciplinary/Other senior researchers beginning to publish at NLP venues

Step 2: Extract author emails from papers

How to Run

python extract_email_from_paper_pdf.py
# input file: junior_authors_n_papers.csv (generated by Step 1's `extract_junior_authors.py`)
# output file: junior_authors_n_email.csv

Quality Check

There can be about 5,946 valid emails. The rest of the emails are too difficult to be parsed from PDF or need some complicated rules.

We welcome pull requests to improve this function.

Credits

Thanks a lot for the Gist code of Prof Matt Post (Johns Hopkins University)