/gen-ack-auth

developing scripts for project studying relationship between authorship, acknowledgement and gender in schol comm

Primary LanguagePython

gen-ack-auth

simple scripts for project studying relationship between authorship, acknowledgement and gender in schol comm

AuthorAckExtract.py : this script:

  • crawls a directory containgin PMC files;
  • mines the JATS mark up for different parts of the journal article (authors lists and acknowledgement statements
  • uses Stanford NER to ID people in the acknowledgment statements
  • ID's the gender of each author and acknowledgee
  • outputs this into a series of files for later analysis (review for precision and recall of the acknowledgement extraction and gender ID).

nameGender.txt contains the names and genders*

NERExtractor.py : after running forNER.csv through the Stanford NER (currently using the GUI for testing), this script pulls the entities out and spits them into a separate file

parsenames.py - ID gender of authors and acknowledged entities

gender identification

*gender ID scripts forked from [@ptigas] who doesn't seem to have this in a repo
pulls names from social security "most popular US baby names" into a csv: http://ptigas.com/blog/2012/01/21/name2gender-in-python/

Resulting HTML processed in excel: I made a pivot table to count how many years a name was a male or female name. I assigned a gender to a name if was male or female >55% of the time; for those within the 45-55% range I left them ambiguous.

  • We could possibly reduce the error if we constrain the temporal range of names a bit more; if the earliest paper in our corpus was published in 2000, then it's reasonable to say that a publishing author wouldn't be older than, say, 80, or younger than 20 -- meaning we could pull names from 1920-2000 and see if that makes gender more precise. Might be way overthinking it though
  • Could also be more conservative with the ambiguous names -- move the % to 45-55%.