Parse out relevant information from co-op student resume package PDFs provided by the University of Waterloo.
NOTE: This program is not perfect, and the results (email address, etc.) will need to be manually inspected for errors and fixed.
Ensure that Go is installed and setup with a working $GOPATH
.
Installation and sanity check:
$ go get github.com/curvegrid/parse-uw-coop-package
$ parse-uw-coop-package -h
If running on Mc, use HomeBrew to install the Poplar package, which provides the pdftotext
utility (brew install poppler
). The utility will fallback to ps2ascii
, but it tends to be less reliable.
This assumes you are an employer of University of Waterloo co-operative education (co-op, interns) students and have a valid Employer login on WaterlooWorks.
- Post a job on WaterlooWorks and wait for student applications to become available.
- Login to WaterlooWorks, navigate to the applications list, and click the blue 'Application Options' button button near the top of the page to create a custom application bundle with each application as a separate PDF.
- Download and unzip the consolidated package.
- Install this utility,
parse-uw-coop-package
. - From the directory where you unzipped the consolidated package of PDFs, run
parse-uw-coop-package
and pipe the output to a CSV file (e.g.,parse-uw-coop-package > applicants.csv
). You can tweak the options (tryparse-uw-coop-package -h
) as required. - Import into your spreadsheet of choice. As noted above, manual cleanup will be required.
By default, searches the current directory for all PDFs that fit a regular expression (-fileregex
) and parse the text within for fields specific to UW co-op.
Usage of ./parse-uw-coop-package:
-averagesRegex string
Regex for averages (default "Term Average:\\s*([0-9]{2}\\.*[0-9]*)")
-concurrency int
Number of PDF parsing threads to run in parallel (default 4)
-coverLetterRegex string
Regex for cover letter yes/no (default "[Ss]incerely|[Hh]iring [Mm]anager")
-emailRegex string
Regex for email address (default "[A-Za-z0-9_.-]+\\@[A-Za-z0-9.-]+\\.[A-Za-z0-9]+")
-fileregex string
Regex filter for filenames (default "([A-Za-z-]+) ([A-Za-z-]+) \\(([0-9]+)\\).pdf")
-githubRegex string
Regex for Github (default "github.com/[A-Za-z0-9_.-]+")
-linkedInRegex string
Regex for LinkedIn (default "linkedin.com/in/[A-Za-z0-9_.-]+")
-pdftoascii string
PDF to ASCII converter (default "ps2ascii")
-worktermEvalRegex string
Regex for work term evaluations (default "UNSATISFACTORY|MARGINAL|SATISFACTORY|VERY GOOD|EXCELLENT|OUTSTANDING")
$ parse-uw-coop-package
ID,First name,Last name,Email,Email with name,LinkedIn,Github,Included a cover letter,Work term evaluations,Term averages,Overall average
123456,Able,Baker,able@example.com,Able Baker <abel@example.com>,,,Yes,"OUTSTANDING,OUTSTANDING,OUTSTANDING,GOOD,OUTSTANDING","72,81,84.5,72,78",73.4
...
- This has only been tested on macOS.
- The PDF-to-text converter defaults to
ps2ascii
orpdftotext
, which may not be available on your system. See the command line options to adjust. - The PDF-to-text process is not perfect, especially with formatted PDFs. Email addresses seem to be especially problematic, with many of them mangled. For example, we've seen
jeff@example.com
turn into.com example je ef@
withps2ascii
, even in what seems like a fairly "standard" formatted PDF. Manual cleanup will be required.
- DRY up the whole program
- Switch from
ps2ascii
to a native Go PDF-to-text solution - Improve the parsing accuracy: better regexes, etc.
- Direct package download, and integration with tabular info, from WaterlooWorks
- Keyword extraction
Assuming parse-uw-coop-package
was installed per the previous step, then change to the directory where go get
downloaded the source:
$ cd $GOPATH/src/github.com/curvegrid/parse-uw-coop-package
$ go build parse-uw-coop-package.go
$ ./parse-uw-coop-package
Note that you will now have two copies of the parse-uw-coop-package
binary on your system, the one in $GOPATH/bin
via go install
, and the one just built in $GOPATH/src/curvegrid/parse-uw-coop-package
via go build
.
Licensed under the MIT License. See the LICENSE
file for details of the MIT License. Copyright 2018 by Curvegrid Inc.