The main goal of these scripts is to build a page that displaying the accepted papers for CVPR 2019 in a way that is easier for humans to parse (re: https://mattdeitke.github.io/CVPR-2019 ). Below is an example of what this repository will display and following that is what CVPR open access currently displays.
In particular, there is functionality to cluster papers based on latent Dirichlet allocation topics, create thumbnail images from the first 8 pages of each PDF, find the abstracts, copy a bibtex, view the paper and supplementary material, and more. Feel free to use the scripts as they are up to date with Python 3.7 and should work for any past CVPR (unless they change their HTML) as well as making modifications to adapt to another conference.-
Clone this repository
git clone https://github.com/mattdeitke/CVPR2019
-
Save the HTML from where the accepted papers are displayed. For CVPR, this year, that would be
http://openaccess.thecvf.com/CVPR2019.py
. -
Install ImageMagick, if it is not already installed. This can be done using
sudo apt-get install imagemagick
or using another supported method such asbrew install imagemagick
. -
Run
pdftowordcloud.py
(to generate top words for each paper. Output saved in topwords.p as pickle) -
Run
pdftothumbs.py
(to generate tiny thumbnails for all papers. Outputs saved in thumbs/ folder) -
Run
scrape.py
(to generate paperid, title, authors list by scraping NIPS .html page) -
Run
makecorpus.py
(to create allpapers.txt file that has all papers one per row) -
Run
python lda.py -f allpapers.txt -k 7 --alpha=0.5 --beta=0.5 -i 100
. This will generate a pickle file calledldaphi.p
that contains the LDA word distribution matrix. Thanks to this nice LDA code by @shuyo! It requires nltk library and numpy. In this example we are using 7 categories. You would need to change thenipsnice_template.html
file a bit if you wanted to try different number of categories. -
Generate the abstract files inside abstracts/ folder using
getabstracts.py
. -
Finally, run
generatenicelda.py
(to create theindex.html
page)
Big thanks to @karpathy for his NeurIPS preview and ArXiV Sanity Preserver, which is what this repository is built on top of! Also a thanks to @tholman for creating a more modern GitHub Corners and @shuyo for the LDA code! Finally, more thanks go to the people at CVPR for openly publishing all of their accepted papers!
WTFPL licence