Hello world 🌏! Are you an economist, or economics student, or just a random person like me who is interested in economics? Do you want to write a paper, a thesis, or just ramble on some stuffs but don't have any fresh ideas on what should be the topic? Worry no more! Because, this repository is for you!
Before jumping in, consider this:
As of this writing, there are more than 20,000 working papers on NBER. If getting one paper takes around 30 seconds (including the required time interval imposed by NBER in its crawler policy, it takes more than a week to finish the program.
If you are okay with above caution, you may want to use Heroku or other cloud service so that the program won't run on your local machine. If yes, you may need to spare $9/month to maintain the PostgreSQL database because its free edition only covers up to 10,000 rows. If you are okay with this, you may go on. However, I won't get into setting up database on Heroku since in this repository I only run on my local machine.
If you think this is going to be useful for your purpose, don't hesitate to clone this repository:
cd ~
git clone https://github.com/ledwindra/nber.git
cd nber
To get the latest updates, run git pull origin master
Check its robots.txt
Following is the snippet:
User-agent: *
Crawl-delay: 10
User-agent: *
Disallow: /fda/
Disallow: /contact/
Disallow: /confer/
Disallow: /~confer/
Disallow: /conf_papers/
Disallow: /c/
Disallow: /wpsubmit/
Disallow: /custom
Disallow: /confsubmit/
Disallow: /family/
Disallow: /1050/
Disallow: /cal/
Disallow: /cgi-bin/
Disallow: /nberhistory/historicalarchives/
Disallow: /xming*
Disallow: /taxex/
Disallow: /papers/mail
Disallow: /tmp/
Disallow: /server-status/
Disallow: /mrtg/
Disallow: /bb/
Disallow: /img/
Disallow: /pics
Disallow: /*.ris$
Disallow: /*.marc$
Disallow: /*.bib$
Disallow: /*palm*$
Disallow: /taxsim-calc*/
Disallow: /medicare/
Disallow: /*.pl/
Disallow: /arfpub/
Disallow: /pscp*
Disallow: /jobs/stateforms/
Disallow: /hcris/
Everybody is not disallowed to get /papers/
tag. However, please scrape ethically by setting time interval between each request for 10 seconds (see Crawl-delay: 10
).
It's not an obligation to use a virtual environment, but using it is better (I think) for a good reason. This project has dependencies on third-party modules such as pandas
, requests
. Some of these modules may be updated and have consequences to the project. Or maybe you have another project with the same modules but different versions. Using virtual environment will tackle all of the issues. Just run the following on terminal:
python3 -m venv .venv-nber
source .venv-nber/bin/activate
As mentioned above, this project depends on third-party modules. Run the following on terminal to install them on your machine:
pip3 install --upgrade pip
pip3 install -r requirements.txt
Install: link
Connect to localhost:
psql -h localhost -p 5432 -d postgres
Then input your password when it prompts something like this:
Password for user postgres:
However, it may be different if you have different configurations.
To create table, run the SQL query in util/get_paper.sql
Run the Python script on terminal:
python3 src/get_paper.py
It will prompt you questions:
Your PostgreSQL username: [YOUR POSTGRESQL USERNAME]
Your PostgreSQL password: [YOUR POSTGRESQL PASSWORD]
Your PostgreSQL host: [YOUR POSTGRESQL HOST]
Your PostgreSQL port: [YOUR POSTGRESQL PORT]
Your PostgreSQL database: [YOUR POSTGRESQL DATABASE]
Input initial ID: [PRESS 0 IF YOU WANT TO START FROM https://www.nber.org/papers/w0000]
Run unit tests to ensure the data quality is good.
pytest test -s
It then will prompt you to input your PostgreSQL password:
collecting ... Your PostgreSQL password:
column_name | data_type | description |
---|---|---|
id | int4 | A unique ID for each paper |
citation_title | varchar | Paper title |
citation_author | varchar | Paper author(s). Can be more than one. Hence it is stored as an array |
citation_date | date | Date of paper (not clear) |
citation_publication_date | date | Date of paper being published |
citation_technical_report_institution | varchar | Institution which publishes the report |
citation_technical_report_number | varchar | Paper ID as in the website |
citation_journal_title | varchar | Journal title |
citation_journal_issn | varchar | ISSN number |
citation_pdf_url | varchar | Paper URL for PDF version |
topics | varchar | Paper topic(s). Can be more tan one. Hence it is stored as an array |
abstract | varchar | Paper abstract |
also_downloaded | varchar | Users who downloaded this paper also downloaded* these. Can be more than one. Hence it is stored as an array. The reason I use the URL because otherwise it won't catch other than the working papers. Users may also download something from /chapters/ , which may be useful for analysis |
Play around with the SQL queries inside util/explore.sql
, for example:
SELECT topics, COUNT(*)
FROM (
SELECT UNNEST(topics) AS topics
FROM paper
) AS paper
WHERE topics != ''
GROUP BY topics
ORDER BY topics ASC;
...and the output will be:
topics | count
--------------------------------------------------------+-------
Asset Pricing Program | 2417
Corporate Finance Program | 2049
Development Economics Program | 1017
Economic Fluctuations and Growth Program | 5218
Economics of Education Program | 1429
Environment and Energy Program | 1101
Health Care Program | 1472
Health Economics Program | 2067
Industrial Organization Program | 1538
International Finance and Macroeconomics Program | 3841
International Trade and Investment Program | 3323
Labor Studies Program | 5288
Law and Economics Program | 1030
Monetary Economics Program | 3004
Political Economy Program | 1138
Productivity, Innovation, and Entrepreneurship Program | 2191
Program on Children | 1522
Program on the Development of the American Economy | 1410
Program on the Economics of Aging | 1505
Public Economics Program | 5273
Technical Working Papers | 246
(21 rows)
If you'd like to contribute, here's what you can do:
- Clone the repo
- Create new branch by running
git checkout -b [YOUR-BRANCH-NAME]
- Always pull from
master
before push - Make
pull request
The other way would be just contact me personally. I'm just a regular person.
If you have read up to this line, thank you for bearing with me. Hope this is useful for your purpose! 😎