/data-engineering

Cornell Financial Data Collection leverages Python, Selenium, and NLP to aggregate and analyze financial data from Cornell's corporate donors, offering a unique exploration of data collection and analysis techniques.

Primary LanguagePython

Cornell Innovation and Entrepreneurship - Data Analysis Platform

Centralized data analysis platform for the Cornell Innovation and Entrepreneurship Lab. This repository contains scripts for data collection, data cleaning, and data analysis.

Getting Started

Prerequisites

  • Python 3.9
  • pip
  • virtualenv
  • Cornell Email

Installation

  1. Clone the repository
git clone
  1. Create a virtual environment
virtualenv venv
  1. Activate the virtual environment
source venv/bin/activate
  1. CD into the server repository
cd server
  1. Install the dependencies
pip install -r requirements.txt
  1. Create a .env file in the server directory
touch .env
  1. Add the following environment variables to the .env file
export CORNELL_NETID = "your_cornell_netid"
export CORNELL_PASSWORD = "your_cornell_password"
export CAPITAL_IQ_USERNAME = "your_capital_iq_username"
export CAPITAL_IQ_PASSWORD = "your_capital_iq_password"
  1. Source the .env file
source .env
  1. Run the server
python app.py
  1. Open a new terminal window and CD into the client repository
cd cornell-data
  1. Install the dependencies
npm install
  1. Run the client
npm start

Usage

The platform could be used to collect companies data in the following ways:

  1. Collecting data of list of companies from Capital IQ, Mergent Intellect, or Guidestar websites, individually.
cd scraping
python index.py --source
  1. Collecting data of list of companies from Capital IQ, Mergent Intellect, or Guidestar websites, in bulk.
python index.py