Project Manual

This manual provides detailed instruction to reproduce the analysis of this project.

Programming Environment

Python version: Python 2.7.11
- Anaconda distribution for Python 2.7 is highly recommended since it has many of the following libraries preinstalled
Python Package Dependencies (older versions are not tested)
1. pandas (0.17.1)
2. numpy (1.10.4)
3. matplotlib (1.5.1)
4. statsmodels (0.6.1)
5. csv (1.0)
6. patsy (0.4.1)
7. scipy (0.17.0)
8. tablib (0.11.2)

Data Collection

TRACE and Datastream data

Clone this repository to a local directory.
Download sample data of Enhanced TRACE from the WRDS database to sample cusip IDs. (A registered and subscribed account is required to download TRACE data)
Choose Date range to be the 1st month of interested period, select "Search the entire database" and select only two columns : Bond Symbol and CUSIP;
Change the Output Format to comma-delimited text**(*.csv)**, and keep everything else default;
Click Submit Query to acquire datasets. Wait for 10 sec to a few minutes, depending on size of the dataset, until a link to the .csv file appears below the query form;
Save the data file with name "cusip_id_start_raw.csv".
Repeat 1~4 with Date range set to last month of interested period. Save the data file with name "cusip_id_end_raw.csv".
Download Datastream data from a Datastream Terminal.

Download list of bonds with fields ISIN, SIC and MOODY'S RATING as file "bond_list_datastream.csv".
Open "bond_list_datastream.csv" with Excel, create a new column named "cusip_id" with each cell containing formula =MID(ISIN_CELL, 3, 9).
Rename "MOODY'S RATING" to "MOODY_RATING" for easier Python usage.

Put the above data files in ./data/ folder of the repository.
In command prompt or Anaconda prompt, run \MarketMakingProfitability> python cusip.py to generate a list of cusip IDs

A master list of cusip ID "cusip_id_list_TRACE.txt" is generated, as well as txt files for cusip ID lists of each market segment. The following instruction will use the master list as example.

Go to Enhanced TRACE again to download raw data to be used for analysis
Choose Date range to be the full length of the interested period;
Choose "CUSIP" as search option and choose to upload a file. Upload "cusip_id_list_TRACE.txt" generated by step 5;
Select ALL variables and Output Format as (.csv);
Submit the query. Download the file to _.\data_ folder.

Fama/French factors and other public index data

Go to Kenneth R. French's data library
Download the Fama/French 3 Factors in csv format;
Open the file with Excel, remove excess information (only keep data and column names "Date", "MKT", "SMB", "HML" and "RF");
Change the format of the Date colume to yyyymm;
Rename the file to "FF_monthly_data.csv"
Go to the Fed historical data page.
Download the monthly data of 20-year Nominal Treasury constant maturities;
Copy the data in the yield column to replace "RF" column in "FF_monthlly_data.csv" (make sure the dates are matched).
Go to the FRED Economic Data website to download the following three datasets
BofA Merrill Lynch US Corporate A Effective Yield: Download Monthly data (aggregated by End of Period), copy the data column to "FF_monthlly_data.csv" as a new column named "IGBonds";
10-year US Government Bond Yields benchmark: Download Monthly data, copy the data column to "FF_monthlly_data.csv" as a new column named "Gbond";
4-week Treasury Bill: Download Monthly data, copy the data column to "FF_monthlly_data.csv" as a new column named "Tbill";
Open "FF_monthlly_data.csv" with Excel, create a new column "DEF" by =IGBonds_cell - Gbond_cell and a new column "TERM" by =Gbond_cell - Tbill_cell. Save "FF_monthlly_data.csv" in _.\data\ folder.
Download BofA Merrill Lynch US Corporate A Effective Yield again in Daily, Close data, save to _.\data\ as "BofA_Corporate_Bond_Index.csv".

Profitability Analysis

Now with all data ready in data folder, we can run the entire analysis procedure automatically with proper parameters set.

Adjust configuration parameters by editing "MMP_config.py" file

With raw datafile, set readDaily = False and readRaw = True.
Set the names of the raw files to parameter RAW_DATA_NAMES as a list. Note that all given raw files will be merged to one large raw file to run analysis as one portfolio.
Set doOutput = False to speed up data processing. If doOutput = True, during the data cleaning, all transactions of each bond will be output to one csv file in folder _.\data\output_.
Set dt_start and dt_end to the corresponded dates in yyyymmdd format.
Set CLEAN_DATA_FILE to be the desired prefix for analysis outputs.

In command prompt, run MarketMakingProfitability> python main.py to start processing.

Uncomment print statements in main function of "main.py" to output progress logs in console.
Depending the size of the raw dataset, this process will take from 5 mins to 1 hour.

After first run of the script, a "_<CLEAN_DATA_FILE>daily.csv" file is generated. In "MMP_config.py" file, set readDaily = True and readRaw = False to speed up future analysis time.

Analysis Output

The Python script will produce a series of output csv files as intermediate and final results of the analysis. All the files will have prefix CLEAN_DATA_FILE set in "MMP_config.py". Use CLEAN_DATA_FILE = "list_clean_UTILITIES" as an example.

list_clean_UTILITIES.csv: Fully processed clean transaction data
list_clean_UTILITIES_daily.csv: transaction data aggregated into daily yield
list_clean_UTILITIES_monthly_total_bond.csv: number of bonds traded each month
list_clean_UTILITIES_monthly_volume.csv: aggregated dollar volume traded each month
list_clean_UTILITIES_liquidity.csv: monthly liquidity measure and liquidity risk measure
list_clean_UTILITIES_Betas.csv: Betas of each bond resulted from Fama French regression
list_clean_UTILITIES_return_diff.csv: actual and expected yield from Fama French model, and regression error term
list_clean_UTILITIES_stats_summary.csv: adjusted R-squared of Fama French regression, and t-stats and p-stats of the Betas of each bond