/FredImport

Import Federal Reserve Economic Data (FRED2) into Impala and Redshift

Primary LanguageJavaOtherNOASSERTION

Federal Reserve Economic Data (FRED2) Big Data Import Utility

Overview

The Federal Reserve Economic Database (FRED2) contains ~240,000 discrete series' of data ( http://http://research.stlouisfed.org/fred2/ ). Each series is stored in a separate CSV file, with each row containing a Date and Value. The data is discrete and not categorized. Data Visualization tools, such as QlikView, generally view data by categories, providing drill downs into subcategories. These tools are very adept at aggregating time sequence data from multiple sources. The objectives of the FRED Import Utility are:

  • Parse the raw FRED data into date, value, categories and other criteria (country, city, state, county, sex, FRB district)
    • The data can be viewed in a drill down manner via Data Visualization tools
    • The data is available for other other processes such as R, MapReduce, machine learning algorithms, etc.
  • The FRED data is stored in common Big Data columnar databases, in this case Cloudera Impala and Amazon Redshift.
    • Provide the Business Analyst with a means to combine FRED data with other Big Data sources, such as financial, weather, health, etc.
    • Note that the FRED Import Utility could be easily modified to store the FRED data in any relational database.
  • Provide documentation generated by the utility to assist the Business Analyst in finding the appropriate subsets of data and configuring QlikView to select those data sets
    • A spreadsheet is generated that contains a summary of all the distinct groups of categories (see samples/FredDistinctCategories.xls). This can be rapidly searched and/or filtered to locate specific sets of FRED data.
    • For every distinct category, a block of QlikView script is generated. The Business Analyst can copy/paste this block into the QlikView Load script to Select the data for that distinct category (see samples/t4232_Consumer Price Index.qvs)

High Level Process

  • Parse the README_SERIES_ID_SORT.txt to get the path/name of each CSV file, as well as it's title, frequency, seasonal adjustment and last update date
  • Parse each title into Categories. If applicable, also parse to find the City, State, County, Country, Sex and FRB District.
  • Read each CSV file (which contains only Date and Value), create a line in a TSV containing all of the data required to produce a row in the target database (i.e. Date, Value, Categories, Country, etc)
  • Create a spreadsheet containing all of the Distinct Category combinations
  • For each Distinct Category, create a file containing a block of QlikView Load/Select script
  • At this point, the Business Analyst can create a QlikView dashboard, copying/pasting Load/Select scripts to load the relevant data.

##Installation

  • Prereqs:

    • Java 7 is the minimum required
  • If Target Database is Impala:

    • Note that the data and code must be on a node in a Hadoop cluster, you must have a valid login on that cluster and have access to Impala on that cluster
    • Download the FredImport and FredImportCdh projects from GitHub
    • Maven Update, Maven Clean, Maven Install on the projects
    • SSH to a node on the Hadoop cluster, create the target directory for code and data. Note that this may need to be created on another mount due to disk space requirements – you might not have enough disk space on your home directory. For example, lets assume “data1/mnt”.
    • Bring up the impala-shell and create the database “dbfred” via
      • create database dbfred;
      • exit;
    • Upload fredimportcdh-1.0.jar, properties/fredcdh.properties and the contents of FredImport/reference
    • Download and unzip the FRED2 CSV zip file: http://http://research.stlouisfed.org/fred2/downloaddata
    • Modify the paths and JDBC connection information in /FredImportCdh/properties/fredcdh.properties
    • Run the process to convert raw FRED data to table-ready TSV's
      • java -Xms1G -Xmx4G -cp fredimportcdh-1.0.jar com.ipcglobal.fredimport.process.ProcessReadmeSeriesId properties/fredcdh.properties
    • Run the process to import the table-ready TSV's into Impala
      • Create directories fred/ and fred/properties/ in HDFS
      • Copy the fredcdh.properties file into HDFS, i.e. into /user//fred/properties/fredcdh.properties
      • hadoop jar fredimportcdh-1.0.jar com.ipcglobal.fredimportcdh.TsvsToImpala /user//fred/properties/fredcdh.properties
  • If Target Database is Redshift

    • Assumptions
    • You have a .aws/credentials file containing your AWS access keys
    • You have Write access to an existing S3 bucket
    • You have an existing Redshift instance and permission to create databases and tables
    • Connect to Redshift and create the database:
      • create database dbfred;
    • The Redshift version of the utility can be run entirely from within Eclipse. However, it will run faster on an EC2 instance – the table-ready TSV's are transferred to S3, and this will run faster from an EC2 instance.
    • Download the FredImport and FredImportAws projects from GitHub
    • Maven Update, Maven Clean, Maven Install on the projects
    • Upload fredimportcdh-1.0.jar, properties/fredcdh.properties and the contents of FredImport/reference
    • Download and unzip the FRED2 CSV zip file: http://http://research.stlouisfed.org/fred2/downloaddata
    • Modify the AWS, paths and JDBC connection information in /FredImportAws/properties/fredaws.properties
    • Run the process to convert raw FRED data to table-ready TSV's
      • java -Xms1G -Xmx4G -cp fredimportaws-1.0.jar com.ipcglobal.fredimport.process.ProcessReadmeSeriesId properties/fredaws.properties
    • Run the process to import the table-ready TSV's into Redshift (note – this will run for about 5 minutes)
      • java -Xms1G -Xmx4G -cp fredimportaws-1.0.jar com.ipcglobal.fredimportaws.TsvsToRedshift properties/fredaws.properties

##Access the data

  • If necessary, install the Impala or Redshift drivers – these are available on the Cloudera or Amazon web sites
  • Configure your SQL tool of choice (i.e. Squirrel, TOAD, etc.) to access Impala or Redshift
  • Verify the data exists: select count(*) from dbfred.tbfred;

##View the data in QlikView

  • Note that if the import utility was run remotely (i.e. on Hadoop cluster or EC2 instance), a tar.gz file was created that contains the FredDistinctCategories.xls and the QlikView Load/Select scripts. You should download and extract this file to the local computer and/or network share.
  • Open QlikView, create a new dashboard, create a connection to Impala or Redshift
  • Review the fred_rpt_sql/FredDistinctCategories.xls
    • One row per distinct set of Categories
    • Use Search and/or Filter to help determine which Distinct Categories you would like to view in QlikView.
    • The Row Number is used to locate the Load/Select script file.
    • For example, you are interested in: Consumer Price Index, All Items
      • Search the spreadsheet, find a match on row 4234
      • Look at the subdirs in fred_rpt_sql – the names of the subdirs are “t-nnn”, where nnn is the first matching row. For example, if you see two subdirs “t-100” and “t-200”, then subdir “t-100” contains the QlikView Load/Select scripts for rows 100-199.
      • In this case, 4234 is greater than the subdir t3951 and less than the subdir t5585, so the script file will be in subdir t3951
      • The script file names are based on the row number of the first unique Category1, in this case “Consumer Price Index”. Notice the files “t4232_Consumer Price Index.qvs” and “t5485_Consumer Revolving Credit...qvs”. Our row number is 4234, so the script we are looking for will be in t4232_Consumer Price Index.qvs. Open that file and you will see all of the script Load/Select statements where Category1=”Consumer Price Index”
      • Find the block for t4234, copy/paste into QlikView, remove the comments and do a QlikView Load.
      • At this point you could also copy/paste other Load/Select blocks or start adding objects to the dashboard.