theOGognf/finagg

Static SQL table columns, better column names, and better feature refinement logic

theOGognf opened this issue · 2 comments

Repurposing this issue to focus on general SQL improvements. Most of the SQL tables follow a key-value/ETA model that is fine for low row counts, but has huge slowdowns for larger row count tables. As an example, the SEC tables have about the same query performance, but the Yahoo! Finance tables are about 10x SLOWER using the key-value/ETA model vs a typical column-per-attribute model

In addition, I'd like to rename some feature columns so the naming convention is a bit more consistent across subpackages. If a column is just the result of another column being passed to the function, then I'd like to use a naming convention similar to other SQL engines: FUNC(COL) where FUNC is the function and COL is the argument to the function. This'll change names like AssetsCurrent_pct_change to PCT_CHANGE(AssetsCurrent). Although it's a bit more verbose, this makes it easier to describe columns, especially when since some columns will be processed by multiple functions (e.g.,, percent change columns normalized by industry for SEC features would be NORM(PCT_CHANGE(COL))

The pivoted table change and the column name change brings up another point - features have a column attribute that lists the names of all the columns returned by the feature, but, since we're swapping to a pivoted table representation and since the column names will have function names in them, it may be a bit more straightforward to remove the column attribute and just use columns on the SQL table definitions

To summarize the above and more:

  • Switch to "pivoted" table view for all refined tables
  • Use FUNC(COL) naming convention for refined columns
  • Don't use raw columns in refined tables (people can just join dataframes if they want raw columns with refined columns anyways)
  • Use SQL table definitions to create refined columns from raw columns rather than depending on a feature column attribute
  • Use log change instead of percent change for all features that're guaranteed to be > 0 across all of time
  • Keep raw column names as close to the API sources as possible (no pointless renames like EarningsPerShare instead of EarningsPerShareBasic)
  • Update docs to reflect these changes

This is covered by #41

Merged