/sql_analytics

Tf-idf, cosine similarity, weighted regressions in SQL

Primary LanguageTSQL

SQL Analytics

This is a repository to show that analytical things can be done in SQL :P All are written as user-defined functions (UDFs) compatible with snowflake.

RBF Kernel smoothing UDFs:

  • Calculates the gaussian kernel (1D) - used in anomaly detection
  • Can choose the level of smoothing to apply (includes 'rule-of-thumb' measure)
  • Can proportionally decrease smoothness at edges by decreasing sigma by euclidean similarity




Weighted Least Squares UDFs:

  • Calculates the alpha and beta required for a univariate weighted least squares regression
  • UDFs include partition windows so multiple models can be trained on different cuts of the same table (and tested on other cuts!)
  • Added example comments to show how to extend to multivariate regressions if required (note in the example below, the x1 value has been log transformed for the beta calc)
  • Also includes MAE and MAPE calculations to assess performance on training data




TF-IDF, Cosine Similarity UDFs between sets of strings:

  • Calculates the cosine similarity between every combination of two tables of strings.
  • It fits TF-IDF for tokenisation (unigram, bigram, trigram)
  • It then takes advantage of L2-normalisation as a cheeky math hack to use a linear kernel instead to calculate cosine similarity.
  • So it essentially just creates a sum product on the left-join to get a dot product :D