GuruFocus.com provides payed API to get stocks fundamental data but there are only 2,000 calls per month. I monitor more then 5,000 stocks and need to know what new fundamental data were recently uploaded to GuruFocus to use my limited API calls only to get new data.

The site provides free access to the recent 4 quarters of the data. I web-scrap the free web-pages to get most recent available date for each my stock. Then I compare the most recent available date on the site to the data I already downloaded. I use API calls only to get fundamental data if it's new.

image

The data (most recent available dates in csv file and fundamental data in JSON files) are uploaded to S3 bucket.

There are Snowflake external tables configured to get staging data exactly as they are in S3 files.

DBT macro is used to parse JSON records in the staging table and each individual part is stored in its own table in Snowflake.

The macro gets keys from the 1st JSON record in the staging table:

with stg as (select * from {{ source('Json_Data', 'fundamental_stg') }} limit 1) select distinct json.key as column_name from stg, lateral flatten(stg.value:financials:{{ key1 }}:{{ key2}}) json

starting from the specific key for each balance sheet, income statement, chash flow etc parts and based on the result creates SELECT statement to build correspondent tables.

Redshift version of the get key SQL:

SELECT key as column_name FROM (select * from {{ source('Json_Data', 'fundamental_stg') }} limit 1) AS stg, UNPIVOT stg.value.financials.{{ key1 }}.{{ key2}} AS value AT key;

image

The steps are orchesttrated in Airflow DAG.

image

The project is rather POC then production and requre improvments: incremental load in the tables, clean-up S3 bucket.