polygon-io/client-python

polygon_client.list_tickers() returns duplicate tickers

evanvolgas opened this issue · 5 comments

Describe the bug

A few days ago, my ETL audits started reporting duplicate records in my all_tickers BigQuery table, which itself is based on polygon_client.list_tickers(). I dug into these 102 duplicates and it appears to me they are being caused by the same ticker having different last_updated_utc and/or a different CIK.

In 12 of the 51 affected tickers I found, the same ticker has two CIKs listed at the polygon_client.list_tickers() endpoint.

In 39 of the 51 affected tickers I found, the same ticker has the same CIK listed at the polygon_client.list_tickers() endpoint, and the only thing different is the last_updated_utc

To Reproduce

import os 

from collections import Counter
from concurrent.futures import ThreadPoolExecutor
from threading import Lock

import pandas as pd

from polygon import RESTClient

WORKERS = 50
polygon_secret = os.getenv("POLYGON_API_KEY")
polygon_client = RESTClient(polygon_secret, retries=10, trace=False)

# Fetch Ticker Info
tickers_active = polygon_client.list_tickers()
dfs = []
dfs_lock = Lock()

def fetch_ticker_info(ticker):
    ticker_info = {
        "active": ticker.active,
        "cik": ticker.cik,
        "composite_figi": ticker.composite_figi,
        "currency_symbol": ticker.currency_symbol,
        "base_currency_symbol": ticker.base_currency_symbol,
        "currency_name": ticker.currency_name,
        "base_currency_name": ticker.base_currency_name,
        "delisted_utc": ticker.delisted_utc,
        "locale": ticker.locale,
        "title": ticker.name,
        "primary_exchange": ticker.primary_exchange,
        "share_class_figi": ticker.share_class_figi,
        "ticker": ticker.ticker,
        "type": ticker.type,
        "source_feed": ticker.source_feed,
    }
    with dfs_lock:
        dfs.append(ticker_info)
    return ticker_info


# Using ThreadPoolExecutor to fetch tickers data concurrently
print(f"Scraping web data")
with ThreadPoolExecutor(max_workers=WORKERS) as executor:
    executor.map(fetch_ticker_info, tickers_active)

all_tickers_df = pd.DataFrame(dfs)
all_tickers_df

# Show the affected records; 
# there are 51 unique tickers and 102 records that appear to be duplicated
all_tickers_df[all_tickers_df.groupby('ticker')['ticker'].transform('count') > 1]

# Consider the duplicate records above; 
# Count the number of tickers with more than one CIK (True) 
# vs the number of # tickers with exactly one CIK (False)
Counter(all_tickers_df[all_tickers_df.groupby('ticker')['ticker'].transform('count') > 1].groupby("ticker")["cik"].transform('nunique') > 1)

Expected behavior
all_tickers_df[all_tickers_df.groupby('ticker')['ticker'].transform('count') > 1] should return None. If you notice, the results of this command show 51 unique tickers with two rows each. There are 12 unique tickers for which the same ticker has two CIKs. There are 39 unique tickers for which the same ticker has only one CIK, but has a different updated at timestamp.

Additional context
I join to this table and assume the tickers are unique. I have 6 downstream tables that are affected by this. I can deduplicate the all tickers endpoint on my end, but it feels to me as if the endpoint shouldn't be returning what appear to be duplicate records in the first place.

FWIW, nothing on my end changed and the audits I've put in place have been active for over four months without issue. According to my audit records, these duplicates started showing up on March 24th. I only download these data weekly, so I believe it's likely that the problem was introduced some time between March 17th and March 24th.

It seems that duplicated data are continuing to be added to the data source on your end. I'm now seeing 46 exact duplicates with the only difference being the updated at timestamp, and 11 partial dups where one ticker has two different CIDs but otherwise the records are the same.

Hi @evanvolgas, we're checking this out now. We created an internal ticker earlier but bumping things up in priority. Thank you for reporting.

Thank you!

Hey @evanvolgas, this has been fixed. We made some major improvements to reference data overall. I tested this on my end but I'll leave open until you confirm on your end.

Hey, I'm going to close this since I've been able to verify the fix on my end and with your script. If you see this again please just comment or reach out to support. But, we should be good now. Thanks again for reporting this and sorry for the inconvenience.