NielsenReader defines the classes RetailReader and PanelReader to facilitate easy processing of the Kilts Center's Nielsen IQ Data.
- RetailReader processes Retail Scanner Data
- PanelReader processes Consumer Panel Data
The main advantages of this package are:
- Speed. this is meant to be at least 10x faster (and can be more than 100x faster) than pd.read_csv() or pd.read_table()
- Supports partial reading (you can read in specific products, specific cities, stores, etc.) without having to load the entire file
- It understands the Kilts/Nielsen directory structure -- so you can just download from Kilts and go.
- It saves highly compressed (and fast) .parquet files
- Optional support for various fixes/updates issued by Kilts.
These classes are built on pyarrow 5.0.1
Apache Arrow v.5 which you will need to install.
To install with pip simply type:
pip install git+https://github.com/chrisconlon/kiltsnielsen
Make sure your python installation has the latest pyarrow (as well as pandas and NumPy)
Using pip:
pip install pyarrow
pip install --upgrade pyarrow
Using pip (requirements.txt):
python -m pip install -r requirements.txt
Using conda:
conda intall pyarrow
conda update pyarrow
from kiltsreader import RetailReader, PanelReader
- Locate your Nielsen Retail Scanner and Consumer Panel data separately
- Open
Example.py
- Replace
dir_retail
anddir_panel
with the locations of your Retail Scanner and Consumer Panel Data respectively - Replace the
KEEP_GROUPS
,KEEP_MODULES
,DROP_YEARS
,KEEP_YEARS
,KEEP_STATES
, andKEEP_CHANNEL
with your relevant selection - Run
Example.py
to verify the code works.
Information about the data can be found at the Kilts Center's Website for the Nielsen Dataset.
Check with your institution to gain access to the data. Once you have gained access, download files as follows:
- Construct file extracts using the Kilts File Selection System. Note you must separately gain access to the File Selection System after applying for the data.
- The data are available in .tgz files. Data can be downloaded by group, module, and/or year
- Unzip the .tgz files
- Panelist data can be downloaded directly from Globus. The data are small enough for a typical work machine. The data are available in .tgz files
- Unzip the .tgz files.
Importantly, make sure all files are unzipped and preserved in the original Nielsen structure before using the methods provided here. (Do not rearrange the directory structure.)
class NielsenReader.RetailReader(dir_read=path.Path.cwd(), verbose=True)
RetailReader defines the class object used to read in the Nielsen Retail Scanner Data (see above)
- dir_read(pathlib Path object, optional): points to the location of the Retail Scanner Files. Should be named
nielsen_extract
or something similar, containing the subfolderRMS
. Default is the current working directory. - verbose(bool): if
True
, prints updates after processing files. Displays size of files processed. Default is True.
df_products
(pandas DataFrame): default empty, stores products data after processingdf_sales
(pandas DataFrame): default empty, stores sales data after processingdf_stores
(pandas DataFrame): default empty, stores store data after processingdf_rms
(pandas DataFrame): default empty, stores RMS versions data after processingdf_extra
(pandas DataFrame): default empty, stores extra product data after processingall_years
(list): list of years for which data will be processedfiles_product
(pathlib Path object): stores name of product characteristic filefiles_rms
(list of pathlib Path objects): stores names of annual rms_versions filesfiles_stores
(list of pathlib Path objects): stores names of annual stores filesfiles_extra
(list of pathlib Path objects): stores names of annual extra filesfiles_sales
(list of pathlib Path objects)stores names of annual sales filesall_years
(list): list of years included in data. Updates with filtering
Functions also described with docstrings in the NielsenReader.py
file
RetailReader.filter_years(keep=None, drop=None):
Selects years for which to process annual sales files. Used in pre-processing to limit required memory if desired; otherwise later functions will process all available data. Populates RetailReader.all_years
- keep(list of integers, optional): list of years to keep, e.g. range(2004, 2013). Can only include years that are already present in the data, e.g. specifying `keep=[1999]' will result in an empty set of years
- drop(list of integers, optional): list of years to remove, e.g. [2006, 2009, 2013]
RetailReader.filter_sales(keep_groups=None, drop_groups=None, keep_modules=None, drop_modules=None): Selects product groups (outer category) and product modules (inner category) for which to process annual sales files. Used in pre-processing to limit required memory if desired; otherwise later functions will process all available data.
- keep_groups(list of integers, optional): list of product groups to keep, e.g.
keep_groups=[1508, 1048]
- drop_groups(list of integers, optional): list of product groups to exclude, e.g.
drop_groups = [1046]
. Takes precedence if there is any overlap with keep_groups - keep_groups(list of integers, optional): list of product modules to keep, e.g.
keep_modules=[1481, 1482]
- drop_groups(list of integers, optional): list of product modules to exclude, e.g.
drop_groups = [1483]
. Takes precedence if there is any overlap with keep_module
RetailReader.read_rms():
Populates RetailReader.df_rms
Processes the annual RMS versions files, which map reused UPCs to the appropriate version based on year
RetailReader.read_products(upc_list=None, keep_groups=None, drop_groups=None, keep_modules=None, drop_modules=None):
Populates RetailReader.df_products
.
Reads in the set of product characteristics, typically located in Master_Files/Latest/products.tsv
. Optionally elects product groups (outer category) and product modules (inner category) for which to process annual sales files.
Note that the function does NOT carry over the filtered set of groups and modules from RetailReader.filter_sales()
. The RetailReader.read_products()
function is redundant with the PanelReader.read_products()
function, and therefore allows the user to read in the full set of products and their characteristics even while reading only a subset of sales.
- upc_list(list of integers, optional): list of Universal Product Codes to keep, e.g.
upc_list=[002111039080, 009017445929]
(leading zeros not required) - keep_groups(list of integers, optional): list of product groups to keep, e.g.
keep_groups=[1508, 1048]
- drop_groups(list of integers, optional): list of product groups to exclude, e.g.
drop_groups = [1046]
. Takes precedence if there is any overlap with keep_groups - keep_groups(list of integers, optional): list of product modules to keep, e.g.
keep_modules=[1481, 1482]
- drop_groups(list of integers, optional): list of product modules to exclude, e.g.
drop_groups = [1483]
. Takes precedence if there is any overlap with keep_module
RetailReader.read_extra(years=None, upc_list=None)
Populates RetailReader.df_extra
Selects annual Products Extra files for all post-filtering years. Redundant with PanelReader.read_extra()
Product group and module filtering are not possible.
Note that UPCs may be repeated for different years. Use RMS versions to select appropriate years. Differences between multiple years for a single UPC may be due not to actual product changes but rather due to Nielsen filling in previously missing data. See Nielsen documentation for an in-depth description
- upc_list(list of integers, optional): list of Universal Product Codes to keep, e.g.
upc_list=[002111039080, 009017445929]
(leading zeros not required) - years(list of integers, optional): selects years for which to. Note that previous year-filtering of the RetailReader object will carry over unless a new set of years is specified.
RetailReader.read_stores():
Populates RetailReader.df_stores
Stores files are common to all product groups and modules, so processing will be unaffected by RetailReader.filter_years
or RetailReader.filter_sales
RetailReader.filter_stores(keep_dma=None, drop_dma=None, keep_states=None, drop_states=None, keep_channel=None, drop_dma=None):
Updates RetailReader.df_stores
- keep_dmas(list of integers, optional): list of DMAs (Designated Market Areas) to keep, e.g.
keep_dma=[801, 503]
- drop_dmas(list of integers, optional): list of DMAs (Designated Market Areas) to exclude, e.g.
drop_dma=[602]
. Takes precedence if there is any overlap with keep_dma. - keep_states(list of strings, optional): list of states to keep, with list in two-character format, e.g.
keep_states=['TX', 'CA']
- drop_states(list of integers, optional): list of states to exclude, with list in two-character format, e.g.
drop_states=['NJ', 'NY']
Takes precedence if there is any overlap with keep_states. - keep_channels(list of characters, optional): list of channels (store types) to keep, e.g.
keep_channels=['F', 'G']
. See Nielsen documentation for explanation of channels and list of options. - drop_channels(list of characters, optional): list of channels (store types) to exclude, e.g.
drop_channels = ['C']
. Takes precedence if there is any overlap with keep_channels. See Nielsen documentation for explanation of channels and list of options.
Note: Must be run AFTER RetailReader.read_stores()
. Pre-filtering is not possible.
RetailReader.read_sales(incl_promo = True)
Populates RetailReader.df_sales
Note this function call takes the longest time.
Reads in the weekly, store x upc level sales data, post-filter if any have been applied. Uses pyarrow methods to filter and read the data to minimize memory and time use. May still require large amounts of memory/CPU.
RetailReader.write_data(dir_write = path.Path.cwd(), _stub = 'out'), compr = 'brotli', as_table = False, separator = 'panel_year') Writes the pandas DataFrames of the RetailReader class to parquet format (see class description abvove).
- dir_write(pathlib Path object, optional): folder within which to write the parquets. Defalt is current working directory.
- stub(str): initial string to name all files. Files will be named 'stub'_'[file type].parquet', e.g. 'out_stores.parquet'
- compr(str): type of compression used for generating parquets. Default is brotli
- as_table(bool): whether to write as pyarrow separated row-tables. See
Example.py
for instance of how to write and read row-groups. Requires a separator to generate rows for the row-tables. Useful if you seek to preserve space when reading in files by using only one row-group at a time. - separator(column name): variable on which to separate row-groups when saving as a pyarrow table. Note that [for now] the separator must be common to all files. Default is
panel_year
. If separator is not present in the file, it cannot be saved as a pyarrow table. If you are looking to save just a single file with a specific separator, modify the following snippet:
RetailReader.read_{FILE_TYPE}()
RR.write_data(dir_write, separator = {VARIABLE_NAME})
RetailReader.get_module(file_sales=): returns product module (inner category) corresponding to particular sales file. Auxiliary, not commonly used.
- file_sales(pathlib Path object): Retail Scanner sales file, e.g. 1046_2006.tsv
RetailReader.get_group(file_sales=): returns product group (outer category) corresponding to particular sales file. Auxiliary, not commonly used.
- file_sales(pathlib Path object): Retail Scanner sales file, e.g. 1046_2006.tsv
class NielsenReader.PanelReader(dir_read=path.Path.cwd(), verbose=True) PanelReader defines the class object used to read in the Nielsen Consumer Panel data (see above)
- dir_read(pathlib Path object, optional): points to the location of the Consumer Panel data files. Likely named
Panel
or something similar. Subfolders should be years. Default is the current working directory. - verbose(bool): if
True
, prints updates after processing files. Displayes size of files processed. Default is true.
df_products
(pandas DataFrame): default empty, stores products data after from Master Files processingdf_variations
(pandas DataFrame): default empty, stores brand_variations data from Master Files after processingdf_retailers
(pandas DataFrame): default empty, stores retailers data from Master Files after processingdf_trips
(pandas DataFrame): default empty, stores annual trips data after processingdf_panelists
(pandas DataFrame): default empty, stores annual panelists data after processingdf_purchases
(pandas DataFrame): default empty, stores annual purchases after processingdf_extra
(pandas DataFrame): default empty, stores annual products extra characteristics data after processingfiles_annual
(list of pathlib Path objects)files_product
(list of pathlib Path objects): stores name of unrevised products filefiles_variation
(list of pathlib Path objects): stores name of unrevised brand variations filefiles_retailers
(list of pathlib Path objects): stores name of unrevised retailers filefiles_trips
(list of pathlib Path objects): stores names of annual tripes filesfiles_panelists
(list of pathlib Path objects): stores names of annual panelists filesfiles_purchases
(list of pathlib Path objects): stores names of annual purchases filesfiles_extra
(list of pathlib Path objects): stores names of annual products extra filesall_years
(list): list of years included in data. Updates with filtering
PanelReader.filter_years(keep=None, drop=None)
Selects years for which to process annual panelist, purchase, trips, and extra files. Used in pre-processing to limit required memory if desired; otherwise later functions will process all available data. Updates PanelReader.all_years
- keep(list of integers, optional): list of years to keep, e.g. range(2004, 2013). Can only include years that are already present in the data, e.g. specifying
keep=[1999]
will result in an empty set of years - drop(list of integers, optional): list of years to remove, e.g. [2006, 2009, 2013]
PanelReader.read_retailers():
Populates PanelReader.df_retailers
Processes the Master retailers file, which list retailer codes and channels.
Note that the file may be later revised following a call to PanelReader.read_revised_panelists()
or PanelReader.process_open_issues()
PanelReader.read_products(upc_list=None, keep_groups=None, drop_groups=None, keep_modules=None, drop_modules=None) Populates PanelReader.df_products, which should be identical to RetailReader.df_products following a call to RetailReader.read_products().
Reads in the set of product characteristics, typically located in Master_Files/Latest/products.tsv
. Optionally elects product groups (outer category) and product modules (inner category) for which to process annual sales files.
Note that the file may be later revised following a call to PanelReader.read_revised_panelists()
or PanelReader.process_open_issues()
. Such updating is only possible through PanelReader
; the Retail Scanner files do not contain any revisions.
- upc_list(list of integers, optional): list of Universal Product Codes to keep, e.g.
upc_list=[002111039080, 009017445929]
(leading zeros not required) - keep_groups(list of integers, optional): list of product groups to keep, e.g.
keep_groups=[1508, 1048]
- drop_groups(list of integers, optional): list of product groups to exclude, e.g.
drop_groups = [1046]
. Takes precedence if there is any overlap with keep_groups - keep_groups(list of integers, optional): list of product modules to keep, e.g.
keep_modules=[1481, 1482]
- drop_groups(list of integers, optional): list of product modules to exclude, e.g.
drop_groups = [1483]
. Takes precedence if there is any overlap with keep_module
PanelReader.read_extra(years=None, upc_list=None)
Populates PanelReader.df_extra
Selects annual Products Extra files for all post-filtering years. Redundant with RetailReader.read_extra(). Does not have any revisions as of October 2021.
Product group and module filtering are not possible; only UPC and year filtering are available.
Note that UPCs may be repeated for different years. Use RMS versions to select appropriate years. Differences between multiple years for a single UPC may be due not to actual product changes but rather due to Nielsen filling in previously missing data. See Nielsen documentation for an in-depth description.
- years(list of integers, optional): selects years for which to. Note that previous year-filtering of the PanelReader object will carry over unless a new set of years is specified
- upc_list(list of integers, optional): list of Universal Product Codes to keep, e.g.
upc_list=[002111039080, 009017445929]
(leading zeros not required)
PanelReader.read_variations()
Populates PanelReader.df_variations
with brand variations data, typically located in MasterFiles/Latest/brand_variations.tsv
.
Lists brand codes, descriptions, and any alternative descriptions.
PanelReader.read_year(year, keep_dmas=None, drop_dmas=None, keep_states=None, drop_states=None,)
Populates PanelReader.df_panelists
, PanelReader.df_purchases
, and PanelReader.df_trips
Processes a single year of annual data (panelists, purchases, and trips data). Useful if you seek to only process one year at a time; otherwise use read_annual
as described below
- year (int): single year to process
- keep_dmas (list of integers): list of DMAs (Designated Market Areas) to keep, e.g.
keep_dma=[801, 503]
- drop_dmas (list of integers): list of DMAs (Designated Market Areas) to exclude, e.g.
drop_dma=[602]
. Takes precedence if there is any overlap with keep_dma. - keep_states (list of strings): list of states to keep, with list in two-character format, e.g.
keep_states=['TX', 'CA']
- drop_states (list of strings): list of states to exclude, with list in two-character format, e.g.
drop_states=['NJ', 'NY']
Takes precedence if there is any overlap with keep_states.
PanelReader.read_annual(keep_states=None, drop_states=None, keep_dmas=None, drop_dmas=None)
Processes all years (post-PanelReader.filter_years()
) with repeated calls to PanelReader.read_year()
- keep_dmas (list of integers): list of DMAs (Designated Market Areas) to keep, e.g.
keep_dma=[801, 503]
- drop_dmas (list of integers): list of DMAs (Designated Market Areas) to exclude, e.g.
drop_dma=[602]
. Takes precedence if there is any overlap with keep_dma. - keep_states (list of strings): list of states to keep, with list in two-character format, e.g.
keep_states=['TX', 'CA']
- drop_states (list of strings): list of states to exclude, with list in two-character format, e.g.
drop_states=['NJ', 'NY']
Takes precedence if there is any overlap with keep_states.
PanelReader.write_data(dir_write = path.Path.cwd(), _stub = 'out'), compr = 'brotli', as_table = False, separator = 'panel_year') Writes the pandas DataFrames of the RetailReader class to parquet format (see class description abvove).
- dir_write(pathlib Path object, optional): folder within which to write the parquets. Defalt is current working directory.
- stub(str): initial string to name all files. Files will be named 'stub'_'[file type].parquet', e.g. 'out_stores.parquet'
- compr(str): type of compression used for generating parquets. Default is brotli
- as_table(bool): whether to write as pyarrow separated row-tables. See
Example.py
for instance of how to write and read row-groups. Requires a separator to generate rows for the row-tables. Useful if you seek to preserve space when reading in files by using only one row-group at a time. - separator(column name): variable on which to separate row-groups when saving as a pyarrow table. Note that [for now] the separator must be common to all files. Default is
panel_year
. If separator is not present in the file, it cannot be saved as a pyarrow table. If you are looking to save just a single file with a specific separator, modify the following snippet:
RetailReader.read_{FILE_TYPE}()
RR.write_data(dir_write, separator = {VARIABLE_NAME})
PanelReader.read_revised_panelists()
Updates PanelReader.df_products
, PanelReader.df_variations
, PanelReader.df_retailers
, and PanelReader.df_panelists
Corrects the Panelist data using errata provided by Nielsen for issues not yet incorporated into the data as of October 2021. Must have already called PanelReader.read_annual()
, PanelReader.read_products()
, PanelReader.read_variations()
, PanelReader.read_retailers()
PanelReader.process_open_issues()
Updates PanelReader.df_panelists
and PanelReader.df_extra
Corrects the product extra and panelist data using errata provided by Nielsen for two specific issues:
ExtraAttributes_FlavorCode
: adds missing flavor code and flavor description to 2010 products extra characteristics filePanelist_maleHeadBirth_femaleHeadBirth
: corrects issue with male head of household birth month