The Dominick's Finer Foods data set: Promoting the use of a publicly available scanner data set in price index research and for capacity building.
The present material demonstrates how a publicly available scanner data set can be used for price index research and capacity building.
The Dominick's Finer Foods (a now-defunct Chicago-area grocery store chain) data set is a publicly available scanner dataset that is provided for academic research purposes only. It contains sales information at the store level on a weekly basis for each UPC (Universal Product Code) in a category. The data set covers more than 90 stores for almost 400 weeks from September 1989 to May 1997 and totals around 100 million observations (after cleansing) of about 18 000 UPCs (including re-launches) in 29 categories (from analgesics to toothpastes).
The documentation located in the docs/ folder introduces the data set and describes how the data can be acquired
and pre-processed, followed by a presentation of the estimation of price index numbers showing the usefulness for
both research and training purposes. The codes used are located in the SAS/ folder. The newly-made CSV
files
(see link below) should be used to run the code located in the R/ folder. Both sets of code allow generating
analysis-ready data and basing calculations on the very same data, thus discounting the incomparability of different
data sets.
In order to run the codes, it is necessary to download (and extract) all category-specific files, i.e. the UPC files
and movement files (in SAS
format for the SAS
codes, in CSV
format for the R
code) from the website of the
James M. Kilts Center at the University of Chicago Booth School of Business:
https://www.chicagobooth.edu/research/kilts/datasets/dominicks.
Furthermore, we provide two files located in the CSV/ folder that prepare the information on the week variable and the stores included that was covered only in Dominick's Data Manual.
- CSV/: These files are needed to run the
SAS
code andR
code, respectively. The weeks file codes the week for which a data point is recorded. The stores file lists the stores included in the Dominick's research project. Theupcrfj
file provides the UPC file information for refrigerated juices ('RFJ') in aSAS
readable format (see documentation about acquiring the data in the docs/ folder). Note that, if usingR
, there is no movement file available inCSV
format for refrigerated juices from the Dominick's website. - SAS/: The
SAS
codes replicate the data and results of the paper located in the docs/ folder. Theupc
part reads in all UPC files and adds a category identifier. Themove
part reads in all movement files, adds a category identifier, and calculates total dollar sales; suspect data are dropped. Theweeks_stores
part reads in the week and store files and merges them with the movement and UPC files. Thewtpd
example aggregates the data, calculates unit prices as well as expenditure shares per category, and derives price indices by means of the weighted time-product dummy (WTPD
) method. Thesas2csv
code was used to convertSAS
files to theCSV
format newly available at the Dominick's website. TheCSV
files are provided to make them more useful to researchers. - R/: The
R
code generates analysis-ready data and derives price indices equivalent to theSAS
codes (located in the SAS/ folder). Common to the two sets of codes is that for the sake of exposition the weekly store-level UPC data are aggregated to chain-wide item codes (attempt at tracking products across multiple UPCs) at monthly frequency – but this can be changed. The difference is that while theSAS
codes calculate results for each category, theR
code is restricted to one particular category, where the three-letter acronym for the category can be adapted. The folder includes twoR
codes that both create the same results. The first code can be run with theR
base package whereas the second code requires the installation of thetidyverse
package. - docs/: The documentation includes the paper demonstrating how the data set can be used for price
index research and capacity building as well as the
SAS
output from the weighted time-product dummy method at monthly frequency across all 29 categories inCSV
format. Note that, if usingR
, there is a small loss of information between conversion in the 'truncated'PRICE
variable in theCSV
files. The annex to the paper gives instructions on how to use theR
code located in the R/ folder.
About
author | Mehrhoff J. | |
status | since 2018 – closed | |
version | 2.0 | |
license | EUPL (cite the source code or the reference below!) |
-
Kilts Center for Marketing (2013, updated 2018): Dominick's Data Manual, Chicago.
-
Mehrhoff, J. (2018): Promoting the use of a publicly available scanner data set in price index research and for capacity building, Luxembourg.