/HBSIR_OLD

An intuitive platform designed to simplify the analysis of Iran household budget survey data.

Primary LanguageJupyter NotebookMIT LicenseMIT

Household Budget Survey of Iran (HBSIR)

Simplify Your Analysis of Household Budget Survey Data

Analyzing the Household Budget Survey of Iran (HBSIR) data has historically been challenging due to the constant changes in codings and table schemas, the lack of standardization and alignment with universal coding systems, absence of clear and consistent specifications for data types assigned to variables, and the improper distribution format. With this package, you can seamlessly load, process, and analyze HBSIR data, allowing you to focus on your research and derive insights instead of grappling with these challenges.

Package Structure

The HBSIR package consists of three core components, each designed to not only streamline your experience working with the HBSIR data but also maximize the reproducibility of results and offer a high degree of flexibility for customizations.

  1. Data Extraction:
    This component focuses on extracting data from Microsoft Access files, applying labels, and performing essential data cleaning tasks. Its purpose is to ensure that the data is transformed into a usable format suitable for further analysis.

  2. Data Manipulation:
    The package offers a uniform API designed to streamline the process of working with HBSIR data. It simplifies tasks like data loading, adding goods and other classifications, household attributes, and sampling weights.

  3. Schema System
    We introduce a schema system that empowers users to define customized table structures based on household budget data.

Motivation

The motivation behind the HBSIR Data Python Package stems from the recognition of the extreme difficulty faced by users when dealing with HBSIR data. The data's original format in MS Access, coupled with its lack of standardization across different years, posed significant challenges for analysis. This package was conceived to provide a solution that simplifies the entire process – from data extraction to analysis – making HBSIR data more accessible and usable for everyone.

Installation

Getting started with the package is straightforward. Simply install it using pip:

pip install hbsir

For those interested in the under-development version, you can also download it as follows:

pip install git+https://github.com/Iran-Open-Data/HBSIR.git

Quick Usage Example

Breakdown of food expenditure by urban families in Tehran
The purpose of this example is to provide a brief overview of package usage. In the following steps, we will guide you through various stages, such as loading specific tables, and incorporating attributes, classifications, and sampling weights. As we conclude this example, we will have generated a breakdown of food expenses among the urban residents of Tehran.

# Import the hbsir package
import hbsir

# Load food table for year 1400 into dataframe
food_table = hbsir.load_table("food", 1400) 

By employing the provided code, you will read the table into a Pandas DataFrame format.
Below is an illustration depicting the potential presentation of the loaded table:

Year ID Code Provision_Method Amount Duration Price Expenditure
0 1400 10004004227 11112 Purchase 10.0 30 250000.0 2500000.0
1 1400 10004004227 11142 Purchase 10.0 30 30000.0 300000.0
2 1400 10004004227 11143 Purchase 10.0 30 30000.0 300000.0
3 1400 10004004227 11174 Purchase 2.0 30 350000.0 700000.0
4 1400 10004004227 11211 Purchase 2.0 30 1300000.0 2600000.0

Next, we incorporate the urban-rural and province attributes into the table and apply filtering:

# Drop unnecessary columns 
df = df.drop(columns=["Table_Name", "Provision_Method", "Amount", "Duration"])

# Add urban/rural attribute column
df = hbsir.add_attribute(df, "Urban_Rural")

# Add province attribute column
df = hbsir.add_attribute(df, "Province")

# Filter to only urban Tehran rows
filt = (df["Urban_Rural"] == "Urban") & (df["Province"] == "Tehran")
df = df.loc[filt]
Year ID Code Price Expenditure Urban_Rural Province
107928 1400 12313290719 11144 25000.0 125000.0 Urban Tehran
107929 1400 12313290719 11151 25000.0 400000.0 Urban Tehran
107930 1400 12313290719 11164 128571.0 180000.0 Urban Tehran
107931 1400 12313290719 11165 140000.0 70000.0 Urban Tehran
107932 1400 12313290719 11172 350000.0 350000.0 Urban Tehran

In the next phase, we'll incorporate classification and sampling weights to compute weighted expenditures.

# Remove unnecessary columns
df = df.drop(columns=["Urban_Rural", "Province"])

# Classify based on food types
df = hbsir.add_classification(df, "original", levels=[2], output_column_names=["Food_Type"])

# Integrate sampling weights
df = hbsir.add_weight(df)

# Compute weighted expenditure
df["Weighted_Expenditure"] = df["Expenditure"] * df["Weight"]
Year ID Code Price Expenditure Food_Type Weight Weighted_Expenditure
107928 1400 12313290719 11144 25000.0 125000.0 cereals_and_cereal_products 2155 269375008.0
107929 1400 12313290719 11151 25000.0 400000.0 cereals_and_cereal_products 2155 862000000.0
107930 1400 12313290719 11164 128571.0 180000.0 cereals_and_cereal_products 2155 387900000.0
107931 1400 12313290719 11165 140000.0 70000.0 cereals_and_cereal_products 2155 150850000.0
107932 1400 12313290719 11172 350000.0 350000.0 cereals_and_cereal_products 2155 754249984.0

Additionally, we will calculate the sum of weights for urban residents of Tehran.

# Load weights table for year 1400
weights = hbsir.load_table("Weights", 1400)
Year ID Weight
0 1400 10001000226 1245.0
1 1400 10001000235 1245.0
2 1400 10011009720 201.0
3 1400 10011009735 201.0
4 1400 10003003235 237.0
# Calculate the sum of weights
weights_sum = (
    weights
    .pipe(hbsir.add_attribute, "Urban_Rural")
    .pipe(hbsir.add_attribute, "Province")
    .query("`Urban_Rural`=='Urban' & Province=='Tehran' & Weight.notnull()")
    .sum()
    .loc["Weight"]
)

print(weights_sum)
>>> 4466717.0

At this final stage, we proceed to calculate the weighted expenditure associated with each type of food expense.

# Compute the weighted mean
df = df.groupby("Food_Type")[["Weighted_Expenditure"]].sum()
df = df / weights_sum / 1e4

# Calculate the sum of expenditures
df["sum"] = df.sum()

# Rename column and sort results
df = df.rename(columns={"Weighted_Expenditure": "Food_Expenditure (k Tomans)"})
df = df.sort_values("Food_Expenditure (k Tomans)", ascending=False)

Outcome of the analysis:

Food_Type Food_Expenditure (k Tomans)
sum 2,255
cereals_and_cereal_products 481
meat 425
fruits_and_nuts 325
vegetables_and_pulses 307
milk_other_dairy_product_excluding_butter_and_eggs 268
sugar_confectionery_and_desserts 106
spices_condiments_and_other_food_products 85
oils_fats_and_butter 84
tea_coffee_and_cocoa_drinks 69
soft_drinks 51
fish_and_other_sea_foods 48