Simplify Your Analysis of Household Budget Survey Data
Analyzing the Household Budget Survey of Iran (HBSIR) data has historically been challenging due to the constant changes in codings and table schemas, the lack of standardization and alignment with universal coding systems, absence of clear and consistent specifications for data types assigned to variables, and the improper distribution format. With this package, you can seamlessly load, process, and analyze HBSIR data, allowing you to focus on your research and derive insights instead of grappling with these challenges.
The HBSIR package consists of three core components, each designed to not only streamline your experience working with the HBSIR data but also maximize the reproducibility of results and offer a high degree of flexibility for customizations.
-
Data Extraction:
This component focuses on extracting data from Microsoft Access files, applying labels, and performing essential data cleaning tasks. Its purpose is to ensure that the data is transformed into a usable format suitable for further analysis. -
Data Manipulation:
The package offers a uniform API designed to streamline the process of working with HBSIR data. It simplifies tasks like data loading, adding goods and other classifications, household attributes, and sampling weights. -
Schema System
We introduce a schema system that empowers users to define customized table structures based on household budget data.
The motivation behind the HBSIR Data Python Package stems from the recognition of the extreme difficulty faced by users when dealing with HBSIR data. The data's original format in MS Access, coupled with its lack of standardization across different years, posed significant challenges for analysis. This package was conceived to provide a solution that simplifies the entire process – from data extraction to analysis – making HBSIR data more accessible and usable for everyone.
Getting started with the package is straightforward. Simply install it using pip:
pip install hbsir
For those interested in the under-development version, you can also download it as follows:
pip install git+https://github.com/Iran-Open-Data/HBSIR.git
Breakdown of food expenditure by urban families in Tehran
The purpose of this example is to provide a brief overview of package usage. In the following steps, we will guide you through various stages, such as loading specific tables, and incorporating attributes, classifications, and sampling weights. As we conclude this example, we will have generated a breakdown of food expenses among the urban residents of Tehran.
# Import the hbsir package
import hbsir
# Load food table for year 1400 into dataframe
food_table = hbsir.load_table("food", 1400)
By employing the provided code, you will read the table into a Pandas DataFrame format.
Below is an illustration depicting the potential presentation of the loaded table:
Year | ID | Code | Provision_Method | Amount | Duration | Price | Expenditure | |
---|---|---|---|---|---|---|---|---|
0 | 1400 | 10004004227 | 11112 | Purchase | 10.0 | 30 | 250000.0 | 2500000.0 |
1 | 1400 | 10004004227 | 11142 | Purchase | 10.0 | 30 | 30000.0 | 300000.0 |
2 | 1400 | 10004004227 | 11143 | Purchase | 10.0 | 30 | 30000.0 | 300000.0 |
3 | 1400 | 10004004227 | 11174 | Purchase | 2.0 | 30 | 350000.0 | 700000.0 |
4 | 1400 | 10004004227 | 11211 | Purchase | 2.0 | 30 | 1300000.0 | 2600000.0 |
Next, we incorporate the urban-rural and province attributes into the table and apply filtering:
# Drop unnecessary columns
df = df.drop(columns=["Table_Name", "Provision_Method", "Amount", "Duration"])
# Add urban/rural attribute column
df = hbsir.add_attribute(df, "Urban_Rural")
# Add province attribute column
df = hbsir.add_attribute(df, "Province")
# Filter to only urban Tehran rows
filt = (df["Urban_Rural"] == "Urban") & (df["Province"] == "Tehran")
df = df.loc[filt]
Year | ID | Code | Price | Expenditure | Urban_Rural | Province | |
---|---|---|---|---|---|---|---|
107928 | 1400 | 12313290719 | 11144 | 25000.0 | 125000.0 | Urban | Tehran |
107929 | 1400 | 12313290719 | 11151 | 25000.0 | 400000.0 | Urban | Tehran |
107930 | 1400 | 12313290719 | 11164 | 128571.0 | 180000.0 | Urban | Tehran |
107931 | 1400 | 12313290719 | 11165 | 140000.0 | 70000.0 | Urban | Tehran |
107932 | 1400 | 12313290719 | 11172 | 350000.0 | 350000.0 | Urban | Tehran |
In the next phase, we'll incorporate classification and sampling weights to compute weighted expenditures.
# Remove unnecessary columns
df = df.drop(columns=["Urban_Rural", "Province"])
# Classify based on food types
df = hbsir.add_classification(df, "original", levels=[2], output_column_names=["Food_Type"])
# Integrate sampling weights
df = hbsir.add_weight(df)
# Compute weighted expenditure
df["Weighted_Expenditure"] = df["Expenditure"] * df["Weight"]
Year | ID | Code | Price | Expenditure | Food_Type | Weight | Weighted_Expenditure | |
---|---|---|---|---|---|---|---|---|
107928 | 1400 | 12313290719 | 11144 | 25000.0 | 125000.0 | cereals_and_cereal_products | 2155 | 269375008.0 |
107929 | 1400 | 12313290719 | 11151 | 25000.0 | 400000.0 | cereals_and_cereal_products | 2155 | 862000000.0 |
107930 | 1400 | 12313290719 | 11164 | 128571.0 | 180000.0 | cereals_and_cereal_products | 2155 | 387900000.0 |
107931 | 1400 | 12313290719 | 11165 | 140000.0 | 70000.0 | cereals_and_cereal_products | 2155 | 150850000.0 |
107932 | 1400 | 12313290719 | 11172 | 350000.0 | 350000.0 | cereals_and_cereal_products | 2155 | 754249984.0 |
Additionally, we will calculate the sum of weights for urban residents of Tehran.
# Load weights table for year 1400
weights = hbsir.load_table("Weights", 1400)
Year | ID | Weight | |
---|---|---|---|
0 | 1400 | 10001000226 | 1245.0 |
1 | 1400 | 10001000235 | 1245.0 |
2 | 1400 | 10011009720 | 201.0 |
3 | 1400 | 10011009735 | 201.0 |
4 | 1400 | 10003003235 | 237.0 |
# Calculate the sum of weights
weights_sum = (
weights
.pipe(hbsir.add_attribute, "Urban_Rural")
.pipe(hbsir.add_attribute, "Province")
.query("`Urban_Rural`=='Urban' & Province=='Tehran' & Weight.notnull()")
.sum()
.loc["Weight"]
)
print(weights_sum)
>>> 4466717.0
At this final stage, we proceed to calculate the weighted expenditure associated with each type of food expense.
# Compute the weighted mean
df = df.groupby("Food_Type")[["Weighted_Expenditure"]].sum()
df = df / weights_sum / 1e4
# Calculate the sum of expenditures
df["sum"] = df.sum()
# Rename column and sort results
df = df.rename(columns={"Weighted_Expenditure": "Food_Expenditure (k Tomans)"})
df = df.sort_values("Food_Expenditure (k Tomans)", ascending=False)
Outcome of the analysis:
Food_Type | Food_Expenditure (k Tomans) |
---|---|
sum | 2,255 |
cereals_and_cereal_products | 481 |
meat | 425 |
fruits_and_nuts | 325 |
vegetables_and_pulses | 307 |
milk_other_dairy_product_excluding_butter_and_eggs | 268 |
sugar_confectionery_and_desserts | 106 |
spices_condiments_and_other_food_products | 85 |
oils_fats_and_butter | 84 |
tea_coffee_and_cocoa_drinks | 69 |
soft_drinks | 51 |
fish_and_other_sea_foods | 48 |