::: {.container-fluid .main-container} ::: {#header}

Hoticulture Data {#hoticulture-data .title .toc-ignore}

Lakshay Kumar {#lakshay-kumar .author}

24-02-2023 {#section .date}

:::

::: {#about-dataset .section .level1}

About Dataset

The dataset includes information on crop production in different states and districts of a country over the years. The columns in the dataset are as follows:

"state": This column represents the state where the crop was produced. The state is a categorical variable, meaning it can take a limited number of values that represent the different states in the country.

  1. "districts": This column represents the district where the crop was produced. The district is also a categorical variable and can take different values depending on the state.

  2. "crop": This column represents the crop that was produced. The crop is a categorical variable and can take different values depending on the region and season.

  3. "area_in_thousand_ha": This column represents the area of land used for growing the crop in thousands of hectares. This is a continuous variable and can take any positive value.

  4. "production_in_thousand_mt": This column represents the production of the crop in thousands of metric tons. This is also a continuous variable and can take any positive value.

  5. "year": This column represents the year in which the crop was produced. This is a categorical variable, and it can take different values representing different years.

  6. "location": This column represents the location where the crop was produced. This could be a specific farm or area within the district.

  7. "crop_type": This column represents the type of crop produced, for example, whether it was genetically fruit or vegetable.

  8. "production_per_area": This column represents the production per unit of area, which is calculated by dividing the production by the area used for growing the crop.

This dataset can be used to analyze the trends in crop production over the years and to identify the factors that influence crop production, such as the area of land used for growing the crop, the type of crop produced, and the production per unit of area. The dataset can also be used to compare the crop production in different states and districts, and to identify the crops that are most profitable in a given area. :::

::: {#questions-from-data-set .section .level1}

Questions From Data-set

  1. Ratio of largest producer of each crop VS crop produced in the entire country?

  2. Which parts of India produce maximum fruits and vegetables?

  3. How is Ratio for production of crops and area distributed in the country? This can give us the idea of fertile areas in the country.

  4. How to make sense of anomalies in production of crops (by studying the outliers in box plots for every state/crop)?

  5. How to categorize per state crop distribution to understand which districts specifically produce what crop? 

  6. Finding correlation between crop production and the area, how is correlation changing for every crop per year?

  7. Does a state produce more fruits or vegetables? Are the states that produce more of one clustered together or far apart(geographic advantage)? 

  8. If we divide vegetables into more specific categories(such as root vegetables or leafy greens) and see if each part of the country (north, west, south and east regions)has an alternative(similar vegetable that is commonly used or grown in a particular region as a substitute for a vegetable that is not commonly available or grown there) within the category, will we find a good balance?

  9. What is the correlation between total area and production to get a sense of productivity.

  10. What is the productivity trend? :::

::: {#data-exploration .section .level1}

Data Exploration

crop_data <- select(crops_data, state, districts, area_in_thousand_ha, production_in_thousand_mt, year, location,crop_type)
crop_types <- crop_data %>%
  group_by(crop_type,state)
ggplot(data=crop_types,aes(x=location, fill=crop_type)) + geom_bar(position="dodge")

{width="672"}

crops_data$production_per_area <- crops_data$production_in_thousand_mt/crops_data$area_in_thousand_ha
crops_data$production_per_area[!is.finite(crops_data$production_per_area)] <- NA
fertility <- crops_data %>% group_by(location) %>% summarize(mean_production_per_area = mean(production_per_area,na.rm = TRUE))
fertility
## # A tibble: 6 × 2
##   location  mean_production_per_area
##   <chr>                        <dbl>
## 1 Central                       21.9
## 2 East                          17.4
## 3 North                         16.8
## 4 Northeast                     15.5
## 5 South                         30.1
## 6 West                          20.7
my_data <- subset(crops_data, name == "lakshay")
table(my_data$crop)
## 
##    capsicum      carrot cauliflower 
##          79         172         273
#Scatter Plot for crops
ggplot(my_data, aes(x=area_in_thousand_ha, y=production_in_thousand_mt, color=crop)) +
  geom_point() +
  labs(title="Crop Production in India", x="Area (thousand ha)", y="Production (thousand mt)") +
  theme(plot.title = element_text(hjust = 0.5, size = 12 , face = "bold"), 
        axis.text = element_text(size = 8),
        axis.text.x = element_text(size = 8, angle = 90),
        axis.title = element_text(size = 8, face = "bold"))

{width="672"}

#Scatter plot of the Distribution of production across states for each crop
ggplot(my_data, aes(x = area_in_thousand_ha, y = production_in_thousand_mt, color = crop)) +
  geom_point() +
  facet_wrap(~ state) +
  labs(title = "Distribution of production across states for each crop", 
       x = "Area in thousand hectares", 
       y = "Production in thousand metric tons") +
  theme(plot.title = element_text(hjust = 0.5, size = 12 , face = "bold"), 
        axis.text = element_text(size = 8),
        axis.title = element_text(size = 8, face = "bold"))

{width="672"}

#Boxplot of the Distribution of production across states for each crop
ggplot(my_data, aes(x = crop, y = production_in_thousand_mt, fill = crop)) +
  geom_boxplot() +
  labs(title = "Distribution of Production across States for Each Crop", x = "Crop", y = "Production (in kg/ha)", fill = "Crop") +
  theme(plot.title = element_text(hjust = 0.5, size = 12 , face = "bold"), 
        axis.text = element_text(size = 8),
        axis.text.x = element_text(size = 8, angle = 0),
        axis.title = element_text(size = 8, face = "bold"))

{width="672"}

#Crop Production
ggplot(my_data, aes(x = state, y = production_in_thousand_mt, fill = crop)) + 
  geom_bar(stat = "identity", position = "dodge") + 
  labs(x = "State", y = "Production (in thousand mt)", fill = "Crop", 
       title = "Total Production of Each Crop in Each State") +
  theme(plot.title = element_text(hjust = 0.5, size = 12 , face = "bold"), 
        axis.text = element_text(size = 8),
        axis.text.x = element_text(size = 8, angle = 90),
        axis.title = element_text(size = 8, face = "bold"))

{width="672"}

colnames(my_data)
##  [1] "id"                        "name"                     
##  [3] "table_number"              "table_name"               
##  [5] "state"                     "districts"                
##  [7] "crop"                      "area_in_thousand_ha"      
##  [9] "production_in_thousand_mt" "year"                     
## [11] "location"                  "crop_type"                
## [13] "production_per_area"
my_data_subset <- my_data %>% select(c("crop", "area_in_thousand_ha", "production_in_thousand_mt", "year"))

# Group the data by crop and year, and summarize by taking the sum of area and production
my_data_summary <- my_data_subset %>% 
  group_by(crop, year) %>% 
  summarize(total_area = sum(area_in_thousand_ha), total_production = sum(production_in_thousand_mt))
## `summarise()` has grouped output by 'crop'. You can override using the
## `.groups` argument.
# Create the plot using ggplot2
ggplot(my_data_summary, aes(x = total_area, y = total_production, color = as.factor(year))) +
  geom_point(size = 2.5) +
  facet_wrap(~ crop, ncol = 3) +
  scale_color_discrete(name = "Year") +
  labs(x = "Area (in thousand ha)", y = "Production (in thousand mt)", title = "Production vs Area by Year")

{width="672"}

group_data <- subset(crops_data, name == "aryan" | name=="lakshay" | name=="diya")
ggplot(group_data, aes(x = state, y = production_in_thousand_mt, fill = crop)) + 
  geom_bar(stat = "identity", position = "dodge") + 
  labs(x = "State", y = "Production (in thousand mt)", fill = "Crop", 
       title = "Total Production of Each Crop in Each State") +
  theme(plot.title = element_text(hjust = 0.5, size = 12 , face = "bold"), 
        axis.text = element_text(size = 8),
        axis.text.x = element_text(size = 8, angle = 90),
        axis.title = element_text(size = 8, face = "bold"))

{width="672"} :::

::: {#data-explanation .section .level1}

Data Explanation

  • Vegetables are majorly produced in Central and East India.
  • South India produces almost similar fruits & Vegetables.
  • Based on external source, 70% of vegetables are exported.
  • There is not much significant change in production across two years.
  • Tamil Nadu produces most of the Tapioca in India :::

::: {#further-work .section .level1}

Further Work

  • We can check the contribution of different crops in Vegetable category.
  • Applying ML Models to predict crop production in further years ::: :::