Value-iteration-and-policy-iteration-using-gym-environment

This code implements an agent class that is a custom environment in the OpenAI Gym framework. The agent has a discrete observation space and action space, specified by the parameters n_states and n_actions. The agent's state transition probabilities, rewards, and initial state value and policy are randomly generated.

The agent has two methods to solve for the optimal policy: value iteration and policy iteration. The value iteration method updates the state value function using the maximum expected return from taking each action in the current state. The policy iteration method updates both the policy and state value function. It first performs policy evaluation to update the state value function for the current policy, and then improves the policy by selecting the action with the highest expected return for each state.

The agent also has a step method to simulate a single step in the environment, a reset method to reset the state to the initial state.