- Fall 2015
- Due: Monday, November 2, 11:59pm
In this assignment, you will implement sequence decision making algorithms and apply these to a gridworld and to Pacman. Note: We will use the Pacman framework developed at Berkeley. This framework is used worldwide to teach AI, therefore it is very important that you DO NOT publish your solutions online.
Follow the instructions at:
http://ai.berkeley.edu/reinforcement.html
The page includes questions requiring implementation of sequential decision making and reinforcement algorithms we studied in class. [The grading scheme described on the Berkley webpage will not be used, but can be used for your own testing for evaluating your performance.]
For this assignment you are only required to do Questions 1-7. Question 8 is quite interesting though, so we recommend it if you have time and are interested in machine learning.
To get the assignment we recommend just cloning this repo:
git clone https://github.com/CS182/HW4.git
Solutions should be submitted to the course dropbox folder. Submit only the files qlearningAgents.py
, valueIterationAgents.py
, and analysis.py
. If you work in a pair, only one student should submit the files, but make sure to include the names of both students at the top of each of the files.
The framework for MDPs and reinforcement learning in this assignment is the same as the one discussed in class. However the reward specification is different than the one given in AIMA. While AIMA provides a helpful guide to this topic, if you want further notes on in the style of the assignment consult the text Reinforcement Learning by Sutton & Barto available free at https://webdocs.cs.ualberta.ca/~sutton/book/ebook/the-book.html.
Answer the following questions individually, and submit as pdf to the dropbox folder.
You control a solar-powered Mars rover. At any time, it can drive fast or slow. You get a reward for the distance travelled, so driving fast gives +10 (in all conditions) while driving slow gives +4 (in all conditions). However, if the rover shuts off, you will not get any reward no matter what you do. (It won’t move anyway...). Your rover can be in one of three conditions: cool, warm, or off. Driving fast tends to heat up the rover, while driving slow tends to cool it down. If the rover overheats, it shuts off, forever. The transitions are shown in the table below.
Current condition | Fast or slow? | Next condition | Probability |
---|---|---|---|
cool | slow | cool | 1 |
cool | fast | cool | 1/4 |
cool | fast | warm | 3/4 |
warm | slow | cool | 1/4 |
warm | slow | warm | 3/4 |
warm | fast | warm | 7/8 |
warm | fast | off | 1/8 |
Model this problem as a Markov Decision Process: Formally specify the states, actions, transition function and reward function.
Write down the
Start with a policy where you drive fast no matter what the condition of the rover is. Simulate the first two iterations of the policy iteration algorithm. Show how the policy evolves as you run the algorithm. What is the policy after the second iteration? For this question assume a discount factor of 0.9.