Reinforcement Learning: Multi-Armed Bandits

This is the classic Multi-Armed Bandit problem. This was one of the tasks completed for HyperionDev.

There are casino facing multiple slot machines (let's say 10) in a row. Each of these slot machines allow you to play for free and has a maximum payout of 10 dollars. This means that each slot machine is guaranteed to give you a reward between 0 and 10 dollars. Each slot machine has a different average payout, and you have to figure out which one gives the most average reward so that you can maximise your reward in the shortest time possible.