hpi-epic/BP2021

market state during simulation step leaks previous vendor actions

Closed this issue · 1 comments

assume a linear market with two vendors (doupoly, RL vs rule-based). During the reset() the vendor_actions of the market will default to:

$$[a_0^{t=0}, a_1^{t=0}]=[prod_{price} + 1, prod_{price} + 1]$$

In the linear market the state simply holds the qualities of both vendors $[x,y]$.

Now in the first episode and first step of the simulation the rl-agent will receive his observation of the market, i.e. $[x, a_1^{t=0}, y]$.
The agent picks an action $a_0^{t=1}$ according to his policy. Now the market will be simulation. As excepted the customers will be split it into groups for each vendor:

customers_per_vendor_iteration = self.config.number_of_customers // self._number_of_vendors.

At first the probability distribution, which defines the purchase behaviour will be generated with prices/actions= $[a_0^{t=1}, a_1^{t=0}]$ and qualities $[x,y]$. This iteration simulated the effects of the action of the rl-agent.

In the second iteration the rule-based agent can choose his action. For doing so one should expect him to get an observation $[y, a_0^{t=0}, x]$. But he actually receives the observation $[y, a_0^{t=1}, x]$.

customers_per_vendor_iteration = self.config.number_of_customers // self._number_of_vendors
		for i in range(self._number_of_vendors):
			self._simulate_customers(profits, customers_per_vendor_iteration)
			if i < len(self.competitors):
				action_competitor_i = self.competitors[i].policy(self._observation(i + 1)) # this observation already leaks information
				self.vendor_actions[i + 1] = action_competitor_i # during the next iteration we know would simulation customers behaviour a second time with the action from vendor 0...

Ok this can be closed. It seems to be the desired behaviour! I just misunderstood it.