Geometry and Determinism of Optimal Stationary Control in Partially Observable Markov Decision Processes
Abstract
It is well known that for any finite state Markov decision process (MDP) there is a memoryless deterministic policy that maximizes the expected reward. For partially observable Markov decision processes (POMDPs), optimal memoryless policies are generally stochastic. We study the expected reward optimization problem over the set of memoryless stochastic policies. We formulate this as a constrained linear optimization problem and develop a corresponding geometric framework. We show that any POMDP has an optimal memoryless policy of limited stochasticity, which allows us to reduce the dimensionality of the search space. Experiments demonstrate that this approach enables better and faster convergence of the policy gradient on the evaluated systems.
Geometry and Determinism of POMDPsMontúfar, Ay, GhaziZahedi \firstpageno1
MDP, POMDP, partial observability, memoryless stochastic policy, average reward, policy gradient, reinforcement learning
1 Introduction
The field of reinforcement learning addresses a broad class of problems where an agent has to learn how to act in order to maximize some form of cumulative reward. On choosing action at some world state the world undergoes a transition to state with probability and the agent receives a reward signal . A policy is a rule for selecting actions based on the information that is available to the agent at each time step. In the simplest case, the Markov decision process (MDP), the full world state is available to the agent at each time step. A key result in this context shows the existence of optimal policies which are memoryless and deterministic (see Ross, 1983). In other words, the agent performs optimally by choosing one specific action at each time step based on the current world state. The agent does not need to take the history of world states into account, nor does he need to randomize his actions.
In many cases one has to assume that the agent experiences the world only through noisy sensors and the agent has to choose actions based only the partial information provided by these sensors. More precisely, if the world state is , the agent only observes a sensor state with probability . This setting is known as partially observable Markov decision process (POMDP). Policy optimization for POMDPs has been discussed by several authors (see Sondik, 1978; Chrisman, 1992; Littman et al., 1995; McCallum, 1996; Parr and Russell, 1995). Optimal policies generally need to take the history of sensor states into account. This requires that the agent be equipped with a memory that stores the sensor history or an encoding thereof (e.g., a properly updated belief state) which may require additional computation.
Although in principle possible, in practice it is often too expensive to find or even to store and execute completely general optimal policies. Some form of representation or approximation is needed. In particular, in the context of embodied artificial intelligence and systems design (Pfeifer and Bongard, 2006) the onboard computation sets limits to the complexity of the controller with respect to both, memory and computational cost. We are interested in policies with limited memory (see, e.g., Hansen, 1998). In fact we will focus on memoryless stochastic policies (see Singh et al., 1994; Jaakkola et al., 1995). Memoryless policies may be worse than policies with memory, but they require far fewer parameters and computation. Among other approaches, the GPOMDP algorithm (Baxter and Bartlett, 2001) provides a gradient based method to optimize the expected reward over parametric models of memoryless stochastic policies. For interesting systems, the set of all memoryless stochastic policies can still be very high dimensional and it is important to find good models. In this article we show that each POMDP has an optimal memoryless policy of limited stochasticity, which allows us to construct lowdimensional differentiable policy models with optimality guarantees. The amount of stochasticity can be bounded in terms of the amount of perceptual aliasing, independently of the specific form of the reward signal.
We follow a geometric approach to memoryless policy optimization for POMDPs. The key idea is that the objective function (the expected reward per time step) can be regarded as a linear function over the set of stationary joint distributions over world states and actions. For MDPs this set is a convex polytope and, in turn, there always exists an optimizer which is an extreme point. The extreme points correspond to deterministic policies (which cannot be written as convex combinations of other policies). For POMDPs this set is in general not convex, but it can be decomposed into convex pieces. There exists an optimizer which is an extreme point of one of these pieces. Depending on the dimension of the convex pieces, the optimizer is more or less stochastic.
This paper is organized as follows. In Section 2 we review basics on POMDPs. In Section 3 we discuss the reward optimization problem in POMDPs as a constrained linear optimization problem with two types of constraints. The first constraint is about the types of policies that can be represented in the underlying MDP. The second constraint relates policies with stationary world state distributions. We discuss the details of these constraints in Sections A and B. In Section 4 we use these geometric descriptions to show that any POMDP has an optimal stationary policy of limited stochasticity. In Section 5 we apply the stochasticity bound to define low dimensional policy models with optimality guarantees. In Section 6 we present experiments which demonstrate the usefulness of the proposed models. In Section 7 we offer our conclusions.
2 Partially observable Markov decision processes
A discrete time partially observable Markov decision process (POMDP) is defined by a tuple , where is a finite set of world states, is a finite set of sensor states, is a finite set of actions, is a Markov kernel that describes sensor state probabilities given the world state, is a Markov kernel that describes the probability of transitioning to a world state given the current world state and action, is a reward signal. A Markov decision process (MDP) is the special case where and is the identity map.
A policy is a mechanism for selecting actions. In general, at each time step , a policy is defined by a Markov kernel taking the history of sensor states and actions to a probability distribution over . A policy is deterministic when at each time step each possible history leads to a single positive probability action. A policy is memoryless when the distribution over actions only depends on the current sensor state, . A policy is stationary (homogeneous) when it is memoryless and time independent, for all . Stationary policies are represented by kernels of the form . We denote the set of all such policies by .
The goal is to find a policy that maximizes some form of expected reward. We consider the long term expected reward per time step (also called average reward)
(1) 
Here is the probability of the sequence , given that is distributed according to the start distribution and at each time step actions are selected according to the policy . Another option is to consider a discount factor and the discounted long term expected reward
(2) 
In the case of an MDP, it is always possible to find an optimal memoryless deterministic policy. In other words, there is a policy that chooses an action deterministically at each time step, depending only on the current world state, which achieves the same or higher long term expected reward as any other policy. This fact can be regarded as a consequence of the policy improvement theorem (Bellman, 1957; Howard, 1960).
In the case of a POMDP, policies with memory may perform much better than the memoryless policies. Furthermore, within the memoryless policies, stochastic policies may perform much better than the deterministic ones (see Singh et al., 1994). The intuitive reason is simple: Several world states may produce the same sensor state with positive probability (perceptual aliasing). On the basis of such a sensor state alone, the agent cannot discriminate the underlying world state with certainty. On different world states the same action may lead to drastically different outcomes. Sometimes the agent is forced to choose probabilistically between the optimal actions for the possibly underlying world states (see Example 2). Sometimes he is forced to choose suboptimal actions in order to minimize the risk of catastrophic outcomes (see Example 1). On the other hand, the sequence of previous sensor states may help the agent identify the current world state and choose one single optimal action. This illustrates why in POMDPs optimal policies may need to take the entire history of sensor states into account and also why the optimal memoryless policies may require stochastic action choices.
The set of policies that take the histories of sensor states and actions into account grows extremely fast. A common approach is to transform the POMDP into a beliefstate MDP, where the discrete sensor state is replaced by a continuous Bayesian belief about the current world state. Such belief states encode the history of sensor states and allow for representations of optimal policies. However, belief states are associated with costly internal computations from the side of the acting agent. We are interested in agents subject to perceptual, computational, and storage limitations. Here we investigate stationary policies.
We assume that for each stationary policy there is exactly one stationary world state distribution and that it is attained in the limit of infinite time when running policy , irrespective of the starting distribution . This is a standard assumption that holds true, for instance, whenever the transition kernel is strictly positive. In this case (1) can be written as
(3) 
where . An optimal stationary policy is a policy with for all . Note that maximizing (3) over is the same as maximizing the discounted expected reward (2) over with (see Singh et al., 1994). The expected reward per time step appears more natural for POMDPs than the discounted expected reward, because, assuming ergodicity, it is independent of the starting distribution, which is not directly accessible to the agent. Our discussion focusses on average rewards, but our main Theorem 7 also covers discounted rewards.
Our analysis is motivated by the following natural question: Given that every MDP has a stationary deterministic optimal policy, does every POMDP have an optimal stationary policy with small stochasticity? Bounding the required amount of stochasticity for a class of POMDPs would allow us to define a policy model with
(4) 
for every POMDP from that class. We will show that such a model can be defined in terms of the number of ambiguous sensor states and actions, such that contains optimal stationary policies for all POMDPs with that number of actions and ambiguous sensor states. Depending on this number, can be much smaller in dimension than the set of all stationary policies.
The following examples illustrate some cases where optimal stationary control requires stochasticity and some of the intricacies involved in upper bounding the necessary amount of stochasticity.
(a)  (b) 
Example 1.
Consider a system with , , and . The reward function is on and otherwise. The agent starts at some random state. On state action takes the agent to some random state and all other actions leave the state unchanged. In this case the best stationary policy chooses actions uniformly at random.
Example 2.
Consider the grid world illustrated in Figure 1a. The agent has four possible actions, north, east, south, and west, which are effective when there is no wall in that direction. On reaching cells , , and the agent is teleported to cell . On he receives a reward of one and otherwise none. In an MDP setting, the agent knows its absolute position in the maze. A deterministic policy can be easily constructed that leads to a maximal reward, as depicted in the upper right. In a POMDP setting the agent may only sense the configuration of its immediate surrounding, as depicted in the lower right. In this case any memoryless deterministic policy fails. Cells and look the same to the agent. Always choosing the same action on this sensation will cause the agent to loop around never reaching the reward cell . Optimally, the agent should choose probabilistically between east and west. The reader might want to have a look at the experiments treating this example in Section 6.
Example 3.
Consider the system illustrated in Figure 1b. Each node corresponds to a world state . The sensor states are , whereby are sensed as . The actions are . Choosing action in state and action in state has a large negative reward. Choosing action in state and action in state has a large positive reward. Choosing action in has a moderate negative reward and takes the agent to state . From state each action has a large positive reward and takes the agent to . From state any action takes the agent to or with equal probability. In an MDP setting the optimal policy will choose action on and action on . In a POMDP setting the optimal policy chooses action on . This shows that the optimal actions in a POMDP do not necessarily correspond to the optimal actions in the underlying MDP. Similar examples can be constructed where on a given sensor state it may be necessary to choose from a large set of actions at random, larger than the set of actions that would be chosen on all possibly underlying world states, were they directly observed.
3 Average reward maximization as a constrained linear optimization problem
The expression that appears in the expected reward (3) is linear in the joint distribution . We want to exploit this linearity. The difficulty is that the optimization problem is with respect to the policy , not the joint distribution, and the stationary world state distribution depends on the policy. This implies that not all joint distributions are feasible. The feasible set is delimited by the following two conditions.

Representability in terms of the policy:
(5) The geometric interpretation is that the conditional distribution belongs to the polytope defined as the image of by the linear map
(6) In turn, the joint distribution belongs to the set of joint distributions with conditionals from the set . In general the set is not convex, although it is convex in the marginals when fixing the conditionals , and vice versa. We discuss the details of this constraint in Section A.

Stationarity of the world state distribution:
(7) where is the polytope of distributions with equal first and second marginals, . This means that is a stationary distribution of the Markov transition kernel . The geometric interpretation is that belongs to the polytope defined as the preimage of by the linear map
(8) We discuss the details of this constraint in Section B.
Summarizing, the objective function is the restriction of the linear function to a feasible domain of the form , where is the set of joint distributions with conditionals from a convex polytope , and is a convex polytope. We illustrate these notions in the next example.
Example 4.
Consider the system illustrated at the top of Figure 2. There are two world states , two sensor states , and two possible actions . The sensor and transition probabilities are given by
In the following we discuss the feasible set of joint distributions. The policy polytope is a square. The set of realizable conditional distributions of world states given actions is the line
inside of the square . The set of joint distributions with conditionals from is a twisted surface. This set has one copy of for every world state distribution . See the lower left of Figure 2. The set of joint distributions over world states and actions that satisfy the stationarity constraint (7) is the subset of that maps to the polytope shown in the lower right of Figure 2. This is the triangle
As we will show in Lemma 6, the extreme points of can always be written in terms of extreme points of ; in the present example, in terms of is a curve. This is the feasible domain of the expected reward , viewed as a function of joint distributions over world states and actions. . The set , ), (or
4 Determinism of optimal stationary policies
In this section we discuss the minimal stochasticity of optimal stationary policies. In order to illustrate our geometric approach we first consider MDPs and then the more general case of POMDPs.
Theorem 5 (MDPs).
Consider an MDP . Then there is a deterministic optimal stationary policy.
Proof of Theorem 5.
The objective function defined in Equation (3) can be regarded as the restriction of a linear function over to the feasible set defined in Equation (8). Since is a convex polytope, the objective function is maximized at one of its extreme points. By Lemma 6, all extreme points of can be realized by extreme points of , that is, deterministic policies. ∎
Lemma 6.
Each extreme point of can be written as , where and is an extreme point of .
Proof of Lemma 6.
We can view the map from Equation (8) as taking pairs to pairs . Here the marginal distribution is mapped by the identity function and the conditional distribution by
Consider some for which contains a distribution whose marginal has support . For each let denote the set of actions with transitions that stay in . With a slight abuse of notation let us write and for the corresponding sets of conditional and joint probability distributions. Note that out of only points from are mapped to points in and hence . The set consists of all joint distributions with and . Now, for each conditional there is at least one marginal such that the joint is an element of . Hence
The set is the union of the fibers of all points in . Hence
Let us now consider some extreme point of . Suppose that the marginal of has support . By the previous discussion, we know that is an extreme point of the polytope . Furthermore, is the dimensional intersection of an affine space and , where . This implies that lies at the intersection of facets of . In turn , for all . This shows that , where and is an extreme point of . We can extend this conditional arbitrarily on to obtain a conditional that is an extreme point of . ∎
Now we discuss the minimal stochasticity of optimal stationary policies for POMDPs. A policy is called stochastic if it is contained in an dimensional face of . This means that at most entries are nonzero and, in particular, that is a convex combination of at most deterministic policies. For instance, a deterministic policy is stochastic and has exactly nonzero entries. The following result holds both in the average reward and in the discounted reward settings.
Theorem 7 (POMDPs).
Consider a POMDP . Let . Then there is a stochastic optimal stationary policy. Furthermore, for any there are such that every optimal stationary policy is at least stochastic.
Proof of Theorem 7.
Here we prove the statement for the average reward setting using the geometric descriptions from Section 3. We cover the discounted setting in Section C using value functions and a policy improvement argument.
Consider the sets and from Equation (6). We can write as a union of Cartesian products of convex sets, as , with . See Proposition 9 for details. In turn, we can write , where each is a convex set of dimension . See Proposition 12 for details.
The objective function is linear over each polytope and is maximized at an extreme point of one of these polytopes. If , then each extreme point of can be written as , where is an extreme point of . To see this, note that the arguments of Lemma 6 still hold when we replace by and by . Each extreme point of lies at a face of of dimension at most . See Proposition 9 for details. Now, since is a linear map, the points in the dimensional faces of have preimages by in dimensional faces of . Thus, there is a maximizer of that is contained in a face of .
The second statement, regarding the optimality of the stochasticity bound, follows from Proposition 23, which computes the optimal stationary policies of a class of POMDPs analytically. ∎
Remark 8.

Our Theorem 7 also has an interpretation for nonergodic systems: Among all pairs of stationary policies and associated stationary world state distributions, the highest value of is attained by a pair where the policy is stochastic. However, this optimal stationary average reward is only equal to (1) for start distributions that converge to .

In a reinforcement learning setting the agent does not know anything about the world state transitions nor the observation model a priori, beside from the sets and . In particular, he does not know the set (nor its cardinality). Nonetheless, he can build a hypothesis about on the basis of observed sensor states, actions, and rewards. This can be done using a suitable variant of the BaumWelch algorithm or inexpensive heuristics, without estimating the full kernels and .
5 Application to defining low dimensional policy models
By Theorem 7, there always exists an optimal stationary policy in a dimensional face of the policy polytope . Instead of optimizing over the entire set , we can optimize over a lower dimensional subset that contains the dimensional faces. In the following we discuss various ways of defining a differentiable policy model with this property.
We denote the set of policies in dimensional faces of the polytope by
Note that each policy in can be written as the convex combination of or fewer deterministic policies. For example, is the set of deterministic policies, and is the entire set of stationary policies.
Conditional exponential families
An exponential policy family is a set of policies of the form
where is a vector of sufficient statistics and is a vector of parameters. We can choose suitably, such that the closure of the exponential family contains .
The interaction model is defined by the sufficient statistics
Here we can identify each pair with a length binary vector , . Since we do not need to model the marginal distribution over , we can remove all for which is constant for all . The interaction model is neighborly (Kahle, 2010), meaning that, for it contains in its closure. This results in a policy model of dimension at most . Note that this is only an upper bound, both on and the dimension, and usually a smaller model will be sufficient.
An alternative exponential family is defined by taking , , equal to the vertices of a cyclic polytope. The cyclic polytope is the convex hull of , where , , . This results in a neighborly model. Using this approach yields a policy model of dimension .
Mixtures of deterministic policies
We can consider policy models of the form
where is the deterministic policy defined by the function and is a model of probability distributions over the set of all such functions. Choosing this as a neighborly exponential family yields a policy model which contains and, in fact, all mixtures of deterministic policies. This kind of model was proposed in Ay et al. (2013).
Identifying each with a length binary vector, , and using a interaction model with yields a model of dimension .
Alternatively, we can use a cyclic exponential family for , which yields a policy model of dimension . If we are only interested in modeling the deterministic policies, , then this model has dimension two.
Conditional restricted Boltzmann machines
A conditional restricted Boltzmann machine (CRBM) is a model of policies of the form
with parameter , , , , . Here we identify each with a vector , , and each with a vector , . There are theoretical results on CRBMs (Montúfar et al., 2015) showing that they can represent every policy from whenever . A sufficient number of parameters is thus .
Each of these models has advantages and disadvantages. The CRBMs can be sampled very efficiently using a Gibbs sampling approach. The mixture models can be very low dimensional, but may have an intricate geometry. The interaction models are smooth manifolds.
6 Experiments
We run computer experiments to explore the practical utility of our theoretical results. We consider the maze from Example 2. In this example, the set of sensor states with has cardinality two. By Theorem 7, there is a stochastic optimal stationary policy. As a family of policy models we choose the interaction models from Section 5. The number of binary variables is . This results in a sufficient statistics matrix with columns, out of which we keep only the first , one for each pair . For , the resulting model dimension is . The policy polytope has dimension .
We consider the reinforcement learning problem, where the agent does not know in advance. We use stochastic gradient with an implementation of the GPOMDP algorithm (Baxter and Bartlett, 2001) for estimating the gradient. We fix a constant learning rate of , a time window of for each Markov chain gradient and average reward estimation, and perform gradient iterations on a random parameter initialization.
The results are shown in Figure 3. The first column shows the learning curves for , for the first gradient iterations. Shown is actually the average of the learning curves for repetitions of the experiment. The individual curves are indeed all very similar for each fixed . The value shown is the estimated average reward, with a running average shown in bold, for better visibility. The second column shows the final policy. The third column gives a detail of the learning curves and shows the reward averaged over the entire learning process.
The independence model, with , performs very poorly, as it learns a fixed distribution of actions for all sensor states. The next model, with , performs better, but still has a very limited expressive power. All the other models have sufficient complexity to learn a (nearly) optimal policy, in principle. However, out of these, the less complex one, with , performs best. This indicates that the least complex model which is able to learn an optimal policy does learn faster. This model has less parameters to explore and is less sensitive to the noise in the stochastic gradient.
7 Conclusions
Policy optimization for partially observable Markov decision processes is a challenging problem. Scaling is a serious difficulty in most algorithms and theoretical results are scarce on approximative methods. This paper develops a geometric view on the problem of finding optimal stationary policies. The maximization of the long term expected reward per time step can be regarded as a constrained linear optimization problem with two constraints. The first one is a quadratic constraint that arises from the partial observability of the world state. The second is a linear constraint that arises from the stationarity of the world state distribution. We can decompose the feasible domain into convex pieces, on which the optimization problem is linear. This analysis sheds light into the complexity of stationary policy optimization for POMDPs and reveals avenues for designing learning algorithms.
We show that every POMDP has an optimal stationary policy of limited stochasticity. The necessary level of stochasticity is bounded above by the number of sensor states that are ambiguous about the underlying world state, independently of the specific reward function. This allows us to define low dimensional models which are guaranteed to contain optimal stationary policies. Our experiments show that the proposed dimensionality reduction does indeed allow to learn better policies faster. Having less parameters, these models are less expensive to train and less sensitive to noise, while at the same time being able to learn best possible stationary policies.
We would like to acknowledge support from the DFG Priority Program Autonomous Learning (DFGSPP 1527).
References
 Ay et al. (2013) Nihat Ay, Guido Montúfar, and Johannes Rauh. Selection criteria for neuromanifolds of stochastic dynamics. In Yoko Yamaguchi, editor, Advances in Cognitive Neurodynamics (III), pages 147–154. Springer, 2013.
 Baxter and Bartlett (2001) Jonathan Baxter and Peter L. Bartlett. Infinitehorizon policygradient estimation. J. Artif. Int. Res., 15(1):319–350, November 2001. URL http://dl.acm.org/citation.cfm?id=1622845.1622855.
 Bellman (1957) Richard Bellman. Dynamic programming. Princeton University Press, Princeton, NY, 1957.
 Chrisman (1992) Lonnie Chrisman. Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In In Proceedings of the Tenth National Conference on Artificial Intelligence, pages 183–188. AAAI Press, 1992.
 Hansen (1998) Eric Anton Hansen. Finitememory Control of Partially Observable Systems. PhD thesis, 1998.
 Howard (1960) Ronald A. Howard. Dynamic Programming and Markov Processes. The MIT Press, Cambridge, MA, 1960.
 Jaakkola et al. (1995) Tommi Jaakkola, Satinder P. Singh, and Michael I. Jordan. Reinforcement learning algorithm for partially observable Markov decision problems. In Advances in Neural Information Processing Systems 7, pages 345–352. MIT Press, 1995.
 Kahle (2010) Thomas Kahle. Neighborliness of marginal polytopes. Beiträge zur Algebra und Geometrie, 51(1):45–56, 2010. URL http://eudml.org/doc/224152.
 Littman et al. (1995) Michael L. Littman, Anthony R. Cassandra, and Leslie Pack Kaelbling. Learning policies for partially observable environments: Scaling up. In International Conference on Machine Learning (ICML), pages 362–370. Morgan Kaufmann, 1995.
 McCallum (1996) Andrew Kachites McCallum. Reinforcement Learning with Selective Perception and Hidden State. PhD thesis, 1996.
 Montúfar et al. (2015) Guido Montúfar, Nihat Ay, and Keyan GhaziZahedi. Geometry and expressive power of conditional restricted Boltzmann machines. JMLR, 16:2405–2436, Dec 2015.
 Parr and Russell (1995) Ronald Parr and Stuart Russell. Approximating optimal policies for partially observable stochastic domains. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, volume 2 of IJCAI’95, pages 1088–1094, San Francisco, CA, USA, 1995. Morgan Kaufmann Publishers Inc.
 Pfeifer and Bongard (2006) Rolf Pfeifer and Josh C. Bongard. How the Body Shapes the Way We Think: A New View of Intelligence. The MIT Press (Bradford Books), Cambridge, MA, 2006.
 Ross (1983) Sheldon M. Ross. Introduction to Stochastic Dynamic Programming: Probability and Mathematical. Academic Press, Inc., Orlando, FL, USA, 1983.
 Singh et al. (1994) Satinder P. Singh, Tommi Jaakkola, and Michael I. Jordan. Learning without stateestimation in partially observable Markovian decision processes. In ICML, pages 284–292, 1994.
 Sondik (1978) Edward J. Sondik. The optimal control of partially observable Markov processes over the infinite horizon: Discounted costs. Operations Research, 26(2):282–304, 1978. URL http://www.jstor.org/stable/169635.
 Sutton and Barto (1998) Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.
 Weis (2010) Stephan Weis. Exponential Families with Incompatible Statistics and Their Entropy Distance. PhD thesis, FriedrichAlexanderUniversität ErlangenNürnberg, 2010.
Appendix A The representability constraint
Here we investigate the set of representable policies in the underlying MDP; that is, the set of kernels of the form . This set is the image of the linear map
We are interested in the properties of this set, depending on the observation kernel .
Consider first the special case of a deterministic kernel , defined by , for some function . Then
where is the set of world states that maps to and is the set of elements of that consist of one repeated probability distribution. This set can be written as a union of Cartesian products,
where is the set of sensor states that can result from several world states. For instance, when is the identity function we have .
Proposition 9.
Consider a measurement and the map . Let be the sensor states that can be obtained from several world states. The set can be written as , where each is a Cartesian product of convex sets, , convex, and each vertex of lies in a face of of dimension at most .
Proof of Proposition 9.
We use as index set the set of policies . We can write
This proves the first part of the claim. For the second part, note that all are equal but for addition of a linear projection of . ∎
Example 10.
Let , , . Let map and to , and to , with probability one. Written as a table this is
The policy polytope is the square with vertices
The polytope is the square with vertices
and can be written as a union of Cartesian products of convex sets, illustrated in Figure 4,
As mentioned in Section 3, the set of joint distributions that are compatible with the representable conditionals , may not be convex. In the following we describe large convex subsets of , depending on the properties of . We use the following definitions.
Definition 11.

Given a set of distributions and a set of kernels , let
denote the set of joint distributions over world states and actions, with world state marginals in and conditional distributions in .

For any let
denote the set of world state distributions with support in .

Given a subset and a set of kernels , let
denote the set of restrictions of elements of to inputs from .
The following proposition states that a set of Markov kernels which is a Cartesian product of convex sets, with one factor for each input, corresponds to a convex set of joint probability distributions. Furthermore, if the considered input distributions assign zero probability to some of the inputs, then the convex factorization property is only needed for the restriction to the positiveprobability inputs.
Proposition 12.
Let . Let be a convex set. Let satisfy , where is a convex set for all . Then is convex.
Proof of Proposition 12.
We need to show that, given any two distributions and in , and any , the convex combination lies in . This is the case if and only if for some and some with . We have
This shows that , where and , , for all . Hence and . ∎
The set of Markov kernels is a Cartesian product of convex sets . The set of joint distributions is a simplex, which is a convex set.
A general set is not necessarily convex, let alone a Cartesian product of convex sets. However, it can always be written as a union of Cartesian products of convex sets of the form
For instance, one can always use , , . Proposition 12, together with this observation, implies that given any and a convex set , the set of joint distributions is a union of convex sets , . The situation is illustrated in Example 13.
Example 13.
Consider the settings from Example 10. The set is the union of following sets:
Each is a polytope with vertices
Appendix B The stationarity constraint
In the objective function, the marginal distribution over world states is the stationary distribution of the world state transition kernel, and not some arbitrary distribution over world states. The coupling of transition kernels and marginal distributions can be described in terms of the polytope of joint distributions in with equal first and second marginals. This is given by
The second marginal is the result of applying the conditional as a Markov kernel to the first marginal; that is, . Hence equality of both marginals means that the marginal is a stationary distribution of the transition .
The polytope has been studied by Weis [2010] under the name Kirchhoff polytope. The vertices of are the joint distributions of the following form. For any nonempty subset and a cyclic permutation , there is a vertex defined by