Sesión 12: Procesos de Decisión de Markov. Incertidumbre - MDP, L.E. Sucar2 Procesos de Decisión de Markov Procesos de Decisión Secuenciales Procesos.

Sesión 12: Procesos de Decisión de Markov

Incertidumbre - MDP, L.E. Sucar2 Procesos de Decisión de Markov Procesos de Decisión Secuenciales Procesos de Decisón de Markov (MDP) Método de Iteración de Valor Procesos de Decisión de Markov Parcialmente Observables (POMDP) Extensiones (abstracción, partición) Aplicaciones

Incertidumbre - MDP, L.E. Sucar3 Problemas de decisión secuenciales Problema de decisión que involucra un conjunto de decisiones cuyo resultado (utilidad) se conoce hasta el final Se considera que se tiene una serie de estados y decisiones asociadas en el tiempo Se tiene incertidumbre asociada con los resultados de las acciones (MDP), y posiblemente también con los estados (POMDP)

Incertidumbre - MDP, L.E. Sucar4 Ejemplo – robot móvil Inicio

Incertidumbre - MDP, L.E. Sucar5 Modelo de Transición Normalmente existe incertidumbre respecto a los resultados de una decisión (acción) Esta incertidumbre se modela como una probabilidad de llegar al estado “j” dado que se encuentra en el estado “i” y se realizá la acción “a”: M ij a

Incertidumbre - MDP, L.E. Sucar6 Modelo de Transición Probabilidad dirección deseada – Pij=0.8 Probabilidad 2 direcciones vecinas – Pik=0.1

Incertidumbre - MDP, L.E. Sucar7 Modelo de los Sensores Normalmente el agente puede sensar el ambiente para observar en que estado se encuentra. Existen dos casos principales: –Observa directamente el estado donde se encuentra- proceso de decisión de Markov –Se tiene incertidumbre sobre el estado en que se encuentra- proceso de decisión de Markov parcialmente observable

Incertidumbre - MDP, L.E. Sucar8 MDP

Incertidumbre - MDP, L.E. Sucar9 POMDP

Incertidumbre - MDP, L.E. Sucar10 Política Óptima Dado el modelo de transición y el modelo de los sensores, el objetivo es encontrar la política óptima para maximizar la utilidad esperada Una política indica la acción que se debe ejecutar dado el estado (o probabilidad del estado) Se considera que las probabilidades de transición sólo dependen del estado actual por lo que son procesos markovianos

Incertidumbre - MDP, L.E. Sucar11 Ejemplo de Política Inicio

Incertidumbre - MDP, L.E. Sucar12 Controlador basado en un MDP solución MDP Controlador Sistema Modelo Eventos estado acción política

Incertidumbre - MDP, L.E. Sucar13 Procesos de Decisión de Markov Problema de obtener la política óptima en un ambiente observable – MDP El método clásico para resolver estos problemas se conoce como “iteración de valor” (value iteration) La idea básica es calcular la utilidad de cada posible estado y usar éstas para seleccionar la acción óptima en cada estado Otros métodos de solución son “iteración de política” (policy iteration) y programación lineal (al transformar el problema a un problema de optimización lineal)

Incertidumbre - MDP, L.E. Sucar14 Procesos de Decisión de Markov Formalmente, un PDM (discreto) se define por: –Un conjunto finito de estados, S –Un conjunto finito de posibles acciones, A –Un modelo de transición, que especifica la probabilidad de pasar a un estado dado el estado presente y la acción, P(s | s’, a) –Una función de recompensa, que especifica el “valor” de ejecutar cierta acción a en el estado s, r(s, a)

Incertidumbre - MDP, L.E. Sucar15 Utilidad La utilidad de un estado depende de la secuencia de acciones tomadas a partir de dicho estado (i) de acuerdo a la política establecida (p) En principio, se puede obtener como la utilidad esperada de todas las posibles secuencias de acciones (Hi) y la utilidad resultante para c/u: U(i) = UE( Hi(p) ) =  P(Hi(p)) U h Hi(p)

Incertidumbre - MDP, L.E. Sucar16 Utilidad Si la utilidad es separable, se puede estimar como la utilidad del estado presente y la utilidad de los siguiente estados La forma más sencilla es que sea una función aditiva: U[s 0, s 1,... s n ] = R(s 0 ) + U[s 1,... s n ] Donde R se conoce como la función de recompensa

Incertidumbre - MDP, L.E. Sucar17 Programación Dinámica Dada la condición de separabilidad, la utilidad de un estado se puede obtener en forma iterativa maximizando la utilidad del siguiente estado: U(i) = R(i) + max a  j P(s j | s i,a) U(j) La política óptima esta dada por la acción que de mayor utilidad: P*(i) = arg max a  j P(s j | s i,a) U(j)

Incertidumbre - MDP, L.E. Sucar18 Programación Dinámica Si se tiene un número finito de pasos (n), entonces la política óptima se puede calcular eficientemente utilizando PD: –Se obtiene la utilidad de los estados en el paso n-1 en base a la utilidad de los estados terminales y se determina la mejor acción –Se obtiene la utilidad de los estados en el paso n-2 en base al paso n-1, y así sucesivamente –Al final se tiene la política óptima (mejor acción para cada estado)

Incertidumbre - MDP, L.E. Sucar19 PD – ejemplo robot Si se define la función de utilidad como: Uh = valor estado final – 1/25 (número de pasos) Entonces la función de recompensa es: R = +1, -1 para los estados terminales R = -1/25 para los demás estados

Incertidumbre - MDP, L.E. Sucar20 Recompensa -1/25 +1

Incertidumbre - MDP, L.E. Sucar21 PD – ejemplo robot Asumiendo que se llega a la meta en n pasos: U(a=derecha) = [0.8*1-0.1*1/25 -0.1*1/25] = 0.792 U(a=abajo) = [0.1*1-0.8*1/25 -0.1*1/25] = 0.064 U(a=izq.) = [-0.1*1/25-0.8*1/25 -0.1*1/25] = -0.04 U(s33) = -1/25 + max [.792,.064, -.04] = 0.752; P*(s31) = derecha 1 2 3 1234

Incertidumbre - MDP, L.E. Sucar22 Valor 0.752 +1 0.422 … …

Incertidumbre - MDP, L.E. Sucar23 Horizonte finito vs. infinito Los problemas de con un número finito de pasos se conocen como MDP de horizonte finito Los problemas en que puede haber un número infinito de pasos se conocen como MDP de horizonte infinito Muchos problemas, como el ejemplo del robot, son de horizonte infinito y no se pueden resolver directamente por PD

Incertidumbre - MDP, L.E. Sucar24 Solución Los métodos principales para resolver MDPs son: Iteración de valor (Bellman, 57), Iteración de política (Howards, 60), and Programación lineal (Puterman, 94).

Incertidumbre - MDP, L.E. Sucar25 MDPs Función de valor (ecuación de Bellman): V*(s) = max a { R(s,a) +   s’ P(s’ | s, a) V*(s’) } Policy:  *(s) = arg max a { R(s,a) +   s’ P(s’ | s, a) V*(s’) } Solución: –Value iteration –Policy iteration

Incertidumbre - MDP, L.E. Sucar26 Solución Función de valor Una política para un MDP es una asociación  :S  A (acción por estado). Dada la política, el valor para horizonte finito es: V n  : S   V n  (i) = R(i,  (i)) +  P(  (i) | i,j) V n-1 (j) Para horizonte infinito, generalmente se considera un factor de descuento, 0<  <1: V  (i) = R(i,  (i)) +   P(  (i) | i,j) V(j)

Incertidumbre - MDP, L.E. Sucar27 Solución Política óptima La solución a un MDP da una política óptima. Esto es, la política que maximiza la ecuación de Bellman:  *(i) = max [R(i, a) +   P(a | i,j) V*(j)]

Incertidumbre - MDP, L.E. Sucar28 Iteración de Valor En el caso de horizonte infinito, se puede obtener la utilidad de los estados –y la política óptima, mediante un método iterativo En cada iteración (t+1), se estima la utilidad de cada estado basada en los valores de la iteración anterior (t): U t+1 (i) = R(i) + max a  j P(s j | s i,a) U t (j) Cuando t  inf, los valores de utilidad convergen a un valor estable

Incertidumbre - MDP, L.E. Sucar29 Iteración de Valor Algoritmo: –Inicializar: U t = U t+1 = R –Repetir: U t =U t+1 U t+1 (i) = R(i) + max a  j P(s j | s i,a) U t (j) –Hasta: | U t -U t+1 | < 

Incertidumbre - MDP, L.E. Sucar30 Iteración de Valor ¿Cuántas veces repetir la iteración? Normalmente el número de iteraciones para obtener la política óptima es menor que el requerido para que las utilidades converjan En la práctica, el número de iteraciones es relativamente pequeño

Incertidumbre - MDP, L.E. Sucar31 Iteración de valor Para evitar problemas de valores muy grandes (infinito) de la recompensa, normalmente se aplica un factor de descuento, 0<  <1, para el valor de los siguientes estados El cálculo iterativo de la utilidad con el factor de descuento es entonces: U t+1 (i) = R(i) + max a   j P(s j | s i,a) U t (j)

Incertidumbre - MDP, L.E. Sucar32 Ejemplo – utilidades de los estados Inicio 0.812 0.762 0.8680.912 0.660 0.6110.7050.3380.655

Incertidumbre - MDP, L.E. Sucar33 Ejemplo – política óptima Inicio

Incertidumbre - MDP, L.E. Sucar34 Iteración de Política Empezando de cierta política (aleatoria), esta se mejora encontrando una acción por estado que tenga un mejor valor que la acción actual Se puede usar conocimiento del problema para definir la política inicial El proceso termina cuando ya no puede haber mejoras Normalmente converge en menor número de iteraciones que iteración de valor, pero cada iteración es más costosa

Incertidumbre - MDP, L.E. Sucar35 Ejemplo –robot virtual

Incertidumbre - MDP, L.E. Sucar36 Política óptima

Incertidumbre - MDP, L.E. Sucar37 Otra configuración

Incertidumbre - MDP, L.E. Sucar38 Función de valor

Incertidumbre - MDP, L.E. Sucar39 POMDP En muchos problemas reales, no se puede observar exactamente el estado del agente, por lo que se tiene un POMDP Además de los elementos de un MDP, un POMDP incluye: –Una función de observación que especifica la probabilidad de las observaciones dado el estado, P(O|S) –Una distribución de probabilidad inicial para los estados, P(S)

Incertidumbre - MDP, L.E. Sucar40 POMDP El enfoque exacto para resolver un POMDP consiste en considerar la distribución de probabilidad sobre los estados y en base a esta determinar las decisiones óptimas Para ello, se puede considerar un POMDP como un MDP en que los estados corresponden a la distribución de probabilidad El problema es que el espacio de estados se vuelve infinito y la solución exacta es muy compleja

Incertidumbre - MDP, L.E. Sucar41 POMDP Soluciones aproximadas: –Asumir que el agente se encuentra en el estado más probable – se transforma en un MDP que se puede resolver por el método de iteración de valor –Considerar un número finito de pasos y modelar el problema como una red de decisión dinámica – la aproximación depende del número de estados que se “ven” hacia delante (lookahead)

Incertidumbre - MDP, L.E. Sucar42 Ejemplo POMDP El robot detecta su posición con sonares Hay errores y ruido en las lecturas, alcance limitado Ciertas celdas son muy parecidas (1,2 – 3,2)

Incertidumbre - MDP, L.E. Sucar43 Extensiones Representaciones factorizadas Representaciones abstractas (cualitativas) Modelos jerárquicos (serie / paralelo) (las siguiente láminas están basadas en un tutorial impartido en Iberamia con Alberto Reyes)

Incertidumbre - MDP, L.E. Sucar44 Factored Representations Extensional representation of the system's states are those in which each state is explicitly named. In AI research, intensional representations are more common. An intensional representation is one in which states or sets of states are described using sets of multi-valued features. The use of MDP formalism in AI has recently adopted this representation.

Incertidumbre - MDP, L.E. Sucar45 Factored MDPs Boutillier, Dearden y Goldsmith (1995) exploits action description and domain structure through state features to represent states as sets of factors (features). A factored state is any possible instantiation of a small set of variables defining a problem domain. They represent the state transition function as a 2- stage DBN with which they exploit state variables independence. Conditional Probability Tables (CPTs) which are the state transition distributions are represented as decision trees.

Incertidumbre - MDP, L.E. Sucar46 Factored MDPs Each value xi’ of a variable X is associated to a probability distribution in X’ P T (x i ’|Parents T (x i ’)). P T (X’|X)=  P T (x i ’|u i ) where u i is the value in X of the variables in Parents T (x i ’). There is one DBD per action x2 x3 x4 x5 x1 x2’ x3’ x4’ x5’ x1’ tt+1 XX’

Incertidumbre - MDP, L.E. Sucar47 Factored MDPs Reward p q p q R T F 1.0 T T 0.9 F F 0.1 F T 0.0 A1A1 A2A2 A3A3 A4A4 X1X1 X2X2 X3X3 X4X4 X’ 1 X’ 2 X’ 3 X’ 4 G tt+1

Incertidumbre - MDP, L.E. Sucar48 Algebraic Decision Diagrams –SPUDD algorithm (Hoey, 1999) uses algebraic decision diagrams (ADD) to represent state transitions, utilities, policies and rewards. One of its contributions is the fact that many instantiations of the state variables map a same value

Incertidumbre - MDP, L.E. Sucar49 Relational Representations State Aggregation= Group of states with similar properties (utility, features). [Morales 2003] uses the notion of state aggregation for grouping states that share the same set of relations to structure and abstract state spaces. The value function is approximated over this abstract state space in a RL context.

Incertidumbre - MDP, L.E. Sucar50 R-states Relational variables are first-order relations. States are defined by the possible instantiations of these relational variables (r-states) e.g., relation(agent,goal,south) AND relation(agent,obst,north-west) AND not(relation(agent,border)).

Incertidumbre - MDP, L.E. Sucar51 R-states An r-state can cover a large number of states For N relations there are in principle 2 N r- states. In practice, only a small fraction of them is possible The user needs to define the relations

Incertidumbre - MDP, L.E. Sucar52 R-actions Actions are represented in terms of first-order relations (r-actions) Syntax: –pre-conditions (set of relations) –g-action (generalized action) –post-conditions (set of relations) When several primitive actions are applicable choose one randomly The user needs to define the r-actions

Incertidumbre - MDP, L.E. Sucar53 R-actions An example of an r-action in the grid domain can be: r-action(1,agent,goal,Move,State) :- relation(agent,goal, Pos), not relation(agent, obst, Pos), not relation(agent,border), move(Pos,[Move|_],State).  If an r-action is applicable to a particular instance of an r-state it must be applicable to all the instances of that r-state.

Incertidumbre - MDP, L.E. Sucar54 R-actions Where, r-acc1: close r-acc2: getAway Here, different movements are possible according to the current r- state !

Incertidumbre - MDP, L.E. Sucar55 Qualitative MDPs An alternative way to improve efficiency and accuracy in the MDP formalism to deal with real world problems is by using Qualitative MDPs. In a Qualitative MDP states are qualitative change vectors (q-states) and actions are STRIPS-like operators that constrain the set of actions applicable in a particular Q-state (r-actions)

Incertidumbre - MDP, L.E. Sucar56 Learning QCFs 1: q Temp =neg q Vol =neg q Pres =pos 1: q Temp =neg q Vol =neg q Pres =pos Qualitative Change Vectors Pres = 2 Temp / Vol Temp Vol Pres 315.00 56.00 11.25 315.00 62.00 10.16 330.00 50.00 13.20 300.00 50.00 12.00 300.00 55.00 10.90 Pres = 2 Temp / Vol Temp Vol Pres 315.00 56.00 11.25 315.00 62.00 10.16 330.00 50.00 13.20 300.00 50.00 12.00 300.00 55.00 10.90 For each pair of examples form a qualitative change vector From [Suc & Bratko 2002]

Incertidumbre - MDP, L.E. Sucar57 Q-States State variables q are qualitative change vectors. Example: q1=pos(current_mw, demand_mw) q2=neg(turbine_vel, synchronism_vel) A qualitative state is an instance of the set of state variables. Example: Q = pos(current_mw, demand_mw) AND neg(turbine_vel, synchronism_vel)

Incertidumbre - MDP, L.E. Sucar58 Q-States x2x2 y x x1x1 y1y1 y2y2 y3y3 x3x3 x 1, x 2, x 3, y 1, y 2, y 3 are reference values over variables x and y Q 1 =pos(x, x2), ~pos(x,x3), pos(y, y1), ~pos(y,y3). Q 2 =pos(x, x1), ~pos(x,x2), pos(y, y1), ~pos(y,y3). Q1Q1 Q2Q2

Incertidumbre - MDP, L.E. Sucar59 Q-State refinements Additional improvements can be obtained refining the initial state partition. Q1Q1 Q2Q2 Q3Q3 Q4Q4 Q5Q5 x1x1 x2x2 Q 1-p1 Q2Q2 Q 3-p1 Q5Q5 x1x1 x2x2 Q 1-p2 Q 3-p2-1 Q 4-p1 Q 4-p2

Incertidumbre - MDP, L.E. Sucar60 R-Actions Additional computational savings can be obtained during the MDP solution using the notion of relational actions (r-actions) An (r-action) is a STRIPS-like operator where preconditions are qualitative state features

Incertidumbre - MDP, L.E. Sucar61 R-Actions raction(pred(term1,term2)) :- q1, q2, qn. where pred={pos,neg,zero} term1=action ground variable term2=reference value of term1 q1, q2,..,qn= qualitative features

Incertidumbre - MDP, L.E. Sucar62 R-Actions An r-action constrains the number of actions for each state. Example from the power domain : raction(pos(vel,vel_ref)):- pos(current_mw, demand_mw), neg(current_vel, synchronism_vel).

Incertidumbre - MDP, L.E. Sucar63 R-Actions State values and optimal policy is obtained by using value iteration algorithm. In each iteration (t+1), the state utility is computed according to the values from the previous iteration (t), maximizing this value only over a constrained set of actions: U t+1 (i) = R(i) + max a  j P(s j | s i,a) U t (j) Using this method, the explicit action space enumeration is avoided.

Incertidumbre - MDP, L.E. Sucar64 Transition Model R-A 1 R-A 2 R-A 3 R-A 4 q1q1 q2q2 q3q3 q4q4 q’ 1 q’ 2 q’ 3 q’ 4 q i variable in time t q' i variable q i in time t+1 G transition graph (DBN) nodes(G)={R-A 1,..,R-A g, q 1,..,q n, q' 1,.., q' n } Parents(q' i )  Q  A G tt+1 [ Dearden & Boutilier 97] Variables are change relations Actions are r-actions QQ’

Incertidumbre - MDP, L.E. Sucar65 Reward Function Reward function is represented as a decision tree or an influence diagram. The difference now is that random variables are qualitative, and the function is applicable to abstract states with these factors (attributes). Reward q1q1 q2q2

Incertidumbre - MDP, L.E. Sucar66 Particiones La otra alternativa para simplificar la solución de un MDP es “partir” el problema en subproblemas, de forma que se puede resolver c/u por separado y después “integrar la solución Dos principales enfoques: –serie: se descompone la tarea en subtareas de forma que cada es una submeta que hay que cumplir para alcanzar la meta global (p. ej. Heirarchical RL) –paralelo: se descompone el problema en subproblemas que puedan resolverse “independientemente” y ejecutarse en “paralelo” (p. ej. Parallel MDPs)

Incertidumbre - MDP, L.E. Sucar67 Learning an MDP Learning the model: –State Partition by Reward (ID3) –Learning a Transition Model (K2) –Learning r-actions (C4.5) Reinforcement Learning

Incertidumbre - MDP, L.E. Sucar68 Learning the Reward Function The reward function can be approximated by using algorithms to learn decision trees (C4.5) from continuous data. The nodes in the obtained d-tree are the qualitative variables necessary to represent an state compactly.

Incertidumbre - MDP, L.E. Sucar69 The power plant domain States were obtained from simulation under different power generation conditions. –Minimum load (10 MW) –Medium load (20 MW) –Maximum load (30 MW) The set of actions were those observed in the classical control system. Undesirable states were characterized from load disturbances (negative reward). Desirable states were those occurred during normal operation (positive reward)

Incertidumbre - MDP, L.E. Sucar70 Qualitative state partition by reward Deseado No Deseado 4102 3826 3447 3445 40.6 40.7 40.8 46.5 Presión Vapor (KPa) Flujo Vapor (Kg/s) Generación<=4804.18 4102 3826 3447 3445 40.6 40.7 40.8 46.5 Generación>4804.18 Flujo Vapor (Kg/s) Presión Vapor (KPa)

Incertidumbre - MDP, L.E. Sucar71 Learning the Transition Model Given the set of qualitative variables, we then take advantage of factored representations to produce DBN-based transition models. We obtained DBNs representing probabilistic state transitions for the actions increase/decrease fwv position, decrease msv position and the null action by using structural and parametric learning algorithms (Elvira). The training data set given to Elvira are also attribute-value augmented tables where the set of variables are X  X’ per action.

Incertidumbre - MDP, L.E. Sucar72 Reinforcement Learning Reinforcement Learning (RL) is: –“the problem faced by an agent through trial-and- error interactions with a dynamic environment” [Kaelbling, Littman, Moore, 1995]. –“learning what to do - how to map situations to actions – so as to maximize a numerical reward signal.. but instead must discover which actions yield the most reward by trying them” [Sutton, 98].

Incertidumbre - MDP, L.E. Sucar73 Reinforcement Learning RL addresses the question of how an autonomous agent that senses and acts in its environment can learn to choose optimal actions to achieve its goals [Mitchell, 97]. Example: When training an agent to play a game the trainer might provide a positive reward when the game is won, negative when it is lost, and zero in all other states.

Incertidumbre - MDP, L.E. Sucar74 Reinforcement Learning

Incertidumbre - MDP, L.E. Sucar75 What’s the difference between DP and RL ? In each type of problem, we want to sequentially control a system to maximize a reward. To apply the dynamic programming methods we need to assume: –The system dynamics and expected rewards are known –We can observe the state system perfectly. –The size of the state space is not too large – the problem is computationally tractable. RL approaches operate without these assumptions.

Incertidumbre - MDP, L.E. Sucar76 Aplicaciones Manejo de inventarios Mantenimiento de equipos y carreteras Control de sistemas de comunicaciones Modelado de procesos biológicos Planeación en robótica móvil Construcción de mapas / localización Control de procesos industriales Control de aviones …

Ejemplo de Aplicación Control de una Planta Eléctrica utilizando MDP

Incertidumbre - MDP, L.E. Sucar78

Incertidumbre - MDP, L.E. Sucar79 Generador de vapor y domo

Incertidumbre - MDP, L.E. Sucar80 Espacio de control

Incertidumbre - MDP, L.E. Sucar81 Resultados preliminares

Incertidumbre - MDP, L.E. Sucar82 Arquitectura del Control Planta PID MDP Set point Nuevo Set point ajuste

Incertidumbre - MDP, L.E. Sucar83 Application example Flujo de agua Flujo de vapor Presión vapor d msv fwv Power Plant Domain

Incertidumbre - MDP, L.E. Sucar84 MDP Ground Elements States: Any possible instantiation of the process variables: Drum pressure (P d ), Feedwater flow (F fw ), Steam flow (F ms ), Power Generation (normal, abnormal), Load rejection (false, true). Actions: Open/Close fwv, msv Reward: good for states under optimal operation, bad for the remaining operation states. Transition Model: Causal relation among variables in time. Process Dynamics.

Incertidumbre - MDP, L.E. Sucar85 Reward Function This function rewards states matching the optimal operation curve and penalizes the remaining ones.

Incertidumbre - MDP, L.E. Sucar86 Transition Model fms, fms_ref1 fms, fms_ref2 ffw, ffw_ref d, d_ref pd, pd_ref1 pd, pd_ref2 pd, pd_ref3 g, g_ref fms, fms_ref1’ ‘ fms, fms_ref2’ ffw, ffw_ref’ d, d_ref’ pd, pd_ref1’ pd, pd_ref2’ pd, pd_ref3’ g, g_ref’ 0+- 00.330.130.01 +0.330.820.00 -0.330.050.99 r-action: neg(msv, msv ref ) Q' t'

Incertidumbre - MDP, L.E. Sucar87 r-actions in Prolog-like format

Incertidumbre - MDP, L.E. Sucar88 Operator Assistant Architecture The power plant operator assistant was implemented in Sun Microsystems Java2. Data Base Power Plant Simulator Operator Interface Factored MDP Operator Process Operator Assistant

Incertidumbre - MDP, L.E. Sucar89 Experimental Results The transition model was successfully induced by using K2 and EM algorithms (Elvira).

Incertidumbre - MDP, L.E. Sucar90 Experimental Results Value Iteration algorithm was used to calculate the optimal policy, which converged in 12 iterations with a discount factor  = 0.9. The experiments showed that it is possible to get important computational savings by doing dynamic programming without explicit enumeration of state space.

Incertidumbre - MDP, L.E. Sucar91 Experimental Results State Space S = 6 1 x 8 1 x 2 3 = 384 states VARmsffwfpdgd # Vals62822 Variable discretization Parameters enumerated a0a1a2a3TotalCompilation time Traditional MDP 147456 5898245.6 days Factored MDP 175 204 7582 mins CPTs dimensions

Incertidumbre - MDP, L.E. Sucar92 Experimental Results In many cases, control commands and MDP commands seems to be similar. The difference is that the MDP sets up the plant on the operation curve faster.

Incertidumbre - MDP, L.E. Sucar93 Task Coordination A complex robotic task, such as message delivery, requires several capabilities: –Path planning –Obstacle avoidance –Localization –Mapping –Person finding –Speech synthesis and recognition –Gesture generation –…

Incertidumbre - MDP, L.E. Sucar94 Task Coordination Each task can be implemented fairly independent as a software module Challenge: how to integrate and coordinate these modules so the robot performs the “best” actions in each situation Our solution: MS-MDP –Multiply Sectioned Markov Decision Processes, that can be specified and solved independently, and executed concurrently to select the best actions according to the optimal policy

Incertidumbre - MDP, L.E. Sucar95 MS-MDPs We partition the global task into a number of subtasks, so each subtask is assigned to an MDP an each one is solved independently –The actions for each MDP are independent and can be performed concurrently –There is no conflict bewteen the actions of different MDPs –All the MDPs have a common goal (reward)

Incertidumbre - MDP, L.E. Sucar96 MS-MDPs We solve each MDP independently (off-line) and execute the optimal policy for each one concurrently (on-line) The MDPs are coordinated by a common state vector – each only needs to consider the state variables that are relevant for its subtask, this reduces the complexity of the model Each MDP only considers its actions, which implies a further reduction in complexity

Incertidumbre - MDP, L.E. Sucar97 MS-MDPs Advantages: –Reduction in complexity –Easier to build or learn the models –Concurrent actions –Modularity Current limitations: –No guarantee of global optimality –Does not consider action conflicts

Incertidumbre - MDP, L.E. Sucar98 Homer RWI B-14 robot Bumblebee stereo vision camera LCD display – animated face “Head” – pan tilt unit Omnidirectional microphone 4 on-board computers, interconnect with a 100Mbps LAN Wireless comm. to external computers at 10Mbps

Incertidumbre - MDP, L.E. Sucar99 Homer: “head”

Incertidumbre - MDP, L.E. Sucar100 Homer: Software Architecture

Incertidumbre - MDP, L.E. Sucar101 Message Delivery Homer explores the environment looking for a sender A sender is detected by speech or vision Homer asks for the receiver and sender name, and the message Homer goes to the receiver expected location (model of the environment –map) When the potential receiver is detected, Homer confirms and delivers the message If not, it continues looking for the receiver or it aborts and looks for a new message At the same time, Homer keeps localized in the map and will go home if its battery is low

Incertidumbre - MDP, L.E. Sucar102 Message Delivery – subtasks Navigator Dialogue manager Gesture generator Locali- zation User Loc. Speech Gen. NDG Naviga- tion Gesture Gen.

Incertidumbre - MDP, L.E. Sucar103 MDPs for message delivery Navigator –Explore –Navigate –Localize –Get new goal –Go home –Wait Dialogue –Ask –Confirm –Give message Gesture –Neutral –Happy –Sad –Angry

Incertidumbre - MDP, L.E. Sucar104 State variables Has message Receiver name Sender name At location Has location Location unreachable Receiver unreachable Battery low Uncertain location Voice heard Person close Called Homer Yes/No

Incertidumbre - MDP, L.E. Sucar105 Experiments 1.Person approached D: ask G: smile 2.Message received N: navigate D: mute G: neutral 3.Position uncertain N: localize

Incertidumbre - MDP, L.E. Sucar106 Experiments 3.Deliver message N: wait D: deliver G: smile 4.Battery low-go home N: go home

Incertidumbre - MDP, L.E. Sucar107 Demo 5: Homer’s video

Demo Herramienta en MatLab

Incertidumbre - MDP, L.E. Sucar109 Referencias [Russell & Norvig] – Cap. 17 H. A. Taha, “Investigación de Operaciones”, Alfaomega, 1991 – Cap. 14 M. Puterman, “Markov Decision Processes”, Wiley, 1994.

Incertidumbre - MDP, L.E. Sucar110 Bibliography Classic papers: Blythe, J., 1999, Decision –Theoretic Planning. AAAI. AI Magazine, 37-54. C. Boutilier, T. Dean, and S. Hanks. Decision-theoretic planning: structural assumptions and computational leverage. Journal of Artificial Intelligence Research, 11:1–94, 1999 D. Suc and I. Bratko. Qualitative reverse engineering. In Proceedings of the 19th International Conference on Machine Learning, 2000. E. Morales. Scaling up reinforcement learning with a relation representation.pages 15–26. Proc. of the Workshop on Adaptability in Multi-agent Systems (AORC-2003), 2003.

Incertidumbre - MDP, L.E. Sucar111 Bibliography Classic papers: J. Hoey, R. St-Aubin, A. Hu, and C. Boutilier. Spudd: Stochastic planning using decision diagrams. In Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence, UAI-99, pages 279–288, 1999. K. Forbus. Qualitative process theory. Artificial Intelligence, 24, 1984. R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. 1998.

Incertidumbre - MDP, L.E. Sucar112 Bibliography Our papers: P. Elinas, E. Sucar, A. Reyes and J. Hoey; A decision theoretic approach to task coordination in social robots, IEEE International Workshop on Robots and Human Interactive Communications RO-MAN 04; Japan 2004. Demo Videos. A. Reyes, P. H. Ibarguengoytia, L. E. Sucar; Power Plant Operator Assistant: An Industrial Application of Factored MDPs; Mexican International Conference on Artificial Intelligence (MICAI-04); Mexico City; April 2004. A. Reyes, L. E. Sucar, E. Morales, P. H. Ibarguengoytia; Abstract MDPs using Qualitative Change Predicates: An Application in Power Generation; Planning under Uncertainty in Real-World Problems Workshop. Neural Information Processing Systems (NIPS-03), Vancouver CA, Winter 2003. Poster.

Incertidumbre - MDP, L.E. Sucar113 Bibliography Our papers: A. Reyes, L. E. Sucar, P. Ibarguengoytia; Power Plant Operator Assistant; Bayesian Modeling Applications Workshop in the 19th Conference on Uncertainty in Artificial Intelligence UAI-03, Acapulco-Mexico, August 2003. A. Reyes, M.A. Delgadillo, P. H. Ibarguengoytia; An Intelligent Assistant for Obtaining the Optimal Policy during Operation Transients in a HRSG; 13th Annual Joint ISA POWID/ EPRI Controls and Instrumentation Conference; Williamsburg, Virginia, June 2003. Ibargüengoytia P. H., Reyes A. 2001. Continuous Planning for The Operation of Power Plants, Memorias del Encuentro Nacional de Computación ENC 2001, Aguscalientes-Mexico.

Incertidumbre - MDP, L.E. Sucar114 Software tools MDPs –Markov Decision Process (MDP) Toolbox v1.0 for MATLAB (INRIA) http://www.inra.fr/bia/T/MDPtoolbox/ http://www.inra.fr/bia/T/MDPtoolbox/ –Markov Decision Process (MDP) Toolbox for Matlab (K. Murphy) http://www.ai.mit.edu/~murphyk/Software/MDP/mdp.html http://www.ai.mit.edu/~murphyk/Software/MDP/mdp.html –SPUDD http://www.cs.ubc.ca/spider/staubin/Spudd/ http://www.cs.ubc.ca/spider/staubin/Spudd/ Bayesian networks –Elvira http://leo.ugr.es/~elvira/http://leo.ugr.es/~elvira/ –Hugin http://www.hugin.com/http://www.hugin.com/ Learning –ADEX http://doc.mor.itesm.mx/ADEX/cgi-bin/sign_in2.ksh

Incertidumbre - MDP, L.E. Sucar115 Actividades Ejercicio de MDPs en Matlab (ver página) Proyecto final

Sesión 12: Procesos de Decisión de Markov. Incertidumbre - MDP, L.E. Sucar2 Procesos de Decisión de Markov Procesos de Decisión Secuenciales Procesos.

Presentaciones similares

Presentación del tema: "Sesión 12: Procesos de Decisión de Markov. Incertidumbre - MDP, L.E. Sucar2 Procesos de Decisión de Markov Procesos de Decisión Secuenciales Procesos."— Transcripción de la presentación:

Presentaciones similares

Sobre el proyecto

Feedback

Iniciar la sesión

Autorizarse a través de una red social:

Sesión 12: Procesos de Decisión de Markov. Incertidumbre - MDP, L.E. Sucar2 Procesos de Decisión de Markov Procesos de Decisión Secuenciales Procesos.

Presentaciones similares

Presentación del tema: "Sesión 12: Procesos de Decisión de Markov. Incertidumbre - MDP, L.E. Sucar2 Procesos de Decisión de Markov Procesos de Decisión Secuenciales Procesos."— Transcripción de la presentación:

Presentaciones similares

Sobre el proyecto

Feedback