NIPS 2012 Workshop Information in Perception and Action

Last update: 12/19/2012, 8:40 a.m.

Description

Since its inception for describing the laws of communication in the 1940's, information theory has been considered in fields beyond its original application area and, in particular, it was long attempted to utilize it for the description of intelligent agents. Already Attneave (1954) and Barlow (1961) suspected that neural information processing might follow principles of information theory and Laughlin (1998) demonstrated that information processing comes at a high metabolic cost; this implies that there would be evolutionary pressure pushing organismic information processing towards the optimal levels of data throughput predicted by information theory. This becomes particularly interesting when one considers the whole perception-action cycle, including feedback. In the last decade, significant progress has been made in this direction, linking information theory and control. The ensuing insights allow to address a large range of fundamental questions pertaining not only to the perception-action cycle, but to general issues of intelligence, and allow to solve classical problems of AI and machine learning in a novel way.

The workshop will present recent work on progress in AI, machine learning, control, as well as biologically plausible cognitive modeling, that is based on information theory.

Invited Speakers

Daniel Y. Little (University of Berkeley, USA)
Evangelos A. Theodorou (University of Washington, USA)

Organizers

Naftali Tishby (Hebrew University, Israel)
Daniel Polani (University of Hertfordshire, UK)
Tobias Jung (University of Liège, Belgium)

Workshop Program

Morning Sessions

7:30 - 7:45	Welcome & Opening Remarks
7:45 - 8:30	Talk: "Information flow in perception-action cycles and the emergence of hierarchies and reverse-hierarchies" Naftali Tishby, Hebrew University
8:30 - 9:00	Spotlight Presentations "Higher-Order Predictive Information for Learning an Infinite Stream of Episodes" Byoung-Tak Zhang, Seoul National University "Regulating the information in spikes: a useful bias" David Balduzzi, ETH Zurich "Information-Maximizing Local Spatial Scale Selection in Early Visual Processing" Sander M. Bohte, CWI, Amsterdam H. Steven Scholte, University of Amsterdam Sennay Ghebreab, University of Amsterdam "Relating information theoretic principles in learning to structure in human cultural behavior" Tessa Verhoef, University of Amsterdam "Continuous-time recursive Bayesian updating in networks of stochastic spiking neurons" Ben Moran Nicolas Della Penna, Australian National University
9:00 - 9:30	Coffee Break + Poster Session
9:30 - 10:30	Talks "Kernel Information Bottleneck" Nori Jacoby, Hebrew University Naftali Tishby, Hebrew University "Adaptive Coding of Actions and Observations" Pedro A. Ortega, MPI Tuebingen Daniel A. Braun, MPI Tuebingen "Active inference with embodied cognitive limitations" Nisheeth Srivastava, University of Minnesota Paul R. Schrater, University of Minnesota

Afternoon Sessions

15:30 - 16:15	Invited Talk "An information-theoretic model of learning-driven exploration" Daniel Y. Little, University of Berkeley
16:15 - 17:00	Invited Talk "Information Theoretic Views of Path Integral Control and Applications To Robotics" Evangelos A. Theodorou, University of Washington
17:00 - 17:30	Coffee Break
17:30 - 18:10	Talks "The value of information in online learning: A study of partial monitoring problems" Gabor Bartok, ETH Zurich David Pal, University of Alberta Csaba Szepesvari, University of Alberta "Active Sensing as Bayes-Optimal Sequential Decision-Making" Sheeraz Ahmad, UC San Diego Angela J. Yu, UC San Diego
18:10 - 19:00	Talk: "Information Constraints as Drivers for the Emergence of Cognitive Architectures and Concepts" Daniel Polani, University of Hertfordshire
19:00 - 24:00	Discussion, Closing Remarks, Dinner

Abstracts

"An information-theoretic model of learning-driven exploration"
Daniel Y. Little, University of Berkeley
Abstract: Psychologists have long held that curiosity constitutes the primary drive of explorative behavior. Such intrinsic desire to learn does not require reinforcement signals from extrinsic motivators. Previous computational modeling of exploration, however, has largely focused on its role in the acquisition of external rewards often presented in terms of an exploration-exploitation dichotomy. While these studies have increased our understanding of control for reward optimization, the investigation of learning as a primary objective of behavior may provide fresh insights into the principles underlying human and animal exploration. To this end, we will describe an information-theoretic approach to studying learning for learning's sake in embodied agents. In simple worlds without external reward signals, we demonstrate how an agent can estimate the expected information gains of an action and use this estimate, called predictive information gain (PIG), to optimize behavior towards learning. We discuss the similarities and differences of our approach with other information-theoretic models of behavior. Finally we present recent results suggesting how a combination of two information-theoretic models may explain the interaction between investigation and play, the two major components of human exploration identified by the psychology literature.

"Information Theoretic Views of Path Integral Control and Applications To Robotics"
Evangelos A. Theodorou, University of Washington
Abstract: Recently reinforcement learning has moved towards combining classical techniques from stochastic optimal optimal control and dynamic programming with learning techniques from statistical estimation theory and the connection between stochastic differential and partial differential equations via the Feynman-Kac Lemma. The resulting framework transforms nonlinear stochastic optimal control into an approximation of a path integral. In this talk, I will present connections of Path Integral(PI) and Kullback-Liebler(KL) control as presented in machine learning and robotics communities with work in control theory on logarithmic transformations of diffusions processes. The analysis provides an information theoretic view of PI stochastic optimal control based on the duality between Free Energy and Relative Entropy. Comparisons between the information theoretic and Dynamic Programming point of view in terms of generalizations and extensions will be discussed. Finally, I will present algorithmic developments on iterative path integral control and show applications to robotics as well as connections to free energy based policy gradients.
Download extended abstract (PDF)

"Information flow in perception-action cycles and the emergence of hierarchies and reverse-hierarchies"
Naftali Tishby, Hebrew University
Abstract: Starting form Large Deviation Theory (Sanov's theorem) we can obtain the connection between the reward rate and the control and sensing information capacities, for systems in "metabolic information equilibrium" with stationary stochastic environments (Tishby & Polani, 2010). This result can be considered as an equilibrium characterisation for systems that achieved a certain value through interactions with the environment, but have no new learning (e.g. "stupid" cleaning robots). The affect of learning can be considered by revisiting the sub-extensivity of predictive information in stationary environments (Bialek, Nemenman & Tishby 2002) and combining it with the requirement of computational tractability of planning. We argue that planning is possible if the information flow terms remain proportional to the reward terms on the one hand, but still bounded by the sub extensive predictive information on the other hand.
I will discuss the possible implications of this new computational principle to the emergence of hierarchical representations and discounting of rewards in our generalised Bellman equation.

"Information Constraints as Drivers for the Emergence of Cognitive Architectures and Concepts"
Daniel Polani, University of Hertfordshire
Abstract: Ashby's Law of Requisite Variety (1956) and its extensions by Touchette and Lloyd (2000, 2004) indicate that Shannon information constraints govern the potential organisation and administration of any cognitive task. In addition, there is increasing evidence that the trade-offs implied by these constraints are indeed exploited by biological organisms close to the limit in adaptive (quasi-)equilibrium.

The talk will briefly discuss above principles and then present several scenarios which illustrate some consequences of these hypotheses. Depending on time and interest, this may include one or more of the following scenarios:

emergence of "subgoal" concept from constraints on working memory
"horizontal" and "vertical" drivers for sensor evolution from information bottleneck inefficiency
steps towards understanding actuator evolution: links between physical actuator power and action-perception informational capacity (empowerment)

"Higher-Order Predictive Information for Learning an Infinite Stream of Episodes"
Byoung-Tak Zhang, Seoul National University
Abstract: We consider the problem of lifelong learning from an indefinite stream of temporal episodes, i.e. a time series consisting of episodes, where the number of the episodes is potentially infinite and the length of each episode varies. What kinds of objective function should the lifelong learner use to balance the short-term and long-term performance? How should the learner optimize its model complexity when the statistics of the episodes change over time? Maximization of the expected future reward, such as a value function used in reinforcement learning, might be useful if we could define rewards for a prespecified goal. For learning an indefinite stream of episodes, we find the mutual information-based measures of information theory, such as predictive information and empowerment suitable. The predictive information is, however, typically approximated by restricting the time horizons to a single time step. Though this is exact under the Markov assumption, i.e. the probability of a state depends only on the probability of the previous state, and still can generate explorative behavior, the predictive power can be improved by increasing the order of temporal dependency. Here we extend the first-order predictive information to the *k*th-order predictive information for lifelong learning from a continuous stream of time-series episodes. This higher-order predictive information can be efficiently approximated by an importance sampling-based Monte Carlo method.

"Regulating the information in spikes: a useful bias"
David Balduzzi, ETH Zurich
Abstract: The bias/variance tradeoff is fundamental to learning: increasing a model's complexity can improve its fit on training data, but potentially worsens performance on future samples. Remarkably, however, the human brain effortlessly handles a wide-range of complex pattern recognition tasks. On the basis of these conflicting observations, it has been argued that useful biases in the form of "generic mechanisms for representation" must be hardwired into cortex (Geman et al). I describe a useful bias, taking the form of an constraint on the information embedded in spiking outputs, that encourages cooperative learning. The constraint is both biologically plausible and rigorously justified.

"Information-Maximizing Local Spatial Scale Selection in Early Visual Processing"
Sander M. Bohte, CWI, Amsterdam
H. Steven Scholte, University of Amsterdam
Sennay Ghebreab, University of Amsterdam
Abstract: From an information theoretic point of view, the optimal amount of spatial pooling in optical sensors is determined by local contrast and mean object intensity. In other words, the optimal scale of a filter is determined by environmental conditions. As neural filters in early visual processing span many different spatial scales, the question is whether the brain uses optimal filter-scale selection. In a model of early visual processing, we derive local filter scale that maximizes information, based on early work by Snyder et al (1977). We show that such information maximizing scale selection produces a neural response distribution that is as predictive for EEG responses as the best current heuristics for local scale control. We furthermore show that a simple neural network can quickly learn such scale selection. Taking predictability of EEG-responses as model evidence, this finding suggests that the brain may hierarchically pool simple features so as to maximize information transfer given uncertainty due to local contrast statistics.

"Relating information theoretic principles in learning to structure in human cultural behavior"
Tessa Verhoef, University of Amsterdam
Abstract: Many human behaviors are rooted in culture. Cultural traditions such as language, music, dance and art are built on systems that are often acquired by social learning and that have been transmitted from generation to generation. Cultural evolution has been studied at length with the use of both artificial and human learners and a key finding from this work is that structure in transmitted systems is shaped by the cognitive biases of its users. In studies investigating structures that emerge from cultural evolution in experiments with humans, compressible and predictable systems appear to be a prevalent result. Findings from cultural evolution research may therefore provide additional sources of evidence about information-theoretic biases in cognition. As an example, data will be shown from an experiment in which artificial whistled languages (produced with slide whistles) are transmitted. These languages evolved in such a way that the set of basic sound primitives was reduced and these primitives were more extensively reused and combined in a predictable way, yielding more compressible systems. Such an efficient combinatorial structure is one of the basic features of linguistic systems, but is also present in artistic systems such as music and dance. Presumably these systems exhibit this type of efficient structure because they are the result of cultural evolution and reflect human compression biases.

"Continuous-time recursive Bayesian updating in networks of stochastic spiking neurons"
Ben Moran
Nicolas Della Penna, Australian National University
Abstract: Neural systems operate under uncertainty, and to behave adaptively must update their beliefs with information from their surroundings. This entails maintaining a probability distribution over possible states, and updating this distribution as sensory data arrive. Optimal updates are given by Bayes' theorem, but it is useful to consider what kinds of network could support this computation. One such formulation arises from exploring the formal connection between Bayes' rule and the replicator equation, a model of biological evolution. We can identify the composition of species within a population with the prior distribution, and "evolutionary fitness" with log likelihood. This analogy is mathematically precise, and holds also in the case of continuous time [Harper 2010] The continuous replicator dynamic is a Lotka-Volterra system, so it is possible to construct a stochastic spiking network of linear-exponential-Poisson neurons with mean rates following the same dynamic [Cardanobile & Rotter 2011]. The replicator dynamic describes only iterated Bayesian inference with no transition model, but we can implement dynamic state filtering by adapting a generalization, the replicator-mutator equation. This suggests expanding the model to incorporate additional linear connections which act as the generator matrix of a state transition Markov process.

"Kernel Information Bottleneck"
Nori Jacoby, Hebrew University
Tali Tishby, Hebrew University
Abstract: The Information Bottleneck (IB) method was introduced as a principled approach to extracting efficient representations of one set of variables with respect to another from empirical data, thus extending the classical notion of minimal sufficient statistics. The method was proposed as a general computational principle for information processing in the brain and has been used in many machine learning and neuroscience applications. The original algorithm for solving the problem was based on the Arimoto-Blahut alternating projection algorithm, but was not guaranteed to converge to a global optimum, which jeopardized the practicality of the approach. One exception was the multivariate Gaussian case, for which the IB was shown to have an efficient globally converging algorithm (GIB) that extended Canonical Correlation Analysis (CCA). The main advantage over CCA was that it provided a continuous optimal tradeoff between the minimality and sufficiency of the representation (described by the information curve), hence allowing for optimal multi-scale analysis of the data using simple spectral methods.
Here we extend the Gaussian solution of the Information Bottleneck (GIB) to a much wider family of distributions using the kernel trick, and make it practical for essentially any empirical data. Our main theoretical result is in proving that we can obtain a bound linking the true information-curve and the one obtained by our approach, using information geometry. We illustrate the algorithm on real data, and discuss some of its potential new applications.

"Adaptive Coding of Actions and Observations"
Pedro A. Ortega, MPI Tuebingen
Daniel A. Braun, MPI Tuebingen
Abstract: The application of expected utility theory to construct adaptive agents is both computationally intractable and statistically questionable. To overcome these difficulties, agents need the ability to delay the choice of the optimal policy to a later stage when they have learned more about the environment. How should agents do this optimally? An information-theoretic answer to this question is given by the Bayesian control rule - the solution to the adaptive coding problem when there are not only observations but also actions. This paper reviews the central ideas behind the Bayesian control rule.
Download extended abstract (PDF)

"Active inference with embodied cognitive limitations"
Nisheeth Srivastava, University of Minnesota
Paul R. Schrater, University of Minnesota
Abstract: Scientists trying to find control-theoretic descriptions of the behavior of biological organisms in choice tasks have gradually begun to turn away from basic optimal control formulations of the problem to one of active inference, where the agent both decides which actions to choose, and crucially, which pieces of information to process while interacting with the environment. In this talk, we will describe a new decision theory that uses a rational optimality criterion grounded in embodied limitations of biological agents - trying to minimize the metabolic costs of decision-making. This theory builds upon a model for learning relative preferences without utility computations we have proposed in a companion paper in the NIPS main conference. Agents in our proposed framework do not experience numerical rewards as outcomes; they themselves decide how to represent information about the world as they navigate it. Simulated agents utilizing our model endogenously replicate a number of violations of behaviors expected by simple probability matching, but systematically observed in human subjects.

"The value of information in online learning: A study of partial monitoring problems"
Gabor Bartok, ETH Zurich
David Pal, University of Alberta
Csaba Szepesvari, University of Alberta
Abstract: In online learning, a learner makes decisions on a turn-by-turn basis. After making her decision, the learner suffers some loss depending on her action and some (possibly random) unknown process running in the background. Then, the learner receives some feedback and the next turn begins. The goal of the learner is to minimize her cumulative loss. The performance is measured by the so-called regret: the excess cumulative loss of the learner compared to that of the best fixed action in hindsight.
How quickly an agent can learn depends on the quality of feedback information. While online learning is well understood under the special cases of "full-information" and "bandit" feedback, other feedback structures have not been thoroughly investigated. In our work, we study the problem of partial monitoring, a general framework for online learning with arbitrary feedback structure. We examine the natural question of how the feedback structure determines the "hardness" of a game. What regret rate is achievable for different problems? What learner strategies are able to achieve the best possible regret? Is there an algorithm that, given a loss and a feedback function as input, approaches the worst case regret of the best strategy? In our work, we answer these and other related questions. We give a full characterization of finite partial monitoring problems based on the growth rate of their minimax regret. Furthermore, we present algorithms that achieve near-optimal regret rate for every partial monitoring problem.

"Active Sensing as Bayes-Optimal Sequential Decision-Making"
Sheeraz Ahmad, UC San Diego
Angela J. Yu, UC San Diego
Abstract: Active sensing, or the way interactive agents use self-motion to focus limited sensing resources on task-relevant environmental features, is an important problem for both machine learning and cognitive science. Here, we present a Bayes-optimal inference and control framework for active sensing, which minimizes a cost function that explicitly takes into account behavioral costs such as response delay, error, and effort. Unlike previously proposed algorithms that optimize abstract statistical objectives such as expected entropy reduction [Butko & Movellan, 2010] or one-step look-ahead accuracy [Najemnik & Geisler, 2005], this model is goal-directed and context-sensitive, and capable of yielding fine temporal dynamics such as fixation duration and switch times. We use simulations to illustrate some scenarios in which context-sensitivity is particularly useful. To address the computational complexity of the optimal algorithm, we also present two value iteration algorithms that learn approximations to the value function using either fixed radial basis functions or a nonparametric Gaussian process, both of which achieve great reduction in computational complexity while retaining performance comparable to the optimal algorithm.