Skip to content

Latest commit

 

History

History
161 lines (129 loc) · 6.3 KB

README.md

File metadata and controls

161 lines (129 loc) · 6.3 KB

EMBO - Empirical Bottleneck

License PyPI version Build status

A Python package for working with the Information Bottleneck [Tishby, Pereira, Bialek 2001] and the Deterministic (and Generalized) Information Bottleneck [Strouse and Schwab 2016]. Embo is especially geared towards the analysis of concrete, finite-size data sets.

Requirements

embo requires Python 3, numpy>=1.7 and scipy.

Installation

To install the latest release, run:

pip install embo

(depending on your system, you may need to use pip3 instead of pip in the command above).

Testing

(requires setuptools). If embo is already installed on your system, look for the copy of the test_embo.py script installed alongside the rest of the embo files and execute it. For example:

python /usr/lib/python3.X/site-packages/embo/test/test_embo.py

Usage

The Information Bottleneck

We refer to [Tishby, Pereira, Bialek 2001] and [Strouse and Schwab 2016] for a general introduction to the Information Bottleneck. Briefly, if X and Y are two random variables, we are interested in finding another random variable M (called the "bottleneck" variable) that solves the following optimisation problem:

min_{p(m|x)} = H(M) - α H(M|X) - β I(M:Y)

for any β>0 and 0≤α≤1, and where M is constrained to be independent on Y conditional on X:

p(x,m,y) = p(x)p(m|x)p(y|x)

Intuitively, we want to find the stochastic mapping p(M|X) that extracts from X as much information about Y as possible while forgetting all irrelevant information. β is a free parameter that sets the relative importance of forgetting irrelevant information versus remembering useful information. α determines what notion of "forgetting" we use: α=1 ("vanilla" bottleneck or IB) implies that we want to minimise the mutual information I(M:X), α=0 (deterministic bottleneck or DIB) that we want to make M a good compression of X by minimising its entropy H(M), and intermediate values interpolate between these two conditions.

Typically, one is interested in the curve described by I(M:Y) as a function of I(M:X) or H(M) at the solution of the bottleneck problem for a range of values of β. This curve gives the optimal tradeoff of compression and prediction, telling us what is the minimum amount of information one needs to know about X (or minimum amount of entropy one needs to retain) to be able to predict Y to a certain accuracy, or vice versa, what is the maximum accuracy one can have in predicting Y given a certain amount of information about X.

Using embo

Embo can solve the information bottleneck problem for discrete random variables starting from a set of joint empirical observations. The main point of entry to the package is the InformationBottleneck class. In its constructor, InformationBottleneck takes as arguments an array of observations for X and an (equally long) array of observations for Y, together with other optional parameters (see the docstring for details). In the most basic use case, users can call the get_bottleneck method of an InformationBottleneck object, which will assume α=1 and return a set of β values and the optimal values of I(M:X), I(M:Y) and H(M) corresponding to those β. The optimal tradeoff can then be visualised by plotting I(M:Y) vs I(M:Y).

For instance:

import numpy as np
from matplotlib import pyplot as plt
from embo import InformationBottleneck

# data sequences
x = np.array([0,0,0,1,0,1,0,1,0,1])
y = np.array([0,1,0,1,0,1,0,1,0,1])

# compute the IB bound from the data (vanilla IB; Tishby et al 2001)
I_x,I_y,H_m,β = InformationBottleneck(x,y).get_bottleneck()

# plot the IB bound
plt.plot(I_x,I_y)

Embo can also operate starting from a joint (X,Y) probability distribution, encoded as a 2D array containing the probability of each combination of states for X and Y.

# define joint probability mass function for a 2x2 joint pmf
pxy = np.array([[0.1, 0.4],[0.35, 0.15]]),

# compute IB
I_x,I_y,H_m,β = InformationBottleneck(pxy=pxy).get_bottleneck()

# plot I(M:Y) vs I(M:X)
plt.plot(I_x,I_y)

The deterministic and generalised bottleneck can be computed by setting appropriately the parameter alpha:

# compute Deterministic Information Bottleneck (Strouse 2016)
I_x,I_y,H_m,β = InformationBottleneck(pxy=pxy, alpha=0).get_bottleneck()

# plot I(M:Y) vs H(M)
plt.plot(H_m,I_y)

More examples

The embo/examples directory contains some Jupyter notebook that should exemplify most of the package's functionality.

  • Basic-Example.ipynb: basics; how to compute and plot an IB bound.
  • Markov-Chains.ipynb: using embo for past-future bottleneck type analyses on data from Markov chains.
  • Deterministic-Bottleneck.ipynb: Deterministic and Generalized Information Bottleneck. Here we reproduce a key figure from the Deterministic Bottleneck paper, and we explore the algorithm's behaviour as α changes from 0 to 1.
  • Compare-embo-dit.ipynb: here we compare embo with dit [James et al 2018]. We compare the solutions found by the two packages on a set of simple IB problems (including a problem taken from dit's documentation), and we show that embo is orders of magnitude faster than dit.

Further details

For more details, please consult the docstrings in InformationBottleneck.

Changelog

See the CHANGELOG.md file for a list of changes from older versions.

Authors

embo is maintained by Eugenio Piasini, Alexandre Filipowicz and Jonathan Levine.

Citing

Piasini, E., Filipowicz, A.L.S., Levine, J. and Gold, J.I., 2021. Embo: a Python package for empirical data analysis using the Information Bottleneck. Journal of Open Research Software, 9(1), p.10. DOI: http://doi.org/10.5334/jors.322