Saturday, 7 January 2017

Practical Kullback-Leibler (KL) Divergence: Discrete Case

KL divergence (Kullback-Leibler57) or KL distance is non-symmetric measure of difference between two probability distributions. It is related to mutual information and can be used to measure the association between two random variables.
Figure: Distance between two distributions. (Wikipedia)

In this short tutorial, I show how to compute KL divergence and mutual information for two categorical variables, interpreted as discrete random variables.

${\bf Definition}$: Kullback-Leibler (KL) Distance on Discrete Distributions

Given two discrete probability distributions ${\it P}(A)$ and ${\it Q}(B)$ with discrete random variates, $A$ and $B$, having realisations $A=a_{j}$ and $B=b_{j}$, over $n$ singletons $j=1,...,n$. KL divergence or distance $D_{KL}$ in between $P$ and $Q$ is defined as follows:

$D_{KL} = D_{KL}\big({\it P}(A) || {\it Q}(B)\big)=\sum_{j=1}^{n} {\it P}(A=a_{j}) \log \Big(  \cfrac{{\it P}(A=a_{j})}{{\it Q}(B=b_{j})} \Big)$

$\log$ is in base $e$.

${\bf Definition}$: Mutual Information on Discrete Random Variates

Two discrete random variables $X$ and $Y$, having realisations $X=x_{k}$ and $Y=y_{l}$, over $m$ and $n$ singletons $k=1,...,n$ and $l=1,...,m$ respectively, are given. Mutual information, $I(X;Y)$ is defined as follows,

$I(X;Y) = \sum_{k=1}^{n} \sum_{l=1}^{m} {\it R}(X=x_{k}, Y=y_{l}) \log \Big(\cfrac{  {\it R}(X=x_{k}, Y=y_{l})  }{  {\it R}(X=x_{k})  {\it R}(Y=y_{l})} \Big)$

$\log$ is in base $e$ and $R$ denotes probabilities.

${\bf Definition}$: Mutual Information on Discrete Random Variates with KL distance

Two discrete random variables $X$ and $Y$. Mutual information, $I(X;Y)$ is defined as follows,

$I(X;Y) = D_{KL} \Big( {\it R}(X, Y) || {\it R}(X) {\it R}(Y) \Big)$

${\bf Problem: Example}$: Given two discrete random variables $X$ and $Y$ defined on the following sample spaces $(a,b,c)$ and $(d,e)$ respectively. Write down the expression for the mutual information, I(X;Y), expanding summation.

${\bf Solution: Example}$:

$I(X;Y)  = {\it R}(X=a, Y=d) \log \Big(\cfrac{ {\it R}(X=a, Y=d) }{  {\it R}(X=a)  {\it R}(Y=d)} \Big) +$

${\it R}(X=b, Y=d) \log \Big(\cfrac{  {\it R}(X=b, Y=d)  }{  {\it R}(X=b)  {\it R}(Y=d)} \Big)+$

${\it R}(X=c, Y=d) \log \Big(\cfrac{  {\it R}(X=c, Y=d)  }{  {\it R}(X=c)  {\it R}(Y=d)} \Big)+$

${\it R}(X=a, Y=e) \log \Big(\cfrac{  {\it R}(X=a, Y=e)  }{  {\it R}(X=a)  {\it R}(Y=e)} \Big)+$

${\it R}(X=b, Y=e) \log \Big(\cfrac{  {\it R}(X=b, Y=e)  }{  {\it R}(X=b)  {\it R}(Y=e)} \Big)+$

${\it R}(X=c, Y=e) \log \Big(\cfrac{  {\it R}(X=c, Y=e)  }{  {\it R}(X=c)  {\it R}(Y=e)} \Big)$

${\bf Definition}$: Two discrete random variables $X$ and $Y$, having realisations $X=x_{k}$ and $Y=y_{l}$, over $m$ and $n$ singletons $k=1,...,n$ and $l=1,...,m$ respectively, are given. Two-way contingency table of realisations of $X$ and $Y$ along the same measured data points, $N$ observations, can be constructed by counting co-occurances, or cross-classifying factors. Normalizing the resulting frequency table will produce joint and marginal probabilities.

                Y=y_{1}              ...    Y=y_{l}              R(X)
    X=x_{1}    R(X=x_{1}, Y=y_{1})   ... R(X=x_{1}, Y=y_{l}) R(X=x_{1})
    X=x_{2}    R(X=x_{2}, Y=y_{1})   ... R(X=x_{2}, Y=y_{l}) R(X=x_{2})
    ...             ...              ...        ...              ...
    X=x_{k}    R(X=x_{k}, Y=y_{1})   ... R(X=x_{k}, Y=y_{1} R(X=x_{k}) 
    R(Y)        R(Y=y{1})           ...  R(Y=y_{l})

Simulated example with R

# Table of counts (contingency)
X <- sample(letters[1:3], 100, 1)
Y <- sample(letters[4:5], 100, 1)
t  <- table(X,Y)
tp  <- t/100 # proportions
tn  <- tp/sum(tp)     # normalized, joints
p_x <- rowSums(tn)    # marginals
p_y <- colSums(tn)

P <- tn 
Q <- p_x %o% p_y 

# P(X, Y)   : bin frequency: P_i
# P(X) P(Y) : bin frequency, Q_i 
mi <- sum(P*log(P/Q))
mi.empirical(t) == mi


(KullbackLeibler57) Kullback, S.; Leibler, R.A. (1951). "On information and sufficiency". Annals of Mathematical Statistics 22 (1): 79–86. 


Nuno Fernandes said...

Hu there.

I found this post very goood.
Is there some issue or typo with the R code command %o% that I do not know?

Thank you,

Nuno Edgar Fernandes

msuzen said...

Hi Nuno,


%o% is an outer product. It is from R base, have a look at ?outer or from the manual outer.

Daniel J said...


Nice example but as of january 2018 can't load entropy library, only EntropyEstimation, would that do it? Or how can i load it? Thanks in advance.

msuzen said...

@Daniel What error message you had? entropy package is still available. You need to install it with install.packages("entropy"). Try a different CRAN mirror if you have issues. Sometimes mirrors can have issues in updating properly. If you want to use EntropyEstimation, use one of the KL or MI functions. Note that it uses a different estimator.

(c) Copyright 2008-2017 Mehmet Suzen (suzen at acm dot org)

Creative Commons Licence
This work is licensed under a Creative Commons Attribution 3.0 Unported License.