KLdivergence frequently appears in many fields such as statistics and information theory. It is defined as the expected value of logarithmic transformation of likelihood ratio. Note that:
 expected value: weighted integration with probability density.
 logarithmic transformation: conversion multiplication to linear combination that is suitable for convex optimization and function analysis.
 likelihood ratio: a measure of likelihood comparison
1. What is KLdivergence?
1.1 Definition
For any probability distributions and , KLdivergence (KullbackLeibler divergence)^{1} is defined as follows, using their probability density function and .
1.2 Basic properties
KLdivergence has the following properties.

（nonnegative）It has a nonnegative range.

（completeness）When it equals to , and are equivalent.

（assymmetry）It is not symmetric about and .

（absolute continuity）Unless it diverges, is absolutely continuous with respect to .
For example, calculating KLdivergence^{2} between two Gaussian distributions gives the following results: It can be seen that the more the shapes between two distributions do not match, the more KLdivergence increases.
1.3 Is KLdivergence a metrics?
KLdivergence is so important measurement when considering probability and information that it is called by various names depending on the field and context.
“KLdivergence” “KLmetrics” “KLinformation” “Information divergence” “Information gain” “Relative entropy”
Since KLdivergence is always nonnegative, it might be interpreted as the metrics in the space where the probability distributions and exist. However, KLdivergence is not strictly a metric because it only satisfies “nonnegativity” and “completeness” among the following axioms of metrics.
Axioms of metrics :
nonnegativity
completeness
symmetry
The triangle inequality
Note that is called the distance function or simply distance
For example, Euclidean distance, squared distance, Mahalanobis distance, and Hamming distance satisfy these conditions, and can be clearly considered as metrics. On the other hand, KLdivergence is a divergence, not metrics. In mathematics, “divergence” is an extended concept of “metrics” that satisfies only nonnegativity and completeness among axioms of metrics. By introducing “divergence”, you can reduce the constraints of axioms of metrics and have a high level of abstraction.
The word “divergence” is generally interpreted as the process or state of diverging; for example, in physics it appears as a vector operator div. There is no Japanese words that corresponds to the meaning of divergence, but it seems that “相違度”, “分離度”, “逸脱度”, “乖離度” etc. might be used.
As an example, let’s measure the KLdivergence between two Gaussian distributions (blue) and (red). In the figure, the left shows KLdivergence from red one as seen from blue one, and the right shows KLdivergence from blue one as seen from red one. Their value are surely different.
Note that given two Gaussian distribution as
the following holds.
.
img src=”images/comparison_of_dkl_norm.png” width=”600”
Incidentally, in addition to the KLdivergence, the following is known as a measure of the proximity (or closeness) between two probability distributions.
The metrics to measure closeness between and
 ( statistics)
 (norm)
 (norm)
 (Herringer distance)
 (divergence)
 (Generalized information)
 (KLdivergence)
 (JSdivergence)
2. Relatinoship to other measurements
2.1 KLdivergence vs Mutual information
In information theory, entropy , join entropy , conditional entropy , mutual information are defined as follows by using probability density ^{3}.
For any two random variable and , mutual information specifies the mutual (symmetric) dependence between them.
Here, the following relationship holds between KLdivergence and mutual information.
So that, mutual information is interpreted as the degree of difference (average degree of deviation) between the joint distribution when the and are not independent and the joint distribution when and are independent.
（cf.）Formula transformation of mutual information:
2.2 KLdivergence vs Log likelihood ratio
In the field of Bayes inference and statistical modeling, you often face the problem of estimating the true distribution by (that is the combination of stochastic model and estimated parameter ) . Therefore, KLdivergence is used when you want to measure the difference between two distributions, or when you want to incorporate the estimation error into the loss function or risk function in order to solve the optimization problem for the parameter .
Also, KLdivergence is related to the log likelihood ratio so much that it has a deep connection to the model selection method ^{4} such as likelihood ratio test, Bayes factor, and AIC (Akaike’s information criterion).
 KLdivergence of estimated distribution for the true distribution : is considerd as the expected value of the log likelihood ratio for tue true distribution .
~
When using KLdivergence as the evaluation/loss value in model selection/comparison, it is equivalent that minimizing KLdivergence: and maximizing the log likelihood: as follows.
 For any parametric stochastic model (such as a linear regression model) which represents the estimated distribution as , if a certain loss function is given, the optimal parameter exists as it satisfy the following. Then, for any estimated parameter ,the estimated loss of the model is represented by KLdivergence. (Note that means the log likelihood function.)
2.3 KLdivergence vs Fisher information
Given a certain stochastic model , Fisher information for the parameter is defined as follows. (Note that means the log likelihood function.) Also, between KLdivergence and Fisher information, the following holds.
(cf.) The following equation holds by using Taylor expansion of .
This formula indicates that in parameter space , for all point ant its neighborring point , their KLdivergence： is directly proportional to Fisher information . After all, Fisher information measures the local information that the stochastic model has at the point .
3. References

Also, fdivergence is defined as its generalized class. ↩

I used scipy.stats.entropy(). ↩

Although thermodynamic entropy is originated in Boltzmann, the historical background of Shannon information is mentioned below link. There seems to be a reference flow: Hartley → Nyquist → Shannon. http://www.ieice.org/jpn/books/kaishikiji/200112/2001129.html ↩

Article on gneralized information criterion(GIC): https://www.ism.ac.jp/editsec/toukei/pdf/472375.pdf ↩