This book is intended for advanced readers.

The concept of entropy was traditionally derived as the only function satisfying certain criteria for a consistent measure of the "amount of uncertainty" in information theory, from a classical analogy derived within the field of statistical mechanics. It can in fact be given a unified and intuitive interpretation in two equivalent ways. The first way is as the logarithm of the "effective number of states" of a system, and the second, as the "effective number of possible values" for a random variable. In the first case, reference systems with that number of equally probable states can be shown to behave in a way identical with the system under consideration, when a certain aspect of the behavior of the system under study is focused upon. In this case, this "number of possible states of an equally probable system" can be used to characterize the system under consideration because some aspects of its behavior are encapsulated in the number. The entropy function can be applied in an analogous fashion to the characterization of the second case, that of equally probable random variables with a given number of possible values.

The concepts of entropy in three particular sectors of science can be shown to be mathematically equivalent and to have the same fundamental interpretation although they are applied in different ways. These measures are the "statistical mechanical entropy" used to characterize the disorderness of a particular system, the "information entropy" used to measure the amount of information that a particular message conveys, and the "information entropy" which can be used to give the most unbiased statistical inference. Within the framework of that interpretation, the possible non-uniqueness for the definition of entropy arises in a natural way. This guide is written with the aim of providing newcomers to these fields with an intuitive picture of entropy that can be applied broadly.

Introduction

Since its early inception in the field of thermodynamics in the early 1850s by Rudolf Clausius,^[1] the concept of entropy had been introduced into a range of sectors of science and had been proved to be a quantity of great importance. Notably among them are the statistical mechanical entropy proposed by Boltzmann, Planck ^[2] and Gibbs,^[3] which gives the phenomenological thermodynamical entropy an insightful microscopic understanding, the information entropy devised by Shannon ^[4] to measure the information content of a particular message, and its later further development by Jaynes ^[5] that the Shannon entropy for a probability distribution of a random variable can be used to quantize our uncertainty about it and the distribution which maximizes the entropy in compliance with a set of predefined constraints gives the most unbiased estimate of the probability distribution of the random variable. All of the proposed concepts of entropy are often categorized as either being statistical mechanical or information-theoretic ^[6] with their interconnection being somewhat obscure and still under debate. In this guide, the aforementioned three concepts of entropy, covering most of the definitions of entropy ever proposed, are all shown to be able to be understood as the logarithm of the "effective number of states" of the system under consideration or "effective number of possible values" of the random variable, which is the number of possible states or values for a system or a random variable whose probability distribution is uniform and whose behavior is asymptotically equivalent to the system or the random variable under investigation in some respects. And then an intuitive and unified interpretation of the concept of entropy can be obtained. In contrast with a previous similar attempt toward a unified concept of entropy that had been made by Jaynes,^[7] in which the objective statistical mechanical entropy is given a subjective meaning, the present guide tries to give the traditionally subjective information-theoretic entropy and the maximum-entropy principle an objective interpretation and justification, so that their connection with statistical mechanical entropy can be made more apparent and their nature might be more clearly seen. Because of the mathematical equivalence between the concept of a state of a system and that of a possible value for a random variable, these two terms would be used interchangeably in the following to avoid cumbersome expressions.

Statistical mechanical entropy

The famous formula carved on the grave stone of Boltzmann,

S=k_{B}\log \Omega ,(1)\!

gives a tremendously insightful understanding of the thermodynamical entropy and its relationship with the direction of spontaneous change: all the microstates accessible by an isolated system are equally probable (equal a priori probability postulate) and the more microstates a particular macrostate is corresponding to, the more likely the macrostate, so the system would spontaneously evolve from macrostates with fewer number of microstates corresponding to macrostate with a greater number. But for the more general case other than the most fundamental micro-canonical ensemble, the probabilities of occurrence for the microstates are not necessarily equal, so the entropy of a system must be computed by the formula proposed by Gibbs ,^[8] which is

S=-k_{B}\sum _{i}p_{i}\log p_{i},(2)\!

where the summation goes over all the possible microstates, which are indexed by $i\!$ , and $p_{i}\!$ gives their respective probability. This formula can be readily seen to be equivalent to equation (1) for a uniform probability distribution. But when it comes to the justification of the above formula for non-uniform probability distributions, some authors,^[9] including Gibbs himself, derived the formula by classical analogy while some other authors ^[10] tend to use equation (2) as the starting point definition of entropy so that equation (1) is able to come as merely a special case of it.

Here, noticed is the fact that other forms of ensemble, in which the probability distribution might be non-uniform, are all derived by embedment of the system under consideration into a large reservoir so that the micro-canonical equal a priori probability postulate can be invoked. Different probabilities for different states are considered to be stemmed from their different number of recurring in the large micro-canonical ensemble and their number of recurring is given by the number of states of the reservoir that are compatible with it. If the number of recurrence of the $i\!$ th state of the system, i. e. its statistical weight, is denoted by $w_{i}\!$ , then its probability $p_{i}\!$ must be equal to ${\frac {w_{i}}{Z}}\!$ , where $Z=\sum _{i}w_{i}\!$ is the partition function. substituting the above expression for the probability of a particular state into equation (2) gives upon rearrangement:

S=k_{B}\ln {\frac {Z}{\prod _{i}w_{i}^{\frac {w_{i}}{Z}}}}(3)\!

This brings a new intuitive interpretation of the statistical mechanical entropy for ensembles whose probability distribution is non-uniform, if the denominator $\prod _{i}w_{i}^{\frac {w_{i}}{Z}}\!$ is recognized as a kind of self-weighted geometrical mean value of the statistical weights of the states. Hence the quotient of the partition function with it can be interpreted as the number of states the system can be in if the partition function is made up of equal statistical weights.

Now for the detailed derivation, Gibbs's definition of entropy can be written in terms of weights and partition function as $S=-\sum _{i}p_{i}\ln p_{i}=-\sum _{i}{\frac {w_{i}}{Z}}\ln {\frac {w_{i}}{Z}},\!$ which can be decomposed into the sum of two summations, $-\sum _{i}{\frac {w_{i}}{Z}}\ln w_{i}\!$ and $\sum _{i}{\frac {w_{i}}{Z}}\ln Z\!$ . The latter can readily be summed into $\ln Z\!$ , so the entropy formula can now be written as $-\sum _{i}{\frac {w_{i}}{Z}}\ln w_{i}+\ln Z.\!$ By rearranging the coefficient of the logarithm, the former sum can be written as $-\sum _{i}\ln(w_{i}^{\frac {w_{i}}{Z}})$ , which equals $\ln \prod _{i}w_{i}^{\frac {w_{i}}{Z}}$ . So combining the two sums gives the final formula $S=\ln {\frac {Z}{\prod _{i}w_{i}^{\frac {w_{i}}{Z}}}}$ .

And it is obvious that if all the weights are simultaneously multiplied by a common factor then the entropy will not be affected, so that the multiplication by a common factor can be justifiably regarded as a kind of change of unit for counting the number of microstates, which would cause no change in its physics. Or it can be considered as substituting the reservoir with another one of identical relevant behavior (temperature in the case of canonical ensemble, temperature and chemical potential in the case of grand canonical ensemble) but with different initial total number of states $\Omega ^{(0)}\!$ before its coupling to the system, so that properties of the system such as entropy would not be altered by it.

When $w_{i}\!$ approaches zero, $w_{i}^{\frac {w_{i}}{Z}}\!$ approaches the limit of unity, therefore as the statistical weight of a particular state approaches zero, its contribution to the self-weighted geometrical mean also vanishes. So in the thermodynamic limit, where only some particular set of states covers almost all of the total probability distribution, entropy calculated by equation (2) gives the number of states of this particular overwhelmingly likely set. In this respect, this interpretation is quite similar to the "phase extension" interpretation used by such authors as Landau ^[11] , in which the maximum value of the statistical weight plays the role of the self-weighted geometric mean value. Specifically, taking the logarithm of the self-weighted-geometric mean value of the statistical weights in canonical ensemble gives $\sum _{i}{\frac {w_{i}}{Z}}\ln w_{i}=\sum _{i}{\frac {e^{-\beta E_{i}}}{Z}}\ln(e^{-\beta E_{i}}),\!$ which can be easily rearranged to give $\sum _{i}-\beta E_{i}p_{i}\!$ , which equals to $-\beta \langle E\rangle \!$ . Hence the self-weighted-geometric mean of the statistical weights equals $e^{-\beta \langle E\rangle }$ , which is the maximum value of the statistical weights and also the statistical weight of the system with average energy. This kind of equivalence shows the suitability of the self-weighted-geometric mean and the entropy formula for the particular form of weights in canonical ensemble. Hence the equivalence of maximum value of the statistical weight and its self-weighted geometric mean.

If two systems are completely uncorrelated, then the entropy of the composite system can be decoupled quite straightforwardly by $S=k_{B}\ln(\Omega _{1}\Omega _{2})=k_{B}\ln \Omega _{1}+k_{B}\ln \Omega _{2}.\!$ But when it comes to the case of correlated subsystems, as in the case of system and reservoir in the canonical and grand-canonical ensemble, in which the non-uniformity of the number of states to which a particular state can be coupled excludes the usage of a simple multiplication, the decoupling process can still be done by averaging the number of states of the reservoir to which the states can be coupled. So the entropy computation procedure described above can be considered as factoring out the contribution of the system from the total entropy of the system and the large reservoir, with a function of $\Omega ^{(0)}\!$ being the remaining part. Thus the entropy calculated by the Gibbsian equation (2) can justifiably be interpreted as the logarithm of "effective number of states" of the system in the ensemble in that a system with that number of states and a uniform probability distribution gives equal total number of states when uncorrelatedly coupled to a reservoir with number of states equal to the previously defined self-weighted geometric mean value, which can be proved to be the statistical weight of states with average energy in the original canonical ensemble, i. e. it equals the value obtained by a micro-canonical ensemble when the thermodynamic limit is approached. So equation (2) is doing the same thing as equation (1), which is counting and taking logarithm of the number of states that a system can be in to measure its disorderedness, and their difference approaches zero as the thermodynamic limit is approached.

Information entropy of Shannon

In his seminal paper in 1948, Shannon proposed a formula which is formally identical to the Gibbsian statistical mechanical entropy formula in equation (2) to measure the information content of a particular message chosen from a set of all possible messages.^[12] This particular choice of measure was justified by the fact that it is the only possible function which is able to fulfill a set of criteria necessary for a consistent measure. But this way of introducing the concept suffers from the serious deficiency that its derivation is not very intuitively clear, so some authors such as Kardar had offered some more intuitive derivations ^[13] and the following is just a further clarification and detailed analysis of the derivation by Kardar, so that the interpretation of entropy as a kind of "effective number of states" or "effective number of possible values" would be made clear.

If the message is to be chosen from an equally-probable set of possible messages whose cardinality is $N\!$ , a series of $n\!$ such messages would have $N^{n}\!$ possibilities. If one of this set of possibilities is to be transmitted by a device which operates by transmitting one of $u\!$ possible values consecutively, then the number of such transmission unit required to make the transmission of $n\!$ messages possible would be at least $\log _{u}(N^{n})\!$ , which equals $n\log _{u}N\!$ . Then it can be seen that the number of transmission units required to transmit one message would be its information entropy $\log _{u}N\!$ , which is formally identical to the Boltzmann entropy formula, Equation (1). But when it comes to the case in which the set of all possible messages is no longer equally probable, the total number of all possibilities when $n\!$ such messages are to be conveyed would become $g={\frac {N!}{\prod _{i}(Np_{i})!}}$ when the number $n\!$ approaches infinity and $p_{i}\!$ denotes the probability of the $i$ th possible message. If those $n\!$ messages are to be transmitted, the number of transmission units required would be $\log _{u}g\!$ , which can be readily shown to be equal to $n(-\sum _{i}p_{i}\log _{u}p_{i})\!$ when Stirling's approximation is used. So when the total number of messages to be transmitted approaches infinity, the number of transmission units required for each message would be $-\sum _{i}p_{i}\log _{u}p_{i}\!$ , i. e. the entropy, which is defined to be a measure of the information content of a particular message. That is to say, each of the series of messages contains the amount of information that has to be transmitted by that number of transmission units.

Based on the above analysis, it can be clearly seen that entropy in the sense of Shannon can be viewed as the logarithm of the "effective number of possible values" of the random variable of messages to be transmitted because a set of equally-probable possible messages of cardinality $u^{S}\!$ would require an equal number of transmission units as the original message set when the number of messages to be transmitted approaches infinity, i. e. its asymptotic behavior in the sense of number of transmission units required to transmit a large number of those messages is identical to the message set under consideration, therefore it can be used to characterize the message set.

Information entropy in the maximum entropy principle of Jaynes

The maximum entropy principle was devised by Jaynes in 1957 to allow statistical inference based upon merely partial knowledge. Jaynes justified its utility by a set of consistency requirements .^[14] Due to its intimate connection with Shannon's theory, it also suffers from the deficiency of non-intuition and the concept of probability involved had to be in the subjective sense. So attempts have been made to give the maximum entropy principle a concrete picture and the Wallis derivation,^[15] which is completely combinatorial, is a pre-eminent one. In the present guide a new variation of the Wallis derivation is presented in a new objective manner, so that the maximum entropy principle is rendered as a kind of objective statistical inference in which some quite rational and intuitive rules are taken as the basis. Consequently the concept of entropy can be shown to be able to be interpreted as a kind of "effective number of possible values".

If all sides of a fair dice are marked by different labels, then the probability of occurrence of each of the labels can be considered to be equal because of the symmetry of the dice. But when two of the sides are marked by the same label, then the probability for that label would be twice the value for other labels. In the same manner, all cases in which a random variable whose probability distribution is non-uniform can be considered to be a consequence of the fact that different values are yielded by different numbers of another ``true random variable whose probability distribution is uniform. That is to say, any random experiment can be viewed as an experiment yielding one of a set of possible results with equal probability and different sets of the true random results would result in different observable random variable. Values of the observable random variable which corresponds to a greater number of true random results occur with greater probability in the random experiment. This is a purely ontological interpretation on the origin of the non-uniformity of probability in random experiments and is not falsifiable by any experimental fact, so it can be used in any pertinent circumstances to make a clear interpretation on what we are doing.

If the total number of all possible true random results is assumed to be $N$ , then each of the $N\!$ results must correspond to one observable value. If each true random result is considered to be equally plausible to yield each of the possible observable results on account of our oblivion about the internal details of their correspondence, then elements of the space of correspondence maps from the true random results to the observable results are all equally important. For a probability distribution $\{p_{i}\}\!$ for the observable random variable $x_{i}\!$ , the number of correspondence maps that it covers can be calculated to be ${\frac {N!}{(Np_{i})!}}$ , whose logarithm can be written as $N(-\sum _{i}p_{i}\ln p_{i})\!$ if Stirling's approximation formula is used, and it can easily be recognized as $NS\!$ , where $S\!$ denotes the information entropy of the distribution. So the entropy can measure the extent that a particular probability distribution can cover in the space of correspondence maps from the true random results to the observable results, and maximizing entropy amounts to maximizing the extent covered. In this way, the "quanta of probability" in the original Wallis derivation is made concrete and the maximum entropy principle can be given as finding the probability distribution which covers the greatest part of the true random result &mdash observable result correspondence map space. In this manner, the maximum entropy principle can be reduced to assuming that all the possible ways of mapping true random results to observable results are equally important and the probability distribution which is compliant with the greatest number of such correspondence maps when it is subject to certain constraints is the least biased one. Accordingly it can really be considered to be a kind of direct extension of the principle of insufficient reason. But here we are not brutally assigning equal probability to every observable results but rather assigning equal importance to all possible correspondence maps from true random results to the observable results. So what partial knowledge does is that it sets constraints on the space of all possible correspondence maps by the information that it is able to provide. For the quantity of entropy itself, in this context it can be interpreted as the logarithm of an "effective number of possible values" of the random variable for the reason that the extent of the space of correspondence maps covered by this random variable and the equally-likely random variable with that number of states are equal when a common number $N\!$ of possible true random results are assumed. The greater the "effective number of possible values" is, a greater extent is covered and the more unbiased the probability distribution is.

Now if the above reasoning is to be given a formal form, we can have the following derivation. If a set of $m\!$ observable events happens with probability $\{p_{i}\}\!$ and it is considered to be a result of $N\!$ distinct true random events happening with equal probabilities, each particular map from the true random events to the observable events corresponds to a probability distribution and a given probability distribution is given by a set of such maps. For a particular probability distribution $\{p_{i}\}\!$ , the maps that it covers are comprised of the maps which has $p_{i}N\!$ true random events corresponding to the $i$ th observable event. Then this set of maps have the cardinality of ${\frac {N!}{\prod _{i}(p_{i}N)!}}$ . If each map is assigned equal weight because of our oblivion about the internal casual relationship between the true random result and the observable result, then the probability distribution which maximizes the cardinality would be the least biased one. Because of the equivalence of maximizing the cardinality and maximizing its logarithm and the ease of working with its logarithm, the logarithm of it would be maximized in seeking the most unbiased probability assignment and would be denoted by $T$ , then we have $T=\ln {\frac {N!}{\prod _{i}(p_{i}N)!}},\!$ which reads as follows when

Stirling's approximation is used,

{\begin{aligned}T&=N\ln N-N-\sum _{i}[p_{i}N\ln(p_{i}N)-p_{i}N]\\&=N\ln N-N-\sum _{i}(p_{i}N\ln p_{i}+p_{i}N\ln N-p_{i}N)\\&=N\ln N-N-N(\sum _{i}p_{i}\ln p_{i})-N\ln N+N\\&=N(-\sum _{i}p_{i}\ln p_{i})\end{aligned}}

In this way, the above $T\!$

function is shown to be equal to $NS\!$ , where $S\!$ denotes the Shannon information entropy. Thus to maximize the $T\!$ function to obtain the most unbiased probability assignment is equivalent to maximizing the Shannon entropy. Hence the maximum entropy principle is derived from some more concrete and intuitive premises.

Summary

Frequently in many sectors of science, the total number of all possible states of a system or the total number of all possible values for a random variable is a quantity of seminal importance with a lot of aspects of the behavior of the system or the random variable under consideration determined by it, and it is a direct measure of its internal disorder or our oblivion about it. For example, when a set of values for a random variable is equally likely, the cardinality of the set of values that can give rise to some event can give a direct indication on the probability of that event. When the set of all possible values for the random variable lost its symmetry and a variation in its probability distribution is present, then in a lot of aspects, its behavior is no longer similar to equally-probable random variables with that number of states, but rather behaves in a way that is identical with some other equally-probable variable whose number of possible values is less. Thus it can be considered to be less disordered or we have more information about it. That is to say, its number of all possible values decreases effectively when some aspects of its behavior is focused upon. Then the problem of finding the number of possible values of an equally probable random variable with identical behavior and using it as the "effective number of possible values" to characterize a random variable would be of great use in studying its behavior, and this is considered in the present guide to be what exactly the concept of entropy is doing. Moreover, by this concept, its disorderedness or our oblivion about it can be given a quantitative and intuitive measure, i. e. the logarithm of its "effective number of states" or "effective number of possible values". The above three sections have shown that in the three main fields of applications for the concept of entropy, the traditional definitions can always be interpreted as the logarithm of the "effective number of states" in some intuitive way and its definition formula $S=-\sum _{i}p_{i}\ln p_{i}\!$ always arises naturally when certain aspects of its behavior is considered. But it must be noted that the process of assigning an "effective number of states" unavoidably contains some kind of arbitrariness in it and it is strongly dependent upon what kind of behavior of the random variable is focused on in finding its equivalent equally-probable variable. Consequently it can be expected that different kind of entropy definition other than the traditional definition of $-\sum _{i}p_{i}\ln p_{i}\!$ would also be possible if focused is some different aspects of the behavior of the random variable which cannot be grasped by the traditional equation, which have successfully captured it in the aforementioned three sectors. So recent debates over the possible non-uniqueness of the entropy definition and proposal of other forms of entropy can be understood quite readily within the framework of the intuitive interpretation given in the present guide. The traditional information-theoretic derivation of entropy, while it is of great importance and rigor, tends to exhibit the problem of lacking an intuitively concrete picture, so this guide hopes to be able to offer a new perspective of looking at the concept of entropy, and the different definitions of entropy can be regarded as assigning the "effective number of states" by focusing on different aspects of the behavior of the random variable. In this way, respective merits and utilities of different kinds of entropy arising in various sectors of science can be apparent.

References

↑ Clausius, Rudolf (1867). The mathematical theory of heat: with its applications to the steam-engine and to the physical properties of bodies. London: John van Voorst.
↑ Ludwig, Boltzmann (1995). Lectures on Gas Theory. Dover: John van Voorst.
↑ Gibbs, Josiah Willard (1902). Elementary principles in statistical mechanics. New York: Scribner's Sons.
↑ Shannon, C. E. (1948). "A mathematical theory of communication". Bell System Technical Journal. 27: 379–523.
↑ Jaynes, E. T. (1957). "Information Theory and Statistical Mechanics". Physical Review. 106 (4): 620–620. doi:10.1103/PhysRev.106.620.
↑ Lin, Shu-Kun (1999). "Diversity and Entropy". ENTROPY. 1 (1): 1–3. doi:10.3390/e1010001.
↑ Jaynes, E. T. (1957). "Information Theory and Statistical Mechanics". Physical Review. 106 (4): 620–620. doi:10.1103/PhysRev.106.620.
↑ Gibbs, Josiah Willard (1902). Elementary principles in statistical mechanics. New York: Scribner's Sons.
↑ Gibbs, Josiah Willard (1902). Elementary principles in statistical mechanics. New York: Scribner's Sons. Pathria, R. K. (1996). Statistical Mechanics (2nd ed.). Oxford.
↑ Schwabl, Franz (2006). Statistical Mechanics (3rd ed.). Berlin: Springer.
↑ Landau, L. D.; Lifshitz, E. M. (1986). Statistical Physics, Part 1 (3rd ed.). Oxford: Butterworth-Heinemann.
↑ Shannon, C. E. (1948). "A mathematical theory of communication". Bell System Technical Journal. 27: 379–523.
↑ Kardar, Mehran (2007). Statistical Physics of Particles. Cambridge University Press.
↑ Jaynes, E. T. (1957). "Information Theory and Statistical Mechanics". Physical Review. 106 (4): 620–620. doi:10.1103/PhysRev.106.620.
↑ Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University Press.

[1] Clausius, Rudolf (1867). The mathematical theory of heat: with its applications to the steam-engine and to the physical properties of bodies. London: John van Voorst.

[2] Ludwig, Boltzmann (1995). Lectures on Gas Theory. Dover: John van Voorst.

[3] Gibbs, Josiah Willard (1902). Elementary principles in statistical mechanics. New York: Scribner's Sons.

[4] Shannon, C. E. (1948). "A mathematical theory of communication". Bell System Technical Journal. 27: 379–523.

[5] Jaynes, E. T. (1957). "Information Theory and Statistical Mechanics". Physical Review. 106 (4): 620–620. doi:10.1103/PhysRev.106.620.

[6] Lin, Shu-Kun (1999). "Diversity and Entropy". ENTROPY. 1 (1): 1–3. doi:10.3390/e1010001.

[7] Jaynes, E. T. (1957). "Information Theory and Statistical Mechanics". Physical Review. 106 (4): 620–620. doi:10.1103/PhysRev.106.620.

[8] Gibbs, Josiah Willard (1902). Elementary principles in statistical mechanics. New York: Scribner's Sons.

[9] Gibbs, Josiah Willard (1902). Elementary principles in statistical mechanics. New York: Scribner's Sons. Pathria, R. K. (1996). Statistical Mechanics (2nd ed.). Oxford.

[10] Schwabl, Franz (2006). Statistical Mechanics (3rd ed.). Berlin: Springer.

[11] Landau, L. D.; Lifshitz, E. M. (1986). Statistical Physics, Part 1 (3rd ed.). Oxford: Butterworth-Heinemann.

[12] Shannon, C. E. (1948). "A mathematical theory of communication". Bell System Technical Journal. 27: 379–523.

[13] Kardar, Mehran (2007). Statistical Physics of Particles. Cambridge University Press.

[14] Jaynes, E. T. (1957). "Information Theory and Statistical Mechanics". Physical Review. 106 (4): 620–620. doi:10.1103/PhysRev.106.620.

[15] Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University Press.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

An Intuitive Guide to the Concept of Entropy Arising in Various Sectors of Science

Contents