|
Tags: math, probability, statistics
Yo que siempre trabajo y me desvelo
por parecer que tengo de poeta
la gracia que no quiso darme el cielo
– Miguel de Cervantes Saavedra
Roughly translated by myself:
Me, that I always work and worry
to pretend to be a poet
the grace that heaven didn't wish to send me
Miguel de Cervantes was one of the greatest writers of all time and yet he wished so much to be better at poetry. He was actually a good poet but his poetry was not at the same astounding level of his novels.
I'm not as good on anything as Cervantes was on writing but I love that quote because I empathize completely, not because of poetry, but because of mathematics. I bet I'm not the only engineer that thinks the same.
I added this preface after writing this post because it's so heavy in mathematics and it has taken me a lot of effort. I'm really happy nevertheless. I have known Cox's theorem for some time but I didn't push through the proof before and after writing this I think I have learned something.
This post is an attempt to give a derivation of Cox's theorem in a way I find palatable to my tastes. There are plenty of proofs and Jayne's book [2] has one under some assumptions of differentiability (the book was previously fully downloadable and at least the first chapters where the proof is given remain so). The assumption of differentiability can be dropped but it is mentioned in the book that:
Aczél derives the general solution, Eq. (2–27) below, without assuming differentiability; unfortunately, the proof fills eleven pages (256-267) of his book.
Aczél's book [1] is not exactly light reading but I think the proof can be shortened if specialized for our application and I have tried to make it more straightforward from my point of view.
I consulted too The Uncertain's reasoner companion [3], which seems to give a similar proof to the one I got for the associativity equation and contains a proof also for the negation equation. It is a very terse book and sometimes skips steps. I have tried to fill in the gaps.
This is, more or less, the second part to this other post.
Instead of probabilities we will talk about plausibilities. The reason is that probabilities already have a well defined meaning as given by the axioms we have already seen. Plausibilities are some vague numbers for now that we assign to uncertain events. The program we are going to follow is that if we reason without logical contradictions about plausibilities we get to the conclusion that plausibilities must obey probability axioms and therefore be themselves probabilities. A more detailed and careful explanation is given on Jayne's book.
We are going to write plausibilities as \(A|Z\) where \(A\) is some (uncertain) event and \(Z\) is our knowledge. There is always some knowledge. For example, imagine a coin. You might think that in the absence of knowledge you assign probability 1/2 for heads and 1/2 for tails, but actually you were able to do so because you know that there are two possible outcomes and no more. If you want an example of no knowledge you can find one: a stone. But stones are not able to assign probabilities. What kind of object is \(Z\)? We cannot answer that unless we get a complete math formalization of intelligence but it must be at least able to contain the truth or falsity of the logical propositions whose plausibility we are considering. For example imagine a fair dice. We must be able to add to our knowledge “the result of the throw is odd”.
I have decided to write plausibilities using the above notation instead of, say, writing \(p (A|Z)\) or something similar that mimicks more closely the way we write probabilities. There are two reasons for this. The first one is that since we are going to work with functional equations and the plausibilities appear inside them suddenly you start to get enough parenthesis to choke a Lisp programmer. The second one is that we are going to manipulate plausibilities as if they were just real numbers and we don't care if those real numbers are coming from some event space through a function. This latter point actually has some implicit requirement: we will assume that for every plausibility value we have we are able to find events that give that value. This has been mathematically formalized in the literature but I don't think it adds much to the discussion. See Requirement (Co5) of Theorem 3.4, page 24 in [3].
In the following I will write \(AB\) to mean \(A \text{and } B\), \(A + B\) to mean \(A \text{or } B\) and \(\bar{A}\) to mean not \(A\).
First, and maybe more controversially (see [4] which I found quite readable and is downloadable), we expect that given two events we can answer which one is more plausible. That is, plausibility of all events are comparable. This translates to representing plausibilities with real numbers. Natural and rational numbers are ordered too but it seems quite restrictive to consider them and we are going to assume that plausibilities are continuous.
Let \(Z\) represent known information, \(A\), \(B\) and \(C\) arbitrary propositions, truth as \(\top\) and falsity as \(\bot\). We write \(A|Z \in \mathbb{R}\) to represent the plausibility of \(A\) given we know \(Z\). Define \(\alpha = \bot |Z\) and \(\omega = \top |Z\) to be constants independent of \(Z\). We expect then:
Where \(\alpha\) and \(\omega\) are not necessarily finite but without loss of generality we assume them finite and write closed intervals \([\alpha, \omega]\) for convenience.
It is reasonable that if we know the plausibility \(B|Z\) and the plausibility \(A|Z\) if \(B\) is true then that defines the plausibility \(AB|Z\). If we expect \(AB|Z\) to depend only on \(B|Z\) and \(A|BZ\) there exists some function \(F\) such that:
\(\displaystyle AB|Z = F (A|BZ, B|Z) \) | (1) |
It stands also to reason that if \(B|Z\) increases while \(A|BZ\) remains equal then \(AB|Z\) increases. The same applies increasing \(A|BZ\) while \(B|Z\) is constant. We also expect that small changes in \(B|Z\) or \(A|BZ\) translate into small changes in \(AB|Z\). In math lingo:
We expect \(F : [\alpha, \omega]^2 \to [\alpha, \omega]\) to be strictly increasing and continuous on each variable.
Also, notice by setting \(A = \top\) or \(B = \top\) on equation (1) that \(\omega\) transforms \(F\) into an identity:
That is:
By the way, \(u\) and \(v\) will always represent plausibilities.
In a similar fashion setting \(A = \bot\) or \(B = \bot\):
We get (\(A| \bot Z\) can get any plausibility you want):
Let's see what associativity of logical conjunction implies for our function \(F\):
And with a different grouping:
Renaming \(x = A|BCZ\), \(y = B|CZ\) and \(z = C|Z\) and equating both expressions we get the equation:
As we have said before we assume also the reverse. That for any \(x, y, z\) we can always find \(A, B, C\) and \(Z\) such that:
About commutativity
It annoyed me the first time I read a proof of Cox's theorem that we analyze what condition associativity of logical conjunction imposes but a discussion of commutativity was missing. Let's do it. We now invoke the commutativity of logical conjunction:
\(\displaystyle F (A|BZ, B|Z) = F (B|AZ, A|Z) \) | (2) |
But we get nothing on \(F\)! What is even worse is that we know that \(F\) should be commutative because conditional probability, which is what we are trying to get if you have forgotten, is just multiplication which is certainly commutative. Mathematics is playing a prank on us: we will get commutativity from associativity + continuity + strict monotonicity. If you think about (2) it's actually fortunate that is not satisfied by the structure of \(F\) otherwise we could not use it to get \(B|AZ\) from \(A|BZ\) and our beloved Bayes' theorem would not be useful.
To simplify notation we will define an operator \(\circ : [\alpha, \omega]^2 \to [\alpha, \omega]\) as:
This operator maps the plausibilities of two propositions to the plausibility of their logical conjunction. All plausibilities lie inside the real interval \([\alpha, \omega]\) where \(\alpha\) is the plausibility of a contradiction and \(\omega\) the plausibility of a tautology.
The operator as has been said is continuous and strictly increasing in each variable.
As we have seen it's associative:
\(\displaystyle u \circ (v \circ w) = (u \circ v) \circ w \) | (3) |
It has a unit element, which is easy to prove is unique:
\(\displaystyle u \circ \omega = \omega \circ u = u \) | (4) |
And a zero element, also unique:
\(\displaystyle u \circ \alpha = \alpha \circ u = \alpha \) | (5) |
We are going to find now the most general form for an operator that satisfies all the conditions stated in this section.
The following proof was written after following Section 6.2. The Associativity Equation of [1]. It's vastly simplified since we can jump directly to strict monotonicity and some additional information. It has been two decades since I touched a calculus book at an undergrad level so if you find some discrepancy between what I say and what an accomplished mathematician has said I'm sure you will apply common sense. I hope that if not completely rigorous at least it's correct in spirit. Let's start.
Using associativity we define unambiguous exponentiation for non-negative integers as:
I have used the brackets for exponentiation to remember that although \(u\) is a real number this exponentiation is not the usual exponentiation on real numbers.
From the definition and associativity it follows that:
Notice that, for a fixed \(n\), \(v^{[n]}\) is a continuous strictly increasing function of \(v\) and its range is the whole interval \([\alpha, \omega]\) and so the equation:
Always has a unique solution for all \(u\) which we will write as \(v = u^{[1 / n]}\).
We define:
We are going to show that also:
\(\displaystyle u^{[m / n]} = (u^{[1 / n]})^{[m]} \) | (6) |
Calling \(v = (u^{[m]})^{[1 / n]}\) and \(w = u^{[1 / n]}\) we have by definition:
And from this:
Since exponentiation \(v^{[n]}\) is strictly increasing for \(v\) and continuous it has a unique solution:
And so (6) is proved.
We can now add rational exponents with the same denominator:
And from that rational numbers in general:
We can now define exponentiation for any real number \(x\) in the usual way: pick a sequence of rational numbers \(r_n\) converging to \(x\) and define:
Since \(u^{[r_n]}\) is strictly decreasing and bounded the above limit always exists and is therefore unique and the defined function is continuous on the exponent \(x\). Since the rationals are dense on the reals it's the unique continuous extension from the rationals to the reals (see for example this)
\(u^{[x]}\) is strictly decreasing on \(x\) and since the range is the whole interval \([\alpha, \omega]\) it's invertible. Proof comes: let \(r_n \to x\) and \(s_n \to y\) be sequences of rational numbers. If \(x < y\) then for some \(N\) we have that \(n \ge N \Longrightarrow r_n < a < b < s_n\), where \(a\) and \(b\) are rational too and from being strictly decreasing on the rationals:
Taking limits:
From the additivity of rational exponents we get the additivity of real exponents. Let be \(t_n \to x + y\) a sequence of rationals. Let be \(r_n \to x\) a sequence of rationals and define \(s_n = t_n - r_n\) which is a sequence of rationals too and since \(r_n\) and \(t_n\) have limits we have \(s_n \to y\). Then we have:
And since the limits exist and the operator is continuous on both variables:
And so we proved:
Pick any number \(c \in (\alpha, \omega)\) and define the invertible function \(f\) as:
Notice that:
Now plugging on the above \(x = f^{- 1} (u)\) and \(y = f^{- 1} (v)\) we get:
\(\displaystyle u \circ v = f (f^{- 1} (u) + f^{- 1} (v)) \) | (7) |
Note that the base number \(c\) selected is irrelevant for the above equation since had we picked another \(c'\) as base we would had gotten another function \(f'\) satisfying \(f' (u) = f (\gamma u)\) for some constant \(\gamma\) which would disappear in the above expression.
Some comments: I find striking that we ended with a commutative operator as can be seen from the above expression. Why? The proof is there but it's easy as with so many mathematical proofs to not see the forest for the trees. There are three key steps.
For every associative operator we can define an exponentiation operator as we have done here which satisfies addition of exponents for integer values.
Commutativity \(u \circ v = v \circ u\) always is satisfied for any \(u\) and \(v\) that can be written as exponentials with some common base \(c\):
And now continuity and strict monotonicity always allow us to write any two number as exponentials of a common base, therefore getting commutativity for the whole domain.
Equation (7) shows a necessary condition for the operator. Any solution necessarily can be written in this form where \(f\) is a strictly monotonous continuous invertible function.
It could be the case that if we plug (7) into (3) the equation or conditions (4) or (5) are not met. That would mean our problem has not solution! Fortunately this is not the case. It's easy to see that any operator defined using (7) with an arbitrary continuous, strictly monotonous, invertible function \(f\) satisfies associativity and continuity and strict monotonicity for the operator. Notice also that we have \(\omega = f (0)\) and \(\alpha = f (\pm \infty)\) and so all conditions are satisfied and (7) is truly the most general solution to our problem.
Different choices of \(f\) will get different operators (unless \(f' (u) = f (\gamma u)\) which gives the exact same operator) although they are all somehow equivalent as they represent remapping of plausibilities by continuous invertible functions. We are going to choose \(f (u) = e^u\) where \(f : (- \infty, 0] \to [0, 1]\) so we get the exact form and range as with probabilities:
And so:
We repeat that any continuous invertible function would have done the trick, for example using \(f (x) = 100 e^x\) we get multiplication but with percentages or \(f (x) = x\) would get addition of log probabilities instead of multiplication, but since they are basically equivalent let's just use the usual one.
We now see what conditions can we get for negation. It seems quite reasonable that \(\bar{A} |Z\) depends only on \(A|Z\) by some function \(S\):
We expect that \(S\) is a continuous and decreasing function and since \(\overline{\bar{A}} = A\) we expect also that \(S (S (x)) = x\).
To simplify things we will assume multiplication of plausibilities for the logical conjunction and the usual range of \([0, 1]\) since anyway all of them are equivalent. Let's see what additional constraints can we add to \(S\) by using logical conjunction. Applying the formula for conditional plausibility:
Now we substitute \(A|BZ = S (\bar{A} |BZ)\):
But from:
We can substitute \(\bar{A} |BZ = \bar{A} B| Z / B|Z\) to get:
And this time we do apply commutativity of logical conjunction, exchanging the roles of \(A\) and \(B\) in the right side:
And of course equating:
There are four different plausibilities and amazingly can be reduced to two by noticing that for every event \(D\) if we with some inspiration write:
We get the following equalities:
Substituting these equalities we get finally:
And calling \(A|Z = x\) and \(B|Z = y\) we get:
\(\displaystyle yS \left( \frac{S (x)}{y} \right) = xS \left( \frac{S (y)}{x} \right) \) | (8) |
This proof is from [3]. It has been expanded. In particular (31) is not proved in the original proof. Also it is asserted in the original proof, in a rather convoluted manner, that \(S (1 / 2) = 1 / 2\). This is arbitrary and unnecessary.
We define for \(y \ge x \ge 0\):
\(\displaystyle y \dot{-} x = yS (x / y) \) | (9) |
The following simple property is important:
\(\displaystyle kx \dot{-} ky = kxS \left( \frac{kx}{ky} \right) = k \left[ xS \left( \frac{y}{x} \right) \right] = k (x \dot{-} y) \) | (10) |
We can recover \(S (x)\) by substituting \(y = 1\) in our operator to get for \(x \in [0, 1]\):
\(\displaystyle S (x) = 1 \dot{-} x \) | (11) |
Using (10) and (11) into (8) we get for \(x, y \in [0, 1]\):
\(\displaystyle y \dot{-} (1 \dot{-} x) = x \dot{-} (1 \dot{-} y) \) | (12) |
Actually after proving property (10) and getting (12) we can completely forget about \(S\) and simply manipulate our new operator. The reason is that any operator satisfying property (10) will have the form of (9). See Section, 5.2. The Equations of Homogeneous Functions and Related Equations of [1].
Substituting \(y = 1\) into (12) we get the equivalent of \(S (S (x)) = x\) which surprisingly can be considered a consequence from compatibility with conjunction and not an assumption:
\(\displaystyle 1 \dot{-} (1 \dot{-} x) = x \) | (13) |
The above is the start of our escalation to more general term reordering equations involving triplets of variables.
Let's search more general forms. Substituting \(y \to 1 \dot{-} y\) in (12) and using (13) we get another version similar to (12):
\(\displaystyle (1 \dot{-} y) \dot{-} x = (1 \dot{-} x) \dot{-} y \) | (14) |
Multiply by \(z\) both sides of (12) and distribute \(z\) using (10):
\(\displaystyle z [y \dot{-} (1 \dot{-} x)] = z [x \dot{-} (1 \dot{-} y)] = zy \dot{-} (z \dot{-} x) = zx \dot{-} (z \dot{-} zy) \) | (15) |
And substitute \(zy \to y\) and \(zx \to x\) to get to get for \(0 \le x, y \le z\):
\(\displaystyle y \dot{-} (z \dot{-} x) = x \dot{-} (z \dot{-} y) \) | (16) |
Doing the same process to (14) we get another version:
\(\displaystyle (z \dot{-} y) \dot{-} x = (z \dot{-} x) \dot{-} y \) | (17) |
As before we just need to remember (10), (16) and (17) since all properties can be derived from them by a suitable substitution of variables. Actually (16) and (17) are easy to remember: they say that you can swap variables as long as their “sign” remains correct in the resulting expression.
From (16) and (17) is quite easy to get cancellations results. They all basically say that you can cancel anything as long as you don't get negative results.
Using (16):
\(\displaystyle x \dot{-} (x \dot{-} y) = y \dot{-} (x \dot{-} x) = y \) | (18) |
Using first (17) and then (18):
\(\displaystyle (x \dot{-} z) \dot{-} (y \dot{-} z) = [x \dot{-} (y \dot{-} z)] \dot{-} z = [z \dot{-} (x \dot{-} y)] \dot{-} z = x \dot{-} y \) | (19) |
Using first (16) and then (18):
\(\displaystyle (z \dot{-} x) \dot{-} (z \dot{-} y) = y \dot{-} [z \dot{-} (z \dot{-} x)] = y \dot{-} x \) | (20) |
Define now yet another operator \(\dot{+}\):
\(\displaystyle x \dot{+} y = 1 \dot{-} [(1 \dot{-} x) \dot{-} y] \) | (21) |
Notice that (14) transforms to this:
\(\displaystyle x \dot{+} y = y \dot{+} x \) | (22) |
Let's see how from (16) and (17) get the associative property for \(\dot{+}\). By repeated application of the definition (21) we get:
After reordering using (17):
But the above is the repeated application of the definition (21) to \(x \dot{+} (y \dot{+} z)\) and so we can say:
\(\displaystyle x \dot{+} (y \dot{+} z) = (x \dot{+} y) \dot{+} z \) | (23) |
Subtraction propagates over addition:
\(\displaystyle \begin{array}{llll} x \dot{-} (y \dot{+} z) & = & x \dot{-} \{ 1 \dot{-} [(1 \dot{-} y) \dot{-} z] \} & (21)\\ & = & [(1 \dot{-} y) \dot{-} z] \dot{-} (1 \dot{-} x) & (16)\\ & = & [(1 \dot{-} z) \dot{-} y] \dot{-} (1 \dot{-} x) & (17)\\ & = & [(1 \dot{-} z) \dot{-} (1 \dot{-} x)] \dot{-} y & (17)\\ & = & (x \dot{-} z) \dot{-} y & (20)\\ & = & (x \dot{-} y) \dot{-} z & (17) \end{array} \) | (24) |
We can easily prove now how multiplication distributes over addition:
\(\displaystyle \begin{array}{llll} k (x \dot{+} y) & = & k \{ 1 \dot{-} [(1 \dot{-} x) \dot{-} y] \} & (21)\\ & = & k \dot{-} [(k \dot{-} kx) \dot{-} ky] & (10)\\ & = & k \dot{-} [k \dot{-} (kx \dot{+} ky)] & (24)\\ & = & kx \dot{+} ky & (18) \end{array} \) | (25) |
As a continuous function \(S (x) : [0, 1] \to [0, 1]\) does necessarily have a fixed point, let's call it \(\gamma\):
\(\displaystyle S (\gamma) = \gamma \) | (26) |
Note that this means that:
\(\displaystyle \gamma \dot{+} \gamma = 1 \) | (27) |
Let's write a new multiplication operator \(\star\) defined as:
\(\displaystyle m \star x = \underbrace{x + \cdots + x}_m \) | (28) |
From associativity (23) we get that:
\(\displaystyle m \star (n \star x) = (mn) \star x \) | (29) |
And from distributivity (25) we get:
\(\displaystyle k (m \star x) = m \star (kx) \) | (30) |
We prove now by induction that:
\(\displaystyle (m \star x)^n = m^n \star x^n \) | (31) |
Proof of (31):
We prove now:
\(\displaystyle 2^n \star x = \frac{x}{\gamma^n} \) | (32) |
Proof of (32):
We are now going to prove that if \(\delta = \log_2 \frac{1}{\gamma}\):
\(\displaystyle m \star x = m^{\delta} x \) | (33) |
The idea comes from the fact that we can get arbitrarily close with numbers \(2^{\frac{a}{k}}\) to any number \(m\). Let's define two series of rational numbers \(a_n / k_n\) and \(b_n / k_n\) converging from below and from above to \(\log_2 m\) then:
It would be great if we could express any addition \(x \dot{+} y\) as an addition of multiplications with a common factor. That way we could use (33) to compute any addition. That is, if there was some \(0 < c < 1\) and natural \(m, n\) such that:
It would require:
Which results in:
Obviously the above is not in general a rational number but we can get arbitrarily close with rational numbers so let's throw some limits and continuity arguments to the mix.
Let's go for it. Pick any series of rational numbers \(m_k / n_k\) such that:
Define the series of real numbers:
And the following series:
We have then:
Taking limits:
But continuity everywhere gets:
And so:
Substituting the above in:
And solving for \(1 \dot{-} x\) we get the solution:
\(\displaystyle S (x) = 1 \dot{-} x = (1 - x^{1 / \delta})^{\delta} \) | (34) |
We have proved that \((8) \Longrightarrow (34)\) and it's easy to check that (34) satisfies (8) and so (34) it's the only possible continuous strictly decreasing solution defined on \([0, 1]\) for any \(\delta > 0\).
In the language of plausibilities we now have the somewhat strange:
Or in a more symmetric form:
Notice that we can always remap our plausibilities by the continuous strictly increasing function \(f (x) = x^{\frac{1}{\delta}}\) that gives the more familiar equation:
\(\displaystyle \bar{A} |Z + A| Z = 1 \) | (35) |
Notice that under this transformation the relationship of conditional probability is preserved:
And so we can simply choose (35) as the solution, modulo continuous transformations.
We now just need to combine negation and conjunction. Starting from the De Morgan laws:
To summarize, if we have a set of plausibilities that are logically consistent between themselves and satisfy some reasonable assumptions like:
They are represented by real numbers, so that they can be ordered
Increasing the plausibility \(A|BZ\) or \(B|Z\) must increase \(AB|Z\).
Increasing the plausibility \(A|Z\) decreases \(\bar{A} |Z\)
Small changes in inputs give small changes in outputs.
Then we can always rescale them by a continuous strictly increasing function (and as such retain their ordering) in such a way that they lie in the interval \([0, 1]\), falsity is 0, truth is 1 and they obey the following laws which are equivalent to Kolmogorov's axioms (for finite cases).
Conditional plausibility
Negation of plausibilities
And as a consequence of the two above also:
Logical disjunction
It has been hell to reference equations in KaTeX. I have manually numbered them with \tag{} and any time I needed to insert or delete an existing one I wanted to cry. I won't do it again, there must be a better way. I wouldn't be surprised if there was some errors with the equation numbering.
If someone, who knows why, reads this post I would advice to skim over the proof and then try to replicate it. And be patient. Sometimes the steps look magical but they start to seem obvious after staring at them for some hours straight. It takes time and effort.
Joseph Aczél. Lectures on functional equations and their applications. Academic press, 1966.
Edwin T Jaynes. Probability theory: The logic of science. Cambridge university press, 2003.
Jeff B Paris. The uncertain reasoner's companion: a mathematical perspective, volume 39. Cambridge University Press, 2006.
Kevin S Van Horn. Constructing a logic of plausible inference: a guide to cox's theorem. International Journal of Approximate Reasoning, 34(1):3–24, 2003.