Notes on Probability Theory (Jaynes)

Chapter 1: Plausible Reasoning

Extending Deductive Logic to Incomplete Information
- Jaynes reinterprets probability as an extension of deductive logic to cases of incomplete information.
- Whereas a classical syllogism (a type of logical argument that uses deductive reasoning to arrive at a conclusion based on premises, where the conclusion logically follows from the premises) goes: $\text{If } P \text{ then } Q, P \implies Q$
- In plausible reasoning we observe Q and infer that P is made more likely — but not proven — by that evidence.
- Example:
  - A police officer sees a masked man near a broken window.
  - Although many explanations are possible, the observation increases the plausibility that the man is a thief.
  - This isn’t a strict logical deduction but a hypothesis update based on background experience.
- This is huge because it transforms how we approach uncertainty. Instead of being stuck with a binary true/false framework of traditional deductive logic, Jaynes’s reinterpretation bridges the gap between certainty and uncertainty.
  - Traditional syllogisms guarantee a conclusion if the premises are met. However, in the real world we rarely have complete information.
  - By observing Q and inferring that P becomes more likely (though not proven), we move from rigid deduction to flexible, evidence-based reasoning.
- Additionally, propositional logic can be used as the building block
- And plausibility can be quantitatively represented
  - Jaynes asserts that plausibility should be represented by a real number.
  - A higher number indicates a greater degree of belief in the truth of a proposition.
Desiderata for Consistent Reasoning
- Jaynes proposes an imaginary robot whose brain is designed to follow a set of carefully chosen rules of plausible reasoning, derived from fundamental desiderata expected in human cognition.
- Its brain is to be designed by us, so that it reasons according to certain definite rules.
- These rules will be deduced from simple desiderata which would be desirable in human brains; i.e. a rational person, on discovering that they were violating one of these desiderata, would wish to revise their thinking.
- Jaynes introduces a set of criteria or desiderata that any such robot must satisfy:
  1. Degrees of plausibility are represented by real numbers
    - Plausibilities are real numbers to allow a clear, unambiguous comparison (e.g., “more plausible” means numerically larger).
  2. Qualitative correspondence with common sense
    - If additional information makes A more plausible, then for any proposition B (whose plausibility remains unchanged), the joint plausibility should not decrease:
    - If we update information (C) to (C’) in such a way that the plausibility for (A) is increased: $$ (A \mid C') > (A \mid C) $$
      but the plausibility for (B) given (A) remains unchanged:
      $$ (B \mid A C') = (B \mid A C) $$
      then this update can only produce an increase (never a decrease) in the plausibility that both (A) and (B) are true:
      $$ (AB \mid C') \geq (AB \mid C) $$
      and it must produce a decrease in the plausibility that (A) is false:
      $$ (\overline{A} \mid C') < (\overline{A} \mid C) $$
  3. Consistent Reasoning:
    1. If a conclusion can be reasoned out in more than one way, then every possible way must lead to the same result.
    2. The robot always takes into account all of the evidence it has relevant to a question.
      - It does not arbitrarily ignore some of the information, basing its conclusions only on what remains.
      - In other words, the robot is completely nonideological.
    3. The robot always represents equivalent states of knowledge by equivalent plausibility assignments.
      - If in two problems the robot’s state of knowledge is the same (except perhaps for the labeling of the propositions), then it must assign the same plausibilities in both.
Mind Projection Fallacy
- Jaynes warns against confusing the properties of our knowledge with properties of the world.
- When we say “the probability of event X is p,” we are expressing our state of knowledge rather than a physical attribute of the event itself.
- Many debates in probability stem from conflating epistemological uncertainty (our lack of complete information) with ontological indeterminacy (the inherent randomness of nature).

Chapter 2: The quantitative rules

Jaynes derives the product rule and the sum rule by manipulating syllogisms - rules we take as axioms when we first encounter probability.
(Forgive me for any latex/formatting crimes I commit)

Product Rule

Consider the expression $(AB \mid C).$ This can be decomposed in two natural ways:

Is $B$ true given $C$? (i.e., $B \mid C$) and then, given $B$, ask: Is $A$ true? (i.e., $A \mid BC$)
Alternatively, is $A$ true given $C$? (i.e., $A \mid C$) and then, given $A$, ask: Is $B$ true? (i.e., $B \mid AC$)

Thus, assume there exists a function (F) such that $(AB \mid C) = F\bigl[B \mid C,\, A \mid BC\bigr],$ and equivalently, $(AB \mid C) = F\bigl[A \mid C,\, B \mid AC\bigr].$

Let $u = (AB \mid C), \quad x = B \mid C, \quad y = A \mid BC.$ Substituting these into (F) leads to the Associativity Equation: $F\bigl[F(x,y),z\bigr] = F\bigl[x,F(y,z)\bigr].$

It turns out that the general solution can be expressed in terms of a continuous, monotonic function (w) as $F(x,y) = w^{-1}\bigl(w(x)w(y)\bigr).$ Substituting this form into the associativity condition yields $w^{-1}\Bigl(w\bigl(F(x,y),z\bigr)\Bigr) = w^{-1}\Bigl(w\Bigl(w^{-1}\bigl(w(x)w(y)\bigr)w(z)\Bigr)\Bigr) = w^{-1}\bigl(w(x)w(y)w(z)\bigr).$

Thus, we arrive at the following expressions:

$$ \begin{aligned} AB \mid C &= w'\bigl(w(B \mid C),\, w(A \mid BC)\bigr) \quad (i)\\ BA \mid C &= w'\bigl(w(A \mid C),\, w(B \mid AC)\bigr) \quad (ii)\\[6pt] \end{aligned} $$

Equating (i) and (ii) gives:

$$ \begin{aligned} w(AB \mid C) &= w(B \mid C)\, w(A \mid BC) \;=\; w(A \mid C)\, w(B \mid AC) \end{aligned} $$

This is starting to look familiar to us, except that we see $w$ instead of $P$ or $p$.

Since $w$ transforms our plausibility levels, it’s helpful to determine what the range of our scale is.

If $A$ is certain given $C$:

\[AB \mid C = B \mid C\]
\[A \mid BC = A \mid C\]

So, $w(AB \mid C) = w(B \mid C)\,w(A \mid BC) \;\;\Longrightarrow\;\; w(B \mid C) = w(B \mid C)\,w(A \mid C) \\ \;\;\Longrightarrow\;\; w(A \mid BC) = 1 \;\;\Longrightarrow\;\; w(A \mid C) = 1$

If $A$ is impossible:

\[AB \mid C = A \mid C\]
\[A \mid BC = A \mid C\]

Thus, $w(AB \mid C) = w(B \mid C)\,w(A \mid BC) \;\;\Longrightarrow\;\; w(A \mid C) = w(B \mid C)\,w(A \mid C) \\ \;\;\Longrightarrow\;\; w(A \mid C) = 0 \, \text{or} +\infty$

To accommodate $+\infty$, we can transform $w$ into $1/w$, so that the scale becomes $0$ to $1$ again.

Sum Rule

Consider $A$ and $\overline{A}$. Intuitively, the plausibility of $\overline{A}$ should depend on the plausibility of $A$. So we can guess that there is a function $S$ such that: $w(\overline{A} \mid C) \;=\; S\bigl(w(A \mid C)\bigr).$

From the condition $w(\overline{\overline{A}} \mid C) = w(A \mid C)$, we get: $S\bigl(S(x)\bigr) = x, \quad S(0) = 1, \quad S(1) = 0.$

Also:

$$ \begin{aligned} w(AB \mid C) &= w(A \mid C)\,w(B \mid A C) \quad (i)\\ w(A \overline{B} \mid C) &= w(A \mid C)\,w(\overline{B} \mid A C) \quad (ii)\\[6pt] \end{aligned} $$

Substituting in $(i)$ and $(ii)$,

$$ \begin{aligned} w(AB \mid C) &= w(A \mid C)\,w(B \mid A C) \quad\\ &= w(A \mid C)\, S\!\bigl[w(\overline{B} \mid A C)\bigr] \\ &= w(A \mid C)\, S\!\Bigl[\frac{w(A \overline{B} \mid C)}{w(A \mid C)}\Bigr] \quad (iii)\\ \end{aligned} $$

Equivalently,

$$ \begin{aligned} w(AB \mid C) &= w(B \mid C)\, S\!\Bigl[\frac{w(B \overline{A} \mid C)}{w(B \mid C)}\Bigr] \quad (iv)\\ \end{aligned} $$

Now, Let $B = \overline{AD}$ for some $D$, which gives us:

$$ \begin{aligned} A \overline{B} &= A \overline{(\overline{AD})} \\ &= A (AD) \\ &= AD \\ &= \overline{B} \quad \quad \quad \quad \quad \quad (v)\\ \end{aligned} $$

and,

$$ \begin{aligned} B \overline{A} &= (\overline{AD}) (\overline{A}) \\ &= (\overline{A} + \overline{D}) (\overline{A}) \\ &= \overline{A} + (\overline{A})(\overline{D}) \\ &= \overline{A}(1 + \overline{D}) \\ &= \overline{A} \quad \quad \quad \quad \quad \quad (vi)\\ \end{aligned} $$

Define $x = w(A \mid C)$ and $y = w(B \mid C)$. Then using $(iii)$ and $(iv)$:

\[x \, S\!\Bigl( y \,\frac{S(x)}{x} \Bigr) \;=\; y \, S\!\Bigl( \frac{S(x)}{y} \Bigr) \quad \text{for all } x,y.\]

The general solution to this problem turns out to be $S(x) = (1 - x^{m})^{(1/m)}$ for some constant $m$.

Since $w(\overline{A} \mid C) \;=\; S\bigl(w(A \mid C)\bigr)$,

$$ \begin{aligned} w(\overline{A} \mid C) &= (1-(w^{m}(A \mid C)))^{(1/m)} \end{aligned} $$

which gives us,

$$ \begin{aligned} (w^{m}(A \mid C)) + (w^{m}(\overline{A} \mid C)) &= 1 \end{aligned} $$

The ‘p’

Recapping the product and sum rule,

$$ \begin{aligned} w^{m}(AB \mid C) &= w^{m}(A \mid C)\, w^{m}(B \mid AC) \quad \quad \quad (i) \\ \end{aligned} $$ $$ \begin{aligned} (w^{m}(A \mid C)) + (w^{m}(\overline{A} \mid C)) &= 1 \quad \quad \quad \quad \quad \quad (ii) \\ \end{aligned} $$

Let $p = w^{m}(x)$, and you finally get the probability equations we’re familiar with. This entails no loss of generality, for the only requirement we have imposed on the function $w(x)$ is that it is a continuous monotonic increasing function ranging from $w = 0$ for impossibility to $w = 1$ for certainty. And if $w(x)$ satisfies this, then so does $w^{m}(x), 0<m< \infty$.

Connecting Probability Rules to Logical Syllogisms

Jaynes demonstrates that the derived sum and product rules encapsulate classical deductive logic as a special case (where probabilities are 0 or 1) and extend it to handle degrees of plausibility (probabilities between 0 and 1). This is illustrated using strong and weak syllogisms.

Strong Syllogism (Modus Ponens) as a Limit Case

The strong syllogism, or Modus Ponens, is a fundamental rule of deductive logic:

If A implies B (Major Premise)
And A is true (Minor Premise)
Then B must be true (Conclusion)

Let’s see how this arises from the probability rules.

Let our background information $C$ include the certainty that “ $A$ implies $B$ “. This means that if $A$ is true, $B$ is guaranteed to be true, given $C$. Also, assume the minor premise: $A$ is true, given $C$. In probability terms, we are interested in the plausibility of $B$ given both $A$ and $C$, i.e., $P(B \mid AC)$.

Logical Equivalence: Given $C$ (which includes $A \implies B$), the statement “$A$ and $B$ are both true” $(AB)$ is logically equivalent to the statement “$A$ is true” ($A$). Why? Because if $A$ is true, $B$ must also be true according to the premise embedded in $C$.

Probability Equality: Since the statements $(AB \mid C)$ and $(A \mid C)$ are logically equivalent under our background information $C$, their plausibilities must be equal:

\[\begin{aligned} P(AB \mid C) = P(A \mid C) \\ \end{aligned}\]

Applying Product Rule:

\[\begin{aligned} P(AB \mid C) = P(A \mid C) P(B \mid AC) \end{aligned}\]

Substitute and Solve:

\[P(A \mid C) = P(A \mid C) \cdot P(B \mid AC)\]

Assuming $P(A \mid C) > 0$ (i.e., A is not impossible given C), we can divide both sides by $P(A \mid C)$ to get:

\[P(B \mid AC) = 1\]

Conclusion: This result $P(B \mid AC) = 1$ is the probabilistic statement of the strong syllogism. It means: Given the background information (C) (which includes the major premise $A \implies B$ and the minor premise (A), the conclusion (B) is certain. Probability theory, with (P = 1) representing certainty, correctly reproduces deductive logic.

Weak Syllogism (Affirming the Consequent) and Plausibility Updates

The weak syllogism involves affirming the consequent:

If A implies B (Major Premise)
And B is true (Evidence)
What can we say about A? (Conclusion?)

In classical logic, observing B tells us nothing certain about A. However, in plausible reasoning, observing B often increases our belief in A. Let’s see how probability handles this.

Again, let our background information C include “A implies B”. Now, our new information (evidence) is that B is true. We want to find the updated plausibility of A, given this evidence B and the background C, i.e., $P(A \mid BC)$.

Apply Product Rule (Definition of Conditional Probability): We can express the desired posterior probability $P(A \mid BC)$ using the product rule in two ways and equating them (this is the core of deriving Bayes’ theorem):

\[P(AB \mid C) = P(A \mid BC) P(B \mid C) \\ P(AB \mid C) = P(B \mid AC) P(A \mid C)\]

Equating the right-hand sides and solving for $P(A \mid BC)$:

$$ P(A \mid BC) P(B \mid C) = P(B \mid AC) P(A \mid C) $$ $$ P(A \mid BC) = P(A \mid C) \frac{P(B \mid AC)}{P(B \mid C)} $$

Conclusion (Bayesian Inference Structure): This equation is fundamental for learning from data. It shows how our belief in hypothesis A changes upon observing evidence B:

$P(A \mid BC)$: Posterior probability of A, updated after observing B.

$P(A \mid C)$: Prior probability of A, based on background information C before observing B.

$P(B \mid AC)$: Likelihood of observing evidence B if hypothesis A were true (given C).

$P(B \mid C)$: Evidence probability (or marginal likelihood), the overall probability of observing B, given C (averaged over whether A is true or not).

The update factor $\frac{P(B \mid AC)}{P(B \mid C)}$ (the Likelihood Ratio) determines how the plausibility changes:

If observing B is more likely when A is true than it is in general (i.e., $P(B \mid AC) > P(B \mid C)$), then the ratio is > 1, and the posterior $P(A \mid BC)$ is greater than the prior $P(A \mid C)$. Observing B makes A more plausible.

If $A \implies B$ is certain (as in our strong syllogism derivation), then $P(B \mid AC) = 1$. The update becomes $P(A \mid BC) = P(A \mid C) / P(B \mid C)$. Since $P(B \mid C)$ must be ≤ 1, observing B cannot decrease the plausibility of A in this case $(P(A \mid BC) ≥ P(A \mid C))$.

As an intuition pump, consider the example of a police officer …

A = “The person is a thief”
B = “The person is wearing a mask and sneaking near a broken window”
C = Background knowledge about crime, behavior, etc. (including “Thieves often wear masks and sneak around”)

The equation $P(A \mid BC) = P(A \mid C) * [P(B \mid AC) / P(B \mid C)]$ tells us: Our updated belief that the person is a thief $(P(A \mid BC))$ equals our initial belief $(P(A \mid C))$ multiplied by a factor. This factor compares how likely the sneaky behavior is if they are a thief $(P(B \mid AC))$ versus how likely that behavior is in general $(P(B \mid C))$. Since $P(B \mid AC)$ is likely much higher than $P(B \mid C)$, observing the behavior B significantly increases the plausibility of A.

These examples show that the sum and product rules, derived from basic desiderata of consistency, provide a quantitative framework for extending logic to handle uncertainty and update beliefs based on evidence, encompassing classical deduction as a limiting case. This forms the foundation for Bayesian statistical inference.

Chapter 3: Elementary Sampling Theory

All probabilities are conditional on background assumptions. As a corollary, there are no unconditional probabilities.
What can we say about $P(A+B \mid C)$? $$ \begin{aligned} P(A+B \mid C) &= 1 - P(\overline{A + B} \mid C) \\ &= 1 - P(\overline{A} * \overline{B} \mid C) \\ &= 1 - P(\overline{A} \mid C) P(\overline{B} \mid \overline{A}C) \\ &= 1 - (1 - P(A \mid C)) (1 - P(B \mid \overline{A}C)) \\ &= 1 - (1 - P(A \mid C) - P(B \mid \overline{A}C) + P(A \mid C)*P(B \mid \overline{A}C)) \\ &= P(A \mid C) + P(B \mid \overline{A}C) - P(A \mid C)*P(B \mid \overline{A}C) \\ &= P(A \mid C) + P(B \mid \overline{A}C)(1- P(A \mid C)) \\ &= P(A \mid C) + P(B \mid \overline{A}C)(P(\overline{A} \mid C)) \\ &= P(A \mid C) + P(B*\overline{A} \mid C) \\ &= P(A \mid C) + P(B \mid C)*P(\overline{A} \mid BC) \\ &= P(A \mid C) + P(B \mid C)*(1-P(A \mid BC)) \\ &= P(A \mid C) + P(B \mid C) - P(B \mid C)*P(A \mid BC) \\ &= P(A \mid C) + P(B \mid C) - P(AB \mid C) \end{aligned} $$
Special Case: Mutually Exclusive Events
- If $A$ and $B$ are mutually exclusive ($P(A\cap B)=0$), then
$$ \begin{aligned} P(A \cup B) &= P(A) + P(B) \\ P(A \cap B) &= 0 \end{aligned} $$
So when two events cannot occur together, their union’s probability is just the sum of the parts.
Additivity over $n$ Mutually Exclusive Alternatives

For events $A_1, A_2, \dots, A_n$ with $P(A_i \cap A_j) = 0\quad (\forall\,i\neq j),$ conditioning on background information $C$ preserves additivity: $P\Bigl(\bigcup_{i=1}^n A_i \;\big|\; C\Bigr) = \sum_{i=1}^n P\bigl(A_i\mid C\bigr).$ Mutual exclusivity lets you break a complex union into a simple sum under any conditioning.
Deriving the Principle of Indifference
1. Symmetry (Indifference):
  If $C$ gives you no reason to favor one $A_i$ over another,
  $P(A_i\mid C) = P(A_j\mid C)\quad\forall\,i,j.$
2. Exhaustiveness:
  If the $A_i$ together cover all possibilities under $C$,
  $P\Bigl(\bigcup_{i=1}^n A_i \;\big|\; C\Bigr) = 1,$ then $\sum_{i=1}^n P(A_i\mid C) = 1 \;\Longrightarrow\; P(A_i\mid C) = \frac{1}{n}.$
  Equal plausibilities plus completeness force a uniform assignment $\tfrac1n$ to each alternative.
- These steps form Jaynes’s clean derivation of the uniform prior under total ignorance. By imposing (1) mutual exclusivity, (2) exhaustiveness, and (3) informational symmetry, the only probability assignment consistent with the axioms of probability is the equal‑weight distribution.

The canonical example of balls in an urn

Background $C$:
An urn contains 10 balls labeled $1$ through $10$. Exactly 3 of them ($4$, $5$, $7$) are red; the other 7 are white. We have no information about their positions and draw one ball at random.

Define the Elementary Events

Let $A_i = \{\text{“the drawn ball is ball }i\}\,,\quad i=1,\dots,10.$

Under $C$, by indifference all $A_i$ are equally plausible.
They are mutually exclusive $(A_i\cap A_j=\varnothing)$ for $i\neq j$) and exhaustive ($\bigcup_{i=1}^{10}A_i$ is certain).

Thus $P(A_i \mid C) \;=\;\frac{1}{10} \quad\bigl(\forall\,i\bigr)$
Equal ignorance $\Rightarrow$ uniform prior over the 10 outcomes.

Probability of Drawing a Red Ball

Let $R = \{\text{“the drawn ball is red”}\} = A_4 \;\cup\; A_5 \;\cup\; A_7.$ Since $A_4, A_5, A_7$ are mutually exclusive, $P(R \mid C) = P(A_4\mid C) + P(A_5\mid C) + P(A_7\mid C) = 3 \times \frac{1}{10} = \frac{3}{10}.$

Let’s try to generalize this.

Background $C$:
An urn contains $N$ balls labeled $1$ through $N$. Exactly $M$ of them are red; the other $N - M$ are white. We know nothing about their positions and draw one ball at random without looking.

Elementary Events and Indifference

Let
$A_i = \{\text{“the drawn ball is ball }i\}\,,\quad i = 1,\dots,N.$

Mutually exclusive: $A_i\cap A_j = \varnothing$ for $i\neq j$.
Exhaustive: $\bigcup_{i=1}^N A_i$ is certain given $C$.
Indifference: No label is privileged, so
$P(A_i \mid C) \;=\;\frac{1}{N} \quad\forall\,i.$
Uniform assignment is the only way to respect symmetry under total ignorance.

Probability of “Red”

Define the event “red” as
$R = \{\text{drawn ball is one of the \(M$ reds}\} = \bigcup_{i \in \mathcal{R}} A_i,\) where $\mathcal{R}$ is the set of red‑labeled indices (of size $M$). By additivity over mutually exclusive $A_i$, $P(R \mid C) = \sum_{i\in\mathcal{R}} P(A_i\mid C) = M \times \frac{1}{N} = \frac{M}{N}.$
The ratio $M/N$ emerges directly from counting symmetrical alternatives.

Sampling Without Replacement

Background (C):
An urn contains (N) balls, (M) of which are red and (N-M) white. We draw (n) balls without replacement, learning nothing about the unseen positions between draws.

First Draw

$R_1$: “red on 1st draw”
$W_1$: “white on 1st draw”

$\begin{aligned} P(R_1 \mid C) = \frac{M}{N}, \quad P(W_1 \mid C) = \frac{N - M}{N}. \end{aligned}$
Uniform ignorance means each ball is equally likely initially.

Two Reds in a Row

$P(R_1, R_2 \mid C) = P(R_1 \mid C)\;P(R_2 \mid R_1, C) = \frac{M}{N}\;\times\;\frac{M - 1}{N - 1}.$
Removing one red reduces both the total balls and the red count by one.

Generalization

\[P(R_1,\dots,R_r \mid C) = \prod_{k=0}^{r-1}\frac{M - k}{N - k} = \frac{M!}{(M - r)!} \times \frac{(N - r)!}{N!}\]

This product form captures the shrinking pool. In combinatorial terms, it’s the building block of the hypergeometric distribution when you care about exactly $r$ reds in $n$ draws.

Sampling without replacement creates negative dependence: each success makes the next slightly less probable. The factorial ratio shows how the count of favorable sequences compares to all possible sequences, directly yielding hypergeometric probabilities and illuminating why combinatorial coefficients arise naturally in discrete Bayesian reasoning.

$\omega$ Whites in a Row

$P(W_1,\dots,W_\omega \mid C) = \prod_{k=0}^{\omega-1} \frac{(N-M)-k}{\,N-k\,} = \frac{(N-M)!\,(N-n)!}{(N-M-\omega)!\,N!}.$$
Each white draw reduces both the white‑count and total‑count by one.

Whites After $r$ Reds

$P(W_{r+1},\dots,W_{r+\omega}\mid R_1,\dots,R_r,C) = \prod_{k=0}^{\omega-1} \frac{(N-M)-k}{(N-r)-k} = \frac{(N-M)!\,(N-n)!}{(N-M-\omega)!\,(N-r)!}$
Conditioning on $r$ reds shrinks the remaining pool to $N-r$.

$r$ Reds Then $\omega$ Whites

$P(R_1,\dots,R_r,W_{r+1},\dots,W_{r+\omega}\mid C) = \frac{M!}{(M-r)!}\;\frac{(N-M)!}{(N-M-\omega)!}\;\frac{(N-n)!}{N!} = \frac{M!\,(N-M)!\,(N-n)!}{(M-r)!\,(N-M-\omega)!\,N!}$
Factorials encode successive depletion of reds and whites.

Order Independence & the Hypergeometric Law

Claim: Any ordering of $r$ reds and $\omega$ whites has the same joint probability above.
Summing over $\binom{n}{r}$ orderings gives $P(\text{exactly }r\text{ reds in }n\text{ draws}\mid C) = \binom{n}{r} \frac{M!\,(N-M)!\,(N-n)!} {(M-r)!\,(N-M-\omega)!\,N!} = \frac{\binom{M}{r}\,\binom{N-M}{\omega}}{\binom{N}{n}}.$
This is the hypergeometric distribution—only the counts of reds and whites matter, not the sequence in which they appear.

Why Order‑Independence Is So Remarkable

Two‑Draw Illustration

Compare the probabilities of drawing one red and one white in two draws without replacement:

$\begin{aligned} P(R_1, W_2 \mid C) &= \frac{M}{N}\;\times\;\frac{N - M}{N - 1},\\ P(W_1, R_2 \mid C) &= \frac{N - M}{N}\;\times\;\frac{M}{N - 1} \end{aligned}$
Since multiplication is commutative, $\frac{M}{N}\frac{N - M}{N - 1} \;=\; \frac{N - M}{N}\frac{M}{N - 1}$

                Draw 1
               /      \
     R₁ (red, M/N)  W₁ (white, (N–M)/N)
      /      \          /        \
R₂|R₁     W₂|R₁    R₂|W₁      W₂|W₁
(M–1)/(N–1) (N–M)/(N–1)  M/(N–1)  (N–M–1)/(N–1)

Despite the apparent asymmetry—first red then white vs. first white then red—the joint probability is identical.

General Sequence of $r$ Reds & $\omega$ Whites

For any specific ordering of $r$ reds and $\omega$ whites (in $n=r+\omega$ draws):

$P(\underbrace{R,\dots,R}_{r},\underbrace{W,\dots,W}_{\omega}\mid C) = \prod_{k=0}^{r-1}\frac{M - k}{N - k} \;\times\; \prod_{k=0}^{\omega-1}\frac{(N - M) - k}{(N - r) - k},$ which algebraically reduces to the same factorial ratio $\frac{M!\,(N - M)!\,(N - n)!}{(M - r)!\,(N - M - \omega)!\,N!}.$

Permuting the order just rearranges factors in the product—nothing more.

From Sequences to Counts: The Hypergeometric

Summing over all $\binom{n}{r}$ distinct orderings gives $P(\text{exactly }r\text{ reds in }n\mid C) = \binom{n}{r} \,\frac{M!\,(N - M)!\,(N - n)!}{(M - r)!\,(N - M - \omega)!\,N!} = \frac{\binom{M}{r}\,\binom{N - M}{\omega}}{\binom{N}{n}}.$

This exchangeability—that all sequences with the same counts are equiprobable—is far from trivial. A naïve intuition might believe the order of draws carries extra information about “where” reds and whites were in the urn. Instead, under true ignorance of positions, order simply shuffles equally likely depletions.

Why it matters:

Sufficiency of counts: You need only track how many reds (and whites) appear, not the entire draw history.

Dramatic simplification: Inference collapses from a factorially large space of sequences to a simple combinatorial formula.

Foundation for Bayesian discrete models: This symmetry justifies using the hypergeometric (and, with replacement, the binomial) as the correct likelihood when only counts matter.

Mode of the Predictive Distribution

For drawing $n$ balls without replacement (hypergeometric model), the most probable number of reds $r$ is the mode of the hypergeometric (equivalently the Beta‑Binomial predictive) distribution: $r^* \;=\;\frac{(n+1)\,(M+1)}{N+2}\quad\text{(rounded to the nearest integer)}.$

As $N\to\infty$ with $M/N$ fixed, $r^*/n\approx M/N$. The sample fraction converges on the population fraction.

Marginal Consistency Across Draws

Even though draws happen in sequence, the marginal probability of “red” on the second draw equals that on the first:
$\begin{aligned} P(R_2\mid C) &=P(R_2,R_1\mid C)\;+\;P(R_2,W_1\mid C)\\ &=\frac{M}{N}\,\frac{M-1}{N-1} \;+\;\frac{N-M}{N}\,\frac{M}{N-1} =\frac{M}{N} =P(R_1\mid C)\, \end{aligned}$

No causal “push” from draw 1 to draw 2 is needed—symmetry alone guarantees consistency.

Symmetric Conditioning

Conditioning on one red in either position yields the same posterior probability for the other: $P(R_1\mid R_2,C) =\frac{M-1}{N-1} \;=\; P(R_2\mid R_1,C)\,.$

Probability “propagates” via exchangeability—the order of information doesn’t matter.

Why This Matters

Physical causality isn’t the only mechanism for “updating” beliefs. Under true ignorance of labels, combinatorial symmetry enforces:

Predictive stability: Marginal probabilities remain constant across draws.
Order‑irrelevance: Conditional probabilities depend only on counts, not chronology.

Sampling With Replacement ⇒ Binomial Model

Background $C'$:
As before, an urn has $N$ balls with $M$ red. After each draw, you replace the ball in the urn

Background $C''$:
C’ but after each draw, you replace the ball and (for true randomization) shuffle the urn—so nothing “remembers” past positions.

Identical, Independent Draws

With replacement and randomization ($C''$): $P(R_2 \mid R_1, C'') \;=\; P(R_1 \mid C'') \;=\;\frac{M}{N}$

Each draw is an independent Bernoulli trial with success probability $M/N$.

Binomial Probability for $r$ Reds in $n$ Draws

\[P(r\text{ reds in }n\mid C'') = \binom{n}{r}\biggl(\frac{M}{N}\biggr)^{\!r} \biggl(1 - \frac{M}{N}\biggr)^{\!n-r}\]

Counts of reds follow the binomial distribution, since each trial is identical and independent.

Replacing and reshuffling erases any memory between draws, so information structure (indifference + renewal) drives the result, not physical causation. You end up with the classic binomial law: only the total number of successes matters, not their ordering or history.

What if order does matter?

Non-Exchangeable Sampling: “Reinforcement” Model

Setup:
An urn with $N=2M$ balls ($M$ red, $M$ white) yields a baseline probability $p=\tfrac12$, where $p$ is the probability of drawing a red ball. Introduce a small reinforcement $\varepsilon$ so that each draw depends on the previous one:

\[P(R_k \mid R_{k-1}, C'') = p + \varepsilon\]
\[P(R_k \mid W_{k-1}, C'') = p - \varepsilon\]
\[P(W_k \mid W_{k-1}, C'') = p + \varepsilon\]
\[P(W_k \mid R_{k-1}, C'') = p - \varepsilon\]

Unlike the binomial case, draws are not independent and not exchangeable—there’s a “memory” effect.

Sequence Probability

Label:

$c$ = count of reds following a red
$c'$ = reds following a white
$\omega$ = whites following a white
$\omega'$ = whites following a red

Then for any specific sequence, $\begin{aligned} P(\text{sequence}\mid C'') &= (p+\varepsilon)^{c}\,(p-\varepsilon)^{c'}\,(1-(p-\varepsilon))^{\omega}\,(1-(p+\varepsilon))^{\omega'} \end{aligned}$

Each factor depends on the immediately preceding draw, so order matters.

Exponential Approximation for Long Runs

Consider $c$ reds all in a row (so $c'=0,\omega=\omega'=0$): $P(RRR\cdots R) = (p+\varepsilon)^{c} = p^{c}\,\Bigl(1 + \tfrac{\varepsilon}{p}\Bigr)^{c} \approx p^{c}\,\exp\!\bigl(c\,\tfrac{\varepsilon}{p}\bigr)$
With $p=\tfrac12$, this becomes $\begin{aligned} P \approx \bigl(\tfrac12\bigr)^{c}\,\exp\!\bigl(2\,\varepsilon\,c\bigr) \end{aligned}$

A tiny reinforcement $\varepsilon$ produces an exponential tilt of run-length probabilities.

This model illustrates path-dependence: small biases accumulate over time, dramatically altering long-run behavior compared to iid or exchangeable draws (binomial/hypergeometric).

Exponential Tilt for a 50-Red then 50-White Run

For a specific sequence of $c$ reds in a row followed by $\omega$ whites:

$P(\text{sequence}\mid C'') \;\approx\; \exp\!\Bigl(2\varepsilon\,(c - c') + 2\varepsilon\,(\omega - \omega')\Bigr)\; p^{\,c+c'}\,(1-p)^{\,\omega+\omega'}$ \

Here:

$c$ = number of reds preceded by a red (so for a run of 50 reds, $c=49$, $c'=0$).
$\omega$ = number of whites preceded by a white (for 50 whites, $\omega=49$, $\omega'=0$).
$p=\tfrac12$, $\varepsilon$ small.

Approximation for $c=\omega=49$, $\varepsilon=0.01$

Exponent term
$2\varepsilon\,(c-c') + 2\varepsilon\,(\omega-\omega') = 2\cdot0.01\cdot49 + 2\cdot0.01\cdot49 = 1.96 \;\approx\; 2 \;\Longrightarrow\; e^{1.96}\approx e^2\approx7.4.$
Binomial base
$p^{50}(1-p)^{50} = \bigl(\tfrac12\bigr)^{100}.$
Combined
$P\bigl(R_1\ldots R_{50}\,W_{51}\ldots W_{100}\mid C''\bigr) \;\approx\; e^2 \;\times\;\bigl(\tfrac12\bigr)^{100}.$

A tiny reinforcement ($\varepsilon=0.01$) more than 7x’s the odds of the contiguous run compared to the pure binomial probability $\bigl(\tfrac12\bigr)^{100}$. This dramatic exponential tilt shows how even weak path-dependence can overwhelmingly favor long streaks.