A Primer
My work has become much more technical that I am used to, so I thought it would be good to take some notes on basic measure and probability theory in anticipation of working through several theoretical papers. A lot of the definitions below come from Wikipedia, Durrett
Note: Not all of the proofs are finished/included. I am hoping to find the time to return to this post and complete them.
Measure theory, in my mind, is just about sets, mappings, and ways to describe them. To formalize these ideas, however, we need to define some basic building blocks.
We’ll begin with a $\sigma$-field. This is simply a collection of subsets of some other set.
An example may make it a bit more concrete in one’s mind…
Notice that the first and second properties in the above definition imply that $\emptyset \in \mathcal{S}$ as well. The properties also imply that a $\sigma$-field must be closed under countable intersection. That is, $\cap_{i = 1}^\infty A_i \in \mathcal{S}$ for some sequence of $A_1, A_2, \dots \in \mathcal{S}$.
A $\sigma$-field is a generalization of the concept of an algebra (also called a field).
Now we can define measurable spaces!
We’ll come back to this definition later when we discuss measures, but a measurable space is just a space that could be assigned a measure.
Let’s finish up this sub-section by introducing topological spaces and Borel sets.
This definition is a bit tricky to develop intuition for. The Borel $\sigma$-field is just the collection of all possible open sets in a given space, $X$.
An important Borel $\sigma$-field that will come up again when we discuss measures and probability is the Borel $\sigma$-field on the real line. Several examples follow from our definition:
Borel sets on $\mathbb{R}$ can also be extended to $[-\infty, \infty]$.
Along with the Borel set and the $\sigma$-field is the semialgebra.
This concept will not be as useful in later discussions, but we include it for completeness. An example of a semialgebra is the union of ${ \emptyset }$ and the collection of sets that can be written as:
\[(a_1, b_1] \times \dots \times (a_d, b_d] \subset \mathbb{R}^d \hspace{5mm} \text{for } -\infty \leq a_i < b_i \leq \infty\]Given a semialgebra, $\mathcal{S}$, the collection of finite disjoint unions of sets in $\mathcal{S}$ forms an algebra called the algebra generated by $\mathcal{S}$.
We now need to define a concept that is at the crux of our discussions of mappings: the inverse image.
In words, the inverse image of a subset $A$ of $Y$ under function $f$ is the subset of elements in the domain $X$ that map to elements in $A$. It’s important to note that the inverse image of the whole of $Y$ does not necessarily have to be the whole of $X$!
We now introduce a definition that describes what it means for functions of a certain type to be “nice” with respect to a $\sigma$-field.
The basic idea behind an $\mathcal{S}$-measurable function is that we should be able to achieve any Borel set as output for some part of $\mathcal{S}$, which is in its domain (since $\text{dom}(f) = X$). It is important to remember that measurability is with respect to the $\sigma$-fields of the two measure spaces of interest.
To put it intuitively, a measurable function $f$ needs to take on values that “make sense” with respect to the $\sigma$-field of interest. For example, only constant functions are measurable with respect to the trivial $\sigma$-field ${ \emptyset, \Omega }$ for some $\Omega$. In addition, we have the following claim:
Constant functions are measurable with respect to any $\sigma$-field.
Suppose we have measurable spaces $(X, \mathcal{S})$ and $(Y, \mathcal{S}’)$. Let $\mathcal{S} = { \emptyset, \Omega }$, and suppose we have non-constant function $f: X \rightarrow Y$. That is, there exist $a, b \in \Omega$ such that $f(a), f(b) \in \mathcal{S}’$ and $f(a) \neq f(b)$.
Consider the pre-image of one of these points. We know that $f^{-1}(f(a)) = a \notin \mathcal{S}$ since $a$ is neither the null set nor the entirety of $\Omega$ (since we also have $b$ and, necessarily, $a \neq b$).
To prove the second claim, consider $\mathcal{S} = { \emptyset, \Omega }$ and arbitrary $\mathcal{S}’$ in the previous set-up. Since $f$ is constant, it must be the case that $f(x) = a$ for all $x \in X$ and some $a$. Pick any $s \in \mathcal{S}’$. If $a \in s$, then $f^{-1}(s) = \Omega$, since any input value maps to $a$ ($f$ is constant). If $a \notin s$, then $f^{-1}(s) = \Omega^c = \emptyset$ by the same argument.
Thus, for any $s \in \mathcal{S}’$, $f^{-1}(s) \in \mathcal{S}$, implying that $f$ is $(\mathcal{S}, \mathcal{S}’)$-measurable for any $\mathcal{S}’$.
To check whether a function is $\mathcal{S}$-measurable, it is sufficient to check whether \(f^{-1}((a, \infty]) = \{ x \in X \rvert f(x) > a \} \in \mathcal{S}\) for all $a \in \mathbb{R}$.
Furthermore, in the special case that $X \subseteq \mathbb{R}$ and $\mathcal{S}$ is the set of Borel subsets of $\mathbb{R}$ that are contained in $X$, then a function $f: X \rightarrow \mathbb{R}$ is called Borel measurable if $f^{-1}(B)$ is a Borel set for all Borel sets $B \subseteq \mathbb{R}$. It can be shown that any continuous or increasing function $f: X \rightarrow \mathbb{R}$ where $X$ is a Borel subset of $\mathbb{R}$ is Borel measurable.
We have finally come to the star of our discussion: the measure. A measure is a function that assigns a “size” to sets (it is similar to the idea of length for intervals or area for two dimensional regions).
With this definition, we define a measure space, which is the tuple $(X, \mathcal{S}, \mu)$. For measure space $(X, \mathcal{S}, \mu)$ and $A, B \in \mathcal{S}$ such that $A \subseteq B$, we have that $\mu(A) \leq \mu(B)$ and $\mu(B \setminus A) = \mu(B) - \mu(A)$ (assuming that $\mu(A)$ is finite). We also have the additional property of countable subadditivity, which is basically a generalization of Boole’s inequality:
\[\mu\left(\bigcup_{i = 1}^\infty A_i \right) \leq \sum_{i = 1}^\infty \mu(A_i)\]for any sequence of sets $A_1, A_2, \dots \in \mathcal{S}$. Measures also satisfy $\mu(A \cup B) = \mu(A) + \mu(B) - \mu(A \cap B)$ (assuming that $\mu(A \cap B)$ is finite).
If we have two $\sigma$-finite (see below) measure spaces, $(X, \mathcal{S}, \mu_1)$ and $(Y, \mathcal{T}, \mu_2)$, we can define two addition sets:
\[\begin{aligned} \Omega &= X \times Y = \{ (x, y): x \in X, y \in Y\} \\ \mathcal{U} &= \{ S \times T: S \in \mathcal{S}, T \in \mathcal{T}\} \end{aligned}\]Sets $U \in \mathcal{U}$ are rectangles. Let $\mathcal{F} = \mathcal{S} \times \mathcal{T}$ be the $\sigma$-filed generated by $\mathcal{U}$. The unique measure $\mu = \mu_1 \times \mu_2$ on $\mathcal{F}$ defined as $\mu(S \times T) = \mu_1(S) \mu_2(T)$ is called a product measure. This result can be extended to finitely many $\sigma$-finite measurable spaces.
Measures can be characterized in a variety of ways. First, consider the $\sigma$-finite measure.
We can also define a sense of continuity to measures.
Measures can also be “coarsened” by restricting the $\sigma$-field on which they operate.
A restricted measure is basically the original measure but its domain is shrunken to whatever sub-$\sigma$-field it is restricted to. Measures also satisfy several properties.
Let $\mu$ be a measure on $(\Omega, \mathcal{F})$, and let $A_i \uparrow A$ denote $A_1 \subset A_2 \subset \dots$ with $\cup_i A_i = A$. The measure $\mu$ satisfies the following:
Proof to be completed.
A sense of “convergence” with respect to a measure can be defined for measurable functions.
Before we can move on to some of the core concepts in probability theory, we need one more definition.
With our building blocks in place, we can move on to probability theory. We’ll start with a fundamental definition: the probability space, which is just a special measure space!
Using the above, we can define random variables and vectors in a rigorous way. Note that the following can be generalized to the extended real line (i.e. $\mathbb{R} \cup {-\infty, \infty }$).
Random variables map each element in the sample space to an element in $H$, which is the set of all possible values the variable can take on. Naturally, we need the pre-image of all elements in $\mathcal{H}$ to be in $\mathcal{F}$. When we refer to a random variable being measurable with respect to some $\mathcal{F}’$ (a sub-$\sigma$-field of $\mathcal{F}$), we mean that it is $(\mathcal{F}’, \mathcal{B}(\mathbb{R}))$-measurable.
Let $X_1, X_2, \dots$ be random variables, and let $f: (\mathbb{R}^n, \mathcal{B}^n) \rightarrow (\mathbb{R}, \mathcal{B})$ be a measurable function. Then the following are also random variables:
Proof to be completed.
The distribution of a random variable can also be defined from a measure theoretic perspective.
Durrett provides the best intuition for the distribution of a random variable: “In words, we pull $A \in \mathbb{B}$ back to $X^{-1}(A) \in \mathcal{F}$ and then take $P$ of that set”[^fn-durrett].
It’s important to remember that two different random variables can induce the same distribution. In this case, we say that the random variables (denote them by $X$ and $Y$) are equal in distribution, which we denote with $X \overset{d}{=} Y$.
A distribution function with the form $F(x) = \int_{-\infty}^x f(y) dy$ can also be described by its density function, $f$, satisfying:
\[\mathbb{P}(X = x) = \underset{e \rightarrow 0}{\lim} \int_{x - e}^{x + e} f(y) dy = 0\]In this case, we say that $F$ is absolutely continuous. Integrating the density function over the entire sample space/support will equal $1$, and the density function will always be non-negative.
Similarly, we can define a discrete distribution function (i.e. an induced probability measure) as one in which there exists a countable set $S$ such that $P(S^c) = 0$.
Now, since random variables are just measurable functions, we can use its mapping to define special $\sigma$-fields.
Put intuitively, the $\sigma$-field generated by random variable $X$ is the collection of all possible subsets of the set of values $X$ can take on such that the probability of the event that $X$ takes on that value can be determined (i.e. is measurable).
We can also define $\sigma$-fields generated by arbitary subsets. This is the smallest $\sigma$-field containing a given collection of subsets.
We can also think of having many random variables, each associated with some step in a sequence (perhaps time or space). We call this a stochastic process.
Stochastic processes can be characterized by their continuity (or lack thereof).
Before we can look at random variables any further, we need to discuss a very important concept in mathematics. In the following, we will restrict our discussion to $\mathbb{R}$, but the definitions can easily be generalized to higher dimensions by exchanging lengths for volumes via Cartesian products.
First, we define a special indicator function that got a fancy name (not sure why).
Though not very useful for our discussion, we’ll define the outer measure of a set $A \subseteq \mathbb{R}$. The outer measure formalizes the size of a set by using the lengths of open intervals.
In words, the outer measure of a set is the smallest total length of some sequence of open intervals of $\mathbb{R}$ that, together, contain $A$. Finite sets have outer measure $0$ because we can make our open intervals arbitrarily “short” (i.e. force them to have length approaching $0$). By similar reasoning, any countable subset of $\mathbb{R}$ also has outer measure $0$.
It’s important to remember that the outer measure is not a true measure in the sense that we defined. However, the outer measure allows us to define a special (and true) measure called the Lebesgue measure.
In words, the outer measure becomes a true measure if we restrict ourselves to only Borel sets. The Lebesgue measure leads to a refinement of the idea of a measurable set. A set $A \subseteq \mathbb{R}$ is called Lebesgue measurable if it is really “close” to being a Borel set. Put formally, $A$ is Lebesgue measurable if there exists a Borel set $B \subseteq A$ such that $\rvert A \setminus B \rvert = 0$. There are also many equivalent definitions (see pg. 52 of Axler (2025)).
Note that sometimes the definition of the Lebesgue measure is alteredThe change is limited to the function’s domain (Borel vs. Lebesgue measurable sets). to mean the measure on $(\mathbb{R}, \mathcal{L})$ where $\mathcal{L}$ is the $\sigma$-field of Lebesgue measurable subsets of $\mathbb{R}$.
A function $f: A \rightarrow \mathbb{R}$ for $A \subseteq \mathbb{R}$ is Lebesgue measurable if $f^{-1}(B)$ is a Lebesgue measurable set for every Borel set $B \subseteq \mathbb{R}$.
A lot of things in probability depend upon integration. For example, expectation, variance, cumulative probability, and many more things can all be stated as some type of integral. Thus, it’s important we have a solid understanding of the integral.
We start with the integral of the characteristic function:
\[\int \chi_E d\mu = \mu(E) \hspace{5mm} \forall E \in \mathcal{S}\]Recall that a simple function is any function that takes on finitely many values. Any piecewise function with finitely many pieces is simple. We can use the integral of the characteristic function to that of simple functions by taking a linear combination.
Let $(X, \mathcal{S}, \mu)$ be a measure space, let $A_1, \dots, A_n$ be disjoint set in $\mathcal{S}$, and let $c_1, \dots, c_n \in [0, \infty]$. Then:
\[\int \left(\sum_{i = 1}^n c_i \chi_{A_i} \right) d\mu = \sum_{i = 1}^n c_i \mu(A_i)\]With these definitions in mind, we can define the integral of any non-negative function.
We begin with a definition.
Notice that if $f(x) \geq 0$, then $f^+(x) \geq 0$ and $f^-(x) = 0$. Alternatively, if $f(x) < 0$, then $f^+(x) = 0$ and $f^-(x) = -f(x) > 0$. Thus, $f^+$ and $f^-$ are both non-negative functions. This allows us extend the definition of the integral to real-valued functions.
If we have $(\Omega, \mathcal{F}, \mu) = (\mathbb{R}^d, \mathcal{B}^d, \lambda)$, then we denote $\int f d\lambda$ with $\int f(x) dx$, and if $(\Omega, \mathcal{F}, \mu) = (\mathbb{R}, \mathcal{B}, \lambda)$ and we have some interval $E = [a, b]$, we write $\int_a^b f(x) dx$ instead of $\int_E f d\lambda$.
Integration can be restricted to a subset of the domain of a function. That is, for $E \in \mathcal{S}$:
\[\int_E f d\mu = \int f \chi_E d \mu\]It can also be restricted to an interval of the extended real line. First, we call a bounded function $f: [a, b] \rightarrow \mathbb{R}$ Riemann integrable if the set of points in $[a, b]$ at which $f$ is not continuous has length $0$. If we have Lebesgue measure on $\mathbb{R}$, $\lambda$, and $f: (a, b) \rightarrow \mathbb{R}$ is a Lebesgue measurable function, then for $-\infty \leq a < b \leq \infty$ we let $\int_a^b f(x) dx = \int_{(a,b)} f d\lambda$.
Two different measures can be related via the Radon-Nikodym Theorem, which states that (under certain conditions), there exists a function such that one measure is equivalent to the integral of the function with respect to a second measure.
Let $(X, \mathcal{S})$ be a measurable space, and let $\mu$ and $\nu$ denote two $\sigma$-finite measures on this space such that $\nu « \mu$ ($\nu$ is absolutely continuous with respect to $\mu$). Then there existgs a $\mathcal{S}$-measurable function, $f: X \rightarrow [0, \infty)$ such that, for any measurable $A \subset \mathcal{S}$:
\[\nu(A) = \int_A f d\mu\]Proof to be completed.
A fun fact is that $f$ is unique up to some set of measure $0$ with respect to $0$. That is, for any other $g$ that satisfies the definition, $f(x) = g(x)$ for all $x \in X$ except some $x \in X’ \subset X$ such that $\mu(X’) = 0$. Such a function, $f$, is called the Radon-Nikodym derivative and can be denoted by $\frac{d \nu}{d \mu}$.
Here we list and prove several properties of integrals that are ubiquitous in theoretical statistics.
Let $\phi$ be a convex function (i.e. $\lambda \phi(x) + (1- \lambda)\phi(y) \geq \phi(\lambda x + (1-\lambda)y)$ for all $\lambda \in (0, 1)$, $x,y \in \mathbb{R}$). Let $\mu$ be a probability measure, and let $f$ and $\phi(f)$ be integrable. Jensen’s inequality states:
\[\phi\left(\int f d\mu \right) \leq \int \phi(f)d\mu\]Proof to be completed.
Let $\mu$ be a probability measure, and let $p, q \in (1, \infty)$ such that $\frac{1}{p} + \frac{1}{q} = 1$. Hölder’s inequality states:
\[\int \rvert fg \rvert d\mu \leq \rvert \rvert f \rvert \rvert_p \rvert \rvert g \rvert \rvert_1\]where $\rvert \rvert f \rvert \rvert_p = (\int \rvert f \rvert^p d\mu)^{\frac{1}{p}}$ for $1 \leq p < \infty$.
Proof to be completed.
If $p = q = 2$, the above is called the Cauchy-Schwarz inequality.
Let $E$ be a set of finite measure (i.e. $\mu(E) < \infty$), and let ${ f_n }$ be a sequence of functions that vanish on $E^c$, are uniformly pointwise bounded (i.e. $\rvert f_n(x) \rvert \leq M$), and $f_n \rightarrow f$ in measure. Then:
\[\int f d\mu = \underset{n \rightarrow \infty}{\lim} \int f_n d\mu\]Proof to be completed.
Let $(\Omega, \mathcal{F}, \mu)$ be a measure space, and let $X \in \mathcal{F}$ be a measurable set. Let \(\{ f_k \}_{k = 0}^\infty\) be a pointwise non-decreasing sequence of \((\mathcal{F}, \mathbb{B}(\bar{\mathbb{R}}_{\geq 0})\)-measurable, non-negative functions (i.e. \(0 \leq \dots \leq f_k(x) \leq f_{k+1}(x) \leq \dots \leq \infty\) for every $k \geq 1$ and $x \in X$). Then the pointwise supremum, defined as the function:
\[\underset{k}{\sup} f_k: x \rightarrow \underset{k}{\sup} f_k(x)\]is $(\mathcal{F}, \mathbb{B}(\bar{\mathbb{R}}_{\geq 0}))$-measurable and satisfies:
\[\underset{k}{\sup} \int_X f_k d\mu = \int_X \underset{k}{\sup} f_k d\mu\]Proof to be completed.
Let $(\Omega, \mathcal{F}, \mu)$ be a measure space, and let \(\{ f_k \}_{k \in T}\) be a sequence of measurable functions (with index set $T$) on this space such that \(\underset{n \rightarrow \infty}{\lim} f_n(x) = f(x)\) for some function $f$ for all $x \in \Omega$ (i.e. \(\{ f_k \}_{k \in T}\) converges pointwise to $f$). Suppose that our sequence is dominated by some other integrable function, $g$; that is:
\[\rvert f_n(x) \rvert \leq g(x) \hspace{5mm} \forall x \in \Omega, \hspace{2mm} \forall n \in T\]The Dominated Convergence Theorem states that $f_n$ and $f$ are both (Lebesgue) integrable and:
\[\underset{n \rightarrow \infty}{\lim} \int_\Omega f_n d\mu = \int_\Omega \underset{n \rightarrow \infty}{\lim} f_n d\mu = \int_\Omega f d\mu\]Proof to be completed.
Let $(\Omega, \mathcal{F}, \mu)$ be a measure space, and let \(\{ f_n: \Omega \rightarrow [0, \infty]\}\) be a sequence of non-negative measurable functions. Then:
\[\int_X \underset{n \rightarrow \infty}{\lim} \underset{m \geq n}{\inf} f_n d\mu \leq \underset{n \rightarrow \infty}{\lim} \underset{m \geq n}{\inf} \int_X f_n d\mu\]Proof to be completed.
Let $(X, \mathcal{S}, \mu_1)$ and $(Y, \mathcal{T}, \mu_2)$ be $\sigma$-finite measurable spaces, and let $\mu = \mu_1 \times \mu_2$ (the product meeasure). If we have a function $f$ such that $f \geq 0$ or $\int \rvert f \rvert d \mu$, then:
\[\int_X \int_Y f(x,y) \mu_2(dy) \mu_1(dx) = \int_{X \times Y} f d \mu = \int_Y \int_X f(x,y) \mu_1(dx) \mu_2(dy)\]Proof to be completed.
Fubini’s Theorem tells us when it is okay to exchange the order of a double integral and to compute a double integral as an interated integral.
For a random variable, $X$ on probability space $(\Omega, \mathcal{F}, P)$, how can we describe its central tendency (i.e. what values $X$ usually takes on)? We answer this question with the following definitions, which use ideas from integration (see later in this post).
The expected value or expectation of a random variable is basically just integration with respect to the probability measure of the space the variable is defined on. It can be any real number or even $\infty$. Since it is just an integral, we can extend all of the results in the previous section to the expectation. The results are the same, just rewritten with $\mathbb{E}[X]$ instead of $\int_\Omega X dP$.
We can also define the conditional expectation of a random variable with respect to a particular sub-$\sigma$-field.
These definitions are a bit confusing, so let’s parse them by coming at the topic from a different angle (see this post).
Let’s say we have a random variable $X$ on some probability space $(\Omega, \mathcal{F}, P)$. We don’t know anything about it, so our best guess at its value would be some sort of weighted average over all of the possible values it could take on. These weights are determined by the probability measure, $P$, since a good guess should be closer to the more likely outcomes.
Now, suppose we know some information about $X$’s outcome (i.e. we can answer some set of questions about $X$). We could formulate this as a collection of subsets of $\Omega$. For example, if we were rolling dice, the question “Is $X$ odd?” could be contained in the set \(\{1, 3, 5\}\) or \(\{2, 4, 6\}\). We could imagine outputting a different best guess depending upon what set of information we are given, which is basically what the conditional expectation does.
In one way of thinking, $\mathbb{E}[X \rvert \mathcal{H}]$ is a random variable mapping from the possible values of $X$ to the best guesses. The condition $\int_H \mathbb{E}[X \rvert \mathcal{H}] dP = \int_H X dP$ for all $H \in \mathcal{H}$ can be thought of as enforcing the idea that, if we only are guessing values that are consistent with $H$, then our best guess using only the information in $H$ should be the same as the weighted average of $X$ itself over $H$. More concretely, if \(H = \{ 1, 3, 5\}\) in our dice rolling example, having $\int_H \mathbb{E}[X \rvert \mathcal{H} dP = \int_H X dP$ implies that, given $H$, we can guess the average of $X$ perfectly.
We can relate this measure theoretic definition with the more common ones learned in statistics courses. First, partition the sample space, $\Omega$, into disjoint sets $\Omega_1, \Omega_2, \dots$ such that $\mu(\Omega_i) > 0$ for all $i$. Let $\mathcal{F} = \sigma(\Omega_1, \Omega_2, \dots)$ be the $\sigma$-field generated by this collection of sets. For random variable $X$ defined on $(\Omega, \mathcal{F}, \mu)$, we have:
\[\mathbb{E}_\mu[X \rvert \mathcal{F}] = \frac{\mathbb{E}_\mu[X \rvert \Omega_i]}{\mu(\Omega_i)} \hspace{5mm} \text{on } \Omega_i\]When we are given some information, $\Omega_i$, about which set in our partition $X$ can be found in, our best guess at $X$ becomes the average of $X$ over that set.
In most probability courses, we also learn about conditional expectations with respect to some other random variable. In this case, we write $\mathbb{E}[X \rvert Y]$ to mean $\mathbb{E}[X \rvert \sigma(Y)]$.
We now introduce the idea of vector spaces. We begin with the definition of a field.
A vector space is defined with respect to a field. In generality, it is a set of elements that satisfy some special properties in relation to some field.
Many concepts in linear algebra and general mathematics are derived from the vector space, including linear combinations, subspaces, and bases. It’s important to note that, though we usually think of vectors as tuples, they don’t need to be. You could define a vector to be different cheeses, and as long as the definition is satisfied, it will be a valid vector space.
If we equip a vector space with a special type of map, then we get an inner product space.
Something that will be very useful is a map from a vector space to the real numbers that can be thought of as assigning a “size” to vectors in the space. We call this a norm, and if we equip a vector space with a norm, then we have a normed vector space.
We can define the canonical norm of an inner product space as $\rvert \rvert x \rvert \rvert \sqrt{\langle x, x \rangle}$. Thus, any inner product space is a normed vector space. A special type of normed vector space is the Banach space.
By “complete”, we mean that the space does not have any “holes” in it. Formally put, any Cauchy sequence taking values in $X$ converges to a point in $X$ as well.
Norms can also induce what we call a distance metric or function (or just metric for short) which assigns a value to represent how “far apart” two vectors in our space are.
The induced metric (i.e. the distance metric induced by the norm of a vector space) is the function $d: V \times V \rightarrow \mathbb{R}$ satisfying $d(x, y) = \rvert \rvert x - y \rvert \rvert$ for all $x,y \in V$. If we combine a metric with a set, then we get a metric space, which is just a set on which we have a particular sense of distance between its elements.
Using our definitions of inner product and complete metric spaces, we can define what is known as a Hilbert space.
Let $(\Omega, \mathcal{F}, P)$ be a probability space, and let \(\mathbb{F} = (\mathcal{F}_t)_{t \geq 0}\) be a filtration such that $\mathcal{F}_t$ is a sub-$\sigma$-field of $\mathcal{F}$ for all $t$. (That is, $(\Omega, \mathcal{F}, \mathbb{F}, P)$ is a filtered probability space). Suppose we also have \(X: [0, \infty) \times \Omega \rightarrow \mathbb{R}\), a right-continuous supermartingale with respect to $\mathbb{F}$.
For $t \geq 0$, define \(X^-_t = \max\{-X_t, 0 \}\). Assume \(\underset{t > 0}{\sup} \mathbb{E}[X_t^-] < +\infty\). Then (the point-wise limit) \(X(\omega) = \underset{t \rightarrow + \infty}{\lim} X_t(\omega)\) exists and is finite for all $\omega \in \Omega$ except a $P$-null set.
Proof to be completed.
Here are some more articles you might like to read next: