In short measure theory is
ordinary freshman calculus done
in a way that is in some contexts
significantly more powerful.
Here measure is essentially
just a synonym for area, essentially
just ordinary area.
Measure theory addresses both
the
differentiation and integration
of freshman calculus,
but most of the focus is on
integration.
So, in freshman calculus, you
are given, say, a real valued function
of a real variable, say, function
f where, for real numbers x, we have,
e.g.,
f(x) = x^2 + 1
Then, say, we graph f and
want the area under the curve for
x in the interval [0,10]. Okay,
you saw that more than 10,000 times
in freshman calculus.
Well, in this case, measure theory
will give the same answer for the
area under the curve. The difference
is how that area is calculated.
Here is the shortest description
of where measure theory is different:
As you recall, in freshman calculus
you found the area under the curve
by picking points on the X axis,
that is, partitioning the X axis
into little intervals, on each interval
building a rectangle that approximated
the area under the curve over that
little interval, inserting more in the
partition so that the longest of
the little intervals had its length
get as small as we pleased, adding up
the areas of the rectangles, and
taking the limit. That was it. That
was the definition of what the
integral was. Of course, to integrate
x^2 + 1 you learned about, for any
constant C, the anti-derivative
(1/3)x^3 + x + C
So, here's what measure theory does:
Yup, it also works with a partition
but, in this case, the partition is
on the Y axis instead of the X axis.
Then for each little interval on the
Y axis, we get a horizontal bar and
look at the parts that are under the
curve. Then as we add points to the
partition, we add up the ordinary
areas of the relevant parts of the
horizontal bars. The picture
is less nice than in freshman calculus
but, still, no biggie.
Now, how would one do, say, numerical
integration? Sure: The same
as in freshman calculus, say,
Simpson's rule where work with
little trapezoids. Nope, measure
theory is not suggesting that we
change that.
Here are four features of the
approach of measure theory:
(1) There are some goofy, pathological
functions that get the integration
theory (the Riemann integral) of
freshman calculus all confused.
E.g., consider the function 0 on
the rational numbers and 1 otherwise.
Then the upper and lower
Riemann sums of those little
rectangles never converge to what
we want. Bummer. Well, it follows
from some theorems that the integral
of measure theory rarely gets confused --
of course the theorems are very precise,
but the generality is mind blowing, so
much that it's darned tricky to
construct a case where the measure
theory integral fails.
(2) A big deal is what happens
when a sequence of functions
is used to approximate and converge
to another function. A leading
example is Fourier series. Well,
on taking limits during this
approximation, the Riemann integral
can get confused when the measure
theory integral (Lebesgue) does just fine.
H. Lebesgue was a student of
E. Borel in France and did
his work near 1900.
(3) Often we want to pass some
limits under the integral sign.
Again, Lebesgue does much better
here than Riemann. Indeed,
the Lebesgue integral has
a super nice theorem on
differentiation under the
integral sign (from the
TOC of Durrett, that theorem
may be the last topic in that
book -- it was a really fun
exercise when I was studying
that stuff).
(4) Notice a biggie: With
Lebesgue, actually we used
next to nothing about
the X axis, that is, about
the domain of the function
we are integrating. In this
way, right away, presto, bingo,
we get a theory of integration
that works on domains
with much, much less in assumptions
and properties than the real
numbers or the usual finite
dimensional real Euclidean
vector space. In particular,
we are now GO for doing
probability theory --
the Lebesgue integral is
used to define both probability
of events and expectation of
random variables. It was
A. Kolmogorov in 1933 who
noticed, and wrote a monster paper,
on how to use measure theory
and the Lebesgue integral
for a solid foundation for
probability theory. Since then
for essentially all serious
research in probability and
stochastic processes, much of
mathematical statistics,
nearly all of stochastic optimal
control, is based solidly
on the Kolmogorov,
i.e., measure theory, foundations.
So, from a mathematician not
much interested in probability,
probability theory is just
a special case of measure theory
where the total area (measure)
is just 1. That's not literally
or logically wrong
but does discard a fantastic
baby with any bathwater.
Some of the results in
probability are just astounding
and powerful -- both
beyond belief.
So, in measure theory, here
is what a measure (a definition
of area) is:
We start with a non empty set,
a space, say, X. From
X we have some subsets we
call measurable sets.
So, measurable set A is a
subset of X. In the special
case of probability,
the measurable sets are
the events, that is, e.g.,
all the trials where
our coin comes up heads.
We ask to have enough measurable
sets so that all of them
form a sigma algebra. Why?
Otherwise we don't have much.
A sigma algebra doesn't ask
for much. The sigma part is
supposed to suggest
finite or countably infinite
adding up, as in the usual use of
the capital Greek sigma for
summing.
Say our sigma algebra of
subsets of our measurable
space X is S (probability
theory usually uses script F).
Then we want the empty subset
of X to be an element of S.
For A in S we want X - A (the
relative complement) to be
an element of S. And for
B(i) in S for i = 1, 2, ...,
we want the union of
all the B(i) to be an element of
S. These conditions ensure that
we will have enough sets
in S to have a decently strong
theory.
In what we commonly
do in applied probability,
we wouldn't settle for less.
E.g., the empty set is the
event it never happens.
If H is the event of heads,
then T = X - H, the relative
complement, is the event tails.
If H(i) is the event that
flip i comes up heads, then
the union of all the H(i)
is the event the coin comes
up heads at least once or
doesn't always come up tails.
In probability, those are all
simple situations, and
just from those we need
a sigma algebra of events.
And it turns out, that's enough,
and has been since 1933.
So, for a measure, say,
m, to measurable set A
there is real number m(A),
the measure (think area
or, in probability theory, the
probability) of A. Of course
in probability we call the
measure P instead of m
and write P(A) for the probability
of event A.
You can begin to see that we
are essentially forced into
how Kolmogorov applied measure
theory whether we like it or not.
Sorry 'bout that!
Well, for a measure
m, we want countable
additivity. So,
for disjoint measurable
sets B(i), i = 1, 2, ...,
we want
m(union B(i)) = sum m(B(i))
for some sloppy math notation
since I can't type TeX here!
Usually m(A) is real with
m(A) >= 0, and commonly we
need to have m(X) = infinity
so that m can take on value
infinity.
We can also extend
to m(A) any real number or
any complex number.
Measure theory is the
total cat's meow for
Fourier theory!
To get a sigma algebra
of measurable sets we want,
commonly we start with
a topology, that is,
its collection of open
sets, and ask for the
unique smallest sigma
algebra for which each open
set is also a measurable
set in the sigma algebra.
When we do this on the real
line and assign
the measure of intervals
their ordinary length
and extend that to
as many subsets of the
reals as we can, we get
Lebesgue measure for the
real line. We get a lot
of sets! It's a tricky
exercise, that uses the
axiom of choice, even to
construct a subset of the
reals that is not Lebesgue
measurable. Powerful theory!
Suppose we have spaces X
and Y, each with a sigma
algebra of subsets and
a function
f: X --> Y
Then f is measurable
if for each measurable
subset B of Y
f^(-1)(B) is also
a measurable subset
of X. In measure
theory, when we integrate
a function, we ask that
it be measurable.
In the usual cases, it's
even tough to construct
a function that is not
measurable. Darned
near any limit of
measurable functions is
also measurable -- super
nice theory.
In probability theory,
a random variable
is just a measurable
function where its
domain is
a probability space,
that is, a sample space
Omega with a sigma
algebra of subsets
script F and
a probability measure P.
Of course, there's much
more, stacks, shelves,
racks of books
as difficult as you wish,
but the above is
a simple, intuitive view
from 10,000 feet up.
Or, measure theory is
a nicer theory of
area and area under a curve
that in all the simple cases
gives you just what
you have been used to!
I'm saving your responses to view later as I really appreciate them. Nassim also highly recommends Leo Breiman's text, I'll actually be working through the path you recommend , or something very similar.
Here measure is essentially just a synonym for area, essentially just ordinary area.
Measure theory addresses both the differentiation and integration of freshman calculus, but most of the focus is on integration.
So, in freshman calculus, you are given, say, a real valued function of a real variable, say, function f where, for real numbers x, we have, e.g.,
f(x) = x^2 + 1
Then, say, we graph f and want the area under the curve for x in the interval [0,10]. Okay, you saw that more than 10,000 times in freshman calculus.
Well, in this case, measure theory will give the same answer for the area under the curve. The difference is how that area is calculated.
Here is the shortest description of where measure theory is different: As you recall, in freshman calculus you found the area under the curve by picking points on the X axis, that is, partitioning the X axis into little intervals, on each interval building a rectangle that approximated the area under the curve over that little interval, inserting more in the partition so that the longest of the little intervals had its length get as small as we pleased, adding up the areas of the rectangles, and taking the limit. That was it. That was the definition of what the integral was. Of course, to integrate x^2 + 1 you learned about, for any constant C, the anti-derivative
(1/3)x^3 + x + C
So, here's what measure theory does: Yup, it also works with a partition but, in this case, the partition is on the Y axis instead of the X axis. Then for each little interval on the Y axis, we get a horizontal bar and look at the parts that are under the curve. Then as we add points to the partition, we add up the ordinary areas of the relevant parts of the horizontal bars. The picture is less nice than in freshman calculus but, still, no biggie.
Now, how would one do, say, numerical integration? Sure: The same as in freshman calculus, say, Simpson's rule where work with little trapezoids. Nope, measure theory is not suggesting that we change that.
Here are four features of the approach of measure theory:
(1) There are some goofy, pathological functions that get the integration theory (the Riemann integral) of freshman calculus all confused. E.g., consider the function 0 on the rational numbers and 1 otherwise. Then the upper and lower Riemann sums of those little rectangles never converge to what we want. Bummer. Well, it follows from some theorems that the integral of measure theory rarely gets confused -- of course the theorems are very precise, but the generality is mind blowing, so much that it's darned tricky to construct a case where the measure theory integral fails.
(2) A big deal is what happens when a sequence of functions is used to approximate and converge to another function. A leading example is Fourier series. Well, on taking limits during this approximation, the Riemann integral can get confused when the measure theory integral (Lebesgue) does just fine.
H. Lebesgue was a student of E. Borel in France and did his work near 1900.
(3) Often we want to pass some limits under the integral sign. Again, Lebesgue does much better here than Riemann. Indeed, the Lebesgue integral has a super nice theorem on differentiation under the integral sign (from the TOC of Durrett, that theorem may be the last topic in that book -- it was a really fun exercise when I was studying that stuff).
(4) Notice a biggie: With Lebesgue, actually we used next to nothing about the X axis, that is, about the domain of the function we are integrating. In this way, right away, presto, bingo, we get a theory of integration that works on domains with much, much less in assumptions and properties than the real numbers or the usual finite dimensional real Euclidean vector space. In particular, we are now GO for doing probability theory -- the Lebesgue integral is used to define both probability of events and expectation of random variables. It was A. Kolmogorov in 1933 who noticed, and wrote a monster paper, on how to use measure theory and the Lebesgue integral for a solid foundation for probability theory. Since then for essentially all serious research in probability and stochastic processes, much of mathematical statistics, nearly all of stochastic optimal control, is based solidly on the Kolmogorov, i.e., measure theory, foundations.
So, from a mathematician not much interested in probability, probability theory is just a special case of measure theory where the total area (measure) is just 1. That's not literally or logically wrong but does discard a fantastic baby with any bathwater. Some of the results in probability are just astounding and powerful -- both beyond belief.
So, in measure theory, here is what a measure (a definition of area) is: We start with a non empty set, a space, say, X. From X we have some subsets we call measurable sets. So, measurable set A is a subset of X. In the special case of probability, the measurable sets are the events, that is, e.g., all the trials where our coin comes up heads.
We ask to have enough measurable sets so that all of them form a sigma algebra. Why? Otherwise we don't have much. A sigma algebra doesn't ask for much. The sigma part is supposed to suggest finite or countably infinite adding up, as in the usual use of the capital Greek sigma for summing.
Say our sigma algebra of subsets of our measurable space X is S (probability theory usually uses script F). Then we want the empty subset of X to be an element of S. For A in S we want X - A (the relative complement) to be an element of S. And for B(i) in S for i = 1, 2, ..., we want the union of all the B(i) to be an element of S. These conditions ensure that we will have enough sets in S to have a decently strong theory.
In what we commonly do in applied probability, we wouldn't settle for less. E.g., the empty set is the event it never happens. If H is the event of heads, then T = X - H, the relative complement, is the event tails. If H(i) is the event that flip i comes up heads, then the union of all the H(i) is the event the coin comes up heads at least once or doesn't always come up tails. In probability, those are all simple situations, and just from those we need a sigma algebra of events. And it turns out, that's enough, and has been since 1933.
So, for a measure, say, m, to measurable set A there is real number m(A), the measure (think area or, in probability theory, the probability) of A. Of course in probability we call the measure P instead of m and write P(A) for the probability of event A.
You can begin to see that we are essentially forced into how Kolmogorov applied measure theory whether we like it or not. Sorry 'bout that!
Well, for a measure m, we want countable additivity. So, for disjoint measurable sets B(i), i = 1, 2, ..., we want
m(union B(i)) = sum m(B(i))
for some sloppy math notation since I can't type TeX here!
Usually m(A) is real with m(A) >= 0, and commonly we need to have m(X) = infinity so that m can take on value infinity.
We can also extend to m(A) any real number or any complex number.
Measure theory is the total cat's meow for Fourier theory!
To get a sigma algebra of measurable sets we want, commonly we start with a topology, that is, its collection of open sets, and ask for the unique smallest sigma algebra for which each open set is also a measurable set in the sigma algebra.
When we do this on the real line and assign the measure of intervals their ordinary length and extend that to as many subsets of the reals as we can, we get Lebesgue measure for the real line. We get a lot of sets! It's a tricky exercise, that uses the axiom of choice, even to construct a subset of the reals that is not Lebesgue measurable. Powerful theory!
Suppose we have spaces X and Y, each with a sigma algebra of subsets and a function
f: X --> Y
Then f is measurable if for each measurable subset B of Y f^(-1)(B) is also a measurable subset of X. In measure theory, when we integrate a function, we ask that it be measurable. In the usual cases, it's even tough to construct a function that is not measurable. Darned near any limit of measurable functions is also measurable -- super nice theory.
In probability theory, a random variable is just a measurable function where its domain is a probability space, that is, a sample space Omega with a sigma algebra of subsets script F and a probability measure P.
Of course, there's much more, stacks, shelves, racks of books as difficult as you wish, but the above is a simple, intuitive view from 10,000 feet up.
Or, measure theory is a nicer theory of area and area under a curve that in all the simple cases gives you just what you have been used to!