2 Probabilities, likelihood, and inference

In the previous chapter, we introduced some data and talked about variables and experiments. In this chapter, we’re going to talk about probabilities and explain how they can be used to make inferences about our data and research questions. Before beginning this chapter we should note that it’s normal if some, or many, of the topics to be discussed don’t make sense the first time you read this chapter. Things will make more sense once you start to actually build models and it becomes less hypothetical and more practical.

In addition, as noted in the preface, it’s a mistake to think that you can read a chapter in a statistics textbook once and move on having fully absorbed the content. If many of the topics in this chapter are new to you, you should probably: (1) Read this chapter, (2) wait a few days and read it again, and then (3) wait a few more days (or weeks) and read it again. It may also be useful to return to this chapter even once you are working through the following chapters. In our experience, you may find that you see information for the first time that was there all along, but that finally makes sense given your increased experience with the subject.

Finally, we want to note that what we present below is a teeny tiny slice of statistical inference based on the assumption of normally distributed data (or actually, errors). In Chapter 5, we will introduce another error distribution (t), and in Chapter 10, we will model binary dependent variables using logistic regression. In each case, those models require us to change aspects of the details outlined below. However, the same general principles still apply with respect to the relationships between data, likelihoods, and the estimation of credible parameter values.

2.1 Chapter pre-cap

In this chapter, we introduce important statistical concepts such as probability, probability distributions, and likelihoods. The discussion in this chapter centers around the Gaussian/Normal distribution, and making inferences related to the population mean parameter. First, we introduce conditional, marginal, and joint probabilities, and the importance of assuming the independence of observations. Then, we present probability densities and the normal distribution, and discuss the parameters of this distribution in detail. Following this, we introduce likelihood functions, and discuss logarithms and the use of log-likelihoods. Finally, we outline the use of likelihood functions and statistical models to make inferences about credible values of parameters.

2.2 Data and research questions

We’re going to think about a hypothetical value “the average apparent height of adult males”. To put it another way, we might wonder “how tall do men ‘sound’?”. The apparent heights of men cannot be known a priori. In other words, you don’t know how tall a random man will ‘sound’ until you actually observe the judgment. For this reason, “the average apparent height of men” can be modeled as a random variable. We’re going to think about how we can use our experimental data to try to answer the following two questions:

(Q1) How tall does the average adult male ‘sound’?

(Q2) Can we set limits on credible average apparent heights based on the data we collected?

These two questions can be thought of as relating to the central location and the spread of the data, respectively. Answering the first question tells you what values your variable tends to take on, while answering the second question tells you how much variation you can expect around the most typical values. Scientific research is often focused on questions such as (Q1) regarding the central location, the average value, of some variable. For example, someone might ask “how tall do adult male speakers sound?” and you can say, for example, “I have some data that suggests 174 cm is a reasonable estimate”. However, reliable inference requires answering question (Q2) as well, and determining what range of values are plausible for a certain variable.

Think of the (actual) average height of the people in a large city. You can go out and sample 100 people and find the mean (average) of your sample, arriving at a single estimate of the population mean. Now imagine that 50 people went out in the same city and each sampled 100 random people. There is no chance that every one of those 50 people would find identical means across all of their samples. Instead, there will be a distribution of sample means, in the same way, there is a distribution of the original data used to calculate the means. Another way to look at this is that there is some degree of uncertainty involved when answering any research question. As a result of uncertainty, it can be difficult to rule out alternative possible answers to our research questions. For example, if 174 cm is a good estimate, what about 173.99 cm? What about 173 cm? 172 cm? Where do estimates of average height stop being ‘good’? Without being able to say what is not a good estimate, it’s not quite as useful to be able to say what is.

A related issue arises with respect to the interpretation of the magnitude of effects. Imagine that you read about a miracle diet that was guaranteed to make you lose one gram of weight a day. You know that is not very impressive. How? Because you understand that a difference of one gram is not large relative to the variation that exists in the weight of a human body on a daily basis. You could gain the weight back (and more) by drinking a teaspoon of water. Without knowing how much human weight tends to vary between and within people, it’s impossible to know whether a reduction of one gram constitutes a meaningful change in the mass of a human. In contrast, a diet that causes one gram of weight loss in the average mouse may actually be of interest to mice owners, as this may be a large value relative to natural variation in the weight of a mouse.

So, we see that imposing limits on credible ranges for our average values can be as important as finding the average values themselves. Further, in order to properly contextualize values and effects, we need to have some idea about the underlying variation in the measurements. Clearly, we need some principled way to ‘guess’ reasonable ranges based on our sample of observations, in addition to just talking about average values. In this chapter, we will discuss how statistics provides us with a framework to answer (Q1) and (Q2) above using only our sample of values and a (statistical) model.

In order to discuss the apparent heights of adult males, we need to extract the subset of speakers judged to be adult males from our data. Below we load the book package (bmmb) and select the relevant rows from our experimental data. Recall that our experiment contained an acoustic manipulation such that speech resonances were changed to make speakers sound bigger (see Section 1.3.2). In the main text, we’re going to focus only on the unmodified productions, the ‘natural’ speech produced by the men in our sample. To do this, we are going to use the data in exp_data, which contains only the trials involving unmodified productions.

# Load book package 
library (bmmb)

# Load experimental data
data (exp_data)

# Take only rows produced by men (`m`)
men = exp_data[exp_data$C_v == 'm',]
mens_height = men$height

We can have a look at some of the quantiles (see Section 1.5.2) to get an idea of what range of values this variable tends to have. We see that the minimum and maximum values are 139.7 and 192.3 cm, that 174.5 cm is the median, and that half of the observed height judgments for adult males fell between 169.2 and 179.1 cm.

quantile (mens_height)
##    0%   25%   50%   75%  100% 
## 139.7 169.2 174.5 179.1 192.3

Obviously, inspecting the distribution of our observed height judgments only gives us direct information about the judgments we actually observed. To make inferences about the probable characteristics of the height judgments we have not observed, or to talk about height judgments for adult males more generally, we rely on methods of statistical inference, as will be described below.

2.3 Empirical Probabilities

The sample space of a variable is the set of all possible outcomes/values that a variable can take. Classic examples are a coin flip, which can take on the values ‘heads’ or ‘tails’, or the roll of a standard six-sided die, which can take on the values one through six. In other cases, the sample space may have an infinite or practically infinite number of members. For example, since time is continuous, there are an infinite number of durations an event may have, given adequate precision in measuring time. If we think of the human population, or the population of fish in the sea, these are theoretically finite but practically infinite. It would be extremely difficult to fully sample either of these populations and impossible to do so before they changed substantially (e.g. before some members have died and others have been born).

The probability of an event/outcome is the number of times an outcome occurs relative to the number of all possible outcomes that can occur (i.e. all of the outcomes in the sample space). For example, there are six possible outcomes when you roll a die. As a result, the probability of any one outcome is 1/6 (0.17), and the probability of observing an even number is 3/6 (0.5). Actually, this is an extremely brief, and vastly oversimplified, presentation of probability, a topic that could occupy one’s entire career. There are actually many ways to conceive of probability mathematically and philosophically. For more detailed, rigorous expositions, see Ross (2019) for a good introductory text, Parzen (1960) for a more classical, and advanced, approach, and Jaynes (2003) for a distinctly Bayesian treatment of the subject.

By convention, the probability of each event is assigned a value between 0 and 1, and the total probability of all of the possible outcomes in the sample space is equal to one. As a result of this, you know that a probability of 0.5 means something is expected to occur half the time (i.e. on 50% of trials), and a probability of 0.25 indicates that something should happen 25% of the time, about one in every four trials.

Empirical probabilities are the probabilities of different outcomes in a sample of data. For example, we can flip a coin 100 times and observe 65 heads. This means that the empirical probability of observing heads in our data is 0.65 (65% of trials). Empirical probabilities can also be referred to as relative frequencies or proportions. Later, we’re going to talk about theoretical probabilities based on mathematical models.

Suppose we want to know the probability of observing an apparent height judgment over 180 cm in our sample. To calculate the empirical probability of this occurring in our data we need to find (1) the number of times it occurred, and (2) the total number of observations for the variable. We can do this easily using the logical operators and variables discussed in Chapter 1. Below, we find the total number of outcomes that satisfy our restrictions (being over 180 cm) and divide this by the total number of observations being considered.

# the evaluation in the parenthesis will return 1 if true, 0 if false
# number of observations the fall above the threshold
sum (mens_height > 180)  
## [1] 136

# divided by total number of events
sum (mens_height > 180) / length (mens_height)  
## [1] 0.2015

# a shortcut to calculate probability, mean = sum/length
mean (mens_height > 180)
## [1] 0.2015

The top value is the frequency of the occurrence. This is not so useful because this number can mean very different things given different sample sizes (e.g. 136/675, 136/675,000). The middle and bottom values have been divided by the total number of observations. As a result, these now represent an estimate of the probability of occurrence in a way that is independent of the total number of observations.

2.3.1 Conditional and marginal probabilities

Figure 2.1 presents boxplots of the overall distribution of height judgments in our data (left), and of the distribution of these provided by each listener individually. We can see that height judgments range from about 140 to 200 cm, with most responses falling between 170 and 180 cm. Notice that our overall boxplot does not give us any information about the ranges used by different listeners, nor even the fact that the data was contributed by different listeners. This overall distribution of height responses is the marginal distribution of height judgments. The marginal distribution of a variable is the overall distribution, across all values of all other variables. The marginal boxplot on the left compresses all of the listener-specific boxplots on the right into one single box. It’s common to denote marginal probabilities using notation like this $P(\mathrm{variable})$, meaning we might refer to the marginal probability of height responses like this $P(\mathrm{height})$.

show plot code

par (mar = c(4,.5,1,0), mfrow = c(1,2), oma = c(0,4,0,1))
layout (mat = t(c(1,2)), widths = c(.15,.85))

boxplot (men$height, ylab = "", xlab = "", col=lavender, 
         ylim = c(135,200), width=2)
mtext (side=2,text="Apparent height (cm)",line=3)

boxplot (height ~ L, data = men, ylab = "Apparent height (cm)", 
         xlab = "Listener", col = bmmb::cols,
         yaxt='n', ylim = c(135,200))
grid()
boxplot (height ~ L, data = men, ylab = "", xlab = "",yaxt='n',
         col = bmmb::cols, add = TRUE)

Figure 2.1: (left) Boxplot showing all height judgements for adult male speakers in our experiment. (right) Individual boxplots for each listener’s responses.

A look at Figure 2.1 reveals that the probability of observing a height response of over 180 cm can depend substantially on the listener that provided it (e.g. compare listener 7 vs listener 12). Recall that height is a quantitative variable and listener is a categorical predictor (a factor) with 15 levels, one for each listener (see Section 1.5.1). We can talk about how height judgments vary across the levels of our listener factor (i.e. for different listeners) by considering the conditional probability of height given the listener. A conditional probability is the probability of an outcome given that some other outcome has occurred. For example, rather than ask “what is the probability of observing an apparent height over 180 cm?”, we can ask “what is the probability of observing an apparent height over 180 cm given that we are observing data from listener 12?”. Conditional probabilities basically reduce the sample space by including only the subset of events that satisfy the given condition.

Conditional probabilities are often denoted like this $P(\mathrm{outcome \, variable} | \mathrm{conditioning \, variable})$, which in this case would look like $P(\mathrm{height} | L)$. For example, the first box in the right plot of Figure 2.1 is the distribution of $P(\mathrm{height}| L=1)$, and the second box is the distribution of $P(\mathrm{height}| L=2)$. Below we divide our vector of height judgments (mens_height) into those contributed by listener 7 and listener 12. We then find the probability of observing a height judgment over 180 cm conditional on the listener and see that these can differ quite a bit from each other.

# create subsets based on listener
L07 = mens_height[men$L==7]
L12 = mens_height[men$L==12]

# find the conditional probability of height>180 for each listener
mean (L07 > 180)
## [1] 0.04444

mean (L12 > 180)
## [1] 0.6444

In the boxplots in Figure 2.1 we see that apparent height distributions vary substantially as a function of the value of L, our listener variable. Contrast this with the boxplots seen in Figure 2.2, which shows the distribution of stimulus durations conditional on listener. Since every individual recording had a fixed duration and all listeners heard the same sounds, we know that the distribution of stimulus durations is identical across all levels of the listener variable. As a result, we can see that all of the conditional distributions of duration given listener look just like each other, and just like the marginal probability. This tells us that duration and listener are statistically independent.

show plot code

par (mar = c(4,.5,1,0), mfrow = c(1,2), oma = c(0,4,0,1))
layout (mat = t(c(1,2)), widths = c(.15,.85))

boxplot (men$dur, ylab = "", xlab = "", col=lavender, 
         ylim = c(120,370), width=2)
mtext (side=2,text="Duration (ms)",line=3)

boxplot (dur ~ L, data = men, ylab = "", xlab = "Listener",
         col = c(yellow, deepgreen,coral,skyblue,darkorange,lavender),
         yaxt='n', ylim = c(120,370))
grid()
boxplot (dur ~ L, data = men, ylab = "", xlab = "",yaxt='n',
         col = c(yellow, deepgreen,coral,skyblue,darkorange,lavender), 
         add = TRUE)

Figure 2.2: (left) Boxplot showing the duration of stimuli for adult male speakers in our experiment. (right) Individual boxplots for the stimuli presented to each listener.

When two variables are statistically independent, the distribution of one variable is not affected by the values of the other. As a result, the conditional distribution of one variable given the other will be the same as its marginal distribution, as seen in Figure 2.1. This can be stated for the general case as $P(\mathrm{variable | conditioning \, variable}) = P(\mathrm{variable})$. In our example in Figure 2.1, $P(\mathrm{height} | L=7)$ is not equal to $P(\mathrm{height} | L=12)$, and neither of these is equal to the marginal distribution. As a result, we conclude that the apparent height and listener variables in our data are not independent. Instead, they are statistically dependent, meaning that these variables are related to each other in some way and that knowing the value of one may tell you something about probable values of the other.

2.3.2 Joint probabilities

Joint probabilities reflect the probability of two or more things occurring together. We can refer to the joint probability of $A$ and $B$ using the notation $P(A \cap B)$ or $P(A \: \& \: B)$. Here are some important things to know about joint probabilities:

The formula for calculating the joint probability of two outcomes $A$ and $B$ is given by $P(A \: \& \: B)=P(A|B) \cdot P(B)$. This means that the probability of $A$ and $B$ is equal to the conditional probability of $A$ given $B$, multiplied by the marginal probability of $B$. In plain English this means that the joint probability of A and B is the probability of $A$ assuming $B$ is true, times the probability that $B$ is true.
Recall from Section 2.3.1) that when $A$ and $B$ are independent $P(A|B)=P(A)$. In other words when $A$ and $B$ are independent the conditional probability of $A$ given $B$ equals the marginal (unconditional) probability of $A$. As a result of this, when $A$ and $B$ are independent $P(A|B) \cdot P(B)=P(A) \cdot P(B)$. Thus, when $A$ and $B$ are independent their joint probability can be found by simply multiplying their individual marginal probabilities.
The probability of $A$ and $B$ is equal to the probability of $B$ and $A$. As a result of this $P(A \: \& \: B)=P(B \: \& \: A)$, and as a result of that $P(A|B) \cdot P(B)=P(B|A) \cdot P(A)$. This relation is very important and will become useful later.

We can demonstrate the above properties using the empirical probabilities in our data. Consider the joint probability of observing a response in our data that was contributed by listener seven (L='12'), and also being longer than 250 ms in duration. We can find this by joining two logical variables using the & (and) symbol as shown below.

# TRUE if the listener is 12
L12 = (men$L=='12')

# TRUE if the duration is greater than 250 ms
dur_250 = men$dur > 250

# The probability of A and B
mean (L12 & dur_250)
## [1] 0.02815

mean (men$L=='12' & men$dur > 250)
## [1] 0.02815

So, we see that the probability of observing this event is 0.028, indicating that we expect this in about 3% of trials. Below, we see that this same joint probability can be calculated based on $P(A \: \& \: B) = P(A|B) \cdot P(B)$.

# Marginal probability of observing listener 12 (i.e. P(L=12))
p_L12 = mean (men$L=='12')

# Subset containing only listener 12 
L12 = men[men$L == '12',]

# Probability of dur>250 given listener 12 (i.e., P(dur>250 | L=12))
p_dur_250_given_L12 = mean (L12$dur > 250)

# Joint probability = P(dur>250 | L=12) * P(L=12)
p_dur_250_given_L12 * p_L12
## [1] 0.02815

Because of the experimental design, we know that every listener heard the same vowel durations (see Section 1.3 for a review of the experimental design). This means that duration is independent of the listener: Knowing who the listener is in no way changes the fact that every listener heard the same number of trials with durations longer than 250 ms. As a result of this independence, we could also have calculated the joint probability of the above events by simply multiplying their marginal probabilities, as seen below.

# Joint probability = P(dur>250) * P(L=12)
mean (dur_250) * mean (p_L12)
## [1] 0.02815

This shortcut will not work for variables that are not independent. For example, let’s consider the probability of observing a height judgment of greater than 180 cm provided by listener 12. Below we calculate this the wrong way by multiplying the marginal probabilities, substantially underestimating the probability of the event. The problem is that this method of calculation does not take into account that listener 12 was among the most likely listeners to report heights greater than 180 cm.

# TRUE if the listener is 12
L12 = (men$L=='12')

# TRUE if the height is over 180
over_180 = men$height > 180

# Empirical joint probability of the observation
mean(men$L=='12' & men$height > 180)
## [1] 0.04296

# Wrong: multiplying marginal probabilities
mean(L12) * mean(over_180)
## [1] 0.01343

Below we calculate the joint probability correctly by using the conditional probability.

# Marginal probability of observing listener 12
p_L12 = mean (men$L=='12')

# Subset containing only listener 12 (i.e., given listener 12)
L12 = men[men$L == '12',]
# Conditional probability of a height > 180 given listener 12
p_over_180_given_L12 = mean (L12$height > 180)

# Correct joint probability
p_over_180_given_L12 * p_L12
## [1] 0.04296

This highlights a very important point: The calculation of joint probabilities is much (much) simpler when you can assume that your observations are independent. For example, consider the comparison of the joint probability of four events, one for independent events, and another for dependent events:

\[ \begin{equation} \begin{split} \mathrm{Independent \, events:} \\ P(A \,\&\, B \,\&\, C \,\&\, D) = P(A) \cdot P(B) \cdot P(C) \cdot P(D) \\ \\ \\ \mathrm{Dependent \, events:} \\ P(A \,\&\, B \,\&\, C \,\&\, D) = P(A|B,C,D) \cdot P(B|C,D) \cdot P(C|D) \cdot P(D) \end{split} \end{equation} \]

This means that to calculate the joint probability of dependent events, we may need to calculate many conditional probabilities. Although this may not matter when calculating the joint probability of a handful of observations, calculating the joint probability of hundreds or thousands of correlated variables can become difficult if not impossible in practice.

2.4 Probability distributions

A probability distribution is a function that assigns probabilities to the different possible outcomes in a sample space. We will illustrate and discuss exactly what this means with reference to histograms. In the left plot of Figure 2.3, we see a histogram of counts. This histogram shows the number of times different ranges of height values were observed. The height values are indicated on the x-axis, and the heights of the bars reflect the number of times that different ranges of heights were observed. As a result, the bars of the histogram tell you about the values of the variable that are more or less frequent. This sort of representation makes it difficult to compare distributions across samples of different sizes. For example, if the number of observations were tripled (and if the relative frequency of each height observation was preserved), the heights of the bars in the histogram would also triple.

show plot code

par (mfrow = c(1,3), mar = c(4.1,4.1,1,1))

hist (mens_height, main="", col = lavender, xlim = c(135,200),
      ylim = c(0,205), xlab="Apparent height (cm)",
      cex.lab=1.3,cex.axis=1.3)

hist (mens_height, freq = FALSE, main = "", col = deepgreen, 
      xlim = c(135,200), ylim = c(0,0.06), xlab="Apparent height(cm)",
      cex.lab=1.3,cex.axis=1.3)

hist (mens_height/100, freq = FALSE, main = "", col = skyblue, 
      xlim = c(1.35,2.00), ylim = c(0,06.0), xlab="Apparent height (m)",
      cex.lab=1.3,cex.axis=1.3)

Figure 2.3: (left) A histogram of adult male height judgments showing counts in each bin. (middle) The same data from the left plot, this time showing the density of the distribution. (right) The same data from the middle plot, this time with heights expressed in meters.

Histograms can also be used to show the probability densities associated with different values, i.e. the amount of probability ‘mass’ per unit of the variable of interest. In the middle of Figure 2.3, we see a histogram that shows the density of the distribution of height judgments, i.e. the thickness of the distribution at different locations along the number line. The values of the density, and the heights of the bars of the density histogram, are constrained by the fact that the area under the curve must equal one since these reflect probabilities. In the case of our histograms, this means that the total area of all the bars (rectangles) is constrained to equal one. As a result of this, when variables have wide ranges densities tend to be very small (as in the middle panel in Figure 2.3).

In the right plot of Figure 2.3, we present heights in meters rather than centimeters. This has the effect of substantially increasing the values of the density (100-fold) but does not otherwise affect the shape of the distribution. The relationship between variable ranges and density values means that densities can’t really be interpreted in an absolute sense – they always reflect the amount of probability per unit. So, if you change the unit, you change the interpretation of the corresponding density. For example, a density value of 0.06 means very different things in the middle and right histograms in Figure 2.3. As a result, density values should be interpreted relative to other density values only along the same x-axis (i.e. given the same variable). A higher density in one location tells us that values in that vicinity are more probable than values in locations with lower densities, and differences in densities reflect differences in the relative probabilities of different values.

Probability distributions sometimes have shapes that can be represented using mathematical functions. These functions typically have a limited number of parameters that determine their exact shape. Think of parameters as the properties of a system that can be ‘set’ separately from each other. For example, a radio has three parameters: Tuner frequency, band (AM/FM), and volume. A toaster may have only one, a single knob determining the degree of toasting required. The more parameters something has, the more complicated it is (e.g. an airplane may have thousands).

Consider the code below, which defines a slope and intercept and draws a line based on these parameters. You can change the values of the intercept and the slope and draw many kinds of lines. However, there’s no way to change the characteristics of a line other than by changing its slope and intercept: A line is entirely defined by these parameters.

intercept = 3
slope = 1
x = seq (-10,10,.1)
y = intercept + x*slope

plot (x,y, type='l',lwd=3,col=4)

Sometimes the same sorts of probability density shapes pop up over and over, and these shapes are often well-defined mathematically. Parametric probability distributions are those density shapes that can be represented using curves that vary in terms of a limited number of parameters. Just like with lines, the characteristics of a parametric probability distribution are entirely defined by the values of its parameters.

2.5 The normal distribution

The distribution of many random variables (at least approximately) follows what’s known as the normal or Gaussian distribution. This means that if you take a sample of a random variable and arrange observations into bins, the resulting histogram will tend to have the familiar bell-shaped curve (seen in the histograms in Figure 2.3). Normal distributions have two parameters, meaning they differ from each other in only two ways: A mean ($\mu$) and a variance ($\sigma^2$). The normal distribution has the following important characteristics:

The distribution is symmetrical i.e. observations above and below the mean are equally likely.
The probability of observing a given value decreases as you get further from the mean (i.e., average) value.
It’s easy to work with, very well understood, and (approximately) normally distributed data arises in many domains.

The mean ($\mu$) of a normal distribution determines the location of the distribution along the number line. When the mean of a normal distribution changes, the whole shape of the distribution ‘slides’ along the number line. The mean is the 50% halfway point of the ‘mass’ of the distribution (i.e. with a normal distribution, the mean is identical to the median). If the distribution were a physical object, its mean would be its center of gravity and you would balance the distribution on your fingertip at this point. The mean of a normal distribution is also the expected value of that variable. For discrete variables, the expected value ($\mathbb{E}$) is the sum of all possible values of $y$multiplied by the probability of observing each value,$P(y)$, as in Equation 2.1.

\[ \begin{equation} \mathbb{E}(y_{[i]}) = \sum_{i=1}^{\infty} y_{[i]} P(y_{[i]}) \end{equation} \tag{2.1}\]

For example, for the roll of a single die the possible outcomes are the integers from 1 to 6 and each has a 1/6 chance of occurring. To calculate the expected value, we multiply each outcome by 1/6 and sum the resulting values (1/6, 2/6, 3/6, 4/6, 5/6, 6/6), resulting in 3.5. So, the expected value of a roll of a single die is 3.5. This is why the game of craps involves rolling two dice and centers on whether a player rolls a 7: This is the most probable outcome. Continuous variables have an analogous formula defining the expected value with an integral and a probability density function ($f(y)$), as seen below. A probability density function is just a function that assigns some value, e.g. $f(y)=x$, for the different values of a variable (e.g. $y$).

\[ \begin{equation} \mathbb{E}(y) = \int_{-\infty}^{\infty} yf(y) \; dx \end{equation} \tag{2.2}\]

The $y_{[i]} P(y_{[i]})$ part in the discrete equation above is analogous to the $y f(y)$ in the continuous equation. In each case, the first element ($y$) represents a possible value and the second element ($P(y)$, or $f(y)$) multiplies this by a value reflecting the probability that this value is observed. In the discrete case we use probabilities ($P(y)$ ) and add our terms up (with $\Sigma$). For the continuous case we use the density ($f(y)$) and integrate (i.e. find the area under the curve) across a given interval (with $\int dx$).

The variance ($\sigma^2$), or standard deviation ($\sigma=\sqrt{\sigma^2}$), of a normal distribution determines its spread along the x-axis. When the standard deviation changes, the distribution is stretched wide or made very narrow, but stays in the same place. Since every distribution has an area under the curve equal to one (i.e. they all have the same ‘volume’), distributions with smaller variances must necessarily be more dense.

In principle, a given probability distribution can be thought of as having a set of ‘true’, fixed parameters. For example, we might imagine that the average apparent height of adult male speakers is exactly 175.91254… cm. This is questionable at least in part because, as noted earlier, many real-life populations are practically infinite and constantly changing. In any case, even if ‘true’, fixed parameters exist, in most situations we can’t be certain of what the true parameters of a distribution are. Instead, we must be satisfied with estimating the values of parameters based on our samples.

Estimates of our population parameters based on our sample are called statistics. Statistical inference consists of using statistics (based on our sample) to make inferences about the characteristics of the overall population (i.e. the ‘true’ parameters). In the case of the normal distribution we’re interested in two statistics: the sample mean and the sample standard deviation.

2.5.1 The sample mean

The sample mean is our ‘best’ guess for the population mean, i.e. the expected value of the distribution. We’ll be more specific about what ‘best’ means later, but for now, we can just say that if you don’t know the population mean parameter for a given normal distribution, and all you have is a sample of observations, the sample mean will provide you the best available estimate. The formula for the sample mean is given in Equation 2.3. Initially, reading these mathematical formulas may seem daunting. However, learning to read these is just a skill that is developed with practice. In addition, you will begin to see the same ‘chunks’ or structures come up in formulas over and over, and reading these becomes much easier once you start to recognize the meaning of these repetitive structures intuitively.

\[ \begin{equation} \hat{\mu}_{y} = \sum_{i=1}^{n} y_{[i]}/n \end{equation} \tag{2.3}\]

The sample mean is an estimator of the expected value. In Equation 2.3, the division by $n$ reflects the fact that we treat each of our $n$ observations as equally likely, estimating each probability as $1/n$. If we were to replace $1/n$ with $P(y)$ above, equation Equation 2.3 starts to look a lot more like Equation 2.1.

Equation Equation 2.3 says that the sample mean of $y$ ($\hat{\mu}_{y}$) is equal to the sum of all of the elements of the vector $y$, divided by $n$, where $n$ is equal to the length of the $y$ vector (i.e. the number of observations). We use the little hat symbol ($\hat{\mu}$) to indicate that this is an estimate of the mean, and to distinguish it from the population mean, which we cannot observed directly and goes hatless ($\mu$). The summation ($\sum$) symbol represents the repetitive adding of whatever is to the right of the symbol to some total. The summation begins at the number below the $\sum$ and performs one operation for every integer value of $i$ between the starting point and the end point (indicated below and above the $\sum$). The counter variable, in this case $i$ , is also often used to index values of a vector (or other structure) that is being summed.

Equation Equation 2.3 says that the sample mean of $y$ ($\hat{\mu}_{y}$) is equal to the sum of all of the elements of the vector $y$, divided by $n$, where $n$ is equal to the length of the $y$ vector (i.e., the number of observations). We use the little hat symbol ($\hat{\mu}$) to indicate that this is an estimate of the mean, and to distinguish it from the population mean, which we cannot observed directly and goes hatless ($\mu$). The summation ($\sum$) symbol represents the repetitive adding of whatever is to the right of the symbol to some total. The summation begins at the number below the $\sum$ and performs one operation for every integer value of $i$ between the starting point and the end point (indicated below and above the $\sum$). The counter variable, in this case $i$, is also often used to index values of a vector (or other structure) that is being summed.

This behavior is similar to a for loop in R. Below we define the variable n (equal to the length of the vector) and initialize a variable to receive the summation (mean_height). The for loop then proceeds to increment a count variable (i) from one to n along the integers. For each iteration, it adds the value of the vector, divided by n, to the summation variable mean_height. As we can see below, this results in a value identical to that returned by the mean function in R.

# initialize values
n = length (mens_height)
mean_height = 0

# summation equivalent to equation 2.3
for (i in 1:n) mean_height = mean_height + mens_height[i]/n

# replicates values of the mean function
mean (mens_height)
## [1] 173.8
mean_height
## [1] 173.8

Here are some useful things to know about sample means, in no particular order:

The mean of a set of observations is affected by addition and multiplication. Adding $a$ to each member of a set of observations increases its mean by $a$, and multiplying observations by $b$ results in a change in the mean by the same factor.
The mean of the sum of two sets of variables (of the same length) $x$ an $y$ is equal to $\mu_{x}+\mu_{y}$. In other words, the average of the sum is just the sum of the averages.
The sum of the sample’s deviations from the sample mean equals zero (seen in Equation 2.4). This means that the sum of the distances between positive and negative differences from the sample mean exactly balance out. To some extent this makes sense since the mean is the ‘center of gravity’ of a distribution. It’s worth noting that this does not apply to deviations from the population mean since the population mean is not estimated from the characteristics of a sample. So, there is no guarantee that some set of observations will ‘balance out’ around the population mean.

\[ \begin{equation} 0 = \sum_{i=1}^{n} y_{[i]} - \hat{\mu}_{y} \end{equation} \tag{2.4}\]

2.5.2 The sample variance (or standard deviation)

The formula to calculate the sample variance is seen in Equation 2.5. Note that it is quite similar to the structure of Equation 2.3 and clearly involves the averaging of a value. In fact, if we were to replace $(y_{[i]} - \mu_{y})^2$ with $y_{[i]}$, the two equations would be identical. The value being averaged consists of a difference ($y_{[i]} - \mu_{y}$) followed by a squaring operation. So, we see that what’s being averaged is squared deviations from the sample mean. This is what the variance is: The expected value of squared deviations around the mean of the variable.

\[ \begin{equation} \hat{\sigma}^2_{y} = \sum_{i=1}^n (y_{[i]} - \hat{\mu}_{y})^2 / n \end{equation} \tag{2.5}\]

Below we see that we can use a for loop to recreate equation Equation 2.5. However, we don’t manage to exactly recreate the output of the var (variance) function included in R.

# initialize values
n = length (mens_height)
variance_height = 0
mean_height = mean (mens_height)

# equivalent to equation 2.5 above
for (i in 1:n) variance_height = 
  variance_height + (mens_height[i]-mean_height)^2/n

# this time the values don't match
var (mens_height)
## [1] 60.27
variance_height
## [1] 60.18

This is because R does not use Equation 2.5 to estimate variances but rather Equation 2.6.
\[ \begin{equation} \hat{\sigma}^2_{y} = \sum_{i=1}^n (y_{[i]} - \hat{\mu}_{y})^2 / (n-1) \end{equation} \tag{2.6}\]

Above we said that the sample mean is our ‘best’ estimate of the population mean given a sample. A more formal way to state this is that the sample mean is the value which minimizes the sample variance. In other words, if we choose any value of $\mu_{y}$ to calculate the sample variance other than the sample mean, the variance will necessarily be larger. However, we know that our sample mean is just an estimate of the population mean and will never be exactly equal to it. As a result of this, the true variance must be greater than the sample variance when calculated using the sample mean.

We can put it like this: $\sum_{i=1}^n (y_{[i]} - \hat{\mu}_{y})^2 \leq \sum_{i=1}^n (y_{[i]} - \mu_{y})^2$, the sum of squares around the sample mean will always be less than or equal to the sum of squares around the population mean. For reasons that we won’t get into (but which aren’t too complicated), this expected difference may be offset by dividing the squared deviations by $(n-1)$ rather than $n$ as in Equation 2.6. We can update our R code to reflect this change, and see that this now matches the calculation of the variance carried out by R.

# initialize variable
variance_height = 0

# equivalent to 2.6 above
for (i in 1:n) variance_height = variance_height + (mens_height[i]-mean_height)^2/(n-1)

# this time the values do match
var (mens_height)
## [1] 60.27
variance_height
## [1] 60.27

The sample standard deviation ($\sigma$) is simply the square root of the sample variance, as in Equation 2.7.

\[ \begin{equation} \hat{\sigma}_{y} = \sqrt{\hat{\sigma}^2_{y}} = \sqrt{\sum_{i=1}^n (y_{[i]} - \hat{\mu}_{y})^2 / (n-1)} \end{equation} \tag{2.7}\]

Here are some useful things to know about variances, again in no particular order:

Variances are always positive, and can only be zero for variables that do not actually take on different values (i.e., constants).
The variance of a set of observations is not affected by addition. So, adding or subtracting some arbitrary value from a data set will not affect the variance of that data.
Multiplication does affect the variances of a set of values. Multiplying numbers by $x$ results in a change of the variances equal to $x^2$. So, if we took our heights and multiplied them by 10 to express them in millimeters, we would expect the value of $\sigma^2_{\mathrm{height}}$ to increase by a value of $10^2=100$. Since standard deviations are the square roots of variances, this implies that multiplying data by $x$ results in an increase in the standard deviation of the data by a factor of $x$.
The variance of the sum of variables depends on whether they are independent or not. When variables are independent, the sum of their variances is simply equal to $\sigma^2_{x}+\sigma^2_{y}$ for variables x and y. However, when variables are not independent, the variance can be much greater or smaller than this based on the nature of the relationship between the variables. As a result, just as with the calculation of joint probabilities, we need to take into account whether variables are independent or not when we consider the variance of their sum.

2.5.3 The normal density

The parameters of a probability distribution are used to draw its shape, which can be used to make inferences about likely values. Think back to high school math and the function defining the shape of a parabola $y=a(x-h)^2+k$. This function draws a shape based on the settings of its parameters $a$, $h$ and $k$. The $a$ parameter determines the width of the parabola (and whether it points up or down), while the vertex of the parabola will have x and y axis coordinates of $h$ and $k$ respectively. In the same way, the formula defining the density of the normal distribution draws a shape given the settings of its $\mu$ and $\sigma$ parameters.

The formula for the probability density function of the normal distribution is seen in Equation 2.8. The function returns a density value for the probability distribution based on the value of $x$, and the values of its parameters. The equation in Equation 2.8 features exponentiation ($\exp(x)$), that is, raising the base $e=2.718...$ (Euler’s number) to some power as in $e^x$. In Equation 2.8, the value being exponentiated is $-\frac{1}{2\sigma^2}(x-\mu)^{2}$.

\[ \begin{equation} f(x) = \frac{1}{\sigma\sqrt{2\pi}} \cdot \exp( -\frac{1}{2\sigma^2}(x-\mu)^{2}\,) \end{equation} \tag{2.8}\]

It’s much more common to denote exponentiation like $e^x$ rather than $\exp(x)$. However, for now we will use the latter notation for two reasons. First, it makes the exponent bigger and more isolated, which can both make equations easier to read. Second, this makes it look just like the R function exp(x), which makes it easier to remember what it is and what it does.

2.5.4 The standard normal distribution

The standard normal distribution is a normal distribution with a mean of zero and a standard deviation of one. Variables drawn from a standard normal distribution are often represented by the symbol $z$ (sometimes called a z score). Any normally distributed variable can be turned into a standard normal variable by an operation known as standardization, which consists of centering and then scaling the variable as in Equation 2.9. To center a variable we subtract the mean from the value of each observation, making the new mean equal to zero. By dividing our observations by the standard deviation, we scale these values so that the new standard deviation is equal to one (since anything divided by itself is equal to one).

\[ \begin{equation} z=(x-\mu) / \sigma \end{equation} \tag{2.9}\]

Figure 2.4 presents our height data again, but this time compares the data to its centered and standardized versions.

show plot code

par (mfrow = c(1,3), mar = c(4.1,4.1,1,1))

hist (mens_height, main="", col = lavender, xlim = c(130,205),
      freq = FALSE, ylim = c(0,.06), xlab="Apparent height (cm)",
      cex.lab=1.3,cex.axis=1.3,breaks=10)

hist (mens_height-mean(mens_height), freq = FALSE, main = "", 
      col = deepgreen, xlim = c(-40,30), ylim = c(0,0.057), 
      xlab="Centered apparent height (cm)",cex.lab=1.3,
      cex.axis=1.3,breaks=10)

hist (scale(mens_height), freq = FALSE, main = "", col = skyblue, 
      xlim = c(-5.2,4), ylim = c(0,0.45), cex.lab=1.3,cex.axis=1.3,
      xlab="Scaled apparent height (cm)", breaks=10)

Figure 2.4: (left) A histogram of apparent height judgments for adult male speakers. (middle) The same data from the left plot, this time the data has been centered around the mean so that the new mean is zero. (right) The same data from the middle plot, this time the data has been scaled according to the standard deviation so that the standard deviation is now one (and the mean is still zero).

Equation Equation 2.10 re-arranges the terms in Equation 2.9 to isolate $x$ on the left-hand side. From Equation 2.10 we can see that any normally-distributed variable can be thought of as a standard normal that has been multiplied by a standard deviation and then had a mean added to this product.

\[ \begin{equation} x=z \cdot \sigma+\mu \end{equation} \tag{2.10}\]

Normally-distributed data is often discussed in terms of ‘standard deviations from the mean’. This is because stating things in terms of standard deviations from the mean effectively standardizes a variable, making all variables seem standard normal. For example, if someone says “my test score was two standard deviations above the mean” what do you know about their score? You don’t know what the mean is, nor what the standard deviation is. All you know is that their score is two distance units (standard deviations) above the mean, so their test score can be thought of as a standard normal variable like $z$. This is despite the obvious fact that the true average test score was not zero and the true average standard deviation was not one. This is an extremely useful property because it means that we can discuss the probability of any given event from any given normal distribution in consistent terms. For example, we can say that an observation four standard deviations from the mean is very unusual in any normal distribution. This means that if you have an observation equal to 140, the mean is 174, and the standard deviation is 8, you know that this observation is very improbable. That is because it is 4.25 (34/8) standard deviations from the mean of 174.

2.6 Models and inference

Models are simplified representations that help us understand things. For example, we may want to understand the movement of balls on a billiards table, perhaps to create a video game about playing pool. To do this we may assume the balls are spherical and that their mass is evenly distributed about their volume, among other things. Neither of these things is exactly true but assuming this helps us keep our model simple and manageable. It also helps us build our model in terms of things, like regular spheres, whose properties and behaviors are well understood and easy to work with.

To build an exact model for our billiards game we would need to include friction from the felt on the table, the effect of wind resistance, the gravitational effect of the moon, and a large number of other factors. As a result, a perfect or exact model is not really possible for most things (and maybe for anything?). And yet, the simulation of realistic behavior of billiard balls is easy and can be done with great accuracy, suggesting that a simplified model can still be useful to understand the behavior of the more-complicated phenomenon it is meant to represent.

In general, it’s impossible to know what the ‘true’ data distribution is, so perfect inference is not possible. As a result, scientists often use theoretical probability distributions to make inferences about real-life populations and observations. If our measurements more or less follow the ‘shape’ predicted by the theoretical normal distribution, we may be able to use the characteristics of an appropriate normal distribution to make inferences about our variables. Using a normal distribution to make inferences about your data is like using a mathematical model for spheres to understand the behavior of billiard balls. The billiard balls are spherical enough to allow us to make useful predictions based on the simplified model.

It’s useful to keep in mind that reality will never exactly conform to our model. This can result in unpredictable errors in our conclusions. In general, the things you don’t know you don’t know are the things that will cause the most problems. If you had known that your model was wrong, you would have fixed it! Further, using models to make inferences about the general properties of data assumes that the things you have not seen are more or less just like those you have. Under those conditions then the conclusions you draw may be reliable. It’s important to keep this limitation in mind because you never know for sure that what you have not seen will conform to your model, and as a result, the fit between a model and some set of observations can never definitively prove the truth of the relations encoded in the model (this is related to the problem of induction, discussed in Section 1.2.1).

Parametric distributions can be used to establish theoretical probabilities, that is expectations about which events are and are not likely based on the general shape expected for the distribution. Basically, if we expect our distribution of values to have the shape of the normal distribution, we can use the shape of the normal distribution to make inferences about our distribution of values. When we used empirical probabilities above, our probabilities were estimated only with respect to the data we observed. However, when we refer to theoretical probability distributions we can also think about the behavior of values we did not observe or the behavior of the distribution in general.

In order to calculate theoretical probabilities you first need to commit to a model of the data. You may be thinking, what model? It may seem too simple to be a model, but by assuming that our data can be understood as coming from a normal distribution with some given $\mu$ and $\sigma$, we’ve already already created a simple model for our data. This is analogous to committing to a spherical shape for the model of our billiard balls: Saying that you expect your data to be normally distributed commits you to a certain distribution ‘shape’ and to more and less probable parameter values for your variable.

In Figure 2.5 we compare the histogram of apparent height judgments to the density of a normal distribution with a mean equal to the sample mean ($\mu$ = 173.8 cm) and a standard deviation equal to the sample standard deviation ($\sigma$ = 7.8 cm) of our mens_height vector. The density was drawn using the dnorm function, which draws a curve representing the shape of a theoretical normal distribution with a given mean and standard deviation. Clearly, there is a good alignment between our random sample of real-world data and the theoretical normal density. This suggests that we could potentially use the shape of the theoretical normal distribution to talk about the characteristics of our observed random sample of data. Although the distribution of our sample is unlikely to be perfectly normal, it is normal enough to make the comparison worthwhile.

show plot code


par (mfrow = c(1,1), mar = c(4,4,1,1))

hist (mens_height, freq=FALSE, main="", breaks=20, col='grey',
      xlim = c(140,200), xlab = "Apparent height (cm)")
abline (v = 162.1, lwd = 2, col = 2, lty=3)

# plots the normal density (red line) 
# using stats calculated form our sample. 
curve (dnorm (x, mean(mens_height), sd(mens_height)),
       from=100, to=300, lwd=2, col = 2, add=TRUE)

x = c(140,seq(140,162.1,length.out = 100),162.1)
y = c(0,dnorm(seq(140,162.1,length.out = 100), 
              mean (mens_height), sd (mens_height)),0)
polygon(x, y, col=rgb(1, 0, 0,0.3))
abline (v = 63.8, lwd = 2, col = 2, lty=3); 
abline (v = 70, lwd = 2,col=1,lty=3)
x = c(162.1,seq(162.1,200,length.out = 100),200)
y = c(0,dnorm(seq(162.1,200,length.out = 100), 
              mean (mens_height), sd (mens_height)),0)
polygon(x, y, col=rgb(0, 0, 1,0.3))
abline (v = 63.8, lwd = 2, col = 2, lty=3); 
abline (v = 70, lwd = 2,col=1,lty=3)

Figure 2.5: The histogram shows the empirical distribution of height judgments for adult male speakers in our data. The shaded area shows the theoretical density of the equivalent normal distribution. The red area corresponds to the theoretical probability of observing a height under 162 cm, based on the shape of the normal density.

The average female over 20 in the United states is 162.1 cm, according to the information in height_data. A vertical line has been placed at this value in Figure 2.5. We might wonder, what is the probability of observing a height judgment for an adult male that is shorter than this average adult female height? Asking this question is equivalent to asking: What is the area under the curve of the density above, to the left of the vertical line? Since the total area of the density is always equal to one, the area of the red portion below corresponds to a percentage/probability of observing values less than 162.1 cm. One way to answer this question is to calculate the empirical probability of observing an apparent height of less than 162.1 cm for male speakers in our data. Another way to do this is by calculating the theoretical probability by finding the proportion of values expected to be less than 162.1 cm in the normal distribution that has approximately the same ‘shape’ as our data distribution (i.e. the one seen in Figure 2.5).

Below, we use the function pnorm to find the proportion of values that are expected to be less than 162.1 cm. This function takes in a value, a mean, and a standard deviation. It then tells you the proportion of the distribution that is to the left of (i.e. less than) a given value. Below, we use parameters estimated from our sample to run the pnorm function, as these are our best guesses of the population parameters. The output of this function is equal to the area of the red section in the density in Figure 2.5. As we can see, the theoretical and empirical probabilities are very similar to one another. If we subtract this area from one, we get the area under the curve to the right of the vertical line, the blue section of the density above.

# empirical probability of height < 162.1
mean (mens_height < 162.1)
## [1] 0.07704

#  Red area of distribution, x < 162.1
pnorm (162.1, mean (mens_height), sd(mens_height))
## [1] 0.0661

#  Blue area of distribution, x > 162.1
1 - pnorm (162.1, mean (mens_height), sd(mens_height))
## [1] 0.9339

Imagine you had 1 pound of clay and you were asked to make a shape exactly like the normal density above. This shape should be perfectly flat and should have a constant depth (like a coin). If you had this shape made of clay and used a knife to remove the part to the left of 162.1 cm (the red subsection) and weighed it, it should weigh 6.6% of a pound (0.066 pounds). The ‘area under the curve’ of this clay sculpture would just correspond to the amount of clay in a certain area, and in this case, we know that only 6.6% of the clay should be in that section of the shape. The area under the curve, the probability, is just the amount of the stuff in the density that falls below/above a certain point, or between two points. The pnorm function allows you to slice and ‘weigh’ the sections of the distribution to tell you how much of it is in any given interval.

This is what our theoretical probabilities tell us: If height judgments come from a normal distribution, and that distribution has a mean and standard deviation that is close to the sample estimates, then we expect (in the long run) that 6.6% of height judgments will be lower than 162.1 cm. What we are expressing here is effectively a conditional probability, we’re saying if the parameters have certain values, and the probability distribution has a certain shape, then we expect certain height judgments to be more or less probable. Of course, if you change any part of that, either the values of the parameters or the probability distribution, then your estimated theoretical probabilities are likely to change. This is an important thing to keep in mind because it means that inference based on theoretical probabilities can change when our assumptions change, and many of these assumptions cannot be ‘proven’ true or false.

2.7 Probabilities of events and likelihoods of parameters

We’re going to switch from talking about probabilities to talking about likelihoods. When we talk about a probability, we are talking about the probability of observing some particular data/event/outcome, given some parameter(s). A likelihood inverts this, and places the focus on different parameter values, given some observed data. For example, you could say “how probable is it that a random man will sound shorter than 162.1 cm in height if the mean is 174 cm?”. When stated this way, we are discussing the probability of observing data (apparent height < 162.1) from some fixed distribution. So, probability puts the data in question and takes the distribution for granted.

In contrast, you could ask “how likely is it that the average man sounds 162.1 cm tall, given some observations?”. Now, we are discussing the likelihood of a particular mean parameter ($\mu$ = 162.1) given some fixed data. So, likelihood puts the parameters in question and takes the data for granted. The likelihood of a parameter represents the joint probability (density) of observing all the data you observed, given specific parameter values. In other words, the likelihood relates to the probability of your first observation, and your second observation, and your third observation, and so on, for all observations, given the parameter value(s) of interest.

The likelihood function is a curve showing the relative likelihoods of different parameter values, given a fixed set of observations/data. The likelihood function tells you what parameter values are credible given your data. If a value is very unlikely, that means that it is not supported by your data. In other words, unlikely parameter estimates represent conclusions that your data is rejecting as not viable, and hence they are not credible. Here’s a simple example of how you use likelihoods informally in your everyday life. Suppose your friend tells you they can hit about 90% of their three-point shots in basketball. You know this friend is prone to making spectacular claims, so you ask to watch them shoot. They heave up 100 shots, sinking 20/100, not too bad for an amateur, but not 90%. Do you believe that your friend sinks 90% of their shots overall? If you decide not to, it may be because their claim is extremely unlikely: A distribution which generates 90% successful shots is extremely unlikely to generate only 20 successes in 100 shots. In other words, given the data you have, a ‘real’ ability to score 90% of shots does not seem credible.

Every parameter for every probability distribution has a likelihood function, given some data. Here, we’re only going to discuss the likelihood of the normal mean parameter, $\mu$, in detail. The likelihood of the sample mean reflects the joint probability of observing all of your data, given different values of the mean, for a given standard deviation. An example of how this is calculated is given in Figure 2.6. The left plot shows the likelihood function for $\mu$ based on a random sample of ten height judgments from our data (indicated by the blue points at the bottom of the plot). We can see that the most likely values of $\mu$ are centered on the bulk of the observations, and that values become less likely as we deviate from them. The vertical dotted lines indicate three possible mean values that will be highlighted in this discussion.

show plot code

set.seed (4)
n = 10
height_sample = sample (mens_height,n)   ## tiny sub sample for example
par (mfrow = c(1,2), mar = c(4,4,2,1))
plot (height_sample,rep(0,n), ylim = c(0,.2), pch=16, col=4, 
      xlim = c(140,195), ylab='Density', cex.main=.8, cex.main=1,
      main = 'Likelihood of mean', xlab='Apparent height (cm)')

## here the likelihood sd is divided by the sample size
curve (dnorm (x, mean(height_sample), sd(height_sample) / sqrt (n)), 
       from = c(140,195), add=TRUE, col = 2, lwd = 2)
abline (v = c(mean (height_sample), 160, 170), lwd=2,lty=3)

plot (height_sample,rep(0,n), ylim = c(0,.06), pch=16,col=4,
      xlim = c(130,200),cex.main=.8, ylab='Density', cex.main=1,
      main = "Distribution if mean = 174 cm",xlab='Apparent height (cm)')
## now it is centered at mean = 175 cm
curve (dnorm (x, mean(height_sample), sd(height_sample)), 
       from = c(130,200), add=TRUE, col = 2, lwd = 2)
segments (height_sample,rep(0,n),height_sample, 
          dnorm (height_sample, mean(height_sample), sd(height_sample) ))

Figure 2.6: (left) The red curve indicates the likelihood of the population mean given the blue points in the figure. The vertical lines indicate three different parameter values that will be considered. (right) The red curve indicates the probability density given an assumed mean of 174 cm (the sample mean). Vertical lines highlight the density over each point.

The likelihood of a parameter value (e.g., $\mu$ = 174 cm in the right plot of Figure 2.6) is equal to the product of the density of each observation in the sample, given the value of the parameter. This sounds like a mouthful but is actually deceptively simple. For example, to calculate the likelihood that $\mu=174$, we:

Assume that the data is generated by a normal distribution with $\mu=174$ and $\sigma$ equal to your sample standard deviation (7.8 cm).
Find the height of the curve of the probability density over each point (indicated by the vertical lines in the right plot in Figure 2.6. This reflects the relative probability of each observation given your parameter value.
The joint probability of all of the observations (the likelihood) is the product of all of these densities (heights). This assumes that all of your observations are statistically independent of each other (see Section 2.3.2).

So, the value of the likelihood function in the left plot of Figure 2.6 at 174 cm is equal to the product of the densities over the points in the right plot (i.e., the heights of the lines in the plot). Imagine we did this for a range of values along the $x$ axis, sliding our probability distribution along the x axis and recording the likelihood values at each step. If we do this and then plot the product of the densities for each corresponding $x$ value the result would be a curve identical to that of the left plot in Figure 2.6.

Let’s discuss what the likelihood in Figure 2.6 means. When you conduct an experiment, you know you have your data, but you don’t know much else. Suppose all your data was the blue points in the figure. If someone said to you “hey do you think the mean is 140 cm?” you might respond “that is unlikely”. It is unlikely because the observations you have are extremely improbable given a probability distribution centered at 140 cm. Since your data are incompatible with a mean of 140 cm, and you know you have your data, a mean of 140 cm does not seem credible given your probability model. This information is reflected by the very low value of the likelihood function at 140 cm. In contrast, based on the high value of the likelihood function, we can see that a mean of 174 cm is likely. This is because the data you have is very probable given a mean of 174 cm, making this parameter value consistent with our data.

The right plot of Figure 2.6 shows the probability of points assuming that the population mean is equal to the sample mean (for our tiny sample). We can see in the left plot of Figure 2.6 that this is the most likely value for the mean parameter. When we said earlier that the sample mean is the ‘best’ estimate of the population mean, what we really meant was that the sample mean is the maximum-likelihood estimate of the population mean. This means that the sample mean provides an estimate of $\mu$ that maximizes the value of the likelihood function given the data. This is related to the fact that, as mentioned in Section 2.5.1, the sample mean minimizes the variance of the sample. As a result, if you want to know which mean estimate is most likely given your data, you simply need to calculate the sample mean as in Equation 2.3.

In the left plot of Figure 2.7 we see that a normal distribution with a $\mu$ of 170 cm is a reasonable fit to the data. However, several observations are very improbable and this relative lack of fit is reflected by the low value of the likelihood function at 170 in Figure 2.6. In the right plot of Figure 2.7, we see that a normal distribution with a mean of 160 is very unlikely to generate this data: Many points are extremely improbable and have densities close to zero. Correspondingly, the value of the likelihood corresponding to $\mu = 160$ in Figure 2.6 reflects a very unlikely parameter value.

show plot code

par (mfrow = c(1,2), mar = c(4,4,2,1))
plot (height_sample,rep(0,n), ylim = c(0,.06), pch=16,col=4,
      xlim = c(140,195),cex.main=.8, ylab='Density', cex.main=1,
      main = "Distribution if mean = 170 cm",xlab='Apparent height (cm)')
## now it is centered at mean = 175 cm
curve (dnorm (x, 170, sd(height_sample)), from = c(130,200), 
       add=TRUE, col = 2, lwd = 2)
segments (height_sample,rep(0,n),height_sample,
          dnorm (height_sample, 170, sd(height_sample) ))
#abline (v = 170, lwd=2,lty=3)

plot (height_sample,rep(0,n), ylim = c(0,.06), pch=16,col=4,
      xlim = c(140,195),cex.main=.8, ylab='Density', cex.main=1,
      main = "Distribution if mean = 160 cm",xlab='Apparent height (cm)')
## now it is centered at mean = 175 cm
curve (dnorm (x, 160, sd(height_sample)), from = c(130,200), 
       add=TRUE, col = 2, lwd = 2)
segments (height_sample,rep(0,n),height_sample, 
          dnorm (height_sample, 160, sd(height_sample) ))
#abline (v = 160, lwd=2,lty=3)

Figure 2.7: (left) The red curve indicates the probability density given an assumed mean of 170 cm (the sample mean). Vertical lines highlight the density over each point. (right) The same information as in the left plot, but given a mean of 160 cm.

We want to talk about why it makes sense to multiply densities to calculate joint probabilities. Above, we stated that probabilities relate to the area under the curve of the density function. A problem we have is that the area under the curve of a single point is always zero. This is because a single point is so thin (it’s width is zero) that the area under the curve under the point will equal zero regardless of its density. Suppose that we said “ok let’s agree to use a fixed width, $a$, around our point to calculate our area under the curve”. We can make $a$ so teeny tiny that it is almost as if we were calculating just the area under our single point. We can use $f(x)$ to represent the height of a density for a given value of $x$. This means that we could approximate the value of the area under the curve at $x$ by treating it like a trapezoid and finding $f(x)\cdot a$. If we wanted to calculate the joint probability of a number of observations, we could multiply a series of these areas, as seen in Equation 2.11.

\[ \begin{equation} P(x_1\,\&\,x_2\,\&\, \dots \& \, x_n)=[f(x_1) \cdot a] \cdot [f(x_2) \cdot a] \cdot \dots \cdot [f(x_n) \cdot a] \end{equation} \tag{2.11}\]

Since each term in Equation 2.11 contains $a$, this can be factored out as in Equation 2.12.

\[ \begin{equation} P(x_1\,\&\,x_2\,\&\, \dots \& \, x_n)=[f(x_1) \cdot f(x_2) \cdot \dots \cdot f(x_n)] \cdot a^n \end{equation} \tag{2.12}\]

At this point, we see that the joint probability of a set of values ($x_n$) can be thought of as the product of the densities over those values ($f(x_n)$) times some arbitrary constant $a^n$. If we agree to use the same constant for all our calculations, we can ignore it and use the product of the densities to evaluate the relative probabilities of different combinations of observations.

2.7.1 Characteristics of likelihoods

It’s fairly important to understand at least the gist of the previous section: Likelihoods tell you which parameter values are ‘likely’ (i.e. credible) given the data you have. It’s less important that you understand the information from here to Section 2.8. We’re presenting it here to take the mystery out of it, and because in order to understand it you will probably need to read about it several times so you might as well get started now. If this is the first time you read about these topics, you should accept the fact that you may not understand much of this, however, given time and effort you will eventually get used to it.

We can think about the characteristics of the likelihood of $\mu$ to make some predictions about its expected shape given different numbers of observations and underlying data distributions. The likelihood is the joint probability of your data given some parameter values. We can calculate the joint probability of two observations $x_1$ and $x_2$ (using a comma instead of ‘&’ to indicate ‘and’) by multiplying their individual densities as in Equation 2.13, assuming these observations are independent.

\[ \begin{equation} f(x_1,x_2) = [\frac{1}{\sigma\sqrt{2\pi}} \exp( -\frac{1}{2\sigma^2}(x_1-\mu)^{\!2}\,)] \cdot [\frac{1}{\sigma\sqrt{2\pi}} \exp( -\frac{1}{2\sigma^2}(x_2-\mu)^{\!2}\,)] \end{equation} \tag{2.13}\]

Equation Equation 2.13 defines the likelihood of $\mu$ given the data $x_1$ and $x_2$ and some $\sigma$. We will update the left-hand side to reflect this, replacing a term representing the joint probability of $x_1$ and $x_2$ ($f(x_1,x_2)$) with a term representing the likelihood of the mean given the data, $\mathcal{L}_{(\mu|x)}$. Notice that nothing has changed except our perspective. In equation Equation 2.13 we treat the parameter $\mu$ as fixed and use it to calculate the joint probability of some data. In equation Equation 2.14 we treat the data $x$ as fixed and use it to calculate the likelihood of values of $\mu$.

\[ \begin{equation} \mathcal{L}_{(\mu|x)} = [\frac{1}{\sigma\sqrt{2\pi}} \exp( -\frac{1}{2\sigma^2}(x_1-\mu)^{\!2}\,)] \cdot [\frac{1}{\sigma\sqrt{2\pi}} \exp( -\frac{1}{2\sigma^2}(x_2-\mu)^{\!2}\,)] \end{equation} \tag{2.14}\]

2.7.2 A brief aside on logarithms

The logarithm (log) is the inverse function to exponentiation, it basically erases or undoes exponentiation. We can apply a logarithmic transformation to both sides of the density in equation Equation 2.8, resulting in the log density of the normal distribution seen in equation Equation 2.15.

\[ \begin{equation} \log(f(x)) = \log(\frac{1}{\sigma\sqrt{2\pi}})-\frac{1}{2\sigma^2}\left(x-\mu\right)^2 \end{equation} \tag{2.15}\]

Understanding logarithms and log densities is important because, in practice, much of statistics is done on the logarithms of probabilities, likelihoods, and densities. In addition, it is extremely common to log-transform probability density functions or other equations before manipulating them, making a good understanding of the properties of logarithms extremely useful. As a result, as we will see below, using logarithms can help us simplify and understand at least some probability distributions.

Before explaining Equation 2.15, we will discuss some basic properties of logarithms that are useful to understand probabilities and probability distributions, as these often involve exponentiation and logarithms. The first line in Equation 2.16, shows the basic behavior of logarithms. The next four lines pertain to the values expected, or undefined, for different $x$. The next two lines highlight the fact that the exponentiation of numbers is equivalent to the multiplication of their logarithms. The final two lines highlight the fact that the multiplication of two numbers is equivalent to the addition of the logarithms of the numbers.

\[ \begin{equation} \begin{split} \log(e^x)=x \\ \\ \log(1) = 0 \\ \textrm{if} \: x < 1, \: \log(x) < 0 \\ \textrm{if} \: x > 1, \: \log(x) > 0 \\ \textrm{if} \: x < 0, \: \log(x) = \textrm{undefined} \\ \\ \log(x^y)=\log(x) \cdot y \\ \log(\sqrt[y]{x})=\log(x)/y \\ \\ \log(x)+\log(y)= log(x \cdot y) \\ \log(x)-\log(y)= log(x/y) \\ \end{split} \end{equation} \tag{2.16}\]

Armed with knowledge of the behavior of logarithms, we can see that compared to equation Equation 2.8, equation Equation 2.15 reflects the following changes associated with taking the logarithm of both sides of the equation. First, we remove the $\exp$ function around the rightmost term $-\frac{1}{2\sigma^2}\left(x-\mu\right)^2$, and added the $\log$ function around all terms that were not previously exponentiated. Then, multiplication of the two terms in the right hand side of the equation turns into addition, or subtraction in this case because we are adding a negative term.

The properties of logarithms presented above can be used to make the simplifications presented in Equation 2.17. When presented in its final form, this function is that of a parabola in vertex form, $y=a(x-h)^2+k$, where $a=-1/2\sigma^2$, $h=\mu$, and $h=-\log({\sigma\sqrt{2\pi}})$. So, we can see that the density of the normal distribution is just an exponentiated parabola that is scaled by $1 / \sigma\sqrt{2\pi}$ so that the area under the curve is equal to one.

\[ \begin{equation} \begin{split} \log(f(x)) = \log(\frac{1}{\sigma\sqrt{2\pi}})-\frac{1}{2\sigma^2}\left(x-\mu\right)^2 \\ \log(f(x)) = \log(1)-\log(\sigma\sqrt{2\pi})-\frac{1}{2\sigma^2}\left(x-\mu\right)^2 \\ \log(f(x)) = 0-\log(\sigma\sqrt{2\pi})-\frac{1}{2\sigma^2}\left(x-\mu\right)^2 \\ \log(f(x)) = -\frac{1}{2\sigma^2}\left(x-\mu\right)^2 - \log({\sigma\sqrt{2\pi}}) \end{split} \end{equation} \tag{2.17}\]

The parabola defined in Equation 2.17 has its vertex at $(\mu, -\log({\sigma\sqrt{2\pi}})$, and opens downwards since the $a$ term is negative. When a parabola is in vertex form, the relationship between parabola width and the value of $a$ is inverted, its width decreases as the value of $a$ increases. However, because $a=-\frac{1}{2\sigma^2}$, the relationship is doubly inverted. As a result, in the case of normal distributions the parabola width increases as $\sigma$ grows larger, leading to wider parabolas and wider probability densities.

Equation Equation 2.17 shows how the $\mu$ and $\sigma$ parameters work to make observations further from the mean less probable. First, we know that negative logarithmic values will fall between 0 and 1, with more negative values being closer to zero (i.e. less probable). As observations ($x$) are further from the mean, the value of $(x-\mu)^2$ will be greater so that values further from the mean will be generally less probable. However, whether a deviation is considered “big” or “small” is relative, and so this distance is scaled with respect to the average expected squared deviation from the mean (i.e. the variance $\sigma^2$). Variation of 1 cm in body length means different things for an earthworm as opposed to an anaconda. As a result, large values of $(x-\mu)^2$ can be offset by large values of $\sigma^2$ when determining the probability of an outcome.

2.7.3 Characteristics of likelihoods, continued

We will pick up where we left off, considering the likelihood of $\mu$ given some data. This was presented in equation Equation 2.14 and is presented again as Equation 2.18.

We can log-transform both sides of equation Equation 2.18 so that each bracketed element in the right hand side of Equation 2.18 is equal to the final version of equation Equation 2.17. This is presented in equation Equation 2.19.

\[ \begin{equation} \mathcal{L}\mathcal{L}_{(\mu|x)} = [-\frac{1}{2\sigma^2}(x_1-\mu)^2 - \log({\sigma\sqrt{2\pi}})] + [-\frac{1}{2\sigma^2}(x_2-\mu)^2 - \log({\sigma\sqrt{2\pi}})] \end{equation} \tag{2.19}\]

Because they make working with likelihoods much simpler, statisticians often use the logarithms of likelihood functions, referred to as log-likelihoods. Log-likelihoods are denoted using the symbol $\mathcal{L}\mathcal{L}_{(\mu|x)}$. When you see this, $\mathcal{L}\mathcal{L}_{(\mu|x)}$, just think “the logarithm of the likelihood of the mean given the data $x$”. In equation Equation 2.19 we see that the log-likelihood of these two observations is the sum of two parabolas. Since the sum of parabolas will (almost always) also be a parabola, we know that the log-likelihood of the mean given these, and any number of other observations, will also be a parabola. As a result, we see that the (non-log) likelihood of the mean is an exponentiated parabola and has the same general shape as the normal distribution.

Equation Equation 2.19 is specifically for two variables, however, because each term is identical except for $x_n-\mu$, the log-likelihood function can be greatly simplified as in equation Equation 2.20. In Equation 2.20 we can see that the log-likelihood of $\mu$ given $n$ observations of $x$ is equal to the sum of the individual squared deviations from the mean, divided by $-1 / 2\sigma^2$, with the value $n \cdot (\log({\sigma\sqrt{2\pi}}))$ subtracted from it. So, we see that the log-likelihood is a parabola that has its vertex at $(\hat{\mu}, - n \cdot (\log({\sigma\sqrt{2\pi}}))$, meaning that this parabola has its maximum value when $\mu$ is equal to the sample mean. This again reflects the fact that the sample mean is the maximum-likelihood estimate for the population mean.

\[ \begin{equation} \mathcal{L}\mathcal{L}_{(\mu|x)} = -\frac{1}{2\sigma^2}\sum_{i=1}^{n}(x_i-\mu)^2 - n \cdot (\log({\sigma\sqrt{2\pi}})) \end{equation} \tag{2.20}\]

There are two mechanisms by which the likelihood function may get narrower or wider. The first was discussed in Section 2.5.3, where a smaller value of $\sigma$ results in a narrower parabola and a narrower likelihood. Independently of the underlying variation in the data, likelihood functions can also become narrower when the sample size is increased. To see why this is the case, consider what happens to the relative value of our $a$ parameter ($-1 / 2\sigma^2$) as the sample size grows. Imagine that we are calculating the sum of squared deviations about our sample mean. We know that the average squared variation we expect from our mean is equal to the variance. So, let’s replace $(x_i-\mu)^2$ in Equation 2.20 with $\sigma^2$ in Equation 2.21. Now, instead of adding n squared deviations around the mean, $\sum_{i=1}^{n}(x_i-\mu)^2$, we are multiplying our expected squared deviation by n, $n \cdot \sigma^2$.

\[ \begin{equation} \mathcal{L}\mathcal{L}_{(\mu|x)} = -\frac{1}{2\sigma^2} \cdot (n \cdot \sigma^2) - n \cdot (\log({\sigma\sqrt{2\pi}})) \end{equation} \tag{2.21}\]

We can also move the $\sigma^2$ over from under the $-1/2$ to make the following point simpler, as in Equation 2.22.

\[ \begin{equation} \mathcal{L}\mathcal{L}_{(\mu|x)} = -\frac{1}{2} \cdot \frac{n \cdot \sigma^2}{\sigma^2} - n \cdot (\log({\sigma\sqrt{2\pi}})) \end{equation} \tag{2.22}\]

Multiplying the numerator of a fraction is exactly equivalent to dividing the denominator of the fraction by the same amount, e.g. $(x \cdot y)/z = x/(z/y)$. Thus, we see that calculating the likelihood using $n$ data points is expected to have about the same effect on parabola width as dividing the data variance by $n$, as shown in Equation 2.23. As a result, increasing numbers of observations reduces the uncertainty in parameter estimates by making the likelihood narrower and narrower for any given underlying $\sigma$.

\[ \begin{equation} \mathcal{L}\mathcal{L}_{(\mu|x)} = -\frac{1}{2} \cdot \frac{\sigma^2}{(\sigma^2/n)} - n \cdot (\log({\sigma\sqrt{2\pi}})) \end{equation} \tag{2.23}\]

2.8 Answering our research questions

In Section 2.6, we discussed using the normal distribution to make inferences about the probable values of a random variable. When variables are normally distributed we can use the theoretical normal distribution and functions such as pnorm to answer questions about values we expect, and don’t expect, to see. In the same way, we can use likelihood functions to understand probable, and improbable, values of parameters given our data.

For example, suppose that you measured the heights of 100 women in a small town (pop. 1,500) and found the average height was 160 cm, with a standard deviation of 6 cm. You might accept that the actual population average is 161 cm, but may find it difficult to accept that it was actually 180 cm. This is because a true mean of 180 cm is unlikely given your observed data: The observations you have are improbable given a true mean of 180 cm. The logic is quite simple: A true mean of 180 cm is unlikely to result in a random selection of so many women around 160 cm. You know you observed the short women, therefore, you have no reason to believe that the true mean is 180 cm.

Below we calculate the likelihood of different mean parameters given our data. We do this by finding the log density of each observation and then adding the points together. Log densities are used because the likelihood is often a number so small that computers have a hard time representing them otherwise.

# make candidates for mean parameter
mus = seq (172.5,175, .01)

# easy way to make zero vector of same length as above
log_likelihood = mus*0

# add the log-density of all observations. Notice only the 
# mean changes across iterations of the for loop.
for (i in 1:length(mus)) log_likelihood[i] = 
  sum (dnorm (mens_height, mus[i], sd(mens_height), log = TRUE))

For example, the highest point of the probability densities in Figure 2.7 is about 0.07. Let’s pretend it’s 0.1 for the sake of simplicity. Recall that to find the likelihood we need to multiply the densities above each of the points. This means that the likelihood of two observations at the mean is $0.1 \cdot 0.1=0.001$, and the likelihood of observing $n$ observations at the mean is equal to $0.1^n$. Since we have 675 observations in our mens_height vector, the likelihood of observing every one of those at the mean, the most probable outcome, would be equal to $0.1^{675}$, or a decimal point followed by 674 zeros and then a one. By relying on the logarithms of densities instead, we can accurately represent very small numbers more comfortably. This is because adding together the logarithms of two numbers is equivalent to multiplying those numbers, and multiplying a logarithm by another number is equivalent to raising it to that power (see Equation 2.7). For example,the number $0.1^{675}$ can also be expressed like $\log(0.1) \cdot 675$, which equals −1554.245 (i.e. $0.1^{675}=e^{-1554.245}$). This is obviously a much easier number to deal with than one with 675 decimal places.

In Figure 2.8, we plot the log-likelihood calculated above, and the ‘scaled likelihood’. This is just the likelihood that has been scaled so that its peak is equal to one, and the peak of the log-likelihood equals zero. This allows us to actually plot and consider the likelihood function, though the values of the density no longer reflect the actual values of the likelihood. However, the relational characteristics are maintained by this scaling. So, a scaled likelihood value of 0.2 is still five times less likely than a value of one.

show plot code

mus = seq (172.5,175, .01)

log_likelihood = mus*0
for (i in 1:length(mus)) log_likelihood[i] = 
  sum (dnorm (mens_height, mus[i], sd(mens_height), log = TRUE))


par (mar = c(4.1,4.1,1,1), mfrow = c(1,2))
plot (mus, log_likelihood, col = skyblue, lwd=2, type = 'l', 
      ylab = "Log-Likelihood", xlab = "Apparent height (cm)")

scaled_log_likelihood = log_likelihood-max(log_likelihood)
scaled_log_likelihood = exp (scaled_log_likelihood)

plot (mus, scaled_log_likelihood, col = skyblue, lwd=2, type = 'l', 
      ylab = "Scaled Likelihood", xlab = "Apparent height (cm)")

Figure 2.8: (left) The curve represents the log-likelihood for different values of our mean given our apparent height judgments and assuming a normal distribution. (right) The likelihood implied by the log-likelihood on the left. The curve has been scaled to have a peak of 1 so that we can plot it. For example, exp (-2300) is a value so small that it is difficult to represent in a figure axis.

At this point we can provide answers to the questions posed in Section 2.2. The questions were:

(Q1) How tall does the average adult male sound?

(Q2) Can we set limits on credible average apparent heights based on the data we collected?

Below, we see that the maximum likelihood estimate, the value of the parameter which maximizes the likelihood for the mean, corresponds to the sample mean:

# find index number of highest values in log-likelihood
maximum = which.max(scaled_log_likelihood)

# print and compare to sample mean
mus[maximum]
## [1] 173.8
mean (mens_height)
## [1] 173.8

So, we may conclude that the average male speaker is most likely to sound about 174 cm tall. We can also conclude informally based on Figure 2.8 that the most likely mean values fall between (approximately) 173 and 174.5 cm. This means that although the sample mean was 174 cm, it is reasonable that the true population mean might actually be 173.5 cm. This is because mean parameters in this range are also reasonably likely given our data. Basically, maybe our sample mean is wrong and 173.5 cm is the true population $\mu$. This outcome is compatible with our data. However, a value of 172 cm is very unlikely given our data. Since we think that 172 cm is not a plausible mean parameter given our sample, we can rule it out as a credible value for the $\mu$ parameter of our variable. Using this approach, we can use the information in likelihood functions to rule out implausible values of $\mu$ based on the characteristics of our data.

2.9 Exercises

Use the techniques outlined in this chapter to start thinking about credible estimates, and bounds, of the mean apparent height for other groups of speakers. This can be done for the original or modified voice resonance conditions.

2.10 References

Ross, S. M. (2019). A first course in probability. Boston: Pearson.

Parzen, E. (1960). Modern probability theory and its applications. Wiley.

Jaynes, E. T. (2003). Probability theory: The logic of science. Cambridge university press.