Chapter 45 Bayesian inference
Bayesian data analysis the use of Bayesian inference to learn from data.
Bayesian inference is a method for figuring out unobservable quantities given known facts that uses probability to describe the uncertainty over what the values of the unknown quantities could be.
Bayesian inference is conditioning on data, in order to learn about parameter values.
Bayesian inference is a method of statistical inference in which Bayes’ theorem is used to update the probability for a hypothesis as more evidence or information becomes available.
Bayes’ theorem
\[P( \theta | D ) = frac{P( D | \theta) \pdot P( \theta)}{\sum{P( D | \theta) \pdot P( \theta ) }}\]
\(\theta\) - Parameter values
\(D\) - New data
\(P(D | \theta)\) - Likelihood (the relative) probability of the data given different parameter value.
\(P( \theta )\) - Prior
\(\sum{P( D | \theta) \pdot P( \theta ) }\) - The total sum of the likelihod weighted by the prior.
- A prior is a probability distribution that represents what the model knows before seeing the data.
- A posterior is a probability distribution that represents what the model knows after having seen the data.
Bayesian inference components:
- Data
- Bayesian Model:
- Generative model (
dbinom
,dnorm
,poisson
etc.)
- Priors
- Computational methods:
- Rejection sampling
- Grid approximation
- Markov Chain Monte Carlo (MCMC)
# Visualization function of Bayesian inference for binary data
library(ggjoy)
library(ggridges)
library(ggExtra)
library(tidyverse)
# prop_model function to visualize bayesian posterior distributions
<- function(data = c(), prior_prop = c(1, 1), n_draws = 10000, show_plot = TRUE)
prop_model
{<- as.logical(data)
data <- c(0, seq(0, 1, length.out = 100), 1)
proportion_success <- round(seq(0, length(data), length.out = min(length(data) +
data_indices 1, 20)))
<- map_dfr(data_indices, function(i) {
post_curves <- ifelse(i == 0, "Prior", ifelse(data[i], "Success", "Failure"))
value <- paste0("n=", i)
label <- dbeta(proportion_success, prior_prop[1] +
probability sum(data[seq_len(i)]), prior_prop[2] + sum(!data[seq_len(i)]))
<- probability/max(probability)
probability tibble(value, label, proportion_success, probability)
})$label <- fct_rev(factor(post_curves$label, levels = paste0("n=",
post_curves
data_indices)))$value <- factor(post_curves$value, levels = c("Prior",
post_curves"Success", "Failure"))
<- ggplot(post_curves, aes(x = proportion_success, y = label,
p height = probability, fill = value)) + geom_joy(stat = "identity",
color = "white", alpha = 1, panel_scaling = TRUE, size = 1) +
scale_y_discrete("", expand = c(0.01, 0)) + scale_x_continuous("Underlying proportion of success") +
scale_fill_manual(values = hcl(120 * 2:0 + 15, 100, 65),
name = "", drop = FALSE, labels = c("Prior ", "Success ",
"Failure ")) + theme_light(base_size = 18) +
theme(legend.position = "top")
if (show_plot) {
print(p)
}invisible(rbeta(n_draws, prior_prop[1] + sum(data), prior_prop[2] +
sum(!data)))
}
45.1 Simple model with one binary parameter
# Generate 50 random binary data with P(1)=0.75
<- sample(c(1, 0), prob = c(0.75, 0.25), size = 10, replace = TRUE)
data
# Visualize posteriors
<- prop_model(data) posterior
# View posterior
head(posterior)
## [1] 0.6885118 0.9954317 0.9157077 0.8484793 0.8414593 0.8479106
# Center of the posterior
median(posterior)
## [1] 0.8518252
# The credible interval (CI)
quantile(posterior, c(0.05, 0.95))
## 5% 95%
## 0.6327662 0.9674745
# Probability of successes > 0.3
sum(posterior > 0.3) / length(posterior)
## [1] 1
Generate binomial data
<- 0.42 # probability of success
p <- 100 # number of observations
n # Create logical vector, TRUE if numeric value from uniform distribution < p
<- c()
data for (i in 1:n) {
<- runif(1, min = 0, max = 1) < p
data[i]
}# convert logical to numeric 0/1
<- as.numeric(data)
data data
## [1] 0 1 0 1 0 0 0 0 1 1 1 0 1 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 1 1 1 0 1 1
## [45] 0 0 0 1 1 0 0 0 1 0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 1 1 0 0
## [89] 0 0 0 0 0 0 1 0 0 0 0 1
# the same vector using rbinom distribution
= 100 # number of observations
n = 1 # 0 - fail, 1 - success, for 1 trial
size = 0.42 # probability of success
p rbinom(n, size, p)
## [1] 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 1 1 0 1 0 1 0 0 0 0 1 1 0 1 0 0 0 0 1 0 0 0 0 0
## [45] 1 1 1 1 1 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 1 1 1 0 0 0 1 1 0 0 0 1 1 1 1 1 1 0 1 1 1 0 1
## [89] 1 1 0 0 1 0 0 0 0 1 1 0
45.2 Grid approximation
To get more visitors to your website you are considering paying for an ad to be
shown 100 times on a popular social media site.
According to the social media site, their ads get clicked on 10% of the time.
How many visitors could your site get?
We can fix % of clicks as 0.1 but we do not know exactly how many clicks people do (uncertain). We suggest the % of clicks is uniformly distributed between 0 and 20.
Define model:
\(n_{ads} = 100\)
\(p_{clicks} ~ Uniform(0.0, 0.2)\)
\(p_{visitors} ~ Binomial(n_{ads}, p_{clicks})\)
<- 100000 # number of simulations
n <- 100 # number of shown ads
n_ads_shown
# probability that add will be clicked (assume uniform [0,0.2])
<- runif(n, min = 0.0, max = 0.2)
p_clicks
# MODEL: We assume that the binomial distribution is a reasonable generative model
<- rbinom(n = n,
n_visitors size = n_ads_shown,
prob = p_clicks)
# Visualize the prior model simulation
library(ggplot2)
<- data.frame(x = n_visitors, y = p_clicks)
df <- ggplot(df, aes(n_visitors, p_clicks)) + geom_point() + theme_classic()
p ::ggMarginal(p, type = "histogram", bins = 18) ggExtra
Generate posterior distribution
After the first day of showing ads we registered that 13 people clicked and visited site when the ad was shown 100 times.
# Prior distribution binomial model with uniform clicks
<- data.frame(p_clicks, n_visitors)
prior
# To get posterior we subset from prior 13 visitors
<- prior[prior$n_visitors == 13, ]
posterior head(posterior)
## p_clicks n_visitors
## 39 0.13318400 13
## 57 0.15608873 13
## 64 0.09112852 13
## 75 0.11852016 13
## 76 0.12653570 13
## 106 0.10590670 13
# visualize posterior
<- ggplot(posterior, aes(n_visitors, p_clicks)) + geom_point() + theme_classic()
p ::ggMarginal(p, type = "histogram", bins = 18) ggExtra
For the next iteration we make prior from posterior and include new data for subsetting from prior to get posterior.
# Assign posterior to prior for the next iteration
<- posterior
prior
# next iteration posterior ~ f(prior)
<- nrow(prior)
n <- 100
n_ads_shown $n_visitors <- rbinom(n,
priorsize = n_ads_shown,
prob = prior$p_clicks)
# plot of prior distributions
<- ggplot(prior, aes(n_visitors, p_clicks)) + geom_point() + theme_classic()
p ::ggMarginal(p, type = "histogram", bins = 18) ggExtra
Use result of the model for prediction:
# Calculate the probability that you will get 5 or more visitors.
sum(prior$n_visitors >= 5) / length(prior$n_visitors)
## [1] 0.9889606
# Median number of visitors
median(prior$n_visitors)
## [1] 13
Updated model We will change prior distribution with knowledge we learn from people that the number of clicks are 5% and sometimes 2 or 8%. We will use beta distribution to setup number of clicks to update prior in our model.
<- 100000
n <- 100
n_ads_shown
# Change the prior on proportion_clicks
<- rbeta(n, shape1 = 5, shape2 = 95)
p_clicks
# Updated model
<- rbinom(n = n,
n_visitors size = n_ads_shown,
prob = p_clicks)
<- data.frame(p_clicks, n_visitors)
prior
<- prior[prior$n_visitors == 13, ]
posterior
# Plots the prior and the posterior
par(mfcol = c(1, 2))
hist(prior$p_clicks, xlim = c(0, 0.25))
hist(posterior$p_clicks, xlim = c(0, 0.25))
We tested for video banner and now we would like to compair distributions for:
- video ads (13 out of 100 clicked)
- text ads (6 out of a 100 clicked).
<- 100000
n <- 100
n_ads_shown <- runif(n, min = 0.0, max = 0.2)
p_clicks
<- rbinom(n,
n_visitors size = n_ads_shown,
prob = p_clicks)
<- data.frame(p_clicks, n_visitors)
prior
# Create the posteriors for video and text ads
<- prior[prior$n_visitors == 13, ]
posterior_video <- prior[prior$n_visitors == 6, ]
posterior_text
par(mfrow=c(1,2))
# Visualize the posteriors
hist(posterior_video$p_clicks, xlim = c(0, 0.25))
hist(posterior_text$p_clicks, xlim = c(0, 0.25))
# combine data into dataframe
# Make sizes of distrib. the same (4000) to fit to dataframe
<- data.frame(video_prop = posterior_video$p_clicks[1:4000],
posterior text_prop = posterior_text$p_clicks[1:4000])
# Calculate the posterior difference: video_prop - text_prop
$prop_diff <- posterior$video_prop - posterior$text_prop
posterior
# Calculate the median of prop_diff
median(posterior$prop_diff)
## [1] 0.06535207
# Calculate the proportion
mean(posterior$prop_diff > 0.0)
## [1] 0.94825
# Visualize prop_diff
hist(posterior$prop_diff)
Decision analysis
Our analysis indicated that the video ads are clicked more often.
We would like to estimate probable profit video ads vs text ads if we know:
- Each visitor spends $2.53 on average on our website.
- Video ad costs us $0.25 per click.
- Text ad costs us $0.05 per click.
<- 2.53
visitor_spend <- 0.25
video_cost <- 0.05
text_cost
# Add the column posterior$video_profit
$video_profit <- posterior$video_prop * visitor_spend - video_cost
posterior
# Add the column posterior$text_profit
$text_profit <- posterior$text_prop * visitor_spend - text_cost
posterior
head(posterior)
## video_prop text_prop prop_diff video_profit text_profit
## 1 0.1591797 0.04933518 0.10984449 0.152724569 0.07481801
## 2 0.1639834 0.13350068 0.03048271 0.164877969 0.28775671
## 3 0.1342420 0.04359866 0.09064333 0.089632257 0.06030462
## 4 0.1007833 0.03186180 0.06892146 0.004981632 0.03061034
## 5 0.1434389 0.12331007 0.02012884 0.112900432 0.26197447
## 6 0.1953154 0.14542950 0.04988593 0.244148032 0.31793663
# Visualize the video_profit and text_profit columns
par(mfrow=c(1,2))
hist(posterior$video_profit)
hist(posterior$text_profit)
# Difference between video and text ad profits
$profit_diff <- posterior$video_profit - posterior$text_profit
posterior
# Visualize posterior$profit_diff
hist(posterior$profit_diff)
# Calculate a "best guess" for the difference in profits
median(posterior$profit_diff)
## [1] -0.03465928
# Calculate the probability that text ads are better than video ads
mean(posterior$profit_diff < 0)
## [1] 0.63525
Our analysis showed that text ads will bring probably more profit but the result is too uncertain and we need more data for making the correct decision.
Another model for another case
We would like to put up an ad on the site and pay for it per day.
The site admin promise that we will get 19 clicks per day.
How many daily clicks should we expect on average?
We are going to use Poisson distribution for our new model which correspond
to number of successes in a period of time.
<- 100000
n <- runif(n, min = 0, max = 80) # uniform dist between 0 and 80 clicks
mean_clicks # model
<- rpois(n = n, mean_clicks)
n_visitors
<- data.frame(mean_clicks, n_visitors)
prior
# first day trial showed 13 clicks
<- prior[prior$n_visitors == 13, ]
posterior
# visualize prior and posterior
par(mfrow=c(1,2))
hist(prior$mean_clicks)
hist(posterior$mean_clicks)
45.3 Grid approximation
The following example shows how we can build the previous model more effective.
We calculating the distribution using dbinom
instead of simulating
using rbinom
.
We can directly include condition in our model using dbinom
instead of rbinom
.
Instead of using resul of our trial 13 visitors after showing ads 100 times we
can condition for all posible number of visitors seq(0,100)
.
<- 100
n_ads_shown <- 0.1
p_clicks <- seq(0, 100) # instead of 13 we get for all possible n visitors
n_visitors # model
<- dbinom(n_visitors,
prob size = n_ads_shown,
prob = p_clicks)
head(prob)
## [1] 0.0000265614 0.0002951267 0.0016231966 0.0058916025 0.0158745955 0.0338658038
# Plot the distribution
plot(n_visitors, prob, type = "h")
Calculating a joint distribution
<- 100
n_ads_shown <- seq(0, 1, by = 0.01)
p_clicks <- seq(0, 100, by = 1)
n_visitors
# Define a grid over all the parameter combinations you need to evaluate
<- expand.grid(proportion_clicks = p_clicks,
pars n_visitors = n_visitors)
$prior <- dunif(pars$proportion_clicks, min = 0, max = 0.2)
pars
$likelihood <- dbinom(pars$n_visitors,
parssize = n_ads_shown,
prob = pars$proportion_clicks)
### According to Bayes' theorem:
# Combined probability by the rule of probabilities multiplication
$probability <- pars$likelihood * pars$prior
pars# normalize to the total number to get sum of all probabilities eq 1
$probability <- pars$probability / sum(pars$probability)
pars###
# Conditioning on the data for n_visitors == 6
<- pars[pars$n_visitors == 6, ]
pars
# Normalize again to get sum of all probabilities eq 1
$probability <- pars$probability / sum(pars$probability)
pars
# Plot the posterior pars$probability
plot(pars$proportion_clicks, pars$probability, type = "h")
We can directly condition and change n_visitors <- 6
. In this case we do not
need to subset by pars[pars$n_visitors == 6, ]
. Result will be the same.
45.4 Model of birth weights using normal distribution
Let’s assume that the Normal distribution is a decent model of birth weight data.
# Assign mu and sigma
<- 3500 # central value (mean weight)
m <- 600 # deviation
s
<- rnorm(n = 100000, mean = m, sd = s)
weight_dist hist(weight_dist, 60, xlim = c(0, 6000))
We calculating the distribution using dnorm
instead of simulating using rnorm
.
# Create weight
<- seq(0, 6000, by = 100) # 100 g increment
weight
# Calculate likelihood
<- dnorm(weight, m, s)
likelihood
# Plot the distribution of weight
plot(weight, likelihood, type = "h")
Here is a small data set with the birth weights of six newborn babies in grams.
c(3164, 3362, 4435, 3542, 3578, 4529)
Mark: What to do with the data? Should we condition it to get posterior?
45.5 A Bayesian model of Zombie IQ
Check video
<- c(19, 23, 20, 17, 23)
temp <- seq(8, 30, by = 0.5)
mu <- seq(0.1, 10, by = 0.3)
sigma <- expand.grid(mu = mu, sigma = sigma)
pars
$mu_prior <- dnorm(pars$mu, mean = 18, sd = 5)
pars$sigma_prior <- dunif(pars$sigma, min = 0, max = 10)
pars
$prior <- pars$mu_prior * pars$sigma_prior
pars
for (i in 1:nrow(pars)) {
<- dnorm(temp, pars$mu[i], pars$sigma[i])
likelihoods $likelihood[i] <- prod(likelihoods)
pars
}
$probability <- pars$likelihood * pars$prior
pars$probability <- pars$probability / sum(pars$probability)
pars
<- sample(1:nrow(pars), size = 10000,
sample_indices replace = TRUE, prob = pars$probability)
head(sample_indices)
## [1] 524 705 478 749 298 296
<- pars[sample_indices, c("mu", "sigma")]
pars_sample head(pars_sample)
## mu sigma
## 524 22.0 3.4
## 705 22.5 4.6
## 478 21.5 3.1
## 749 22.0 4.9
## 298 21.5 1.9
## 296 20.5 1.9
hist(pars_sample$mu, 30)
quantile(pars_sample$mu, c(0.05, 0.95))
## 5% 95%
## 17.5 22.5
<- rnorm(10000, mean = pars_sample$mu, sd = pars_sample$sigma)
pred_temp hist(pred_temp, 30)
# Probability of 18C
sum(pred_temp >= 18) / length(pred_temp)
## [1] 0.7315
for the model of temperature but use zombies IQs instead.
Define model:
\(\mu = Normal(mean:18, sd:5)\)
\(\sigma = Uniform(min:0, max:10)\)
\(temp = 19,23,...\)
How much we can learn about the mean zombie IQ from this data.
We need to calculate the probability of each parameter combination in pars.
Use Bayes Theorem to calculate these probabilities and assign them to
pars$probability
to complete the model.
# The IQ of a bunch of zombies
<- c(55, 44, 34, 18, 51, 40, 40, 49, 48, 46)
iq
# Defining the parameter grid
<- expand.grid(mu = seq(0, 150, length.out = 100),
pars sigma = seq(0.1, 50, length.out = 100))
# Defining and calculating the prior density for each parameter combination
$mu_prior <- dnorm(pars$mu, mean = 100, sd = 100)
pars$sigma_prior <- dunif(pars$sigma, min = 0.1, max = 50)
pars$prior <- pars$mu_prior * pars$sigma_prior
pars
# Calculating the likelihood for each parameter combination
for(i in 1:nrow(pars)) {
<- dnorm(iq, pars$mu[i], pars$sigma[i])
likelihoods $likelihood[i] <- prod(likelihoods)
pars
}
# Calculate the probability of each parameter combination
$probability <- pars$likelihood * pars$prior / sum(pars$likelihood * pars$prior) pars
Calculate new parameters.
head(pars)
## mu sigma mu_prior sigma_prior prior likelihood probability
## 1 0.000000 0.1 0.002419707 0.02004008 4.849113e-05 0 0
## 2 1.515152 0.1 0.002456367 0.02004008 4.922578e-05 0 0
## 3 3.030303 0.1 0.002493009 0.02004008 4.996010e-05 0 0
## 4 4.545455 0.1 0.002529617 0.02004008 5.069373e-05 0 0
## 5 6.060606 0.1 0.002566174 0.02004008 5.142633e-05 0 0
## 6 7.575758 0.1 0.002602661 0.02004008 5.215754e-05 0 0
<- sample(nrow(pars), size = 10000,
sample_indices replace = TRUE, prob = pars$probability)
head(sample_indices)
## [1] 2429 2729 2631 3333 2732 2030
# Sample from pars to calculate some new measures
<- pars[sample_indices, c("mu", "sigma")]
pars_sample
# Calculate quantiles
quantile(pars_sample$mu, c(0.025, 0.5, 0.975))
## 2.5% 50% 97.5%
## 34.84848 42.42424 50.00000
head(pars_sample)
## mu sigma
## 2429 42.42424 12.19697
## 2729 42.42424 13.70909
## 2631 45.45455 13.20505
## 3333 48.48485 16.73333
## 2732 46.96970 13.70909
## 2030 43.93939 10.18081
<- rnorm(10000,
pred_iq mean = pars_sample$mu,
sd = pars_sample$sigma)
# Calculate the probability that zombi has IQ > 60
sum(pred_iq >= 60) / length(pred_iq)
## [1] 0.0837
par(mfrow=c(1,2))
# Visualize the mean IQ
hist(pars_sample$mu, 100)
# Visualize pred_iq
hist(pred_iq)
45.6 The BEST models
The t-test is a classical statistical procedure used to compare the means of two data sets.
In 2013 John Kruschke developed a souped-up Bayesian version of the t-test he named BEST (standing for Bayesian Estimation Supersedes the t-test).
We would like to compair IQ of two groups of 10 patiens on different diets (a and b).
# The IQ of patients.
<- c(55, 44, 34, 18, 51, 40, 40, 49, 48, 46)
iq_a <- c(44, 52, 42, 66, 53, 42, 55, 57, 56, 51)
iq_b
# Calculate the mean difference in IQ between the two groups
mean(iq_b) - mean(iq_a)
## [1] 9.3
require(BEST)
# Fit the BEST model to the data from both groups
require(BEST)
<- BEST::BESTmcmc(iq_b, iq_a) best_posterior
# Plot the model result
plot(best_posterior)
The Bayesian model behind BEST assumes that the generative model for the data is a t-distribution; a more flexible distribution than the normal distribution as it assumes that data points might be outliers to some degree. This makes BEST’s estimate of the mean difference robust to outliers in the data.
Sources