---
title: "Statistical Rethinking Chapter 8 Problems"
author: "John Fox"
date: "2016-11-28"
output: html_document
---


```{r echo=FALSE}
# include this code chunk as-is to set options
knitr::opts_chunk$set(comment=NA, prompt=TRUE, fig.height=5, fig.width=9, R.options=list(digits=6)) 
```

8H1
---

```{r}
set.seed(706963) # for reproducibility
library(rethinking)
library(car) # for qqPlot(), and later densityPlot()
mp <- map2stan(
    alist(
        a ~ dnorm(0, 1),
        b ~ dcauchy(0, 1)
    ),
    data = list(y=1),
    start = list(a=0, b=0),
    iter=1e4, warmup=100, WAIC=FALSE
)
precis(mp)
plot(mp)
pairs(mp)
post <- extract.samples(mp)
par(mfrow=c(1, 2))
qqPlot(post$a)
qqPlot(post$b, distribution="cauchy", location=0, scale=1)
```

This "model" just draws pairs of values from the standard-normal distribution and the Cauchy distribution with location 0 and scale 1. The two sets of values are drawn independently. The Cauchy distribution has very heavy tails and thus is prone to producing extreme outliers, as is apparent in both the trace plot and the pairs plot. The QQ plot for the Cauchy data, however, suggests that the sampled values aren't as extreme as expected.

8H2
---

Fitting the models (I used underscores rather than dots in variable names because **stan** apparently doesn't like dots):

```{r}
data(WaffleDivorce)
d <- WaffleDivorce[, c("Divorce", "MedianAgeMarriage", "Marriage")]

d$MedianAgeMarriage_s <- (d$MedianAgeMarriage-mean(d$MedianAgeMarriage))/
    sd(d$MedianAgeMarriage)
m5.1 <- map2stan(
    alist(
        Divorce ~ dnorm(mu, sigma),
        mu <- a + bA * MedianAgeMarriage_s,
        a ~ dnorm(10, 10),
        bA ~ dnorm(0, 1),
        sigma ~ dcauchy(0 ,1)
    ), data = d)

d$Marriage_s <- (d$Marriage - mean(d$Marriage))/sd(d$Marriage)
m5.2 <- map2stan(
    alist(
        Divorce ~ dnorm(mu, sigma),
        mu <- a + bR * Marriage_s,
        a ~ dnorm(10, 10),
        bR ~ dnorm(0, 1),
        sigma ~ dcauchy(0 , 1)
    ), data = d)

m5.3 <- map2stan(
    alist(
        Divorce ~ dnorm(mu, sigma) ,
        mu <- a + bR*Marriage_s + bA*MedianAgeMarriage_s,
        a ~ dnorm(10, 10),
        bR ~ dnorm(0, 1),
        bA ~ dnorm(0, 1),
        sigma ~ dunif(0, 10)
    ),
    data = d)

precis(m5.1)
precis(m5.2)
precis(m5.3)
compare(m5.1, m5.2, m5.3)
plot(m5.3)
```

The parameter trace plots (shown for `m5.3`) look OK to me.

The WAIS prefers model `m5.1` which has only the predictor `MedianAgeMarriage_s`. The model `m5.2` with only the predictor `Marriage_s` is by far the worst according to the WAIS. 

The results are quite similar to those in Ch. 5 fit with `map()`, although the models there used a uniform prior for the SD of the errors.

8H3
---

```{r}
N <- 100                            # number of individuals
height <- rnorm(N, 10, 2)           # sim total height of each
leg_prop <- runif(N, 0.4, 0.5)      # leg as proportion of height
leg_left <- leg_prop*height +       # sim left leg as proportion + error
    rnorm(N, 0, 0.02)
leg_right <- leg_prop*height +      # sim right leg as proportion + error
    rnorm(N, 0, 0.02)
                                    # combine into data frame
d <- data.frame(height, leg_left, leg_right)

m5.8s <- map2stan(
    alist(
        height ~ dnorm(mu, sigma),
        mu <- a + bl*leg_left + br*leg_right,
        a ~ dnorm(10, 100),
        bl ~ dnorm(2, 10),
        br ~ dnorm(2, 10),
        sigma ~ dcauchy(0, 1)
    ) ,
    data=d, chains=4, start=list(a=0, bl=0, br=0, sigma=1))

m5.8s2 <- map2stan(
    alist(
        height ~ dnorm(mu, sigma),
        mu <- a + bl*leg_left + br*leg_right,
        a ~ dnorm(10, 100),
        bl ~ dnorm(2, 10),
        br ~ dnorm(2, 10) & T[0, ],
        sigma ~ dcauchy(0, 1)
    ) ,
    data=d, chains=4, start=list(a=0, bl=0, br=0, sigma=1))
precis(m5.8s)
precis(m5.8s2)
plot(m5.8s)
plot(m5.8s2)
pairs(m5.8s)
pairs(m5.8s2)
```

The left- and right-leg coefficients are perfectly negatively correlated and effectively sum to 2, so the distribution of one mirrors the distribution of the other. There is an indication of a convergence problem for the chains for model `m5.8s2`, where there are occassionally unusual parameter values sampled.

8H4
---

```{r}
compare(m5.8s, m5.8s2)
compare(m5.8s, m5.8s2, func=DIC)
```

The two criteria rate the models very closely; both show `m5.8s` as having more effective parameters. It's not clear to me why (I suppose it's because the prior in model `m5.8s2` is more constraining): in both cases two of the parameters are perfectly (negatively) correlated.

8H5
---

I didn't find this exercise interesting, so I skipped it.

8H6
----

I decided to apply the `metropolis()` function defined in my "Some Notes on Markov Chain Monte Carlo (MCMC)" to this problem. Repeating the R code for the function (and related functions):

```{r}
metropolis <- function(likelihood, prior, proposal, pars0, m=1e5){
    # likelihood: a function of the parameter vector
    # prior:      a function of the parameter vector
    # proposal:   a function of the current value of the parameter 
    #              vector, returning a proposed parameter vector
    # pars0:      a vector of initial values of the parameters
    # m:          number of samples, defaults to 10^5
    post <- function(pars) likelihood(pars)*prior(pars)
    pars <- matrix(0, m, length(pars0))
    accepted <- rejected <- 0
    pars.current <- pars0
    for (i in 1:m){
        pars.proposed <- proposal(pars.current) 
        a <- post(pars.proposed)/post(pars.current)  # aceptance ratio
        if (a >= 1 || runif(1) <= a) {
            pars[i, ] <- pars.proposed
            pars.current <- pars.proposed
            accepted <- accepted + 1
        }
        else {
            pars[i, ] <- pars.current
            rejected <- rejected + 1
        }
    }
    if (length(pars0) == 1) pars <- as.vector(pars)
    result <- list(samples=pars, accepted=accepted, rejected=rejected, 
                   thinned=FALSE)
    class(result) <- "metropolis"
    result
}

print.metropolis <- function(x, ...){
    cat("number of samples: ", 
        with(x, if (is.matrix(samples)) nrow(samples) 
             else length(samples)))
    cat("\nnumber of parameters: ", 
        with(x, if (is.matrix(samples)) ncol(samples) else 1))
    if (x$thinned){
        cat("\nPrior to thinning:")
    }
    cat("\nnumber of proposals accepted: ", x$accepted)
    cat("\nnumber of proposals rejected: ", x$rejected)
    cat("\npercentage of proposals accepted: ", 
        with(x, round(100*accepted/(accepted + rejected), 2)), "\n")
    if (is.matrix(x$samples)){
        cat("\nestimated posterior medians")
        print(apply(x$samples, 2, median))
    }
    else cat("\nestimated posterior median: ", median(x$samples))
    invisible(x)
}

plot.metropolis <- function(x, ...){
    n.par <- if (is.matrix(x$samples)) ncol(x$samples) else 1
    if (n.par == 1) acf(x$samples, main="", ...)
    else{
        save.mfrow <- par(mfrow =  n2mfrow(n.par))
        on.exit(options(save.mfrow))
        for (j in 1:npar){
            acf(x$samples[, j], main = paste("parameter", j), ...)
        }
    }
}

thin <- function(object, ...){
    UseMethod("thin")
}

thin.metropolis <- function(object, by, ...){
    # by: every by-th sample in object is retained
    samples <- object$samples
    object$samples <- if (is.matrix(samples)){
        samples[seq(1, nrow(samples), by=by), ]
    }
    else samples[seq(1, length(samples), by=by)]
    object$thinned <- TRUE
    object
}
```

For the example from Ch. 2 (see p. 42), the likelihood function is $W\sim \mathrm{Bin}(n=9, p)$ with data w = 6 hits on water, and the prior is $p \sim \mathrm{Unif}(0, 1)$. I'll use as a proposal function $N(p, 0.1^2)$:
```{r}
L <- function(p) {
    if (p < 0 || p > 1) 0
    else dbinom(6, size=9, prob=p)
}
prior <- function(p) 1
proposal <- function(p) rnorm(1, mean=p, sd=0.1)
```

Then
```{r}
(res8H6 <- metropolis(L, prior, proposal, pars0=0.5))
plot(res8H6)
```

I'll thin the chain by taking every 25th value and then compute the mean, standard deviation, and 5.5 and 94.5 percentiles to compare with the results on p. 42:
```{r}
(res8H6.th <- thin(res8H6, 25))
```
```{r}
mean(res8H6.th$samples)
sd(res8H6.th$samples)
quantile(res8H6.th$sample, c(0.055, 0.945))
```
The results are reasonably similar to those on p. 42. The differences are probably due to `map()` using a normal approximation to the posterior. Here's a density estimate of the posterior, which is negatively skewed:

```{r}
densityPlot(res8H6.th$samples, xlab="p")
p <- seq(0, 1, length=50)
lines(p, dbeta(p, 7, 4), col="magenta", lty=2)
```

The true posterior here is Beta$(7, 4)$, shown in magenta on the graph. Here's a QQ plot:

```{r}
qqPlot(res8H6.th$samples, distribution="beta", shape1=7, shape2=4)
```