---
title: "Statistical Rethinking Chapter 3 Problems"
author: "John Fox"
output: html_document
---

#### `r as.character(Sys.Date())`

```{r echo=FALSE}
# include this code chunk as-is to set options
knitr::opts_chunk$set(comment=NA, prompt=TRUE, fig.height=5, fig.width=5) 
```

Data for the problems:

```{r}
library(rethinking)
data(homeworkch3) # note: rethinking pkg doesn't use "lazy data"
rbind(birth1, birth2)
```


3H1
----

With a uniform prior over [0, 1] for the $\mathrm{Pr(boy)} = p$, the maximum-posterior-density estimate and the ML estimate coincide:


```{r}
(data <- sum(birth1, birth2))  # number of boys
```
```{r}
p.grid <- seq(0, 1, by = 0.001) 
prior <- rep(1, length(p.grid))
plot(p.grid, prior, type="l", xlab="p", col="magenta", ylim=c(0, 1))
likelihood <- dbinom(data, size=200, prob=p.grid)
plot(p.grid, likelihood, type="l", xlab="p", col="cyan")
posterior <- likelihood*prior
posterior <- posterior/sum(posterior) # normalize
plot(p.grid, posterior, type="l", col="blue")
p.grid[which.max(posterior)]  # maximum of posterior
```
```{r}
p.grid[which.max(likelihood)] # ML estimate
```

3H2
----

Sampling from the posterior distribution to approximate various HPD intervals:

```{r}
set.seed(12345) # for reproducibility
samples <- sample(p.grid, size=1e6, replace=TRUE, prob=posterior) # I drew 1 x 10^6 samples
head(samples) # first few
plot(samples[1:10000], type="p", pch=".")  # first 10,000
abline(h=median(samples), col="magenta")
```

Various HPD intervals:

```{r}
HPD <- matrix(0, 3, 2)
rownames(HPD) <- c(0.5, 0.89, 0.97)
colnames(HPD) <- c("lower", "upper")
for (prob in c(0.5, 0.89, 0.97))
    HPD[as.character(prob), ] <- HPDI(samples, prob=prob)
HPD
```

3H3
----

10,000 simulated samples, each of size $n = 200$, of a binomial r.v. with probability of success taken from the first 10,000 draws from the posterior distribution of $\mathrm{Pr(boy)}$:

```{r}
simulated.counts <- rbinom(1e4, size=200, prob=samples[1:1e4])
head(simulated.counts)  # first few
```

Comparing the simulated distribution of counts to the observed value of 111 boys:

```{r}
dens(simulated.counts)
abline(v=111, col="magenta")
```

The observed value seems near the centre of the simulated counts.


3H4
----

I wasn't entirely sure what McElreath intended here: To base the simulation of 100 first births on the samples from the posterior constructed above from all 200 births (assuming independence between first and second births), or to redo the entire analysis using only the 100 first births? I decided the latter wasn't really interesting, so interpreted the question to mean the former.

Then,

```{r}
simulated.counts.1 <- rbinom(1e4, size=100, prob=samples[1:1e4])
dens(simulated.counts.1)
(data.1 <- sum(birth1)) # observed number of boys in first births
abline(v=data.1, col="magenta")
```

It looks as if there are fewer boys among the first births than one would expect using all 200 births to estimate $\mathrm{Pr(boy)}$, assuming independence between first and second births within families (and `dens()` doesn't seem to be doing a great job of picking the bandwidth for the density estimate).



3H5
----

The number of first borns who were girls:

```{r}
(first.girls <- sum(1 - birth1))
```

And the number of boy second births following girl first births:

```{r}
(b.after.g <- sum(birth2[birth1 == 0])) # boys following a girl
```

Simulating the number of boy second births after girl births, based on the model fit to all 200 births assuming independence between the sexes of first and second births:

```{r}
simulated.counts.b.after.g <- rbinom(1e4, size=first.girls, prob=samples[1:1e4])
dens(simulated.counts.b.after.g)
abline(v=b.after.g, col="magenta")
simplehist(simulated.counts.b.after.g,
     xlab="simulated counts of number of boys after a girl birth", main="")
abline(v=b.after.g, col="magenta")
```

`dens()` does a pretty poor job of adapting the bandwidth to the discreteness of the counts, so I also showed a "spike-graph" using `simplehist()`. That would have been a better choice for Q. 3H4 too.

There seem to be many more boys born after a girl first birth than expected assuming independence between births --- which would happen, e.g., if parents exercised sex-selection in second births favouring boys following the birth of a girl. On the other hand, there were nearly equal numbers of boy and girl first births.


Addendum
-----------

The canonical "frequentist" analysis of the relationship between first and second births:

```{r}
(freqs <- table(birth1, birth2)) # frequency table
```
```{r}
round(100*prop.table(freqs, margin=1), 1) # row percentages
```

Thus, the percentage of boy second births is roughly twice as high following a girl first birth (`birth1` = 0) than a boy first birth (`birth1` = 1).

By a chisquare test of independence, or by Fisher's exact test, the (horrors!) $p$-value for the null hypothesis of independence between second and first births is very small:

```{r}
chisq.test(freqs)
```

```{r}
fisher.test(freqs)
```

(This isn't really fair: The purpose is to demonstrate in a simple setting how the posterior distribution can be approximated by simulation, and how to check the model by simulating the assumed data-generating mechanism and comparing to the observed data.)