# Chapter 6 Spatial modeling of areal data. Lip cancer in Scotland

In this chapter we estimate the risk of lip cancer in males in Scotland, UK, using the **R-INLA** package (Rue et al. 2018). We use data that on the number of observed and expected lip cancer cases, and the proportion of population engaged in agriculture, fishing, or forestry (AFF) for each of the Scotland counties. These data are obtained from the **SpatialEpi** package (Kim and Wakefield 2018). First, we describe how to calculate and interpret the Standardized Incidence Ratios (SIRs) in the Scotland counties. Then, we show how to fit the Besag-York-Mollié (BYM) model to obtain relative risk estimates and quantify the effect of the AFF variable. We also show how to calculate exceedance probabilities of relative risk being greater than a given threshold value. Results are shown by means of tables, static plots created with **ggplot2** (Wickham, Chang, et al. 2018), and interactive maps created with **leaflet** (Cheng, Karambelkar, and Xie 2018). A similar example that analyzes lung cancer data in Pennsylvania, USA, appears in Moraga (2018a).

## 6.1 Data and map

We start by loading the **SpatialEpi** package and attaching the `scotland`

data. The data contain for each of the Scotland counties, the number of observed and expected lip cancer cases between years 1975 and 1980, and a variable that indicates the proportion of the population engaged in agriculture, fishing, or forestry (AFF). The AFF variable is related to exposure to sunlight which is a risk factor for lip cancer. The data also contain a map of the Scotland counties.

```
library(SpatialEpi)
data(scotland)
```

Next we inspect the data.

`class(scotland)`

`## [1] "list"`

`names(scotland)`

```
## [1] "geo" "data"
## [3] "spatial.polygon" "polygon"
```

We see that `scotland`

is a list object with the following elements:

`geo`

: data frame with names and centroid coordinates (eastings/northings) of each of the counties,`data`

: data frame with names, number of observed and expected lip cancer cases, and AFF values of each of the counties,`spatial.polygon`

:`SpatialPolygons`

object with the map of Scotland,`polygon`

: polygon map of Scotland.

We can type `head(scotland$data)`

to see the number of observed and expected lip cancer cases and the AFF values of the first counties.

`head(scotland$data)`

```
## county.names cases expected AFF
## 1 skye-lochalsh 9 1.4 0.16
## 2 banff-buchan 39 8.7 0.16
## 3 caithness 11 3.0 0.10
## 4 berwickshire 9 2.5 0.24
## 5 ross-cromarty 15 4.3 0.10
## 6 orkney 8 2.4 0.24
```

The map of Scotland counties is given by the `SpatialPolygons`

object called `scotland$spatial.polygon`

(Figure 6.1).

```
map <- scotland$spatial.polygon
plot(map)
```

`map`

does not contain information about the Coordinate Reference System (CRS), so we specify a CRS by assigning the corresponding proj4 string to the map. The map is in the projection OSGB 1936/British National Grid which has EPSG code 27700. The proj4 string of this projection can be seen in https://spatialreference.org/ref/epsg/27700/proj4/ or can be obtained with R as follows:

```
codes <- rgdal::make_EPSG()
codes[which(codes$code == "27700"), ]
```

We assign this proj4 string to `map`

and set `+units=km`

because this is the unit of the map projection.

```
proj4string(map) <- "+proj=tmerc +lat_0=49 +lon_0=-2
+k=0.9996012717 +x_0=400000 +y_0=-100000 +datum=OSGB36
+units=km +no_defs"
```

We wish to use the **leaflet** package to create maps. **leaflet** expects data to be specified in latitude and longitude using WGS84, so we transform `map`

to this projection as follows:

```
map <- spTransform(map,
CRS("+proj=longlat +datum=WGS84 +no_defs"))
```

## 6.2 Data preparation

In order to analyze the data, we create a data frame called `d`

with columns containing the counties ids, the observed and expected number of lip cancer cases, the AFF values, and the SIRs. Specifically, `d`

contains the following columns:

`county`

: id of each county,`Y`

: observed number of lip cancer cases in each county,`E`

: expected number of lip cancer cases in each county,`AFF`

: proportion of population engaged in agriculture, fishing, or forestry,`SIR`

: SIR of each county.

We create the data frame `d`

by selecting the columns of `scotland$data`

that denote the counties, the number of observed cases, the number of expected cases, and the variable AFF. Then we set the column names of the data frame to `c("county", "Y", "E", "AFF")`

.

```
d <- scotland$data[,c("county.names", "cases", "expected", "AFF")]
names(d) <- c("county", "Y", "E", "AFF")
```

Note that in this example the number of expected counts are given in the data. If the expected counts were not availabe we could calculate them using indirect standardization as demonstrated in Chapter 5. Then, we can compute the SIR values as the ratio of the observed to the expected number of lip cancer cases.

`d$SIR <- d$Y / d$E`

The first rows of the data frame `d`

are the following:

`head(d)`

```
## county Y E AFF SIR
## 1 skye-lochalsh 9 1.4 0.16 6.429
## 2 banff-buchan 39 8.7 0.16 4.483
## 3 caithness 11 3.0 0.10 3.667
## 4 berwickshire 9 2.5 0.24 3.600
## 5 ross-cromarty 15 4.3 0.10 3.488
## 6 orkney 8 2.4 0.24 3.333
```

### 6.2.1 Adding data to map

The map of Scotland counties is given by the `SpatialPolygons`

object called `map`

. Using this object and the data frame `d`

we can create a `SpatialPolygonsDataFrame`

that we will use to make maps of the variables in `d`

. We create the `SpatialPolygonsDataFrame`

by first setting the row names of `d`

equal to `d$county`

. Then we merge the `SpatialPolygons`

object `map`

and the data frame `d`

matching the `SpatialPolygons`

polygons ID slot values with the data frame row names (`match.ID = TRUE`

). We call `map`

the `SpatialPolygonsDataFrame`

obtained which contains the Scotland counties and the data of data frame `d`

.

```
library(sp)
rownames(d) <- d$county
map <- SpatialPolygonsDataFrame(map, d, match.ID = TRUE)
head(map@data)
```

```
## county Y E AFF SIR
## skye-lochalsh skye-lochalsh 9 1.4 0.16 6.429
## banff-buchan banff-buchan 39 8.7 0.16 4.483
## caithness caithness 11 3.0 0.10 3.667
## berwickshire berwickshire 9 2.5 0.24 3.600
## ross-cromarty ross-cromarty 15 4.3 0.10 3.488
## orkney orkney 8 2.4 0.24 3.333
```

## 6.3 Mapping SIRs

We can visualize the observed and expected lip cancer cases, the SIRs, as well as the AFF values in an interactive chropleth map using the **leaflet** package. We create a map with the SIRs by first calling `leaflet()`

and adding the default OpenStreetMap map tiles to the map with `addTiles()`

. Then we add the Scotland counties with `addPolygons()`

where we specify the areas boundaries color (`color`

) and the stroke width (`weight`

). We fill the areas with the colors given by the color palette function generated with `colorNumeric()`

, and set `fillOpacity`

to a value less than 1 to be able to see the background map. We use `colorNumeric()`

to create a color palette function that maps data values to colors according to a given palette. We create the function using the parameters `palette`

with color function that values will be mapped to, and `domain`

with the possible values that can be mapped. Finally, we add the legend by specifying the color palette function (`pal`

) and the values used to generate colors from the palette function (`values`

). We set `opacity`

to the same value as the opacity in the map, and specify a title and a position for the legend (Figure 6.2).

```
library(leaflet)
l <- leaflet(map) %>% addTiles()
pal <- colorNumeric(palette = "YlOrRd", domain = map$SIR)
l %>%
addPolygons(
color = "grey", weight = 1,
fillColor = ~ pal(SIR), fillOpacity = 0.5
) %>%
addLegend(
pal = pal, values = ~SIR, opacity = 0.5,
title = "SIR", position = "bottomright"
)
```

We can improve the map by highlighting the counties when the mouse hovers over them, and showing information about the observed and expected counts, SIRs, and AFF values. We do this by adding the arguments `highlightOptions`

, `label`

and `labelOptions`

to `addPolygons()`

. We choose to highlight the areas using a bigger stroke width (`highlightOptions(weight = 4)`

). We create the labels using HTML syntax. First, we create the text to be shown using the function `sprintf()`

which returns a character vector containing a formatted combination of text and variable values and then applying `htmltools::HTML()`

which marks the text as HTML. In `labelOptions`

we specify the `style`

, `textsize`

, and `direction`

of the labels. Possible values for `direction`

are `left`

, `right`

and `auto`

and this specifies the direction the label displays in relation to the marker. We set `direction`

equal to `auto`

so the optimal direction will be chosen depending on the position of the marker.

```
labels <- sprintf("<strong> %s </strong> <br/>
Observed: %s <br/> Expected: %s <br/>
AFF: %s <br/> SIR: %s",
map$county, map$Y, round(map$E, 2),
map$AFF, round(map$SIR, 2)
) %>%
lapply(htmltools::HTML)
l %>%
addPolygons(
color = "grey", weight = 1,
fillColor = ~ pal(SIR), fillOpacity = 0.5,
highlightOptions = highlightOptions(weight = 4),
label = labels,
labelOptions = labelOptions(
style = list(
"font-weight" = "normal",
padding = "3px 8px"
),
textsize = "15px", direction = "auto"
)
) %>%
addLegend(
pal = pal, values = ~SIR, opacity = 0.5,
title = "SIR", position = "bottomright"
)
```

We can examine the map of SIRs and see which counties in Scotland have SIR equal to 1 indicating observed counts are the same as expected counts, and which counties have SIR greater (or smaller) than 1, indicating observed counts are greater (or smaller) than expected counts. This map gives a sense of the lip cancer risk across Scotland. However, SIRs may be misleading and insufficiently reliable in counties with small populations. In contrast, model-based approaches enable to incorporate covariates and borrow information from neighboring counties to improve local estimates, resulting in the smoothing of extreme values based on small sample sizes. In the next section we show how to obtain disease risk estimates using a spatial model with the **R-INLA** package.

## 6.4 Modeling

In this section we specify the model for the data, and detail the required steps to fit the model and obtain the disease risk estimates using **R-INLA**.

### 6.4.1 Model

We specify a model assuming that the observed counts, \(Y_i\), are conditionally independently Poisson distributed:

\[Y_i \sim Poisson(E_i \theta_i),\ i=1,\ldots,n,\]

where \(E_i\) is the expected count and \(\theta_i\) is the relative risk in area \(i\). The logarithm of \(\theta_i\) is expressed as

\[\log(\theta_i) = \beta_0 + \beta_1 \times AFF_i + u_i + v_i,\]

where \(\beta_0\) is the intercept that represents the overall risk, \(\beta_1\) is the coefficient of the AFF covariate, \(u_i\) is a spatial structured component modeled with a CAR distribution, \(u_i|\boldsymbol{u_{-i}} \sim N\left(\bar u_{\delta_i}, \frac{\sigma^2_u}{n_{\delta_i}}\right)\), and \(v_i\) is an unstructured spatial effect defined as \(v_i \sim N(0, \sigma^2_v)\). The relative risk \(\theta_i\) quantifies whether area \(i\) has higher (\(\theta_i > 1\)) or lower (\(\theta_i < 1\)) risk than the average risk in the standard population.

### 6.4.2 Neighborhood matrix

We create the neighborhood matrix needed to define the spatial random effect using the `poly2nb()`

and the `nb2INLA()`

functions of the **spdep** package (Bivand 2018). First, we use `poly2nb()`

to create a neighbors list based on areas with contiguous boundaries. Each element of the list `nb`

represents one area and contains the indices of its neighbors. For example, `nb[[2]]`

contains the neighbors of area 2.

```
library(spdep)
library(INLA)
nb <- poly2nb(map)
head(nb)
```

```
## [[1]]
## [1] 5 9 19
##
## [[2]]
## [1] 7 10
##
## [[3]]
## [1] 12
##
## [[4]]
## [1] 18 20 28
##
## [[5]]
## [1] 1 12 19
##
## [[6]]
## [1] 0
```

Then, we use `nb2INLA()`

to convert this list into a file with the representation of the neighborhood matrix as required by **R-INLA**. Then we read the file using the `inla.read.graph()`

function of **R-INLA**, and store it in the object `g`

which we will later use to specify the spatial random effect.

```
nb2INLA("map.adj", nb)
g <- inla.read.graph(filename = "map.adj")
```

### 6.4.3 Inference using INLA

The model includes two random effects, namely, \(u_i\) for modeling the spatial residual variation, and \(v_i\) for modeling unstructured noise. We need to include two vectors in the data that denote the indices of these random effects. We call `idareau`

the indices vector for \(u_i\), and `idareav`

the indices vector for \(v_i\). We set both `idareau`

and `idareav`

equal to \(1,\ldots,n\), where \(n\) is the number of counties. In our example, \(n\)=56 and this can be obtained with the number of rows in the data (`nrow(map@data)`

).

```
map$idareau <- 1:nrow(map@data)
map$idareav <- 1:nrow(map@data)
```

We specify the model formula by including the response in the left-hand side, and the fixed and random effects in the right-hand side. The response variable is `Y`

and we use the covariate `AFF`

. Random effects are defined using `f()`

with parameters equal to the name of the index variable and the chosen model. For \(u_i\), we use `model = "besag"`

with neighborhood matrix given by `g`

. We also use option `scale.model = TRUE`

to make the precision parameter of models with different CAR priors comparable (Freni-Sterrantino, Ventrucci, and Rue 2018). For \(v_i\), we choose `model = "iid"`

.

```
formula <- Y ~ AFF +
f(idareau, model = "besag", graph = g, scale.model = TRUE) +
f(idareav, model = "iid")
```

We fit the model by calling the `inla()`

function. We specify the formula, family (`"poisson"`

), data, and the expected counts (`E`

). We also set `control.predictor`

equal to `list(compute = TRUE)`

to compute the posteriors of the predictions.

```
res <- inla(formula,
family = "poisson", data = map@data,
E = E, control.predictor = list(compute = TRUE)
)
```

### 6.4.4 Results

We can inspect the results object `res`

by typing `summary(res)`

.

`summary(res)`

```
## Fixed effects:
## mean sd 0.025quant 0.5quant
## (Intercept) -0.305 0.1195 -0.5386 -0.3055
## AFF 4.330 1.2766 1.7435 4.3562
## 0.975quant mode kld
## (Intercept) -0.0684 -0.3067 0
## AFF 6.7702 4.4080 0
##
## Random effects:
## Name Model
## idareau Besags ICAR model
## idareav IID model
##
## Model hyperparameters:
## mean sd 0.025quant
## Precision for idareau 4.15 1.449 2.022
## Precision for idareav 19340.52 19386.226 1347.042
## 0.5quant 0.975quant mode
## Precision for idareau 3.914 7.629 3.486
## Precision for idareav 13601.004 70979.161 3679.387
##
## Expected number of effective parameters(std dev): 28.54(3.533)
## Number of equivalent replicates : 1.962
##
## Marginal log-Likelihood: -189.69
```

We observe the intercept \(\hat \beta_0=\) -0.305 with a 95% credible interval equal to (-0.5386, -0.0684), and the coefficient of AFF is \(\hat \beta_1=\) 4.330 with a 95% credible interval equal to (1.7435, 6.7702). This indicates that AFF increases lip cancer risk. We can plot the posterior distribution of the AFF coefficient by first calculating a smoothing of the marginal distribution of the coefficient with `inla.smarginal()`

, and then using the `ggplot()`

function of the **ggplot2** package (Figure 6.4).

```
library(ggplot2)
marginal <- inla.smarginal(res$marginals.fixed$AFF)
marginal <- data.frame(marginal)
ggplot(marginal, aes(x = x, y = y)) + geom_line() +
labs(x = expression(beta[1]), y = "Density") +
geom_vline(xintercept = 0, col = "black") + theme_bw()
```

## 6.5 Mapping relative risks

The estimates of the relative risk of lip cancer and their uncertainty for each of the counties are given by the mean posterior and the 95% credible intervals which are contained the object `res$summary.fitted.values`

. Column `mean`

is the mean posterior and `0.025quant`

and `0.975quant`

are the 2.5 and 97.5 percentiles, respectively.

`head(res$summary.fitted.values)`

```
## mean sd 0.025quant 0.5quant
## fitted.Predictor.01 4.964 1.4515 2.643 4.788
## fitted.Predictor.02 4.396 0.6752 3.172 4.362
## fitted.Predictor.03 3.621 1.0170 1.919 3.524
## fitted.Predictor.04 3.083 0.8950 1.627 2.982
## fitted.Predictor.05 3.329 0.7501 2.042 3.266
## fitted.Predictor.06 2.975 0.9195 1.491 2.869
## 0.975quant mode
## fitted.Predictor.01 8.294 4.449
## fitted.Predictor.02 5.817 4.295
## fitted.Predictor.03 5.884 3.336
## fitted.Predictor.04 5.113 2.789
## fitted.Predictor.05 4.973 3.144
## fitted.Predictor.06 5.070 2.666
```

We add these data to `map`

to be able to create maps of these variables. We assign `mean`

to the estimate of the relative risk, and `0.025quant`

and `0.975quant`

to the lower and upper limits of 95% credible intervals of the risks.

```
map$RR <- res$summary.fitted.values[, "mean"]
map$LL <- res$summary.fitted.values[, "0.025quant"]
map$UL <- res$summary.fitted.values[, "0.975quant"]
```

Then, we show the relative risk of lip cancer in an interactive map using **leaflet**. In the map, we add labels that appear when the mouse hovers over the counties showing information about observed and expected counts, SIRs, AFF values, relative risks, and lower and upper limits of 95% credible intervals. The map created is shown in Figure 6.5. We observe counties with greater lip cancer risk are located in the north of Scotland, and counties with lower risk are located in the center. The 95% credible intervals indicate the uncertainty in the risk estimates.

```
pal <- colorNumeric(palette = "YlOrRd", domain = map$RR)
labels <- sprintf("<strong> %s </strong> <br/>
Observed: %s <br/> Expected: %s <br/>
AFF: %s <br/> SIR: %s <br/> RR: %s (%s, %s)",
map$county, map$Y, round(map$E, 2),
map$AFF, round(map$SIR, 2), round(map$RR, 2),
round(map$LL, 2), round(map$UL, 2)
) %>% lapply(htmltools::HTML)
lRR <- leaflet(map) %>%
addTiles() %>%
addPolygons(
color = "grey", weight = 1, fillColor = ~ pal(RR),
fillOpacity = 0.5,
highlightOptions = highlightOptions(weight = 4),
label = labels,
labelOptions = labelOptions(
style =
list(
"font-weight" = "normal",
padding = "3px 8px"
),
textsize = "15px", direction = "auto"
)
) %>%
addLegend(
pal = pal, values = ~RR, opacity = 0.5, title = "RR",
position = "bottomright"
)
lRR
```

## 6.6 Exceedance probabilities

We can also calculate the probabilities of relative risk estimates being greater than a given threshold value. These probabilities are called exceedance probabilities and are useful to assess unusual elevation of disease risk. The probability that the relative risk of area \(i\) is higher than a value \(c\) can be written as \(P(\theta_i > c)\). This probability can be calculated by substracting \(P(\theta_i \leq c)\) to 1 as follows:

\[P(\theta_i > c) = 1- P(\theta_i \leq c).\]

In **R-INLA**, the probability \(P(\theta_i \leq c)\) can be calculated using the `inla.pmarginal()`

function with arguments equal to the marginal distribution of \(\theta_i\) and the threshold value \(c\). Then, the exceedance probability \(P(\theta_i > c)\) can be calculated by substracting this probability to 1:

`1 - inla.pmarginal(q = c, marginal = marg)`

where `marg`

is the marginal distribution of the predictions, and `c`

is the threshold value.

The marginals of the relative risks are in the list `res$marginals.fitted.values`

, and the marginal corresponding to the first county is `res$marginals.fitted.values[[1]]`

. In our example, we can calculate the probability that the relative risk of the first county exceeds 2, \(P(\theta_1 > 2)\), as follows:

```
marg <- res$marginals.fitted.values[[1]]
1 - inla.pmarginal(q = 2, marginal = marg)
```

`## [1] 0.9975`

To calculate the exceedance probabilities for all counties we can use the `sapply()`

function passing as arguments the list with the marginals of all counties (`res$marginals.fitted.values`

), and the function to calculate the exceedance probabilities (`1- inla.pmarginal()`

). `sapply()`

returns a vector of the same length as the list `res$marginals.fitted.values`

with values equal to the result of applying the function `1- inla.pmarginal()`

to each of the elements of the list of marginals.

```
exc <- sapply(res$marginals.fitted.values,
FUN = function(marg){1 - inla.pmarginal(q = 2, marginal = marg)})
```

Then we can add the exceedance probabilities to the `map`

and create a map of the exceedance probabilities with **leaflet**.

`map$exc <- exc`

```
pal <- colorNumeric(palette = "YlOrRd", domain = map$exc)
labels <- sprintf("<strong> %s </strong> <br/>
Observed: %s <br/> Expected: %s <br/>
AFF: %s <br/> SIR: %s <br/> RR: %s (%s, %s) <br/> P(RR>2): %s",
map$county, map$Y, round(map$E, 2),
map$AFF, round(map$SIR, 2), round(map$RR, 2),
round(map$LL, 2), round(map$UL, 2), round(map$exc, 2)
) %>% lapply(htmltools::HTML)
lexc <- leaflet(map) %>%
addTiles() %>%
addPolygons(
color = "grey", weight = 1, fillColor = ~ pal(exc),
fillOpacity = 0.5,
highlightOptions = highlightOptions(weight = 4),
label = labels,
labelOptions = labelOptions(
style =
list(
"font-weight" = "normal",
padding = "3px 8px"
),
textsize = "15px", direction = "auto"
)
) %>%
addLegend(
pal = pal, values = ~exc, opacity = 0.5, title = "P(RR>2)",
position = "bottomright"
)
lexc
```

Figure 6.6 shows the map of the exceedance probabilities. This map provides evidence of excess risk within individual areas. In areas with probabilities close to 1 it is very likely that the relative risk exceeds 2, and areas with probabilities close to 0 correspond to areas where it is very unlikely that the relative risk exceeds 2. Areas with probabilities around 0.5 have the highest uncertainty, and correspond to areas where the relative risk is below or above 2 with equal probability. We observe that the counties in the north of Scotland are the counties where it is most likely that relative risk exceeds 2.

### References

Rue, Havard, Finn Lindgren, Daniel Simpson, Sara Martino, Elias Teixeira Krainski, Haakon Bakka, Andrea Riebler, and Geir-Arne Fuglstad. 2018. *INLA: Full Bayesian Analysis of Latent Gaussian Models Using Integrated Nested Laplace Approximations*.

Kim, Albert Y., and Jon Wakefield. 2018. *SpatialEpi: Methods and Data for Spatial Epidemiology*. https://CRAN.R-project.org/package=SpatialEpi.

Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, and Kara Woo. 2018. *Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics*. https://CRAN.R-project.org/package=ggplot2.

Cheng, Joe, Bhaskar Karambelkar, and Yihui Xie. 2018. *Leaflet: Create Interactive Web Maps with the Javascript ’Leaflet’ Library*. https://CRAN.R-project.org/package=leaflet.

Moraga, Paula. 2018a. “Small Area Disease Risk Estimation and Visualization Using R.” *The R Journal* 10 (1): 495–506. https://journal.r-project.org/archive/2018/RJ-2018-036/index.html.

Bivand, Roger. 2018. *Spdep: Spatial Dependence: Weighting Schemes, Statistics and Models*. https://CRAN.R-project.org/package=spdep.

Freni-Sterrantino, Anna, Massimo Ventrucci, and Håvard Rue. 2018. “A note on intrinsic conditional autoregressive models for disconnected graphs.” *Spatial and Spatio-Temporal Epidemiology*, no. 26: 25–34.