Geospatial Health Data: Modeling and Visualization with R-INLA and Shiny describes spatial and spatio-temporal statistical methods and visualization techniques to analyze georeferenced health data in R. After a detailed introduction of geospatial data, the book shows how to develop Bayesian hierarchical models for disease mapping and apply computational approaches such as the Integrated Nested Laplace Approximations (INLA) and the Stochastic Partial Differential Equation (SPDE) to analyze areal and geostatistical data. These approaches allow to quantify disease burden, understand geographic patterns and changes over time, identify risk factors, and measure inequalities between populations. The book also shows how to create interactive and static visualizations such as disease maps and time plots, and describes several R packages that can be used to easily turn analyses into visually informative and interactive reports, dashboards, and Shiny web applications that facilitate the communication of insights to collaborators and policy makers.
The book features detailed worked examples of several disease and environmental applications using real-world data such as malaria in The Gambia, cancer in Scotland and USA, and air pollution in Spain. Examples in the book focus on health applications, but the approaches covered are also applicable to other fields that use georeferenced data including epidemiology, ecology, demography or criminology. The book covers the following topics:
- Types of spatial data and coordinate reference systems,
- Manipulate and transform point, areal, and raster data,
- Retrieve high-resolution spatially referenced environmental data,
- Fit and interpret Bayesian spatial and spatio-temporal models with the R-INLA package,
- Model disease risk and quantify risk factors in different settings,
- Create interactive and static visualizations such as disease risk maps and time plots,
- Create reproducible reports with R Markdown,
- Develop dashboards with flexdashboard,
- Build interactive Shiny web applications.
The book uses publicly available data, and provides clear descriptions of the R code for data importing, manipulation, modeling and visualization, as well as the interpretation of the results. This ensures contents are fully reproducible and accessible for students, researchers and practitioners.
This book is primarily aimed at epidemiologists, biostatisticians, public health specialists, and professionals of government agencies working with georeferenced health data. Moreover, since the methods discussed in the book are applicable not only to health data but also to many other fields that deal with georeferenced data, the book is also suitable for researchers and practitioners of other areas wishing to learn how to model and visualize this type of data such as epidemiology, ecology, demography or criminology. The book is also appropriate for postgraduate students of statistics and epidemiology or other subjects with a strong statistical background.
Prerequisites and recommended reading
It is assumed readers are familiar with R and with the basics of data analysis. R (https://www.r-project.org) is a free, open source, software environment for statistical computing and graphics with many excellent packages for importing and manipulating data, statistical modeling, and visualization. R can be downloaded from CRAN (the Comprehensive R Archive Network) (https://cran.rstudio.com). It is recommended to run R using the integrated development environment (IDE) called RStudio which can be freely downloaded from https://www.rstudio.com/products/rstudio/download. RStudio allows to interact with R more readily. It includes a console, syntax-highlighting editor that supports direct code execution, as well as a variety of tools for plotting, history, debugging and workspace management.
Resources available for readers wanting to improve their R skills include Grolemund (2014) which provides a friendly introduction to R with hands-on examples. Books for readers already comfortable with R include Wickham and Grolemund (2016) which teaches how to do data science with R, and Wickham (2019) which is designed primarily for R users who want to improve their programming skills and understanding of the language. Excellent resources to learn how to handle, analyze, and visualize spatial and spatio-temporal data in R are Bivand, Pebesma, and Gómez-Rubio (2013), Lovelace, Nowosad, and Muenchow (2019), and the website https://www.r-spatial.org.
It is also recommended that readers have a working knowledge of linear models, generalized linear models, Gaussian, Poisson and Binomial probability distributions, and Bayesian inference. Wang, Ryan, and Faraway (2018) covers a wide range of Bayesian regression models and detailed examples to fit them using INLA. Specific resources that focus on spatial and spatio-temporal modeling include Blangiardo and Cameletti (2015) which provides an introduction to the Bayesian approach and presents practical examples using real data problems. Krainski et al. (2019) describes the SPDE approach in detail and presents models that can deal with a variety of problems including multivariate data, measurement error, non-stationarity, and point process models. Further resources to learn INLA and SPDE can be found on the website http://www.r-inla.org/.
This book describes several R packages that can be used to easily turn our analyses into visually informative and interactive reports (Allaire et al. 2018), dashboards (Borges and Allaire 2017), and Shiny web applications (Chang et al. 2018). These tools facilitate the communication with collaborators and allow stakeholders to understand our research and make informed decisions. Resources to deepen expertise in these packages can be found on the RStudio website https://www.rstudio.com/ which contains excellent tutorials, articles and examples on advanced concepts as well as information on hosting and deployment of web products.
Why read this book
Geospatial health data are essential to inform public health and policy across high-, middle-, and low-income countries. These data can be used to understand the burden and geographic patterns of disease, and can help in the development of hypothesis that relate disease risk to potential demographic and environmental factors.
This book shows how to apply cutting-edge statistical spatial and spatio-temporal methods on disease data to produce disease risk maps and quantify risk factors. Specifically, the book shows how to develop Bayesian hierarchical models and apply computational approaches such as INLA and SPDE to analyze data collected in areas (e.g., counties or provinces) and at particular locations by disease registries, national and regional institutes of statistics, and other organizations. These approaches allow to quantify the disease burden, understand geographic and temporal patterns, identify risk factors, and measure inequalities.
This book also provides the necessary tools to design and develop web-based digital applications such as disease atlases that incorporate interactive visualizations to make disease risk estimates available and accessible to a wide audience, including policy makers, researchers, health professionals, and the general public. These tools allow to explore vast amounts of data in an interactive and approachable way by means of maps, time plots, tables and other visualizations that support interactive filtering and zooming over different regions and periods of time to display the information of interest. These tools are beneficial when trying to identify information for specific regions, compare risks between populations, and understand how disease patterns have changed over time.
The statistical methods and visualization techniques presented in this book are valuable to analyze a wide range of conditions including infectious diseases, non-communicable diseases, injuries, and health-related behaviors, and provide policy makers with actionable information for the development and implementation of appropriate population health policies.
Structure of the book
This book consists of three parts and an appendix. The first part provides an overview of geospatial health and the R-INLA package (Rue et al. 2018). The goal of this part is to provide some ground to geospatial data and computational methods that can help the development of the subsequent chapters. Chapter 1 provides an overview of geospatial health and discusses methods for analysis and tools for communication of results. Chapter 2 reviews the basic characteristics of spatial data including areal, geostatistical and point patterns, and introduces coordinate reference systems and geographical data storages. This chapter also shows R packages that are commonly used to create maps in R. Chapter 3 provides an introduction to Bayesian inference and INLA to perform approximate Bayesian inference in latent Gaussian models. This first part concludes with Chapter 4 which provides an overview of the R-INLA package. This chapter details how to use R-INLA to specify and fit models and how to interpret the results.
The second part of the book is devoted to modeling and visualization of both areal and geostatistical data. Health data that are aggregated over areas such as administrative divisions are common in public health surveillance. Examples include the number of disease cases in provinces or the number of road accidents in provinces. Chapter 5 introduces methods to analyze this type of data including spatial proximity matrices and Standardized Incidence Ratios (SIRs), and discusses common areal issues such as the Misaligned Data Problem (MIDP) and The Modifiable Areal Unit Problem (MAUP). This chapter also introduces Bayesian hierarchical models to obtain small area disease risk estimates in spatial and spatio-temporal settings. Chapter 6 provides an example on how to use INLA to obtain cancer risk estimates in the Scotland counties and quantify risk factors. Chapter ?? uses a spatio-temporal model to obtain cancer risk estimates in the Ohio counties across several years.
Geostatistical data refers to data about a spatially continuous phenomenon that have been collected at particular sites. Examples of this type of data are disease prevalence observations collected at specific villages using surveys, and air pollution levels measured at several monitoring stations. Chapter 7 shows how to develop spatial and spatio-temporal models that enable to make predictions at unsampled locations and times using the SPDE approach. Chapter 8 presents an example to predict malaria prevalence in The Gambia using survey data and high-resolution environmental covariates. Chapter ?? shows how to model measurements of air pollution obtained at several monitoring stations in Spain across different years to produce continuous maps representing the spatial variation of air pollution over time. The examples presented in these chapters provide the R code needed for data importing, manipulation and modeling, and show how to create static and interactive visualizations such as maps and time plots of disease risk and risk factors using the R packages ggplot2 (Wickham, Chang, et al. 2018), gganimate (Pedersen and Robinson 2019), plotly (Sievert et al. 2018), leaflet (Cheng, Karambelkar, and Xie 2018), mapview (Appelhans et al. 2018) and tmap (Tennekes 2018).
A key aspect of geospatial research is to determine how to share the results of our analyses in a proper, timely and actionable way. Part three of the book describes several R packages that facilitate the communication with collaborators and stakeholders. In Chapter 9 we introduce the package R Markdown (Allaire et al. 2018). This package enables the easy creation of high quality fully reproducible reports including narrative text, tables and visualizations, as well as the R code to generate them. While documents generated with R Markdown can be easily used to reproduce results and help other researchers determine how they were derived, they may not be the best tool for reporting to the relevant stakeholders. Stakeholders may not be interested in the statistical analyses, but they need to fully understand the results to support decision making. Dashboards can help to communicate large amounts of information visually and quickly and support data-driven decision making. In Chapter 10 we introduce the R package flexdashboard (Borges and Allaire 2017) which can be used to create dashboards that contain visual displays of the most important information arranged on a single screen on HTML format.
Interactive web applications are also an essential tool that enable to communicate information in an approachable and actionable way. In Chapter 11 we introduce the package shiny (Chang et al. 2018) which provides a framework to turn results into web applications that allow users to experiment with different data scenarios so that they can answer their own questions. For example, they can filter data to obtain specific summaries, or change several options to obtain different visualizations. In Chapter 12 we show how to create interactive dashboards with Shiny, and Chapter 13 describes how to build a Shiny app that permits to upload and visualize spatio-temporal data. Chapter 14 presents SpatialEpiApp (Moraga 2017), a Shiny web application that allows to visualize spatial and spatio-temporal disease data, estimate disease risk and detect clusters. Finally, Appendix A contains resources about R and shows the packages used in this book.
Grolemund, Garrett. 2014. Hands-on Programming with R. 1st ed. Sebastopol, California: O’Reilly. https://rstudio-education.github.io/hopr/index.html.
Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science. 1st ed. Sebastopol, California: O’Reilly. https://r4ds.had.co.nz/.
Wickham, Hadley. 2019. Advanced R. 2nd ed. Boca Raton, Florida: Chapman & Hall/CRC The R Series. http://adv-r.had.co.nz/.
Bivand, Roger, Edzer J. Pebesma, and Virgilio Gómez-Rubio. 2013. Applied Spatial Data Analysis with R. 2nd ed. Springer. https://asdar-book.org/.
Lovelace, Robin, Jakub Nowosad, and Jannes Muenchow. 2019. Geocomputation with R. 1st ed. Boca Raton, Florida: Chapman; Hall/CRC. https://geocompr.robinlovelace.net/.
Wang, Xiaofeng, Yu Yue Ryan, and Julian J. Faraway. 2018. Bayesian Regression Modeling with Inla. 1st ed. Boca Raton, Florida: Chapman; Hall/CRC.
Blangiardo, Marta, and Michela Cameletti. 2015. Spatial and Spatio-Temporal Bayesian Models with R-Inla. 1st ed. Chichester, UK: John Wiley & Sons, Ltd.
Krainski, Elias T., Virgilio Gómez-Rubio, Haakon Bakka, Amanda Lenzi, Daniela Castro-Camilo, Daniel Simpson, Finn Lindgren, and Håvard Rue. 2019. Advanced Spatial Modeling with Stochastic Partial Differential Equations Using R and Inla. 1st ed. Boca Raton, Florida: Chapman; Hall/CRC. http://www.r-inla.org/spde-book.
Allaire, JJ, Yihui Xie, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, Hadley Wickham, Joe Cheng, and Winston Chang. 2018. Rmarkdown: Dynamic Documents for R. https://CRAN.R-project.org/package=rmarkdown.
Borges, Barbara, and JJ Allaire. 2017. Flexdashboard: R Markdown Format for Flexible Dashboards. https://CRAN.R-project.org/package=flexdashboard.
Chang, Winston, Joe Cheng, JJ Allaire, Yihui Xie, and Jonathan McPherson. 2018. Shiny: Web Application Framework for R. https://CRAN.R-project.org/package=shiny.
Rue, Havard, Finn Lindgren, Daniel Simpson, Sara Martino, Elias Teixeira Krainski, Haakon Bakka, Andrea Riebler, and Geir-Arne Fuglstad. 2018. INLA: Full Bayesian Analysis of Latent Gaussian Models Using Integrated Nested Laplace Approximations.
Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, and Kara Woo. 2018. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.
Pedersen, Thomas Lin, and David Robinson. 2019. Gganimate: A Grammar of Animated Graphics. https://CRAN.R-project.org/package=gganimate.
Sievert, Carson, Chris Parmer, Toby Hocking, Scott Chamberlain, Karthik Ram, Marianne Corvellec, and Pedro Despouy. 2018. Plotly: Create Interactive Web Graphics via ’Plotly.js’. https://CRAN.R-project.org/package=plotly.
Appelhans, Tim, Florian Detsch, Christoph Reudenbach, and Stefan Woellauer. 2018. Mapview: Interactive Viewing of Spatial Data in R. https://CRAN.R-project.org/package=mapview.
Tennekes, Martijn. 2018. Tmap: Thematic Maps. https://CRAN.R-project.org/package=tmap.
Moraga, Paula. 2017. “SpatialEpiApp: A Shiny Web Application for the analysis of Spatial and Spatio-Temporal Disease Data.” Spatial and Spatio-Temporal Epidemiology 23: 47–57. doi:https://doi.org/10.1016/j.sste.2017.08.001.