POPLHLTH 304 tutorial 6: count regression

Scabies swimming in mineral oil at 40x magnification.

Live mites on human skin (courtesy of Berghofer QIMR).

Getting started

This session, we will focus on modelling of count outcomes in epidemiology.

We will learn how to:

Plot the relationships between count variables and other variables
Run and interpret regression models
Interpret the model fit and summary statistics of models
Check models for over-dispersion
Consider the effects of multi-collinearity on exposure-disease relationships
Use regression nomograms to better understand and communicate the results of regression models

We will be leaning on the following libraries to do these analyses:

Background

We will be diving into a real-life research battle ground today.

Research battle by Sander Meyer from unsplash.com

There is some controversy about the cause of acute rheumatic fever, which is an overall rare disease, but it occurs much more commonly in Māori and Pacific children than children of any other ethnic group in NZ.

This disease often presents with fever and inflamed joints and can lead to life-limiting sequelae which include thickened and dysfunctional heart valves. If these aren't bad enough, it may lead to the need for cardiac surgery and in some cases, premature death.

Thickened heart valves associated with rheumatic heart disease

The most prevalent explanation in the health sector is that it is caused by bacterial streptococcal throat infection.

An alternative explanation is that it is caused by the common skin condition, scabies, which has the following skin appearance, which is easy not to notice. The little crops of raised lumps are scabies and they are caused by the microscopic mite pictured above.

Typical skin appearance of scabies lesions: image courtesy of Dr Daniel Engelman

The treatment for scabies in New Zealand is pictured below, and we will use records of pharmacy dispensing of this drug as an indirect measure of scabies prevalence in the community.

Permethrin lotion

The proposed mechanism by which scabies causes streptococcal infection and rheumatic fever is given below:

Consequences of scabies infection

We will be looking for the evidence that prescribing for a certain treatment for scabies (the lotion permethrin) is linked with acute rheumatic fever at a suburb level (1 row of data for each suburb). Since the counts of acute rheumatic fever over a certain period constitute a 'count' variable, Poisson regression will be used. The population of the suburb will be used as an offset in the model. An offset is often used as the denominator when rate values are modelled. For example, here, our outcome is the count or number of permethrin scripts dispensed in a suburb divided by the population of that suburb. The count of prescriptions is naturally influenced by the population, so this information is incorporated in the model as an offset. The offset allows us to model rates rather than counts. Technically, it is a covariate or independent variable whose value of beta coefficient is forced to be one.

Covariates, or potential confounders in the analysis include:

The proportion of Māori and Pacific people living in a suburb
Socioeconomic deprivation

We will use the counts of acute rheumatic fever by suburb (or census area unit). This information is derived from routine datasets used in the health sector. The formal write-up of the analysis is here.

Brief revision

Count regression is used when an outcome is an integer or count, rather than a continuous measure.

Examples of counts include:

the number of hospital admissions per day
the number of people who die each week
the number of teeth in a person's mouth that affected by decay

In comparison, continuous measures can include fractions or decimals, such as:

height
weight
blood pressure
age.

Count variables have special characteristics of their distribution which makes linear regression unsuitable. First of all, the distribution is positive, asymmetric and right skewed, meaning a log transformation of the count is necessary. Also, the variance is usually proportional to the count, rather than a constant, as is assumed for linear regression.

Lecture revision

Our outcome is acute rheumatic fever counts and we will build models to predict this outcome. For the following 3 questions, please answer True or False in response to each statement.

Question 1

Question 2

Question 3

Set-up

To set-up our session for analysis and call the necessary libraries, we’ll use pacman::p_load() instead of install.packages() and library(). The pacman library shortens this procedure, and p_load() will look for a package and install it if it can't be found. It will also put it on the search path, like the library() command does. We won't run this code, as it will mess up our tutorial, but it is essential for getting this running on a local machine or posit cloud.

if(!require(pacman)) install.packages("pacman")
pacman::p_load(tidyverse, 
               epiDisplay, 
               visdat, # missing data visualisation
               skimr, # summarise data frame
               lattice, # basic histograms
               jtools, # exponentiate beta coefficients
               ggstance,
               broom.mixed,
               finalfit, # final tables
               dplyr,
               kableExtra,
               car,
               MixAll,
               regplot, # nomograms
               PerformanceAnalytics,
               corrplot, 
               performance) # overdispersion check

View the dataframe; called for_shiny. Notice that there’s 1 row for each suburb:

for_shiny

List the variable names:

for_shiny |> names()

Rename our dataframe as df. Then rename the variables so they're more understandable. Although this seems tedious, it really makes life easier and reduces confusion.

df <- for_shiny # Rename dataframe

# Rename variables
names(df) <- c("arf_count", "population", "Pacific_proportion",
               "Maori_proportion", "Permethrin_rate", "Suburb",
               "nzdep_decile", "arf_rate")
df |> names() # Check that the variables have been renamed

Data integrity checks

1) Duplicates

There should be 1 row for each Suburb, so check if this variable is duplicated.

stopifnot(df[df$Suburb |> duplicated(),] == 0)

If Suburb is duplicated, the R output would read: df[df$Suburb |> duplicated(),] == 0 are not all TRUE.
Are there any duplicates?

2) Ranges

df |> skim()

Are there any implausible extreme values (check p0 [minimum] and p100 [maximum])?

Question 4

From the summary data above, what is the variance of arf_count? Note, variance is standard deviation squared.

var(df$arf_count)

::: {#variance_2-hint} Hint: Variance = (standard deviation)^2 :::

3) Missing data

df |> vis_miss(sort_miss = TRUE)

Do you see any missing data?

Univariable distributions: histograms

Use the histogram() function for these.

Outcome as a count: arf_count

df$arf_count |> histogram()

Outcome as a rate: arf_rate

df$arf_rate |> histogram()

Question 5

A formal check for overdispersion in Poisson regression models can be estimated by running the check_overdispersion(x) function from the performance package. The argument x is the Poisson regression model object which will be estimated later. We will see later that even though there is a greater variance than mean, a formal statistical test of difference in the conditional (on covariates) mean and variance of arf_count from the model is not statistically significant.

Question 6

Next, we will check the distribution of our exposure

Exposure: permethrin_rate

df$Permethrin_rate |> histogram()

Question 7

Then the distribution of our confounders. Adjust the following code to plot three histograms of the following variables.

df$Pacific_proportion 
df$Maori_proportion
df$nzdep_decile

df$Pacific_proportion |> histogram()
df$Maori_proportion |> histogram()
df$nzdep_decile |> histogram()

::: {#histogram_6-hint} Hint: You need to use the histogram() function :::

Question 8

When a categorical variable has many groups, it’s common to combine these to avoid sparse data. For nzdep_decile, which has 10 categories (last histogram above), we’ll combine consecutive pairs to create quintiles (5 groups). The breaks are included in the category. So, for example, the first category includes any value from negative infinity (-Inf) to 2. Any value greater than two will be incorporated in the next category and so on.

# 5 groups instead of 10
df$nzDepQuint <- cut(df$nzdep_decile, breaks = c(-Inf, 2, 4, 6, 8, Inf),
                     labels = c("1 to 2 (least deprived)",
                                "3 to 4",
                                "5 to 6",
                                "7 to 8",
                                "9 to 10 (most deprived)"))

# Check the frequency of our 5 groups
epiDisplay::tabpct(df$nzDepQuint, df$nzdep_decile)

Bivariate associations: scatterplots

Before undertaking any regression analysis, it is useful to visualise the relationships between key exposure, confounding and outcome variables.

Use the car::scatterplot() function.

Exposure and outcome

car::scatterplot(arf_rate ~ Permethrin_rate, data = df)

Question 9

Question 10

Complete the following lines of code to make scatter plots between covariates and the outcome arf_rate.

arf_rate ~ nzdep_decile
arf_rate ~ Pacific_proportion 
arf_rate ~ Maori_proportion

::: {#scatterplot_2-hint} Hint: You need to use the car::scatterplot() function. You'll also need to fill the data = ... argument, so that R knows which variables in which data.frame you are referring to. :::

car::scatterplot(arf_rate ~ nzdep_decile, data = df)
car::scatterplot(arf_rate ~ Pacific_proportion, data = df)
car::scatterplot(arf_rate ~ Maori_proportion, data = df)

Question 11

Quick method to make several scatterplots
There is a method of making these scatterplots in one line of code!

Use the PerformanceAnalytics::chart.Correlation() function to get all correlations in one chart. Here, we select five continuous variables from df and put them into a data.frame called my_data. Then, we apply our PerformanceAnalytics::chart.Correlation() function to my_data to get histograms, scatter plots and Pearson correlation coefficients.

## Select variables of interest into new data.frame
my_data <- df[, c("Pacific_proportion", 
                  "Maori_proportion", 
                  "Permethrin_rate", 
                  "nzdep_decile", 
                  "arf_rate")]

## Chart the correlations and regressions 
## and distributions of each variable.
PerformanceAnalytics::chart.Correlation(my_data, 
                                        histogram = TRUE, 
                                        pch = 19)

Question 12

Regression

We will build a Poisson regression model with our outcome as arf_count and exposure as Permethrin_rate. To adjust for population size, we will include log_population as an offset. We will also adjust for NZDep quintile (nzDepQuint), Pacific_proportion and Maori_proportion.

Run the following code using the glm() (generalised linear model) function to make a Poisson regression model of arf_count with all other risk factors as independent variables or covariates.

## Make offset variable
df$log_population <- log(df$population)

## Build model
fit <- glm(arf_count ~ Permethrin_rate +
           nzDepQuint +
           Pacific_proportion +
           Maori_proportion, 
           offset = log_population,
           family = poisson(link = "log"),
           data = df)

## Print summary of model
summary(fit)

## check for overdispersion
performance::check_overdispersion(fit)

Look at the term, offset = log_population, in the model. With counts (arf_count) as the outcome, having log_population as the offset makes the model represent a rate (count per unit population). The overdispersion test is a test of the residual deviance differing substantially from the degrees of freedom in the model. Here, it greater than 0.05, so not significant. Hence, we can stick with our Poisson model.

Check your understanding.

Question 13

Question 14

While the exponentiated coefficient for permethrin doesn't seem large (1.000294 is a 0.0294% change, with a one unit increase of permethrin), the nature of the association may only be understood by also knowing to what extent permethrin dispensing varies in the population.

From our skim() function used earlier, we can see that the minimum values of permethrin dispensing is 0 and the maximum is 3,201. Therefore, the ratio of risk for a change of 3,000 is exp(0.000294 * 3000) which equals 2.41. This means that comparing the lowest to highest suburbs results in a 100 * (2.41 - 1) = 141% increase in acute rheumatic fever incidence. That apparently weak association is now looking very strong.

Exponentiate our model coefficients with the jtools::summ() function which is called on our model fit object:

jtools::summ(fit, exp = TRUE)

This output includes an exp(Est.) which shows exponentiated beta coefficients or slope terms.

These are interpreted like risk ratios, where a value greater than one indicates higher incidence of the outcome with increasing exposure and a value less than one indicates lower incidence with increasing exposure. The correct term for these is incidence rate ratios.

Question 15

Question 16

We saw before that, for the outcome, there's some evidence of over-dispersion. A formal check for this was negative, so here a Poisson model is sufficient. If the performance::check_overdispersion() function returned a significant $P$-value (< 0.05), we would have to think about relaxing the strict assumptions of the Poisson model relating to the true mean and variance of the outcome variable being equal.

In such a situation a quasiPoisson or negative binomial model would be useful.

Please change the code below from a Poisson to quasiPoisson model. This can be achieved by changing poisson to quasipoisson in the code below and running it (please use fit_qp as the new model object).

# fit model
fit1 <- glm(arf_count ~ I(Permethrin_rate/1000) +
             nzDepQuint +
             Pacific_proportion +
             Maori_proportion + offset(log(population)),
           family = poisson(link = "log"),
           data = df)

# view characteristics of model
summary(fit1)

# fit model
fit_qp <- glm(arf_count ~ I(Permethrin_rate/1000) +
             nzDepQuint +
             Pacific_proportion +
             Maori_proportion + offset(log(population)),
           family = quasipoisson(link = "log"),
           data = df)

# view characteristics of model
summary(fit_qp)

::: {#regression_6-hint} Hint: Please change the family argument to quasipoisson, and change the name of the model object from fit1 to fit_qp. :::

Question 17

Using the code format below, exponentiate your $\beta$ coefficients so that you get interpretable incidence rate ratios.

# jtools::summ(model_object, exp = logical_value)

jtools::summ(fit_qp, exp = TRUE)

::: {#regression_7-hint} Hint: What is the new model object names and because the outcome is transformed by a natural logarithm, we wish to specify the exp argument to TRUE so that beta coefficients are exponentiated. :::

Now, use the jtools::plot_summs() function to plot the incidence rate ratios:

jtools::plot_summs(fit_qp, exp = TRUE)

You should get a graph showing incidence rate ratios:

Circles for the Estimates
Horizontal lines for the 95% confidence intervals
A dotted vertical line for a ratio of 1: the null value

Which variables are significantly associated with the outcome?

Multi-collinearity

In a model, when 2 or more predictors are strongly correlated with one other, this is called multi-collinearity. In a model, it may be problematic, since:

increases the variance of estimates of regression coefficients
makes these estimates unstable and unreliable

We can detect it by first looking for strong correlations between variables.

Use the cor() function to estimate the correlation between Pacific_proportion and Permethrin_rate:

cor(df$Pacific_proportion, df$Permethrin_rate)

You should get a correlation of 0.88, which is strong. Thus, multi-collinearity in the predictor variables may be a problem for our analysis.

To deal with this, we can remove one of these predictors from the model. So that we retain our exposure (Permethrin_rate) in our model, let’s remove Pacific_proportion by adding # before it.

Another change is that we'll divide Permethrin_rate by 1000. This will increase the coefficient for this variable by 1000 so that it’s easier to interpret. (This is like converting grams into kilograms; the latter is more meaningful for larger weights). Below, we use the I() function to do this calculation within the 'model' argument.
Now run our revised code below:

df$log_population <- log(df$population)

fit3 <- glm(arf_count ~ I(Permethrin_rate/1000) +
            nzDepQuint +
            # Pacific_proportion +
            Maori_proportion, offset = log(population),
            family = poisson(link = "log"),
            data = df)

summary(fit3)

What is the beta-coefficient or slope for the Permethrin_rate/1000 term?

Is it statistically significant?

What is the incidence rate ratio for I(Permethrin_rate/1000)? We need to exponentiate the beta-coefficient.

This next model has Pacific_proportion included again. so we can test for the effect of multi-collinearity. Run the code below.

fit5 <- glm(arf_count ~ I(Permethrin_rate/1000) +
              nzDepQuint +
              Pacific_proportion +
              Maori_proportion, offset = log_population,
              family = poisson(link = "log"),
              data = df)

summary(fit5)

Because of multicollinearity, we expect the standard error for I(Permethrin_rate/1000) to increase after Pacific_proportion has been added. Has it?

Regression nomogram

This is a visual summary of structure of a model. We will display regression nomograms by suburb. First, let’s print the names of the Suburbs in our dataset to see if there is one we recognise.

df$Suburb |> print()

The code below will run in posit, but unfortunately not on this platform, so we will just show you the code and the output. We have plotted the values for Otara East, highlighted on the figure in red.

regplot::regplot(fit, 
                 observation = df[df$Suburb == "Otara East",], clickable = FALSE)

## Error in `[.data.frame`(observation, reord): undefined columns selected

Regplot

For Otara East, for example, we see that Pacific_proportion is about 0.8 or 80%. Also, the dispensing rate of permethrin Permethrin_rate is relatively high, as expected from the correlation coefficient between this variable and Pacific_proportion as we discussed earlier. Each variable contributes a score (β(X-m) terms) and these add up as a type of weighted average to produce a predicted rate of acute rheumatic fever (arf_count). Note that for Otara East, the predicted arf_count is very high.

Please attempt the next 2 questions.

Question 18

(Hint: In the top line, β(X-m) terms, identify the red dot that's furthest from 0. Then read vertically down from it)

Question 19

Hint: In each density plot or histogram, look at where the red dot (for Otara East) is.

regplot::regplot(fit, observation = df[df$Suburb == "Remuera South",])

## Error in `[.data.frame`(observation, reord): undefined columns selected

Regplot2

Model fit

How well a model fits the data can be assessed with a 'marginal model plot', which shows the relationship between model predicted (model) and observed data (data). In a good fitting model, we expect to see the two lines closely superimposed on each other. Poor fit is identified by seeing separation between the two lines, which may indicate over or under prediction of the model compared to the data.

Run this code to produce a marginal model plot for our fit model object:

## The assign commands are only needed in 
## the shiny app to get the marginalModelPlot function working
assign("fit", fit, envir = .GlobalEnv)
assign("df", df, envir = .GlobalEnv)

## model fit plot----
car::marginalModelPlots(fit)

What is the nature of the overall fit of the data to the model?
- Good fit is indicated by good alignment between Data (blue line) and the Model (red line)
- Poor fit is indicated by poor alignment between these lines
- Often, for more extreme or outlying values of an exposure, there is deviation of the Data from the Model. This may not invalidate your model, but would be important to be aware of, and may be worthwhile discussing in your results.

Question 20

Publication-quality regression tables

In published articles, it's common to present univariable (unadjusted) and multivariable (adjusted) regression coefficients in a table. To create these, run the following code.

df$Permethrin_rate_per_1000 <- df$Permethrin_rate/1000  # Like converting grams into kilograms
df$log_population <- log(df$population)  # Offset

dependent <- "arf_count"
explanatory <- c("Pacific_proportion", "Maori_proportion", 
                 "Permethrin_rate_per_1000","nzdep_decile")

# `fit_uni` creates univariable (unadjusted) models
fit_uni = df |> 
  glmuni(dependent, explanatory, family = quasipoisson,
         offset = df$log_population) |> 
  fit2df(estimate_name = "Rate ratio (univariable)")

# `fit_multi` creates a multivariable (adjusted) model
fit_multi = df |> 
  glmmulti(dependent, explanatory, family = quasipoisson, 
           offset = df$log_population) |> 
  fit2df(estimate_name = "Rate ratio (multivariable)")

df |>
  ## Crosstable
  summary_factorlist(dependent, explanatory, cont = "median", fit_id=TRUE)  |> 
  
  #### Add univariable table
  ff_merge(fit_uni, estimate_name = "Rate ratio") |> 
  
  #### Add multivariable table
  ff_merge(fit_multi, estimate_name = "Rate ratio",
           last_merge = TRUE) |> 
dependent_label(df, dependent) -> regression_table

regress_tab <- knitr::kable(regression_table, row.names = FALSE, 
                            align = c("l", "l", "r", "r", "r"))

#### Puts output into viewer so that it can be cut and pasted into Excel, 
#### then Word.
regress_tab |> kableExtra::kable_styling()

Notice that the adjusted Rate ratio (multivariable) for Permethrin_rate_per_1000 is not statistically significant. Does that mean that permethrin or scabies is not related to acute rheumatic fever incidence?

Summary

We have learned a number of things related to count models.

The outcome is a count (non-negative integer) and this is transformed in the model using a natural logarithm.
We use offsets to convert count outcomes to rates by incorporating the denominator into our model.
The beta coefficients or slopes for each exposure or confounder term need to be exponentiated to be readily interpretable as incidence rate ratios which are interpreted similarly to risk ratios or odds ratios (jtools::summ(model, exp = TRUE)).
We usually start with a Poisson model and check for over or under dispersion (mean and variance unequal) and may use a quasi-Poisson or negative binomial to accommodate this feature of the data.
We can check for model fit using marginal model plots.
Collinearity or strong associations within exposure or independent variables can be a feature of any regression analysis and can be checked for using correlation coefficients and scatterplots (PerformanceAnalytics::chart.Correlation()). The inclusion or exclusion of colinear variables can make a big difference to the results of our regression!!
We often present models in scientific papers by comparing crude and adjusted modelled incidence rate ratios (finalfit).

🔑 Key R libraries

car - for scatterplots and marginal model plots (model diagnostics)
jtools - for displaying exponentiated coefficients and plotting rate ratios
performance - for checking overdispersion in Poisson models
PerformanceAnalytics - for correlation matrices with histograms and scatterplots
regplot - for creating regression nomograms (visual model summaries)
finalfit - for publication-ready regression tables
skimr - for comprehensive data summaries
lattice - for histograms

Essential R commands

Data exploration: - skim() - get comprehensive summary statistics including mean, variance, and ranges - histogram() - visualize distributions of count and continuous variables - car::scatterplot(y ~ x, data = df) - create scatterplots with regression lines - cor(x, y) - calculate Pearson correlation between two variables - PerformanceAnalytics::chart.Correlation() - create correlation matrix with histograms and scatter plots

Building count regression models: - glm(y ~ x, family = poisson(link = "log"), data = df) - fit Poisson regression - glm(y ~ x + offset(log(denominator)), family = poisson()) - Poisson with offset for rates - glm(y ~ x, family = quasipoisson(link = "log"), data = df) - quasi-Poisson for overdispersion - summary(model) - display model coefficients and statistics - exp(coef) - exponentiate coefficients to get incidence rate ratios manually

Model interpretation: - jtools::summ(model, exp = TRUE) - display exponentiated coefficients (incidence rate ratios) - jtools::plot_summs(model, exp = TRUE) - plot incidence rate ratios with confidence intervals - predict(model, newdata = data.frame(...)) - predict outcomes for specific values

Model diagnostics: - performance::check_overdispersion(model) - test for overdispersion in Poisson models - car::marginalModelPlots(model) - check model fit with diagnostic plots - regplot::regplot(model, observation = ...) - create regression nomogram for specific observations

Publication tables: - finalfit::summary_factorlist(dependent, explanatory, fit_id = TRUE) - create Table 1 style summaries - knitr::kable() and kableExtra::kable_styling() - format tables for export

Key Concepts

Offset: offset(log(population)) - accounts for varying population sizes when modeling rates
Incidence Rate Ratio (IRR): exp(β) - ratio of rates comparing exposed to unexposed
Overdispersion: when variance > mean in count data, requiring quasi-Poisson or negative binomial models
Collinearity: high correlation between predictors can distort regression results
Formula notation: y ~ x1 + x2 + offset(log(n)) - outcome, exposures, and offset

Homework

Select only suburbs that have a socioeconomic deprivation of 6 or more:
df_dep <- df[df$nzdep_decile >= 6, ]

Then, for this subgroup, repeat the analyses we’ve covered in this tutorial.

Please have a go at these tasks in posit (desktop) or posit cloud.