POPLHLTH 304 tutorial 8: logistic regression

Marijuana joint

If you have a few minutes, this is worth watching on the subject of “how marijuana works”, by National Geographic. It is a useful introduction to what we’ll be studying today.

This session, we will focus on modelling of binary outcomes in epidemiology. These are very common, as many times we will be dealing with hypotheses related to risk factors or causes of diseases which we either do or do not have.

We will learn how to:

Plot relationships with binary variables
Run and interpret logistic regression models
Use regression nomograms to better understand regression models
Interpret the model fit

Introduction to logistic regression

We have, by now, covered linear regression which is used when an outcome variable is continuous. For count outcomes, we use Poisson regression and its derivatives, where the outcome variable is a count or integer. For outcomes that are binary (“yes” or “no” for example), logistic regression is frequently the answer.

Mathematical theory

Logistic models rely on a mathematical trick to transform the left-hand side of a regression equation

\(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_z x_z\), with \(z\)

exposures or confounders (covariates) from a real number (positive or negative, with decimals) to a probability (constrained to be between zero and one).

This is accomplished by the magical logit transformation of the outcome. The logit transformation or log(odds) is:

\(\text{logit}(p) = ln\left( \frac{p}{1-p} \right)\)

Where: \(p\) is a binary variable where 1 indicates the presence and 0 the absence of disease.

The logistic model is therefore:

\(\text{logit}(p) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_z x_z + \epsilon\)

Notice, we’ve added the \(\epsilon\) or error term here to describe the distance between observed and model predicted data (residual).

Interpreting \(\beta\) coefficients

The main thing to remember with logistic models is that the \(\beta\) coefficients need to be exponentiated to be easily interpretable in epidemiological terms. They are interpreted as odds ratios which describe the increased or decreased odds of the outcome \(p\) for a one unit change in the exposure or covariate \(x\). This has parallels with linear and count regressions.

Remember that odds are simply a transformation of risk:

\(\text{odds} = \frac{p}{1 - p}\)

Note also, that on a logit scale, the logistic regression function is linear. This is nicely illustrated in the following shiny app, by ticking the Plot in logit domain check box.

Statistical tests of associations for \(\beta\) coefficients are based on a value of 0 for the null hypothesis. Exponentiated, this corresponds to an odds ratio of 1, which of course is a null value of the odds ratio.

Getting started

We will be leaning on the following libraries to do these analyses:

Background

Marijuana by Matthew Brodeur from unsplash.com

Imagine that we have been assigned a task to consider risk factors for marijuana use, and we have in mind some programmes to help reduce harmful use of the drug, but we need to know where to target this resource. How can we do that?

Well, we have a survey of marijuana use in a representative population, so we will analyse the data to see what the major risk behaviours are.

This is available in the NHANES US national health survey.

What sort of factors do you think are likely to increase the risk of taking recreational marijuana?

Summary of analysis

Load libraries and import data (here the data is imported from an installed library).
Check data integrity (we will skip through this)
Subset data. It doesn’t make sense to ask very young children about their marijuana use! In an epidemiological study, the denominator should consist of people at risk of the event of interest!
Plot associations between various demographic and behavioural risk factors and marijuana use. Conduct crude analyses, using odds ratios. We’ll be using histograms, boxplots (lattice package) and Euler plots (eulerr package). Check the bivariable associations and correlations between variables PerformanceAnalytics::chart.Correlation()). For this, we will need to convert character or factor variables ("Yes" or "No", or TRUE or FALSE) to numeric variables (1 or 0).
If writing a paper, you would normally make a ‘Table 1’ in which the various risk factors are divided by the outcome with column percentages and appropriate statistical tests. We will skip through this, but include it for completeness. This gives us similar information as in the previous point, but limited only to associations between exposures, confounders and outcome, rather than also between exposures and confounders as well. We will use the arsenal::tableby() function.
Logistic regression, use finalfit() and regplot()to finalise and visualise models. jtools is useful for making initial model output interpretable!
Check model fit (car::marginalModelPlots()) and adjust the model if necessary. The Hosmer-Lemeshow test is also relevant for checking the calibration between observed and predicted values.
Interpret the study results. Which variables are significantly associated Could there be other explanations other than a causal explanation? Could it be: reverse causation, measurement error, social desirability bias?

Loading libraries and data

Again, we use the pacman library to load other libraries.

if(!require(pacman)) install.packages("pacman")


pacman::p_load(magrittr, epiDisplay, tidyverse,
               rio, NHANES, visdat, skimr, finalfit, knitr, 
               kableExtra, readr, dplyr, lattice, jtools, 
               ggstance, broom.mixed, car, eulerr, regplot, 
               AF,# Model based attributable fractions 
               PerformanceAnalytics, # correlations    
              arsenal) # Making tables

## Respond with a "Y" if asked for a personal library.
update.packages(ask = FALSE)

Run the next line of code to increase the width of the output in the console.

## Setup work space
options(width = 600)

Load our data, which is the NHANES dataset.

## Load data----
data("NHANES")

Use ? to find out more about this dataset.

?NHANES

It doesn’t really make sense to ask children about marijuana use. So we’ll focus on those aged 20 or over. Run this line of code, which creates a new dataset called Na that includes only those aged ≥20 years.

## Restrict to adults
Na <- NHANES |> subset(Age >= 20) |> data.frame()

Univariable distributions and associations

In this section, we’ll look at the distribution of age and behaviour group, and see what groups are associated with marijuana use.
Use the epiDisplay::tab1() function to view the groups sizes for our outcome, Marijuana.

Na$Marijuana |> tab1()

Notice that there are a lot of missing responses (NAs)?
Remove these with the next line of code which creates a new data.frame called M:
- is.na() selects NA (missing) values and ! means “not”
- So, !is.na() findings selects non-missing values.

Restrict to those with complete responses to Marijuana variable

M <- Na |> subset(!is.na(Na$Marijuana))

Use the skim() function to get a summary of our dataset

M |> skim()

Check for any implausible values (check p0 and p100).

Outcome: marijuana use

We just removed missing responses for Marijuana. Check if they’ve been removed.

M$Marijuana |> tab1()

Do depressed people self-medicate with marijuana?

Inspect the groups for the Depressed variable.

M$Depressed |> tab1()

You should get three depression groups: none, several and most, indicating the number of days in the last month that the subject felt “down” or “depressed” or “hopeless”.

Use the str() function to find out what type of variable our outcome is. For our analyses, it should be of the type, factor. Check that.

M$Marijuana |> str()

Question 1

The cc() function (from the epiDisplay package) calculates and displays odds ratios.
Use this to look at the relationship between Depressed (exposure) and Marijuana (outcome).

cc(M$Marijuana, M$Depressed) # Outcome first, exposure second

The P-values that we’ve previously dealt with in this course compare 2 groups. This P-value is different because we have 2 groups for marijuana and 3 groups for depression – but just one P-value. So the P-value is not for comparing 2 groups. Instead, what it’s testing has a special name called the main effect or overall effect: this is for the overall association between the two variables. As it’s almost zero, it means that, overall, depression is associated with marijuana use.

Question 2

Question 3

Do men or women tend to take marijuana?

Question 4

Use the epiDisplay::cc() function to show the relationship between gender (exposure) and marijuana use (outcome).

# cc(dataset$outcome, dataset$exposure)

cc(M$Marijuana, M$Gender)

Hint: What are dataset, outcome and exposure for this question? Use the names function to list the variable names

The graph shows that female is the reference group as the red arrow points away from it.

Is age associated with marijuana?

As age is a continuous variable, we’ll use a box and whisker plot to visualise its association with our outcome. For this, we’ll use the lattice::bwplot function, which we’ve used in a previous tutorial.

lattice::bwplot(Age ~ Marijuana, data = M)

Question 5

Is BMI associated with marijuana?

Question 6

Create a plot with lattice::bwplot() to show the relationship between BMI (exposure) and Marijuana use (outcome).

# lattice::bwplot(variable1 ~ variable2, data = dataset)

lattice::bwplot(BMI ~ Marijuana, data = M)

Hint: What are variable1, variable2 and dataset for this question? Use ?lattice::bwplot if needed

Question 7

Are smokers more likely to take marijuana?

The question for SmokeNow is only asked if a person smokes and it tells us whether a smoker is an ex-smoker or a current-smoker. As missing values indicate that the person doesn’t smoke, we need to replace these with a group called Never which indicates that the person is a non-smoker (never smoked). To do this, run this code.

## Convert SmokeNow from factor to character
M$SmokeNow <- M$SmokeNow |> as.character()

## Missing data for Smokenow indicate 'never smokers'
M$SmokeCurrent <- ifelse(is.na(M$SmokeNow), "Never", M$SmokeNow)

M$SmokeCurrent |> tab1()

Unfortunately many of the epidemiological tools to analyse \(n\) x 2 tables fall over for SmokeCurrent and Marijuana. This is because they rely on Fisher tests which are relatively computationally intensive. We will just use a simple table and chi-square instead.

tab1 <- arsenal::tableby(Marijuana ~ # column
                            SmokeCurrent,
                            cat.stats="countrowpct", #row percents are best for estimating risk ratios.
                            data = M)

summary(tab1, text = TRUE)

You can use the table to quick mental calculation of risk ratios. For example, you can see that 81% of current smokers (SmokeCurrent = "Yes") are Marijuana users, compared to 42.7% of never smokers (SmokeCurrent = Never). You can quickly see that the risk ratio is about 81%/42.7% \(\approx 2\). The \(P\)-value for the association is low (< 0.001), so the association between the two behaviours are statistically significant.

Question 8

What about alcohol?

There are two alcohol variables and the first one we’ll look at is AlcoholYear: the number of days over the past year that the participant drank alcohol. This is a continuous variable - so use a box plot to visualise its relationship with the categorical outcome Marijuana use.

lattice::bwplot(AlcoholYear ~ Marijuana, data = M,
                ylab = "Average number of days in year alcohol consumed")

The second alcohol variable we’ll consider is AlcoholDay: the average number of drinks consumed on days that a participant drank alcohol. This is also continuous variable - so use a box plot.

Notice that individuals who responded to AlcoholYear with a 0 have a missing (NA) value for AlcoholDay. So we will have to correct AlcoholDay with the code below.

## These values are missing but should be rather zero
M[M$AlcoholYear == 0,]$AlcoholDay |> summary()

## Change them to zero rather than missing
M$AlcoholDay_corrected <- ifelse(M$AlcoholYear == 0, 0, M$AlcoholDay)

M$AlcoholDay |> summary()

lattice::bwplot(AlcoholDay_corrected ~ Marijuana, data = M,
                ylab = "Average number of alcoholic drinks consumed 
                in a drinking session")

Question 9

With the right-skewed nature of the plot, we can use a log transformation to achieve a more symmetric distribution and better compare the two distributions.

We can confirm the need for a log transformation with the following code, which tests for the optimum power transformation \(\lambda\) to achieve normality for the distribution.

A \(P\)-value that is significant means we reject the proposed value of \(\lambda\), which in each case is 0 (meaning a log transformation is necessary) and 1 (meaning no transformation is necessary).

Notice that for log transformations, we often need to add a small number to avoid negative values.

car::powerTransform(I(AlcoholDay_corrected + 0.5) ~ 1, data = M,
                          family = "bcPower") |> summary()

car::symbox(~ I(AlcoholDay_corrected + 0.5), data = M)

The results indicate that both no transformation and a log transformation can be rejected and that the suggested transformation is -0.10, which is close to \(\frac{1}{\sqrt[10]{x}}\). We could use such a transformation, but it is unfamiliar and hard to interpret the result.

However, the boxplot of a range of powers (car::symbox() function) shows that a close approximation to a symmetrical distribution is achieved with a log transformation. Since this is familiar and easy to interpret, we will use this.

summary(M$AlcoholDay)

lattice::bwplot(I(log(AlcoholDay_corrected + 0.5)) ~ Marijuana, data = M,
                ylab = "ln(Average number of alcoholic drinks consumed 
                in a drinking session)")

Question 10

Is marijuana a rich or poor issue?

The next risk factor is household income (HHIncomeMid). Another continuous variable - so use a box plot.

lattice::bwplot(HHIncomeMid ~ Marijuana, data = M,
                ylab = "Household income (US$ per year)")

Question 11

Hard drugs?

HardDrugs is a categorical variable which indicates whether a participant has ever tried cocaine, crack cocaine, heroin or methamphetamine. Use the cc() function to test whether this behaviour is associated with Marijuana.

cc(M$Marijuana, M$HardDrugs)

Question 12

Euler plots

Euler plots are Venn diagrams that allow us to visualise associations between categorical variables, such as exposures and outcomes. These show the overlap between groups, which are represented as circles or ellipses.

When interpreting a Euler plot please note that:
- The area of the ellipse is proportional to the counts of individuals with that characteristic.
- More overlap between variables indicates stronger statistical association between them.

Create a Euler plot to visualise associations between Marijuana, HardDrugs and Alcohol12PlusYr.

M$All <- 1 # Assigns `1` to all rows. We’ll use this to create an ellipse for all participants

# Euler chooses the reference category as that to plot.
# We want to show the "Yes" category, so we'll change the reference
# from "No" to "Yes"

M$Marijuana_factor <- M$Marijuana |> factor() |> relevel(ref = "Yes")
M$HardDrugs_factor <- M$HardDrugs |> factor() |> relevel(ref = "Yes")
M$Alcohol_factor <- M$Alcohol12PlusYr |> factor() |> relevel(ref = "Yes")

# Euler plot
fit1 <- eulerr::euler(M[, c("Marijuana_factor",
                           "HardDrugs_factor",
                           "Alcohol_factor"  ,
                           "All")] |> na.omit(),# subset data and remove missing
                            shape = "ellipse")

plot(fit1, 
     main = "Marijuana, Alcohol and Hard drugs")

Note:
- 3 variables in the plot end in _Yes. This is because the relevel(ref = "Yes") in our code above makes _Yes the reference group and the group that a Euler plot shows is the reference group
- The area of the Alcohol_factor_Yes (dark pink) group is similar to the area of the All (white) group. This indicates that a high % of the sample were in the Alcohol_factor=Yes group.

M$Marijuana |> tab1()
M$HardDrugs |> tab1()
M$Alcohol12PlusYr |> tab1()

Let’s add a legend to make it easier to see what the colours represent. Also, to help quantify the degree of overlap between the groups, we’ll add count and % values.

plot(fit1, 
     legend = list(labels = c( "All", "Marijuana","HardDrugs", "Alcohol")),
     quantities = list(type = c("percent", "counts")))

Let’s now focus on the association between Alcohol_factor and Marijuana_factor by dropping HardDrugs_factor from our plot.

fit1 <- eulerr::euler(M[,c("Marijuana_factor",
                           "Alcohol_factor",
                           "All")] |>  na.omit(),
                          shape = "ellipse")

plot(fit1, main = "Marijuana and Alcohol")

Correlations between variables

In the previous section, we created binary variables. This allows us to look at correlations between them and other variables.

# Select variables for corelation analysis
my_data <- M[, c("Male",
                  "Marijuana_bin", "Alcohol_bin", "SmokeNow_bin",
                  "HardDrugs_bin", "Age")]

PerformanceAnalytics::chart.Correlation(my_data, 
                                        histogram=TRUE, pch=19)

Publication-quality by-group frequency tables (Table 1)

We can use the tableby() function in the arsenal library to produce a publication-quality table of count and % values for variables by Marijuana use. Usually, this is Table 1 in a publication.

tab1 <- arsenal::tableby(Marijuana ~ # column
                            SmokeCurrent + # rows
                            HardDrugs + 
                            Race1 + 
                            Age + 
                            Gender + 
                            Depressed + 
                            Alcohol12PlusYr, 
                            cat.stats="countrowpct", #row percents are best for estimating risk ratios.
                            data = M)

summary(tab1, text = TRUE)

In posit, this table can be copied to a clipboard (accessible from the File tab) and pasted into a document using the code below. (Below, we’ve named the file trash). This means you don’t have to create a table manually by typing in values – so it saves you time!
write2html(tab1, "./trash.html", title = "My table in html")
write2word(tab1, "./trash.doc", title = "My table in Word")

Logistic model

We’ll build models using logistic regression. Before we start, our categorical variables should be of the type, factor rather than character - check if they are using the str() function.

M$HardDrugs |> str()
M$SmokeNow |> str()
M$Race1 |> str()

Run this code to look at the association between HardDrugs and Marijuana.
- glm() stands for generalised linear model
- family = binomial(link = "logit") indicates that we’re using logistic regression (logit link and binomial errors).

The basic logistic model is produced with the following code:

model <- glm(Marijuana ~ HardDrugs, data = M,
             family = binomial(link = "logit"))
summary(model)

Question 13

Exponentiate the model \(\beta\) coefficients with the jtools::summ() function to derive odds ratios.

jtools::summ(model, exp = TRUE)

Question 14

That was a univariable model with just one exposure (HardDrugs). Now we’ll create a multivariable model so that we can also adjust for potential confounders.

model <- glm(Marijuana ~ SmokeCurrent + HardDrugs + Race1 + Age + 
               Gender + Depressed + Alcohol12PlusYr, data = M,
             family = binomial(link = "logit"))

jtools::summ(model, exp = TRUE) # Exponentiate the beta coefficients to get odds ratios

Regression nomogram

This is a visual summary of the contributions of predictors to an outcome. We covered regression nomograms in the previous tutorial.

The following code draws a regression nomogram for our model from the logistic model section.

regplot(model)

The output is shown below

Nomogram

The nomogram depicts the structure of the model as a weighted average.

Each variable contributes a score β(X-m) terms based on the horizontal deviation. These are summed for each individual to give a Total score.
Reading vertically down from the Total score, then gives a probability of Marijuana use for each individual in the model (ranging between 0 and 1).

The horizontal distance between boxes or degree of spread of density plots for variables represents the relative influence of the variable on the model outcome.

The size of each box shows the distribution of categorical variables and the density plot shows the distribution of continuous variables.

Question 15

Model fit

How well a model fits the data can be assessed with a marginal model plot, which shows the relationship between model predicted and observed data.

# Note, all predictor variables must be factors for marginal model plot to work....
model <- glm(Marijuana ~ factor(SmokeCurrent) + factor(HardDrugs) + factor(Race1) + Age + 
               factor(Gender) + factor(Depressed) + factor(Alcohol12PlusYr), data = M,
             family = binomial(link = "logit"))

assign("M", M, envir = .GlobalEnv)
assign("model", model, envir = .GlobalEnv)

## Hosmer-Lemeshow test of model fit

glmtoolbox::hltest(model)

## marginal model plot
car::marginalModelPlots(model)

Hosmer-Lemeshow test

This test divides the data into deciles based on ordered predicted probability of the outcome (here: marijuana use). Then, the test then compares the number of cases expected in each outcome group for expected and observed counts. The null hypothesis is that there is no difference between observed and model predicted cases throughout the range of risk.

You can see that the observed and expected counts in the output above are very closely aligned.

Here, the \(P\)-value is 0.91, which is not significant. Therefore, the calibration of the model between observed and predicted numbers is good. There are some limitations of the Hosmer-Lemeshow test discussed here.

Model fit - marginal model plot

We covered the theory related to marginal model plots in past tutorials.
Look at the plots above.

Publication-quality regression tables (Table 2)

The code below allows us to produce publication-quality regression tables. Usually, a regression table is Table 2 in a publication. Note that the dependent variable must be a factor variable with two levels for this code to produce a logistic regression model, rather than a linear or other regression.

dependent <- "Marijuana"
explanatory <- c(
  "SmokeCurrent", 
  "HardDrugs","Race1", "Age", "Depressed", "Alcohol12PlusYr")

M |>
 finalfit(dependent, explanatory, metrics = FALSE) -> regression_table
  
  
regress_tab <- knitr::kable(regression_table, row.names = FALSE,
                            
                            align = c("l", "l", "r", "r", "r", "r"))

regress_tab |> kableExtra::kable_styling()


M |>
 finalfit(dependent, explanatory, metrics = TRUE) -> metrics_table

## show metrics only for display
metrics_table[[2]]

Question 16

Summary

Today, we have extended regression to help us estimate the strength of association between risk factors and binary outcomes which are often used in epidemiology. Logistic regression is used to predict the probability of a binary outcome such as disease status.

Logistic regression is often used when the outcome is a binary variable and there are confounders to adjust for.
The principles are similar to usual regression, however the beta coefficients are exponentiated and interpreted as odds ratios (either crude or adjusted).
The outcome variable must have only two levels (1 and 0, Yes or No, or TRUE or FALSE).
We often present results in a table with both crude and adjusted odds ratios to see the effect of adjusting for confounders on an association.

Homework

Repeat the above analyses looking at risk factors for alcohol use. For the outcome, use the response to the question: consumption of at least 12 drinks of any type of alcoholic beverage in any one year (Alcohol12PlusYr == "Yes").

Please complete this in posit.