Biostat III exercises in R

Laboratory exercise 3

Suggested solutions by

Author: Annika Tillander, 2014-01-30
Edited: Andreas Karlsson, 2015-02-28, 2016-03-08

Localised melanoma: Comparing estimates of cause-specific survival between periods; first graphically and then using the log rank test


You may have to install the required packages the first time you use them. You can install a package by install.packages("package_of_interest") for each package you require.

require(foreign)  # needed to read data set from Stata
require(survival) # for Surv and survfit
require(muhaz)    # for hazard estimates
require(dplyr)    # for data manipulation

We will now analyse the full data set of patients diagnosed with localised skin melanoma. We start by reading the data selecting those with a localised stage and then define a 1/0 varible for the events that we are interested in.

melanoma_raw <- read.dta("http://biostat3.net/download/melanoma.dta")
melanoma <- melanoma_raw %>%
    filter(stage=="Localised") %>% # subset those with localised cancer
    mutate(death_cancer = ifelse( status == "Dead: cancer", 1, 0),
           death_all = ifelse( status == "Dead: cancer" |
                               status == "Dead: other", 1, 0))

(a) Estimate the cause-specific survivor function, using the Kaplan-Meier method with survival time in months, separately for each of the two calendar periods 1975–1984 and 1985–1994. The variable year8594 indicates whether a patient was diagnosed 1985–1994 or 1975–1984. Without making reference to any formal statistical tests, does it appear that patient survival is superior during the most recent period?

mfityear8594 <- survfit(Surv(surv_mm, death_cancer==1) ~ year8594, data = melanoma)

plot(mfityear8594, col = 1:2,
     xlab = "Follow-up Time",
     ylab = "Survival",
     main = "Kaplan-Meier survival estimates")
legend("topright", c("Diagnosed 75-84", "Diagnosed 85-94"), col=1:2, lty = 1)

(b) The following commands can be used to plot the hazard function (instantaneous mortality rate): i. At what point in the follow-up is mortality highest? ii. Does this pattern seem reasonable from a clinicial/biological perspective? [HINT:Consider the disease with which these patients were classified as being diagnosed along with the expected fatality of the disease as a function of time since diagnosis.]

hazByGroup <- function(group){
    with(subset(melanoma, group),
         muhaz(times=surv_mm,
               delta=death_cancer,
               min.time = min(surv_mm),
               max.time = max(surv_mm)))
}

plot(hazByGroup(melanoma$year8594 == "Diagnosed 75-84"), col=1, main="Smoothed hazard estimates")
lines(hazByGroup(melanoma$year8594 == "Diagnosed 85-94"), col=2)
legend("topright", c("Diagnosed 75-84", "Diagnosed 85-94"), col=1:2, lty = 1)

(c) Use the log rank and the Wilcoxon test to determine whether there is a statistically significant difference in patient survival between the two periods.

## Log-rank test for equality of survivor functions
survdiff(Surv(surv_mm, death_cancer==1) ~ year8594, data=melanoma)
## Call:
## survdiff(formula = Surv(surv_mm, death_cancer == 1) ~ year8594, 
##     data = melanoma)
## 
##                             N Observed Expected (O-E)^2/E (O-E)^2/V
## year8594=Diagnosed 75-84 2145      572      512      7.03      15.5
## year8594=Diagnosed 85-94 3173      441      501      7.18      15.5
## 
##  Chisq= 15.5  on 1 degrees of freedom, p= 8.23e-05
## Equivalent to the Peto & Peto modfication of the Gehan-Wilcoxon test
survdiff(Surv(surv_mm, death_cancer==1) ~ year8594, data=melanoma, rho=1)
## Call:
## survdiff(formula = Surv(surv_mm, death_cancer == 1) ~ year8594, 
##     data = melanoma, rho = 1)
## 
##                             N Observed Expected (O-E)^2/E (O-E)^2/V
## year8594=Diagnosed 75-84 2145      502      447      6.82      16.2
## year8594=Diagnosed 85-94 3173      400      455      6.70      16.2
## 
##  Chisq= 16.2  on 1 degrees of freedom, p= 5.82e-05

Haven’t heard of the log rank (or Wilcoxon) test? It’s possible you may reach this exercise before we cover the details of these tests during lectures. You should nevertheless do the exercise and try and interpret the results. Both of these tests (the log rank and the generalised Wilcoxon) are used to test for differences between the survivor functions. The null hypothesis is that the survivor functions are equivalent for the two calendar periods (i.e., patient survival does not depend on calendar period of diagnosis).

(d) Estimate cause-specific mortality rates for each age group, and graph Kaplan-Meier estimates of the cause-specific survivor function for each age group. Are there differences between the age groups? Is the interpretation consistent between the mortality rates and the survival proportions?

melanoma %>%
    select(death_cancer, surv_mm, agegrp) %>%
    group_by(agegrp) %>%
    summarise(D = sum(death_cancer), Y = sum(surv_mm)/1000, Rate = D/Y) %>%
    mutate(CI_low = Rate + qnorm(0.025) * Rate / sqrt(D),
           CI_high = Rate + qnorm(0.975) * Rate / sqrt(D))
## Source: local data frame [4 x 6]
## 
##   agegrp   D        Y     Rate   CI_low  CI_high
## 1   0-44 217 157.1215 1.381097 1.197340 1.564853
## 2  45-59 282 148.8215 1.894887 1.673727 2.116048
## 3  60-74 333 121.3380 2.744400 2.449637 3.039163
## 4    75+ 181  36.2380 4.994757 4.267106 5.722408
mfit_agegrp <- survfit(Surv(surv_mm, death_cancer==1) ~ agegrp, data = melanoma)
plot(mfit_agegrp, col = 1:4,
     xlab = "Months since diagnosis",
     ylab = "Survival",
     main = "Kaplan-Meier survival estimates")
legend("bottomleft", c("0-44", "45-59", "60-74", "75+"), col=1:4, lty = 1)

What are the units of the estimated hazard rates? HINT: look at how you defined time.

(e) Repeat some of the previous analyses using years instead of months. This is equivalent to dividing the time variable by 12 so all analyses will be the same except the units of time will be different (e.g., the graphs will have different labels).

mfit_agegrp_year <- survfit(Surv(surv_mm/12, death_cancer==1) ~ agegrp, data = melanoma)
plot(mfit_agegrp_year, col = 1:4,
     xlab = "Months since diagnosis",
     ylab = "Survival",
     main = "Kaplan-Meier survival estimates")
legend("bottomleft", c("0-44", "45-59", "60-74", "75+"), col=1:4, lty = 1)

melanoma %>%
    select(death_cancer, surv_mm, agegrp) %>%
    group_by(agegrp) %>%
    summarise(D = sum(death_cancer), Y = sum(surv_mm)/12/1000, Rate = D/Y) %>%
    mutate(CI_low = Rate + qnorm(0.025) * Rate / sqrt(D),
           CI_high = Rate + qnorm(0.975) * Rate / sqrt(D))
## Source: local data frame [4 x 6]
## 
##   agegrp   D         Y     Rate   CI_low  CI_high
## 1   0-44 217 13.093458 16.57316 14.36809 18.77824
## 2  45-59 282 12.401792 22.73865 20.08473 25.39257
## 3  60-74 333 10.111500 32.93280 29.39564 36.46996
## 4    75+ 181  3.019833 59.93708 51.20527 68.66890

(f) Study whether there is evidence of a difference in patient survival between males and females. Estimate both the hazard and survival function and use the log rank test to test for a difference.

par(mfrow=c(1, 2))
mfit_sex <- survfit(Surv(surv_mm, death_cancer) ~ sex, data = melanoma)

plot(mfit_sex, col = 1:2,
     xlab = "Follow-up Time",
     ylab = "Survival",
     main = "Kaplan-Meier survival estimates")
legend("topright", c("Male", "Female"), col=1:2, lty = 1)

plot(hazByGroup(melanoma$sex == "Male"), col=1, main="Smoothed hazard estimates")
lines(hazByGroup(melanoma$sex == "Female"), col=2)
legend("topright", c("Male", "Female"), col=1:2, lty = 1)

## Log-rank test for equality of survivor functions
survdiff(Surv(surv_mm, death_cancer==1) ~ sex, data=melanoma)
## Call:
## survdiff(formula = Surv(surv_mm, death_cancer == 1) ~ sex, data = melanoma)
## 
##               N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=Male   2405      542      433      27.7      48.6
## sex=Female 2913      471      580      20.6      48.6
## 
##  Chisq= 48.6  on 1 degrees of freedom, p= 3.22e-12