Biostat III exercises in R

Laboratory exercise 2

Suggested solutions by

Author: Annika Tillander, 2014-01-30
Edited: Andreas Karlsson, 2015-02-28, 2016-03-07

Comparing survival, proportions and mortality rates by stage for cause-specific and all-cause survival.


You may have to install the required packages the first time you use them. You can install a package by install.packages("package_of_interest") for each package you require.

require(foreign)  # for reading data set from Stata
require(survival) # for Surv and survfit
require(muhaz)    # for hazard estimates
require(dplyr)    # for data manipulation

We start by reading the data and listing the first few observations to get an idea about the data. We then define a 1/0 varible for the events that we are interested in.

## Get the data for exercise 2
melanoma_raw <- read.dta("http://biostat3.net/download/melanoma.dta")

## Examine the data
head(melanoma_raw)
##      sex age     stage mmdx yydx surv_mm surv_yy       status
## 1 Female  81 Localised   11 1981    26.5     2.5  Dead: other
## 2 Female  75 Localised   10 1975    55.5     4.5  Dead: other
## 3 Female  78 Localised    1 1978   177.5    14.5  Dead: other
## 4 Female  75   Unknown   12 1975    29.5     2.5 Dead: cancer
## 5 Female  81   Unknown    4 1981    57.5     4.5  Dead: other
## 6 Female  75 Localised   12 1975    19.5     1.5 Dead: cancer
##            subsite        year8594         dx       exit agegrp      bdate
## 1    Head and Neck Diagnosed 75-84 1981-11-07 1984-01-22    75+ 1900-11-07
## 2    Head and Neck Diagnosed 75-84 1975-10-07 1980-05-22    75+ 1900-10-07
## 3            Limbs Diagnosed 75-84 1978-01-07 1992-10-22    75+ 1900-01-07
## 4 Multiple and NOS Diagnosed 75-84 1975-12-07 1978-05-22    75+ 1900-12-07
## 5    Head and Neck Diagnosed 75-84 1981-04-07 1986-01-22    75+ 1900-04-07
## 6            Trunk Diagnosed 75-84 1975-12-07 1977-07-22    75+ 1900-12-07
##   id
## 1  1
## 2  2
## 3  3
## 4  4
## 5  5
## 6  6
## Create 0/1 outcome variable
melanoma <- melanoma_raw %>%
    mutate(death_cancer = ifelse( status == "Dead: cancer", 1, 0),
           death_all = ifelse( status == "Dead: cancer" |
                               status == "Dead: other", 1, 0))

(a) We now tabulate the distribution of stages.

## Tabulate by stage
melanoma %>%
    group_by(stage) %>%
    summarise(Freq = n(), Percent = n() / nrow(.)) %>%
    mutate(Cum = cumsum(Percent))
## Source: local data frame [4 x 4]
## 
##       stage Freq    Percent       Cum
## 1   Unknown 1631 0.20977492 0.2097749
## 2 Localised 5318 0.68398714 0.8937621
## 3  Regional  350 0.04501608 0.9387781
## 4   Distant  476 0.06122186 1.0000000

Survival depends heavily on stage. It is interesting to note that patients with an unknown stage appear to have a similar survival as patients with a localized stage. Tabulate events by stage Fit and plot survival and smoothed hazard.

par(mfrow=c(1, 2))
mfit <- survfit(Surv(surv_mm, death_cancer) ~ stage, data = melanoma)

plot(mfit, col=1:4,
     xlab = "Follow-up Time",
     ylab = "Survival")
legend("topright", levels(melanoma$stage), col=1:4, lty = 1)

hazByGroup <- function(group){
    with(subset(melanoma, group),
         muhaz(times=surv_mm,
               delta=death_cancer,
               min.time = min(surv_mm),
               max.time = max(surv_mm)))
}

plot(hazByGroup(melanoma$stage=="Unknown"), col=1, xlim=c(0,250), ylim=c(0,0.08))
lines(hazByGroup(melanoma$stage=="Localised"), col=2)
lines(hazByGroup(melanoma$stage=="Regional"), col=3)
lines(hazByGroup(melanoma$stage=="Distant"), col=4)
legend("topright", levels(melanoma$stage), col=1:4, lty = 1)

(b) The time unit that we used was months (since we specified surv_mm as the analysis time). Therefore, the units of the rates shown above are events/person-month. We could multiply these rates by 12 to obtain estimates with units events/person-year or we can change the default time unit by specifying the time scale. As an example we here tabulate crude rates, first by month and then by year.

melanoma %>%
    select(death_cancer, surv_mm, stage) %>%
    group_by(stage) %>%
    summarise(D = sum(death_cancer), M = sum(surv_mm), Rate = D/M) %>%
    mutate(CI_low = Rate + qnorm(0.025) * Rate / sqrt(D),
           CI_high = Rate + qnorm(0.975) * Rate / sqrt(D))
## Source: local data frame [4 x 6]
## 
##       stage    D        M        Rate      CI_low     CI_high
## 1   Unknown  274 123205.5 0.002223927 0.001960601 0.002487252
## 2 Localised 1013 463519.0 0.002185455 0.002050874 0.002320037
## 3  Regional  218  18003.0 0.012109093 0.010501665 0.013716521
## 4   Distant  408  10509.0 0.038823865 0.035056682 0.042591049
melanoma %>%
    select(death_cancer, surv_yy, stage) %>%
    group_by(stage) %>%
    summarise(D = sum(death_cancer), Y = sum(surv_yy), Rate = D/Y) %>%
    mutate(CI_low = Rate + qnorm(0.025) * Rate / sqrt(D),
           CI_high = Rate + qnorm(0.975) * Rate / sqrt(D))
## Source: local data frame [4 x 6]
## 
##       stage    D       Y       Rate     CI_low    CI_high
## 1   Unknown  274 10273.5 0.02667056 0.02351261 0.02982851
## 2 Localised 1013 38633.0 0.02622111 0.02460640 0.02783582
## 3  Regional  218  1501.0 0.14523651 0.12595701 0.16451600
## 4   Distant  408   907.0 0.44983462 0.40618596 0.49348328

(c) Here we tabulate crude rates per 1000 person years.

melanoma %>%
    select(death_cancer, surv_yy, stage) %>%
    group_by(stage) %>%
    summarise(D = sum(death_cancer), Y = sum(surv_yy)/1000, Rate = D/Y) %>%
    mutate(CI_low = Rate + qnorm(0.025) * Rate / sqrt(D),
           CI_high = Rate + qnorm(0.975) * Rate / sqrt(D))
## Source: local data frame [4 x 6]
## 
##       stage    D       Y      Rate    CI_low   CI_high
## 1   Unknown  274 10.2735  26.67056  23.51261  29.82851
## 2 Localised 1013 38.6330  26.22111  24.60640  27.83582
## 3  Regional  218  1.5010 145.23651 125.95701 164.51600
## 4   Distant  408  0.9070 449.83462 406.18596 493.48328

(d) Below we see that the crude mortality rate is higher for males than for females.

melanoma %>%
    select(death_cancer, surv_yy, sex) %>%
    group_by(sex) %>%
    summarise(D = sum(death_cancer), Y = sum(surv_yy)/1000, Rate = D/Y) %>%
    mutate(CI_low = Rate + qnorm(0.025) * Rate / sqrt(D),
           CI_high = Rate + qnorm(0.975) * Rate / sqrt(D))
## Source: local data frame [2 x 6]
## 
##      sex    D       Y     Rate   CI_low  CI_high
## 1   Male 1074 21.9810 48.86038 45.93823 51.78253
## 2 Female  839 29.3335 28.60211 26.66673 30.53749

A difference which is also reflected in the survival and hazard curves below.

par(mfrow=c(1, 2))
sfit <- survfit(Surv(surv_mm, death_cancer) ~ sex, data = melanoma)

plot(sfit, col=1:2,
     xlab = "Follow-up Time",
     ylab = "Survival")
legend("bottomleft", levels(melanoma$sex), col=1:2, lty = 1)

plot(hazByGroup(melanoma$sex=="Male"), col=1, xlim=c(0,250))
lines(hazByGroup(melanoma$sex=="Female"), col=2)
legend("topright", levels(melanoma$sex), col=1:2, lty = 1)

(e) The majority of patients are alive at end of study. The cause of death is highly depending of age, as young people die less from other causes. To observe this we tabulate the events by age group.

with(melanoma,table(status, agegrp))
##                    agegrp
## status              0-44 45-59 60-74  75+
##   Alive             1615  1568  1178  359
##   Dead: cancer       386   522   640  365
##   Dead: other         39   147   461  487
##   Lost to follow-up    6     1     1    0

(f) The survival is worse for all-cause survival than for cause-specific, since you now can die from other causes, and these deaths are incorporated in the Kaplan-Meier estimates. The ”other cause” mortality is particularly present in patients with localised and unknown stage.

par(mfrow=c(1, 1))
afit <- survfit(Surv(surv_mm, death_all) ~ stage, data = melanoma)
plot(afit, col=1:4,
     xlab = "Follow-up Time",
     ylab = "Survival",
     main = "Kaplan-Meier survival estimates\nAll-cause")
legend("topright", levels(melanoma$stage), col=1:4, lty = 1)