This article is part of an extended series of tutorials for time-to-event (or survival) analysis. The next article is here.
Introduction
Applications involving comparison of time-to-failure data from different groups are common in the fields of reliability and manufacturing. For example, a designer might want to implement a new design to reduce costs, but wants to make sure that this change does not adversely affect the reliability of the product. Or, a manufacturer might want to find out if there is any difference in the time-to-failure of products manufactured from different batches or different assembly lines. Parametric and non-parametric methods to compare failure time distributions can be used for such purposes. Two of my previous blog articles(here and here) dealt with non-parametric methods for comparing failure data. In this article, I demonstrate the use of a parametric method for the same purpose.
I have not included any R code within the article, but they are available on my Github page and can be accessed here.
Data
The data used in this article is from the book Statistical Methods for Reliability Data, Second Edition (William Q. Meeker, Luis A. Escobar, Francis G. Pascual)]. It consists of results from a pull test to measure the bond strength of wire bonds from three separate manufacturing batches. Table 1 shows this data, and it is also publicly available on DataShare, Iowa State University’s open data repository which can be accessed through this link.
We are interested in finding out whether the bond strength is different across the manufacturing batches. If we do find that there is a statistically significant difference, we want to find out where is this difference (between which pairs of batches).
Batch Number | Strength (g) | Censoring Indicator |
---|---|---|
Batch1 | 504 | Failed |
Batch1 | 168 | Failed |
Batch1 | 420 | Failed |
Batch1 | 336 | Failed |
Batch1 | 540 | Failed |
Batch1 | 516 | Failed |
Batch1 | 936 | Failed |
Batch1 | 756 | Failed |
Batch1 | 936 | Failed |
Batch1 | 592 | Failed |
Batch1 | 468 | Failed |
Batch1 | 600 | Censored |
Batch1 | 376 | Failed |
Batch1 | 600 | Censored |
Batch1 | 600 | Censored |
Batch1 | 588 | Failed |
Batch1 | 600 | Censored |
Batch1 | 600 | Censored |
Batch4 | 396 | Failed |
Batch4 | 300 | Failed |
Batch4 | 420 | Failed |
Batch4 | 336 | Failed |
Batch4 | 516 | Failed |
Batch4 | 540 | Failed |
Batch4 | 396 | Failed |
Batch4 | 696 | Failed |
Batch4 | 600 | Failed |
Batch5 | 516 | Failed |
Batch5 | 300 | Failed |
Batch5 | 420 | Failed |
Batch5 | 540 | Failed |
Batch5 | 276 | Failed |
Batch5 | 276 | Failed |
Analysis with Separate Distributions
Our initial approach will be to fit separate distributions to the three groups. In order to proceed with a parametric comparison of the different groups, we first need to figure out which parametric distribution to use for the data. Weibull and log-normal are the most commonly used distributions to describe time-to-failure in reliability applications. For the purpose of demonstrating a parametric comparison, we will consider these two distributions. We have three groups in our data set: Batch 1, Batch 4 and Batch 5. Generally, when the failure mode is the same in all groups, the same distribution is used to fit the data.
Figure 1 shows a Weibull probability plot and Figure 2 shows a log normal probability plot of the data. The points on these plots are the empirical CDF1 and the lines are ML2 estimates from fitting the respective distributions.
Figure 1: Weibull Probability Plot
Figure 2: Log-normal Probability Plot
Note that 3 separate Weibull distributions were fit – one for each batch. Similarly, 3 separate log normal distributions were also fit. Looking at both the plots above, the Weibull distribution seems to fit slightly better than the log-normal, but not by much3. The data could have come from any of the two distributions we have considered here. We are mainly looking for any indication that a particular distribution is definitely not a good fit rather than proving that the data is from a particular distribution. The AIC4 from the Weibull model is 381.29 and the log-normal model is 382.29. Since the AIC is lower for the Weibull model, we will continue with Weibull distribution, although log-normal wouldn’t have been a terrible choice either.
The log location scale model for the Weibull distribution is shown in Eq. A below5.
\[ \hspace{-2cm} \begin{align*} F(t;\mu,\sigma) & = Pr(T\le t)\\ & = \Phi_{SEV} \left[\frac{log(t)-\mu}{\sigma}\right] \tag{Eq. A} \end{align*} \] SEV in equation A is the smallest extreme value distribution.
Table 2 shows the ML estimates of the location (\(\mu\)) and scale (\(\sigma\)) parameters from the log-location scale representation of the Weibull distribution for all three batches.
Batch | Parameter | ML Estimate | Standard Error | Wald Lower 90% CI | Wald Upper 90% CI |
---|---|---|---|---|---|
Batch 1 | \(\hat{\mu}_{Batch1}\) | 6.536 | 0.092 | 6.385 | 6.687 |
\(\hat{\sigma}_{Batch1}\) | 0.331 | 0.071 | 0.306 | 0.357 | |
Batch 4 | \(\hat{\mu}_{Batch1}\) | 6.243 | 0.086 | 6.101 | 6.384 |
\(\hat{\sigma}_{Batch1}\) | 0.243 | 0.062 | 0.218 | 0.270 | |
Batch 5 | \(\hat{\mu}_{Batch1}\) | 6.064 | 0.110 | 5.883 | 6.245 |
\(\hat{\sigma}_{Batch1}\) | 0.254 | 0.083 | 0.214 | 0.303 |
Figure 3 shows the approximate 90% joint confidence regions based on the relative likelihood for the Weibull distribution median and the shape parameter \(\beta\)6. There is a significant amount of overlap between Batches 4 and 5, which suggests that the difference in medians between these two batches might not be statistically significant. We can also see from Table 2 that the confidence intervals for the location parameter \(\mu\) for Batches 4 and 5 overlap.
The amount of overlap in the confidence regions of Batch 1 with Batch 5 in Figure 3 is very low, which means the difference in medians of Batch 1 and Batch 5 might be statistically significant.
Figure 3: Relative Likelihood Contour plot
Although the confidence regions give us an idea of how different the batches are to each other, we can formalize the analysis by conducting pairwise comparisons. These pairwise comparisons can be conducted at any particular quantile. For example, we may want to test whether the 20% quantile of all three batches are different. Since we have constructed the confidence region using the median (50% quantile), we will continue with the pairwise comparisons of the medians.
Constructing simultaneous confidence intervals for the difference in log(medians)7 of two groups is a good way of making such comparisons. If the interval contains zero, we do not have evidence that the medians are different for that pair. If the interval does not contain zero, then we do have evidence of a difference in the medians for that pair. That is, we want to construct the simultaneous confidence intervals for the following:
\[ \hspace{-2cm} \begin{align*} \Delta_{14} = \hat{y}_{Batch4} – \hat{y}_{Batch1}\\ \Delta_{15} = \hat{y}_{Batch5} – \hat{y}_{Batch1}\\ \Delta_{45} = \hat{y}_{Batch5} – \hat{y}_{Batch4}\\ \end{align*} \] Where,
\[ \hat{y}_{Batch1} = log(t_{0.5}) = \hat{\mu}_{Batch1} + \Phi^{-1}_{SEV}(0.5)\hat{\sigma}_{Batch1}\\ \hat{y}_{Batch4} = log(t_{0.5}) = \hat{\mu}_{Batch4} + \Phi^{-1}_{SEV}(0.5)\hat{\sigma}_{Batch4}\\ \hat{y}_{Batch5} = log(t_{0.5}) = \hat{\mu}_{Batch5} + \Phi^{-1}_{SEV}(0.5)\hat{\sigma}_{Batch5}\\ \] \(t_{0.5}\) is the 50% quantile i.e. median.
Why should we construct simultaneous confidence intervals and not regular confidence intervals? The short answer is that, if we do not use simultaneous confidence intervals, the actual probability that all three intervals will contain the true difference will be lower than the nominal confidence level that we have chosen i.e 90%.8
Table 3 shows the estimated difference in the log medians, the standard error and the simultaneous confidence interval9 for the three comparisons.
Delta | ML Estimate | Standard Error | Simulataneous approximate 90% CI (Lower) | Simulataneous approximate 90% CI (Upper) |
---|---|---|---|---|
\(\Delta_{41}\) | -0.261 | 0.136 | -0.550 | 0.028 |
\(\Delta_{51}\) | -0.444 | 0.157 | -0.777 | -0.111 |
\(\Delta_{54}\) | -0.183 | 0.156 | -0.515 | 0.149 |
From Table 3, we see that it is only the simultaneous confidence interval for \(\Delta_{51}\) that does not contain zero, which means we have evidence at the 10% significance level to say that the medians for Batches 1 and 5 are different. Since the CI is negative, the median for Batch 1 seems to be greater than the median of Batch 5 (at the 10% significance level). We cannot say the same for the other two comparisons since their CIs includes zero.
Analysis with an Equal Shape (\(\sigma\)) parameter
In the previous section, we fit three separate Weibull distributions – one for each batch. This led to the estimation of 6 parameters (\(\mu_{Batch1}, \mu_{Batch4},\mu_{Batch5},\sigma_{Batch1},\sigma_{Batch4},\sigma_{Batch5}\)), as shown in Table 2. When comparing groups where the mode of failure is the same in all groups, it is reasonable to try fitting a distribution that has the same \(\sigma\) parameter across all groups and only let the location (i.e \(\mu\)) vary. Such a model can be represented by the following regression equation:
\[ y = log(t_{p}) = {\beta}_{0} + {\beta}_{1}x_{Batch4} + {\beta}_{2}x_{Batch5} + \Phi^{-1}_{SEV}(p){\sigma}\\ \tag{Eq. B} \] Where
\[ \begin{align*} x_{Batch4} & = 1 \quad \text{when group is Batch 4 and}\\ x_{Batch4} & = 0 \quad \text{otherwise}\\ x_{Batch5} & = 1 \quad \text{when group is Batch 5 and}\\ x_{Batch5} & = 0 \quad \text{otherwise}\\ \end{align*} \] Notice that in this model, we have the same \(\sigma\) parameter for all batches. The advantage of using such a model is that it uses the available data to estimate only 4 parameters (\(\beta_{0}\), \(\beta_{1}\), \(\beta_{2}\) and \(\sigma\)) instead of 6 parameters that we estimated in the last section10.
Table 4 below shows the estimate, standard error and the Wald 90% confidence interval for the parameters from Eq. B.
Parameter | ML Estimate | Standard Error | Wald Lower 90% CI | Wald Upper 90% CI |
---|---|---|---|---|
\(\hat{\beta_{0}}\) | 6.540 | 0.080 | 6.409 | 6.672 |
\(\hat{\beta_{1}}\) | -0.315 | 0.125 | -0.521 | -0.110 |
\(\hat{\beta_{2}}\) | -0.490 | 0.142 | -0.724 | -0.256 |
\(\hat{\sigma}\) | 0.288 | 0.042 | 0.226 | 0.366 |
Figure 4 shows the Weibull probability plot for the equal \(\sigma\) model. The \(\sigma\) parameter determines the slope of the line on the Weibull plot. Since it is the same for all three batches, the lines on this plot are parallel to each other as opposed to the lines shown in Figure 1.
Figure 4: Weibull Probability PLot for the Equal \(\sigma\) model.
A likelihood ratio test comparing the model in the previous section (separate distributions) with the model in this section (equal \(\sigma\)) gives a p-value of 0.609, which indicates that we do not have enough evidence to say that the models are different. Hence, it is reasonable to choose the model with the lower number of parameters i.e the one with equal \(\sigma\).
We can conduct pairwise comparisons using the equal \(\sigma\) model. We conducted such pairwise comparisons for the median (50% quantile) in the previous section. In the equal \(\sigma\) model, we are not limited to any particular quantile for such comparisons. Since the lines in the Weibull plot are parallel, the difference between the batches at any quantile is going to be the same.
Table 5 below shows the estimate of the difference between any two batches along with the standard error and simultaneous confidence intervals11. The simultaneous confidence intervals for the difference between batches 1 and 4 and batches 1 and 5, both do no contain zero. Thus, we have evidence that batches 1 and 4 are different to each other and so are batches 1 and 5, at the 10% significance level. We cannot say the same for the comparison between batches 4 and 5.
Delta | ML Estimate | Standard Error | Simulataneous approximate 90% CI (Lower) | Simulataneous approximate 90% CI (Upper) |
---|---|---|---|---|
\(\Delta_{41}\) | -0.315 | 0.125 | -0.581 | -0.050 |
\(\Delta_{51}\) | -0.490 | 0.142 | -0.792 | -0.187 |
\(\Delta_{54}\) | -0.174 | 0.152 | -0.497 | 0.148 |
Conclusion
In this case study, the question of interest was to find out if there is any difference in the bond strength of wire bonds produced in three separate batches. We approached this problem initially by fitting separate Weibull distributions to all three batches and then by fitting a model with a constant \(\sigma\) parameter. Since the likelihood ratio test comparing the two approaches revealed an insignificant p-value, and since the mode of failure in all three batches is the same, we were justified in using a simpler model with fewer parameters. The results of the analysis revealed that there is evidence in the data to suggest that Batch 1 is different to Batches 4 and 5, at the 10% significance level. We did not find any evidence (again, at the 10% significance level) that there is any difference between batches 4 and 5.
Acknowledgements
Most of what I have learnt about time-to-event analysis is from the book Survival Analysis: Techniques for Censored and Truncated Data, Second Edition (John P. Klein and Melvin L. Moescheberger) and Statistical Methods for Reliability Data, Second Edition (William Q. Meeker, Luis A. Escobar, Francis G. Pascual).
All analyses were performed using R Statistical Software(v4.4.1; R Core Team 2024). The survival R package(v3.7.0; Therneau T 2024) was extensively used. Please refer to next section for all the packages used, their versions and the names of the package developers.
Credits for R packages used
## - Auguie B (2017). _gridExtra: Miscellaneous Functions for "Grid" Graphics_. R package version 2.3, <https://CRAN.R-project.org/package=gridExtra>.
## - Grolemund G, Wickham H (2011). "Dates and Times Made Easy with lubridate." _Journal of Statistical Software_, *40*(3), 1-25. <https://www.jstatsoft.org/v40/i03/>.
## - Jackson C (2016). "flexsurv: A Platform for Parametric Survival Modeling in R." _Journal of Statistical Software_, *70*(8), 1-33. doi:10.18637/jss.v070.i08 <https://doi.org/10.18637/jss.v070.i08>.
## - Kassambara A (2023). _ggpubr: 'ggplot2' Based Publication Ready Plots_. R package version 0.6.0, <https://CRAN.R-project.org/package=ggpubr>.
## - Kassambara A, Kosinski M, Biecek P (2024). _survminer: Drawing Survival Curves using 'ggplot2'_. R package version 0.5.0, <https://CRAN.R-project.org/package=survminer>.
## - Müller K, Wickham H (2023). _tibble: Simple Data Frames_. R package version 3.2.1, <https://CRAN.R-project.org/package=tibble>.
## - R Core Team (2024). _R: A Language and Environment for Statistical Computing_. R Foundation for Statistical Computing, Vienna, Austria. <https://www.R-project.org/>.
## - Strobl R (2022). _km.ci: Confidence Intervals for the Kaplan-Meier Estimator_. R package version 0.5-6, <https://CRAN.R-project.org/package=km.ci>.
## - Therneau T (2024). _A Package for Survival Analysis in R_. R package version 3.7-0, <https://CRAN.R-project.org/package=survival>. Terry M. Therneau, Patricia M. Grambsch (2000). _Modeling Survival Data: Extending the Cox Model_. Springer, New York. ISBN 0-387-98784-3.
## - Wickham H (2016). _ggplot2: Elegant Graphics for Data Analysis_. Springer-Verlag New York. ISBN 978-3-319-24277-4, <https://ggplot2.tidyverse.org>.
## - Wickham H (2023). _forcats: Tools for Working with Categorical Variables (Factors)_. R package version 1.0.0, <https://CRAN.R-project.org/package=forcats>.
## - Wickham H (2023). _stringr: Simple, Consistent Wrappers for Common String Operations_. R package version 1.5.1, <https://CRAN.R-project.org/package=stringr>.
## - Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to the tidyverse." _Journal of Open Source Software_, *4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
## - Wickham H, François R, Henry L, Müller K, Vaughan D (2023). _dplyr: A Grammar of Data Manipulation_. R package version 1.1.4, <https://CRAN.R-project.org/package=dplyr>.
## - Wickham H, Henry L (2023). _purrr: Functional Programming Tools_. R package version 1.0.2, <https://CRAN.R-project.org/package=purrr>.
## - Wickham H, Hester J, Bryan J (2024). _readr: Read Rectangular Text Data_. R package version 2.1.5, <https://CRAN.R-project.org/package=readr>.
## - Wickham H, Vaughan D, Girlich M (2024). _tidyr: Tidy Messy Data_. R package version 1.3.1, <https://CRAN.R-project.org/package=tidyr>.
## - Zhu H (2024). _kableExtra: Construct Complex Table with 'kable' and Pipe Syntax_. R package version 1.4.0, <https://CRAN.R-project.org/package=kableExtra>.
End
I hope you found this article informative! If you have any comments or suggestions or if you find any errors and are keen to help correct the error, please write to me at shishir909@gmail.com.
Cumulative Distribution Function↩︎
Maximum likelihood↩︎
We say that we have a “good fit” when the points (empirical estimates) are reasonably well aligned with the line (Maximum likelihood estimate).↩︎
Akaike Information Criterion. It is a statistical metric widely used to compare and choose among different models. Lower the AIC, better is the model↩︎
Although we have used the letter “T”, which usually denotes “time to failure”, it denotes the strength variable in our current example.↩︎
\(\beta = 1/\sigma\)↩︎
Why log of median and not just median? Because, we will use the log-location scale representation of the Weibull distribution as shown in Equation A above and it is easier to work with logarithms. The conclusion is not affected by this transformation.↩︎
I hope to write a separate article in the future explaining the difference between regular CI and simultaneous CI and why are they needed.↩︎
Bonferroni method used for adjusting the Type 1 error rate for constructing simultaneous CIs↩︎
Lower the number of parameters to be estimated, simpler is the model and lower is the risk of over fitting a model.↩︎
Bonferroni method used for adjusting the Type 1 error rate for constructing simultaneous CIs↩︎