1 p
2 Clinical Trial Investigation Interpretation of Results “to p or not to p” Ferran Torres Hospital Clínic Barcelona / Universitat Autònoma Barcelona. EMA: Scientific Advice Working Party (SAWP) Biostatistics Working Party (BSWP).
3 p
4 Today’s talk is on statistics
5
6 Statistics Considerations
7 Basic statistics Why Statistics? Samples and populations P-Value Random and sistematical errors Statistical errors Sample size Confidence Intervals Interpretation of CI: superiority, non- inferiority, equivalence
8 The role of statistics “Thus statistical methods are no substitute for common sense and objectivity. They should never aim to confuse the reader, but instead should be a major contributor to the clarity of a scientific argument.” The role of statistics. Pocock SJ. Br J Psychiat 1980; 137:
9 Why Statistics? Variation!!!!
10 Variability
11 Why Statistics? Medicine is a quantitative science but not exact Not like physics or chemistry Variation characterises much of medicine Statistics is about handling and quantifying variation and uncertainty Humans differ in response to exposure to adverse effects Example: not every smoker dies of lung cancer some non-smokers die of lung cancer Humans differ in response to treatment Example: penicillin does not cure all infections Humans differ in disease symptoms Example: Sometimes cough and sometimes wheeze are presenting features for asthma
12 Why Statistics Are Necessary Statistics can tell us whether events could have happened by chance and to make decisions We need to use Statistics because of variability in our data Generalize: can what we know help to predict what will happen in new and different situations?
13 Population and Samples Target Population Population of the Study Sample
14 Extrapolation Sample Population Inferential analysis Statistical Tests Confidence Intervals Study Results “Conclusions”
15 Statistical Inference Statistical Tests=> p-value Confidence Intervals
16 Valid samples? Population Likely to occur Unlikely to occur Invalid Sample and Conclusions
17 P-value The p-value is a “tool” to answer the question: – –Could the observed results have occurred by chance*? – –Remember: Decision given the observed results in a SAMPLE Extrapolating results to POPULATION *: accounts exclusively for the random error, not bias p <.05 “statistically significant”
18 P-value: an intuitive definition The p-value is the probability of having observed our data when the null hypothesis is true (no differences exist) Steps: 1) 1)Calculate the treatment differences in the sample (A-B) 2) 2)Assume that both treatments are equal (A=B) and then… 3) 3)…calculate the probability of obtaining a magnitude of at least the observed differences, given the assumption 2 4) 4)We conclude according the probability: a. a. p<0.05: the differences are unlikely to be explained by random, – –we assume that the treatment explains the differences b. b. p>0.05: the differences could be explained by random, – –we assume that random explains the differences
19 Factors influencing statistical significance Signal Noise (background) Quantity Difference Variance (SD) Quantity of data
True Value Random vs Sistematic error Random Systematic (Bias) True Value Example: Systolic Blood Pressure (mm Hg)
21 Random vs Sistematic error Sample size Random Bias
22 P-value A “statistically significant” result (p<.05) tells us NOTHING about clinical or scientific importance. Only, that the results were not due to chance. A p-value does NOT account for bias only by random error STAT REPORT
23 P-value A “very low” p-value do NOT imply: – –Clinical relevance (NO!!!) – –Magnitude of the treatment effect (NO!!) With n or variability p Please never compare p-values!! (NO!!!)
24 RCT from a statistical point of view 1 homogeneous population2 distinct populations Randomisation Treatment B (control) Treatment A
25 RCT Sample Population
26 Statistics can never PROVE anything beyond any doubt, just beyond reasonable doubt!! … because of working with samples and random error
27 Type I & II Error & Power
28 Utilidad de Creer en la Existencia de Dios (según Pascal) H 0 : Dios No Existe H 1 : Dios Existe
29 Type I & II Error & Power Type I Error ( ) – –False positive – –Rejecting the null hypothesis when in fact it is true – –Standard: =0.05 – –In words, chance of finding statistical significance when in fact there truly was no effect Type II Error ( ) – –False negative – –Accepting the null hypothesis when in fact alternative is true – –Standard: =0.20 or 0.10 – –In words, chance of not finding statistical significance when in fact there was an effect
30 The planned number of participants is calculated on the basis of: –Expected effect of treatment(s) –Variability of the chosen endpoint –Accepted risks in conclusion ↗ effect ↘ number ↗ variability ↗ number ↗ risk ↘ number Sample Size
31 Sample Size The planned number of participants is calculated on the basis of: – –Expected effect of treatment(s) – –Variability of the chosen endpoint – –Accepted risks in conclusion ↗ effect ↘ number ↗ variability ↗ number ↗ risk ↘ number
32 Sample Size The planned number of participants is calculated on the basis of: – –Expected effect of treatment(s) – –Variability of the chosen endpoint – –Accepted risks in conclusion ↗ effect ↘ number ↗ variability ↗ number ↗ risk ↘ number
33 Interval Estimation Confidence interval Sample statistic (point estimate) Confidence limit (lower) Confidence limit (upper) “A probability that the population parameter falls somewhere within the interval”
%CI Better than p-values… – –…use the data collected in the trial to give an estimate of the treatment effect size, together with a measure of how certain we are of our estimate CI is a range of values within which the “true” treatment effect is believed to be found, with a given level of confidence. – –95% CI is a range of values within which the ‘true’ treatment effect will lie 95% of the time Generally, 95% CI is calculated as – –Sample Estimate ± 1.96 x Standard Error
35 Superiority study d > 0 + effect IC95% d = 0 No differences d < 0 - effect Test betterControl better
Lower equivalence boundary Upper equivalence boundary Treatment more effective -><- Treatment less effective Statistical Superiority Non-inferiority Equivalence Inferiority Treatment-Control Statistically and Clinically superiority
37 Escalas de medición del efecto Riesgos
38 Cálculo de RR y OR RR ó OR > 1 RR ó OR =1 RR ó OR < 1 Factor de riesgo Ausencia de ‘efecto’ Factor protector
39 Cálculo de RR y OR No Expuestos Expuestos Enfermos Proporción en Expuestos: 0.50 Proporción en no Expuestos: 0.25 RR=2 Odds en Expuestos: 2/2=> 1 Odds en no Expuestos: 1/3 OR=3
40
41
42 Seamos críticos En ocasiones las cosas no son lo que parecen
43 Seamos críticos Obtención de los resultados ¿Es adecuada la técnica estadística utilizada? T-Test ANOVA de medidas repetidas
44
45
46 Seamos críticos Afirmaciones sin especificación de resultados Porcentajes sin el denominador Medias sin intervalo de confianza ¿Me fío del valor?
47 Seamos críticos A un paciente se le recomienda una intervención quirúrgica y pregunta por la probabilidad de sobrevivir. El cirujano le contesta que en las 30 operaciones que ha realizado, ningún paciente ha muerto. ¿Qué valores de P(morir) son compatibles con esta información, con una confianza del 95%? Otro ejemplo más
48 Seamos críticos Solución Límite superior del IC 95% para p=0 con n=30 Pr(X=0,n=30,p s ) = 0,025 La solución aproximada no sirve. Solución exacta, basada en la binomial: {0; 0,116} Incluso si la mortalidad es de un 11,6%, en 30 intervenciones no se observará ninguna muerte con Pr=0,025
49 Seamos críticos Si se disponen de datos... ... No se han de desperdiciar. Unos datos bien ‘torturados’ al final cantan. ¡¡¡ p<0.05 !!!
¿Y lo del denominador? El famoso perro fantástico
51 Por que después pasa lo que pasa
52 Key statistical issues Multiplicity Subgroups: interaction & confounding Superiority and non-inferiority (and ) Adjustment by covariates Missing data Others –Interim analyses –Meta-analysis vs one pivotal study –Flexible designs
53 MULTIPLICITY
54 Torneo Roland Garros ª Ronda Carlos Moyá vs Markus Hipfl
55 Lancet 2005; 365: 1591–95 To say it colloquially, torture the data until they speak...
56 Torturing data… –Investigators examine additional endpoints, manipulate group comparisons, do many subgroup analyses, and undertake repeated interim analyses. –Investigators should report all analytical comparisons implemented. Unfortunately, they sometimes hide the complete analysis, handicapping the reader’s understanding of the results. Lancet 2005; 365: 1591–95
57 DesignConductionResults
58 Multiplicity K independent hypothesis : H 01, H 02,..., H 0K S significant results ( p< ) Pr (S 1 | H 01 H 02 ... H 0K = H 0. ) = 1 - Pr (S=0|H 0. ) = 1- (1 - ) K
59 Same examples
60Multiplicity Bonferroni correction (simplified version) –K tests with level of signification of –Each test can be tested at the /k level Example: –5 independent tests –Global level of significance=5% –Each test shoud be tested at the 1% level 5% /5=> 1%
61 But this is the simplified version for the general public
62 Cautionary Example RCT to treat rheumatoid arthritis Basic Clin Med 1981, 15: 445 Several end ‑ points repeated at various timepoints and various subdivisions 48 of these gave p-values < 0.05 But… expect 5% of 850 = 850/20 = 42.5 =>so finding 48 is not very impressive
63 Some strategies to ‘burden’ with multiple contrasts
64 Handling Multiplicity in Variables Scenario 1:One Primary Variable –Identify one primary variable -- other variables are secondary –Trial is positive if and only if primary variable shows significant (p < 0.05), positive results
65
66 Handling Multiplicity in Variables Scenario 2Divide Type I Error –Identify two (or more) co-primary variables –Divide the 0.05 experiment-wise Type I error over these co-primary variables, e.g., 0.04 for the 1st, and 0.01 for the 2nd co-primary variable –Trial is positive if at least one of the co-primary variables shows significant, positive results
67 Handling Multiplicity in Variables Scenario 3 Sequentially Rejective Procedure –Identify n co-primary variables, e.g., n = 3 –Order obtained p-values Interpret the variable with the highest p-value at the 0.05 level; if significant, then interpret the variable with the 2nd highest p-value at the 0.05/2 level; if positive, then interpret the variable with the smallest p-value at the 0.05/3 level. Test procedure stops when a test is not significant.
68 Handling Multiplicity in Variables Scenario 4Hierarchy –Prespecify hierarchy among n co-primary variables, –All tested at the same level interpret 1st variable at 0.05 level, if significant, then interpret 2nd variable at 0.05 level; if positive, then interpret 3rd variable at 0.05 level. … Test procedure stops when a test is not significant. –Trial is positive if first co-primary variable shows significant, positive result
69 Secondary Variables Secondary Variables Secondary variables can only be claimed if and only if –the primary variable shows significant results, and –the comparisons related to the secondary variables also are protected under the same Type I error rate as the primary variable. Similar procedures as already discussed can be used to protect Type I error
70 Handling Multiplicity in Treatments Similar procedures as how to handle multiplicity in variables. Additional procedures are available, mainly geared to very specific settings of the statistical hypotheses. –Dunnett, Scheffee, REGW, Williams …
71 SUBGROUPS
72 Subgroups Indiscriminate subgroup analyses pose serious multiplicity concerns. Problems reverberate throughout the medical literature. Even after many warnings, some investigators doggedly persist in undertaking excessive subgroup analyses. Lancet 2000; 355: 1033–34 Lancet 2005; 365: 1657–61
73 Interacción Edad < 45 años Edad >= 45 años d=5 % d=0.7% d=11.5%
74 Factores de confusión No fumadores Fumadores d=6% d=0%
75 Subgroups & Simpson’s Paradox
76 Subgroups & Simpson’s Paradox cont.
77 Subgroups AspirinPlacebo Vascular Death Total %10.2% p= d=-0.9 ISIS-2: Vascular death by Star signs Geminis/LibraOther Star Signs AspirinPlacebo Vascular Death Total % 12.1% p<0.0001d=3.1 Interacction p = Lancet 1988; 2: 349–60.
78 Changes from ISIS-2 results Lancet 2005; 365: 1657–61
79 “The answer to a randomized controlled trial that does not confirm one’s beliefs is not the conduct of several subanalyses until one can see what one believes. Rather, the answer is to re- examine one’s beliefs carefully.” –BMJ 1999; 318: 1008–09.
80 Lancet 2005; 365: 1657–61
81 the question is NOT: ‘Is the treatment effect in this subgroup statistically significantly different from zero?’ BUT… are there any differences in the treatment effect between the various subgroups? The correct statistical procedures are either a test of heterogeneity or a test for interaction
82 Subgroups Recommendations: –1) Examine the global effect –2) Test for the interaction –3) Plan adjustments for confirmatory analyses –4) Some points which increase the credibility: Pre-specification Biologic plausibility
83 Lancet 2005; 365: 176–86
84 MULTIPLE INSPECTIONS
85 Interim Analyses in the CDP Z Value Month of Follow-up (Month 0 = March 1966, Month 100 = July 1974) Coronary Drug Project Mortality Surveillance. Circulation. 1973;47:I-1
86 Lancet 2005; 365: 1657–61
87 Tipos de diseño secuencial 1) Reestimación del tamaño muestral 2) Métodos secuenciales por grupos 3) Aproximación por funciones de gasto de 4) Intervalos de confianza repetidos 5) Restricción estocástica 6) Métodos bayesianos 7) Límites continuos (función de verosimilitud)
88 Diseño NO aplicable a método secuencial ¿Análisis? Desarrollo total Reclutamiento
89 Diseño SÍ aplicable a método secuencial Análisis Desarrollo total Reclutamiento
90 Métodos secuenciales por grupos Pocock (1977) Pruebas de significación repetidas K = Nº máximo de inspecciones a realizar K fijo a priori Análisis con pruebas estadísticas clásicas ( 2, t-test,...)
91 Group Sequential Methods
92 Modelo triangular bilateral
93 CPMP/EWP/482/99: PTC on Switching between Superiority and Non- Inferiority & CPMP/EWP/2158/99: PtC on the Choice of Delta
94 RANDOMIZATION & COVARIATES
95 Adjustement The objective should be not to compensate unbalance (randomisation) but to improve the precision Avoid to adjust by post-randomization variables In RCT, never use this widespread strategy: “adjust by any baseline significant variable (5% or 10% level)”
96 Stratification A priori May desire to have treatment groups balanced with respect to prognostic or risk factors (co- variates) For large studies, randomization “tends” to give balance For smaller studies a better guarantee may be needed Useful only to a limited extent (especially for small trials) but avoid to many variables (i.e. many empty or partly filled strata)
97 Testing for “baseline homogeneity” All observed differences are known with certainty to be due to chance. We must not test for it: there is no alternative hypothesis whose truth can be supported by such a test. If significant, the estimator is still unbiased Balance: –Decreases the variance and increases the power. –It has no effect on type I error.
98 Observed Unbalanced… NEVER justifies the post-hoc adjustment: –Randomization is more important –The treatment effect is unbiased without adjustment (randomization) –Type I error level takes into account for “chance error” –Post-hoc: data driven analyses –Multiplicity issues : increase type I error by allowing a post-hoc adjustment
99 Adjusted Analyses ‘ When the potential value of an adjustment is in doubt, it is often advisable to nominate the unadjusted analysis as the one for primary attention, the adjusted analysis being supportive.’
100 Ajuste por covariables Definición a priori La aparición de desigualdades basales NO justifica el ajuste per se: –Se da más importancia a la randomización –Peligro de los análisis post-hoc –Multiplicidad Como estrategia general, el ajuste por variables significativas basales (ej, p<0.1 o p<0.05) a priori: NO es válida
101 Definición de las distintas poblaciones de un estudio
102 Objetivo: Evaluar la eficacia de un programa para reducir el peso frente a los a los consejos habituales Diseño: Ensayo Clínico Aleatorio Candidatos: 790 Obesos: 320 Grupo intervención: 161Grupo control: 159 Rechazo: 59 Petición espontánea: 54 Acaban: 102 Acaban: 105
103 Grupo intervención: 161Grupo control: 159 Rechazo: 59 Petición espontánea: 54 Acaban: 102 Acaban: 105
104 MISSING DATA
105 Ex: LOCF & lineal extrapolation Time (months) LOCF Lineal Regresion Bias Adas-Cog > Worse < Better
106 Ex: Early drop-out due to AE Adas-Cog Time (months) Placeb o Active > Worse < Better Bias: Favours Active
107 Ex: Early drop-out due to lack of Efficacy Adas-Cog Time (months) Placebo Active > Worse < Better Bias: Favours Placebo
108 RND B Baseline Last Visit ≠ Frecuencies A Drop-outs and missing data AAAA AA B B A Visit 2 Visit 1 A
109 RND Baseline Last Visit ≠ Timing A Drop-outs and missing data AAAAB B Visit 2 Visit 1 BBB
110 MD e incorrecto uso de poblaciones (1) Diseño Cirugía vs Tratamiento Médico en estenosis carotidea bilateral (Sackket et al., 1985) Variable principal: Número de pacientes que presenten TIA, ACV o muerte Distribución de los pacientes: Pacientes randomizados:167 Tratamiento quirúrgico: 94 Tratamiento médico: 73 –Pacientes que no completaron el estudio debido a ACV en las fases iniciales de hospitalización: Tratamiento quirúrgico: 15 pacientes Tratamiento médico: 01 pacientes
111 MD e incorrecto uso de poblaciones (2) Población Por Protocolo (PP): Pacientes que hayan completado el estudio Análisis –Tratamiento quirúrgico:43 / ( ) = 43 / 79 = 54% –Tratamiento médico:53 / (73 - 1) = 53 / 72 = 74% –Reducción del riesgo:27%, p = 0.02 Primer análisis que se realiza :
112 MD e incorrecto uso de poblaciones (3) El análisis definitivo queda de la siguiente forma : Población Intención de Tratar (ITT): Todos los pacientes randomizados Análisis –Tratamiento quirúrgico:58 / 94 = 62% –Tratamiento médico:54 / 73 = 74% –Reducción del riesgo:18%, p = 0.09 (PP: 27%, p = 0.02) Conclusiones: La población correcta de análisis es la ITT El tratamiento quirúrgico no ha demostrado ser significativamente superior al tratamiento médico
113 Handling of MD Methods for imputation: –Many techniques –No gold standard for every situation –In principle, all methods may be valid: Simple methods to more complex: –From LOCF to multiple imputation methods –Worst Case, “Mean methods” Multiple Imputation But their appropriateness has to be justified Statistical approaches less sensitive to MD: –Mixed models –Survival models They assume no relationship between treatment and the missing outcome, and generally this cannot be assumed.
114 CONCLUSION
115
116
117
118 JAMA 2002; 287:
119 Effect Size & Sample Size Relative Effect Absolute Size Power* difference (%) (%) (mmHg) % 4.9% % 5.9% % 8.5% % 13.3% % 20.2% % 28.2% % 39.3% % 49.3% % 61.1% % 71.0% % 80.4% * Statistical power assuming constant variability (SD=20mmHg)
120
121 CPMP/EWP/482/99: PTC on Switching between Superiority and Non- Inferiority & CPMP/EWP/2158/99: PtC on the Choice of Delta
122 ENSAYOS DE NO-INFERIORIDAD NECESIDAD Implicaciones legales. Implicaciones metodológicas. Limitaciones éticas y prácticas al uso de placebo. Limitaciones prácticas a la superioridad frente a control activo. Necesidad de información comparativa. Posibles valores añadidos.
123
124 ENSAYOS DE NO-INFERIORIDAD NECESIDAD Implicaciones legales. Implicaciones metodológicas. Limitaciones éticas y prácticas al uso de placebo. Limitaciones prácticas a la superioridad frente a control activo. Necesidad de información comparativa. Posibles valores añadidos.
125 Aproximación con el Poder (prueba clásica + cálculo del poder)
126 ENSAYOS DE NO-INFERIORIDAD NECESIDAD Implicaciones legales. Implicaciones metodológicas. Limitaciones éticas y prácticas al uso de placebo. Limitaciones prácticas a la superioridad frente a control activo. Necesidad de información comparativa. Posibles valores añadidos.
127
128 ENSAYOS DE NO-INFERIORIDAD NECESIDAD Implicaciones legales. Implicaciones metodológicas. Limitaciones éticas y prácticas al uso de placebo. Limitaciones prácticas a la superioridad frente a control activo. Necesidad de información comparativa. Posibles valores añadidos.
129 Lancet 2001,356:
130 ENSAYOS DE NO-INFERIORIDAD NECESIDAD Implicaciones legales. Implicaciones metodológicas. Limitaciones éticas y prácticas al uso de placebo. Limitaciones prácticas a la superioridad frente a control activo. Necesidad de información comparativa. Posibles valores añadidos.
131 Valores añadidos Posología: 1 vez al día Vía: vía oral Seguridad: Acontecimientos adversos Poblaciones especiales: Ancianos, pediatría Interacciones
132 Ensayos de Equivalencia Ensayos de bioequivalencia (producto genérico vs comercializado) Nuestro producto no es peor y puede presentar otras ventajas (seguridad, comodidad posológica …) –No-inferioridad
133 ESTUDIO DE SUPERIORIDAD d > 0 + efecto IC95% d = 0 No hay diferencia d < 0 - efecto Mejor TestMejor Control
134 ESTIMACIÓN POR INTERVALO (ESTUDIO DE SUPERIORIDAD) Es estadísticamente significativa d = 0 No hay diferencia d < 0 - efecto d > 0 + efecto IC95% Mejor TestMejor Control
135 ESTIMACIÓN POR INTERVALO (ESTUDIO DE SUPERIORIDAD) Es estadísticamente significativa con P=0,05 (justo en el límite) d > 0 + efecto d = 0 No hay diferencia d < 0 - efecto IC 95% Mejor TestMejor Control
136 ESTUDIO DE EQUIVALENCIA d > 0 + efecto d = 0 No hay diferencia d < 0 - efecto -d +d Región de equivalencia clínica Delta: ( ) mayor diferencia sin relevancia clínicamayor diferencia sin relevancia clínicao la menor diferencia con relevancia clínicala menor diferencia con relevancia clínica
137 EQUIVALENCIA 0 Equivalencia No equivalencia
138 NO-INFERIORIDAD TERAPÉUTICA No-Inferioridad No No-Inferioridad 0-- Mejor TestMejor Control
% B A P 1/2 ? 1/3 ?
140
141
%