The Pharmaceutical Industry’s Dilemma with Large Patient Sample Sizes in Clinical Trials: The Statistician’s Role Now and in the Future Davis Gates Jr., Ph.D. Associate Director Schering-Plough Research Institute Brief Background on the Pharmaceutical Industry’s Clinical Trials Study Process There are many areas of drug development in the pharmaceutical industry such as toxicology, drug device development (inhalers, tablet composition), market research, long term safety and surveillance, etc., warranting their own discussion. For this presentation, we will focus on Clinical Trials, which are performed in a sequence of steps called Phases. The first three Clinical Trials Studies Phases (I, II, III), allow researchers to collect reliable information while best protecting patients. Phased clinical trials require considerable input from statisticians. For one, determination of sample size at times can challenge the statistician’s role in a drug development program. Phase I: Focusing on Safety Phase I trials are the first step in testing a new treatment approach in humans. Usually small sample size, divided into cohorts (for example N=12/dose) What range of doses are safe (monitoring side effects) Route of Administration (oral, injection into vein or muscle) Schedule of doing (once-a-week, once-a-day, twice-a-day) Phase II: Studying Effectiveness Sample sizes are much larger (approaching/reaching large clinical trials) Select optimal doses based on a range of dose levels Optimal doses selected from a combination of safety and efficacy Sometimes more than one dose is carried forward to Phase III Phase III: Final Decision for Proposal of Treatment Options Collective sample sizes across studies are in the thousands Determine target dosing for market approval (Pivotal Studies) Compile the safety profile by pooling studies (Adverse Events) Determine if observed adverse events are acceptable with treatment Compare test drug against standard therapy, or in some cases placebo Finalization of the program results in submission to health authorities Key Statistical Terms Randomization The process of randomly assigning a treatment to subjects enrolled into a clinical trial, such that the subject has a pre-assigned probability of being selected to either test drug(s) or a standard therapy/placebo. A simple case would be to equally allocate subjects to a test drug or standard treatment such that the sample sizes of two groups are as close to equal as possible. Other cases could assign unequal sample sizes across treatments, such as twice as many subjects to an active as a placebo, for ethical reasons to minimize those assigned to placebo. Here, a 2:1 ratio is designated such that for every subject assigned to a placebo (P), two are assigned to the active treatment (A), set up in block sizes in multiples of three (APA, AAP, PAA) Randomization is usually enforced through use of a pre-specified randomization schedule generated by a statistician. Blinding A double-blind trial is defined in which both the subject and the investigator conducting the trial do not know the treatment assignment. An evaluator-blind trial is defined in which the subject has knowledge of the treatment assignment but not the investigator. When two treatment regimens differ substantially, it is not feasible to blind the subject. Open-label studies are at times performed, usually to assess safety, in which both the subject and investigator are un-blinded. Pivotal trials, used to assess the efficacy of a drug for approval, are usually double-blind. Though there are exceptions, such as evaluator-blind studies with well defined biological endpoints. Superiority The test drug “beating” the standard treatment or placebo, usually confirmed with a p-value (p <0.05) in which the test drug is statistically significantly better than the test drug or placebo. Clinically Meaningful Difference A pre-defined minimum magnitude justifying a clinically relevant difference, regardless of the observed statistical test. For example, a trial can be so large that small (meaningless) differences can de detected, so a predefined difference is stated, such as 0.5 point difference in a quality of life questionnaire. In many cases, study team members do not want to commit to a magnitude if one doesn’t already exist. The general belief in the industry is that the p-value “gets you in the door” with health authorities such as the FDA, then one argues the clinical relevance in the approval process. Primary Endpoint The designated measure of efficacy (driven by a primary hypothesis) that determines the success or failure of the trial, from which the study is sized. Generally, primary endpoints are powered at 80% to 90% to detect differences, with two-sided 5% levels of significance. Low powered endpoints are a company risk, but overpowered studies (>99% power) are not well received by regulatory agencies. Secondary Endpoints Additional measures of efficacy generally used as supportive information for the trial, but cannot determine study success when the primary endpoint is not successful. Usually not considered to size trials unless they are determined to be key in providing additional test drug benefit. Even in these cases, sizing of the trial can be overlooked. Addition of “Key” Secondary Endpoints Lately, Key Secondary Endpoints have been added to trial designs to support additional benefits of the test drug, which should also be considered to size the study, as the treatment difference in these endpoints may require a larger sample size than the primary endpoint. Typical Sample Size Calculations address the Primary Endpoint: Patient Reported Symptoms Study A sample size of 125 subjects per treatment arm is required to detect a difference of 1.0 point or more between active and placebo, assuming a two-sided 5% level of significance and 90% power, with a pooled standard deviation of 2.40 points. Primary Endpoint Total Nasal Symptom Score (a sum of congestion, nasal itching, sneezing, and post nasal drip) averaged over four weeks. Key Secondary Endpoints 1) Proportion of Symptom Free Days (each determined per patient) across the four week treatment period 2) Quality of Life Questionnaire at Endpoint (the last post-baseline observation carried forward) 3) Morning Peak Nasal Inspiratory Flow Rate (Liters/minute) averaged over four weeks Power Calculations for 125 subjects per treatment arm Endpoint Total Nasal Symptom Score Symptom Free Days Quality of Life Questionnaire Morning Peak Flow Deltaa Pooled STD (CV) 1.0 0.1 0.5 10 2.4 (0.42) 0.28 (0.36) 0.8 (0.63) 36 (0.28) Power 90% 80% >99% 60% Joint power (assuming independent Endpoints) = 43% a: Delta=The treatment difference between the active and placebo. Re-powering around the weakest Endpoint to assure a reasonable probability of study success: propose 205/arm for 80% power to detect Morning Peak Flow Power Calculations for 205 subjects per treatment arm Endpoint Delta Pooled STD (CV) Total Nasal Symptom Score 1.0 Symptom Free Days 0.1 Quality of Life Questionnaire 0.5 Morning Peak Flow 10 2.4 (0.42) 0.28 (0.36) 0.8 (0.63) 36 (0.28) Joint power (assuming independent Endpoints) = 74% Power 98% 95% >99% 80% Multiplicity Multiple Endpoints require adjustments to control the overall alpha level of significance. In general, the primary endpoint is tested first, followed by tests of the key secondary endpoints. Bonferroni: Split the alpha level across all the key secondary endpoints and test them simultaneously, so the failure of one key secondary does not effect the testing of another key secondary. Sequential: Order the key secondary endpoints and test in the prespecified sequence. Though if one test fails, the following tests lose the overall alpha control. More desirable in cases where one has knowledge of the probability of success. Create a family tree: Divide the key secondary endpoints into groups, assign a partitioned alpha to each, and test the families simultaneously, but sequentially within each family. More complex methods, such as Hochberg, etc., can be applied, but are worthy of a separate presentation. When is it time to reconsider including Key Secondary Endpoints? When powering a key secondary endpoint up to 80% forces the primary endpoint to be heavily overpowered (>99%), such that meaningless treatment differences become statistically significant. o Example: An easily powered primary endpoint (such as forced expiratory volume in an asthma study using an analysis of covariance) followed by a more difficult to power key secondary endpoint (such as a time-to-infrequent-event using a logrank test) When the joint power across the primary and key secondary endpoints is <50%, usually a sign that there are too many key secondary endpoints, or it is not feasible to properly power one or more of the key secondary endpoints without overpowering the primary endpoint. In these cases, consider running a separate trial. Overview of Factors Influencing Sample Size Calculations Powering the Primary Endpoint The Need to Power Key Secondary Endpoints Minimum Sample Size Requirements for a pooled studies safety database Superiority/Non-Inferiority Margins (Treatment differences) Additional Tools for Dealing with Sample Size Issues Adaptive Designs: adjustments carried out during the trial o Dropping ineffective treatment arms o Re-estimating the sample size Pooling studies o Pool similar design studies prior to analysis of endpoints o Pool all studies for evaluation of adverse events Non-Inferiority for studies in which a placebo is not Feasible A defined criteria whereas the test drug is no worse than the standard treatment. Criteria require use of treatment difference confidence intervals where the lower bound cannot fall below a pre-specified margin such as a percentage of the treatment effect size. Back to the Pools of Studies for Examination of Safety A single study designed to examine efficacy is too small for a thorough examination of adverse events (an undesired outcome such as a headache, cough, or more serious medical condition such as elevated blood pressure, dizziness, or liver toxicity). At 125 subjects per treatment arm, if an underlying adverse event rate in a control/placebo is 2%, a rate of 8.5% or more must occur in the active treatment arm to reach statistical significance (p < 0.05 at 50% power using a binary outcome test). Thus, it would require more than a four-fold increase in adverse events for chance to be reasonably ruled out. Therefore, similar studies are pooled to accrue sufficient sample size to examine adverse events. Suppose the placebo event rate is 2%, what is the least significant difference (LSD) event rate in the active treatment arm for the following pool of data: Sample Size per treatment 125 250 500 1000 LSD active treatment event rate 8.5% 6.0% 4.5% 3.6% Rare Events occurring in less than 1% of subjects Rare but serious adverse event rates, which can include death or severe liver damage, require large databases to detect safety signals. For example, detecting a difference of 0.5% in adverse event rates could require over 3000 subjects per treatment arm. Some notes on Non-Inferiority Placebo controlled trials are easy examples to use when discussing treatment effect sizes and sample size calculations in public presentations, but in the world of clinical trials they are not always ethical. For example, trials in oncology (cancer), HIV, or in diseases involving pain, etc., require every patient to be on some type of therapy. Therefore, new drugs are tested against active comparators, usually the “standard of care”. Here, if the new drug is “at least as effective”, and is believed to have another advantage, such as better safety, lower cost, easier dosing compliance, then tests for non-Inferiority can be considered. Non-Inferiority Criteria Criteria are based on lower bounds of confidence intervals of the treatment difference. Usually the assumption is that both drugs are equally effective (but there are exceptions), and the magnitude of the lower bound is a pre-defined fraction of the full treatment effect against an inactive treatment. This fractional approach drives up the sample size. Example of a Sample Size Statement For the test of non-inferiority in lung function, defined as the forced expiratory volume in Liters (FEV1), a sample size of 145 subjects for each treatment is required, assuming a standard deviation of 0.30 Liters at about 80% power. Non-Inferiority is achieved when the lower bound of the 95% confidence interval of the treatment difference is -0.10 Liters or more (no upper bound requirement), which is one-half the magnitude of an estimated treatment difference of an active treatment vs. an-inactive treatment (0.20 Liters). Pre-Defined Fractions: Effect on Sample Size for Non-Inferiority Fraction of Effect Size for Lower Bound 95% CI 1.00 0.75 0.67 0.50 0.33 0.25 Sample Size/Treatment 37 64 81 143 316 567 There are many other methods and criteria used to determine the lower bound criteria for non-inferiority. However, the one-half treatment effect criteria has been used in clinical trials, and illustrates the effect on sample size. In fact, this criteria results in a four-fold increase in sample size over the test of superiority (beating the standard drug at 80% power and 5% significance). The important consideration is that the magnitude of the lower bound criteria must be below a reasonable definition of therapeutic advantage of the standard of care. Conclusions and Overall Thoughts for Statisticians The Pharmaceutical industry will be reacting to increased scrutiny of submitted data for drug approval. o More safety data will be required (large safety trials designed to evaluate rare adverse events) o More efficient pivotal trials will be required to contain enrollment costs in Phase III programs. Other methods of drug evaluation will evolve and work their way into the industry o Map of the human genome will allow a more efficient enrollment criteria through better subject identification. o Development of more specific biomarkers will help reduce the variability of outcome measures.