preference : An R Package for Two-Stage Clinical Trial Design Accounting for Patient Preference

The consideration of a patient’s treatment preference may be essential in determining how a patient will respond to a particular treatment. While traditional clinical trials are unable to capture these eﬀects, the two-stage randomized preference design provides an important tool for researchers seeking to understand the role of patient preferences. In addition to the treatment eﬀect, these designs seek to estimate the role of preferences through testing of selection and preference eﬀects. The R package preference facilitates the use of two-stage clinical trials by providing the necessary tools to design and analyze these studies. To aid in the design, functions are provided to estimate the required sample size and to estimate the study power when a sample size is ﬁxed. In addition, analysis functions are provided to determine the signiﬁcance of each eﬀect using either raw data or summary statistics. The package is able to incorporate either an unstratiﬁed or stratiﬁed preference design. The functionality of the package is demonstrated using data from a study evaluating two management methods in women found to have an atypical Pap smear.


Introduction
In a traditional clinical trial setting, investigators are interested in determining the effect of a treatment on a specified outcome. The conventional clinical trial design involves the randomization of all patients to one of multiple treatment arms, which allows for the estimation of this treatment effect. In addition to the treatment effect, the outcome may also be influenced by a patient's preference for one of the available treatment options, especially in situations where the patient is not blinded to treatment assignment (e.g., surgical versus medical intervention). Rucker (1989) proposed a two-stage trial design in order to address this issue of patient preference. In the first stage, patients are randomized to either a choice or a random arm. In the second stage, patients in the choice arm are allowed to select their preferred treatment, while patients in the random arm will undergo a second randomization to one of the available treatment options. In addition to the treatment effect, the two-stage design allows the estimation of two additional effects, the selection effect and the preference effect. The selection effect refers to the additional effect of a patient's preferred treatment on the outcome, while the preference effect refers to the influence on the outcome of whether or not a patient actually receives his/her preferred treatment.
In 2016, the two-stage randomized clinical trial design was extended to allow for the inclusion of stratification factors (Cameron and Esserman 2018). The stratified version should be employed if a patient's treatment preference is expected to be closely related to a certain measurable covariate. In this design, patients are stratified according to a specified stratification variable prior to being assigned a treatment, either through choice or randomization. For example, if older patients may be more likely to choose a medical intervention over a surgical intervention, patients would be stratified by age before randomization. This paper introduces the R (R Core Team 2020) package preference for the design and analysis of two-stage (or doubly) randomized clinical trials for both the original Rucker (1989) version (unstratified) and the stratified design by Cameron and Esserman (2018). This package facilitates the implementation of the two-stage randomized design by providing the necessary sample size estimation and analysis tools for clinicians and statisticians seeking to disentangle the roles of patient preference in clinical trials. The package is available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=preference (Cameron 2020).
The preference package fits into the general category of the "Design and Analysis of Clinical Trials" (Zhang and Zhang 2020) and focuses on Phase III trials. It is the first package to implement both unstratified and stratified clinical trial designs incorporating patient preference. The interface is simple and will readily fit into clinical trial framework packages.

Trial design methodology
The general design, originally proposed by Rucker (1989), is illustrated in Figure 1. In this design, patients are initially randomized to either the choice or the random group. Those in the choice arm are then allowed to choose from the available treatment options, possibly after a shared decision process with their physician or family; patients in the random group are assigned a treatment through a second randomization. We assume that all patients have a preferred treatment (i.e., no undecided patients). For simplicity, we present the case where there are only two possible treatment choices.
Let N represent the total number of patients enrolled in the study and θ represent the proportion of these patients initially randomized to the choice arm. Of note, work has been done to estimate θ and determine the optimal allocation of participants into the choice and random arms (Walter, Turner, Macaskill, McCaffery, and Irwig 2012). We use φ to represent the proportion of patients who prefer treatment A; because we assume all patients have a preference, we let 1 − φ represent the proportion of patients preferring treatment B.
The notations for the mean and standard deviation (SD) of each group are shown in Figure 1.  Figure 1: The flow graph for the two-stage randomized preference trial.
We use µ i and σ i to represent the mean response and SD of patients randomized to treatment i. µ ii and σ ii are used to represent the mean responses and standard deviations of patients choosing treatment i. For the aspects of design discussed below, we assume σ 1 = σ 2 = σ 11 = σ 22 = σ.
In cases where a treatment may not be equally preferred across the study population, the above design can be extended to include important stratification variables (Cameron and Esserman 2018). The notation is similar to that of the unstratified case, with the addition of ξ l used to denote the proportion of patients in stratum l ( s l=1 ξ l = 1).

Package overview
The design and analysis functions in preference assume a continuous, approximately normally distributed outcome measure. Analyses may be conducted on raw data or summary data. The raw data must contain the outcome variable of interest, treatment indicator, and arm (i.e., choice or random). If the design is stratified, there must also be a stratum indicator. The summary data must contain the mean and SD of the outcome, and sample size for each arm by treatment combination. These should be broken down further by stratum if the design is stratified. This package includes methods to compute treatment effects, standard error estimates, test statistics, p values and (1 − α) confidence intervals for the preference, selection, and treatment effects. In the functions preference() and fit_preference(), the study results are computed from provided raw data. An alternative analysis function is provided with fit_preference_summary(), which computes statistics using summary data.

Preference trial data structure
To aid in the design of the two-stage randomized trial, functions are provided to determine the number of participants required to detect a treatment effect, preference effect, and/or selection effect at a specified Type I and Type II error level (Turner, Walter, Macaskill, McCaffery, and Irwig 2014;Cameron and Esserman 2018 a 'data.frame' whose class attributes are 'preference.trial' and 'data.frame' with each row corresponding to an individual trial. Each trial is parameterized by the parameters shown in Table 1. Functions are provided to construct 'preference.trial' objects based on three common scenarios. Each of these scenarios assume the effect sizes (pref_effect, selection_effect, and treatment_effect), the proportions (pref_prop, choice_prop, stratum_prop), and distributional parameters (k, alpha, sigma2) are known. When the sample size for each arm is known the preference.trial() function may be used to specify a set of trials. For the more common cases, when the maximum sample size of any arm is known, the pt_from_ss() function is used, and when the power must be above a threshold for each arm the pt_from_power() function is used. Each of these functions may take vector arguments to create multiple trials at once. When the lengths do not agree, value recycling occurs so that the number of rows is equal to the argument vector with the longest length. As a result, sets of trials can be quickly created based on many different configurations allowing the user to investigate the trial characteristics and select the best trial from a large set of possible trials.

Case study: IMAP
In the Improving Management of Mildly Abnormal Pap Smears (IMAP) study, a two-stage randomized trial design was used to evaluate psychosocial outcomes in women found to have atypical squamous cells of undetermined significance (ASCUS) with a Pap smear (McCaffery, Irwig, Turner, Chan, Macaskill, Lewicka, Clarke, Weisberg, and Barratt 2010;McCaffery, Turner, Macaskill, Walter, Chan, and Irwig 2011). In this study, the investigators sought to determine whether women given an informed choice between two systems of management for Pap smear abnormalities showed improvements in psychosocial outcomes, satisfaction, and health-related quality of life (QOL) over women given no choice. The two arms were: (1) standard management, i.e., repeated Pap testing every 6 months to monitor for more serious abnormalities; and (2) use a human papilomavirus (HPV) triage approach, which tests for the presence of HPV types that are associated with an increased risk of cervical cancer. The original trial enrolled 330 women, 314 were randomized (16 did not meet eligibility criteria), and 235 met criteria for inclusion in the analysis. A decision aid was used to support the choice of women initially randomized to the choice arm of the study. The primary outcome for this example is health related QOL measured with the mental component score (MCS) of the 36 question short form health survey (SF36; Ware Jr and Gandek 1994). While the original trial did not stratify on any factors, for demonstration purposes, we stratified according to the six item abbreviated state trait anxiety inventory (STAI; Marteau and Bekker 1992). The STAI score is a measure of a person's anxiety, with higher STAI scores indicating higher levels of anxiety. A cutoff of 10 was the threshold for stratification (low versus high). We were kindly provided with summary data for this trial by the original investigators. We used this summary data to simulate a data set representative of the 208 individuals with stratification information from the original trial. The simulated data raw set (imap), as well as summarized versions of this data set (imap_summary and imap_stratified_summary) are available with the package (see Section 3.3).

Sample size determination
Suppose we wish to estimate the required sample size to have sufficient power for detecting treatment, preference, and selection effects for the SF36 MCS outcome in the IMAP study. First, we need to define the size of the preference, selection, and treatment effects we would like to detect. If the effect sizes are known, these effect sizes are used directly in the pt_from_power function. If not, the function effects_from_means() can be used to compute these effects from the means of each treatment group in both the choice and random arm.
In the IMAP study, we may expect the HCV triage method to result in only a small increase in SF36. That is, we expect women randomized to the HCV triage method to have a mean SF36 score of 46 (µ 1 ), while women randomized to repeated Pap screening to have a mean SF36 score of 45 (µ 2 ). In the choice arm, we expect the mean SF36 score to increase by 4 points in both management groups. Therefore, we assume that women choosing the HCV triage method to have a mean score of 50 (µ 11 ) and women choosing the repeated Pap smear management option to have a mean SF36 of 49 (µ 22 ). In addition, we must specify the proportion of women preferring the HPV triage management method. Because management with an HPV triage approach is generally a speedier option, we expect 70% (φ) of women to prefer this approach, as compared to the 30% of women who would prefer management with repeated Pap smears. With these assumptions, we can calculate the preference, selection and treatment effects.
R> effects_from_means(mu1 = 46, mu2 = 45, mu11 = 50, mu22 = 49, phi = 0.7) Therefore, with the mean SF36 scores assumed above, we need to design a study to detect a treatment effect of 1, selection effect of 3.81, and a preference effect of 9.52. To determine the necessary sample size, we must also specify the variance of the responses. In this example, we assume a variance of 5 (σ 2 ) for the SF36 outcome. Then, the sample sizes to detect the above effects with 80% power and a Type I error of 0.05 is given by the pt_from_power() function.
R> imap_trial <-pt_from_power(power = 0.8, pref_effect = 9.52, + selection_effect = 3.81, treatment_effect = 1, sigma2 = 5, + pref_prop = 0.7) R> class(imap_trial) [1] "preference.trial" "data.frame" The function returns an object of class 'preference.trial' and 'data.frame'. The trial includes three samples sizes, pref_ss -the preference effect sample size, selection_ss -the selection effect sample size, and treatment_ss -the treatment effect sample size. In order to achieve at least 80% power for each of the effects, the largest sample size should be selected. In this example, 314 patients are needed to detect the specified effects with at least 80% power each with a Type I error rate of 5%. If all three of these effects are of primary interest, the Type I error rate should be adjusted to account for multiple testing (e.g., Bonferroni correction).
Since it is common during the design phase to explore the effect on the sample size by varying the power, a vector of power constraints can be specified returning a set of trial designs. Additionally, a plot() function has been implemented to visualize the power vs. sample size relationship, shown in Figure 2.

Sample size determination under stratification
We may also be interested in stratifying for STAI score (low vs. high) in our study. Because women with low STAI scores have decreased anxiety, we expect women in this stratum to have improved SF36 compared to women in the high STAI stratum. Therefore, we expect women randomized to HCV triage with low STAI scores to have a mean SF36 of 52, while women randomized to HCV triage with high STAI scores to have a mean SF36 of 42. In addition, we assume that women randomized to repeated Pap screening will have a mean SF36 of 50 in the low STAI stratum and a mean SF36 of 42 in the high STAI group.
In the choice arm, we expect both strata to have a small increase in SF36 across both management systems. For the low STAI group, we expect a 2 point increase in mean SF36 for both management groups. In the high STAI group, we also expect a 2 point increase in mean SF36 for women choosing repeated Pap screening. Because the HCV triage method is a speedier approach, we expect a slightly greater increase in SF36 for women in the high STAI group choosing this method; we assume a mean SF36 score of 46 for this group.
Finally, we assume that both management systems will be equally preferable to women in the low STAI stratum (φ 1 = 0.5). Because women with high STAI score have increased anxiety, we may expect a greater proportion of these women to prefer the HPV triage management method, which is the speedier option. Therefore, we assume that 70% of the women in the high STAI stratum will prefer HPV triage, and only 30% will prefer repeated Pap screening (φ 2 = 0.7). Furthermore, we assume that 40% (ξ 1 ) of women will have low STAI scores and 60% (ξ 2 ) of women will be in the high STAI score stratum. Based on these assumptions, we can calculate the preference, selection and treatment effects we wish to test in our study.  With the assumptions listed above, we will need to design a study that can detect an overall treatment effect of 0.8, selection effect of 3.14, and preference effect of 6.46. We also need to specify the variance of the responses. We assume a variance of 3 for the SF36 outcome in the low STAI stratum and a variance of 4 in the high STAI stratum. With 80% power and a Type I error rate of 0.05, we can use the pt_from_power function to find the sample size needed to detect the above effect.

Power determination
Alternatively, we may have a pre-specified number of women to be enrolled (e.g., 208), or a range of sample sizes to explore (e.g., 200 to 500 by 50) and instead want to know the power of our study to detect each effect. If we assume the same preference, treatment, and selection effects as above (Section 3.1) and make the same assumptions regarding the preference rate and response variance, we can use pt_from_ss() to estimate the study power, assuming a Type I error of 0.0167. The calculation below (row 6) shows with a sample size of 450, we would have over 80% power for each of the effects, accounting for multiple testing with the Bonferroni correction. As with the sample size determination, a visualization of the trials can be created with the plot() function, whose output is shown in Figure 3.

Power determination under stratification
We may also wish to know our study power for a two-stage randomized trial stratified by STAI score with 208 patients. We will continue to assume that we wish to detect a treatment effect of 0.8, selection effect of 3.14, and preference effect of 6.46 and retain the assumptions above regarding the variance and preference rate within each stratum.

Analyzing two-stage clinical trial data
Raw data for the primary outcome, SF36 MCS (outcome), is provided in the imap data frame within the preference package.

R> head(imap)
outcome treatment arm strata 1 53. The treatment (HPV triage, "HPV", or repeated Pap screening, "Pap") is indicated in the treatment column; the study arm is either the character value "choice" (denoting the choice arm) or "random" (denoting the random arm) in the arm column; and the strata column contains the stratum indicator: strata = 1 indicates the low STAI group and strata = 2 indicates the high STAI group.
Analyses can be conducted using one of the three functions: preference, fit_preference, or fit_preference_summary. It is important to note, all three functions produce identical output; they only vary in how the data is inputted.

Analyzing preference data with the formula interface
preference allows the user to enter the model description as a formula. The functionality is similar to that of lm. The user is required to provided the formula and the data frame and has the option of specifying the Type I error rate (alpha), which is 0.05 by default. form requires outcome, treatment, and arm (strata is only required if the design is stratified).
Unstratified call with default alpha: R> pt_model <-preference(form = outcome~treatment:arm, data = imap) Stratified call with alpha set at 0. The output is provided as a list: $alpha is the Type I error rate used for the confidence intervals; $unstratified_statistics are the stratum specific statistics with each column corresponding to a stratum; $overall_statistics are the statistics combined across the strata. (note that for the unstratified case $unstratified_statistics and $overall_statistics would be identical.) Both $unstratified_statistics and $overall_statistics contain the effect estimate, standard error (SE), test statistic, p value, and (1 − α)100% upper and lower bounds for each of the three effects (preference, pref; selection, sel; treatment, treat).
For this example, none of the overall effects are statistically significant at a Type I error rate of 5%, although the overall selection effect is close to reaching statistical significance with a p value of 0.07.

Analyzing data with the low-level interface
It is also possible to specify the data elements directly with fit_preference. The user is required to provide the outcome, arm (the character that indicates if the arm is random, "random", or choice, "choice"), and treatment. If the design is stratified, then strata, an indicator of stratum is required. The default Type I error is 0.05, and can be changed by varying alpha, which should be a value between 0 and 1. Analyzing unstratified data is illustrated below using the imap data set.

Analyzing trial summary data
It may be that the raw data is unavailable and the user only has access to the summary data. imap_summary and imap_stratified_summary contain the summary statistics for the unstratified and stratified imap data set.

Optimal distribution between choice and random arms
In the examples presented above, we have assumed the choice proportion (choice_prop), the proportion of individuals that are distributed to the choice arm, to be 0.5 (equal distribution). In general, however, the choice proportion is not a fixed parameter and can be specified according to the main goals of the study. If a desired choice proportion is known in advance, the user can directly set this argument in the function calls above. If the desired choice proportion is not known, the function optimal_proportion() can be used to estimate the optimal value given the goals of a particular study (Walter et al. 2012).
To use optimal_proportion(), a set of weights must be defined that indicate the relative importance of estimating each of the three effects (w_sel, w_pref, w_treat) with the constraint w_sel + w_pref + w_treat = 1. For example, if we are equally interested in estimating the selection, preference, and treatment effects, we would assign a weight of 1 3 to each effect (i.e., w_sel = w_pref = w_treat = 0.33). If we want to determine the optimal choice proportion for the unstratified IMAP example above to detect a preference effect of 9.52 and a selection effect of 3.81 (assuming 70% of women prefer the HCV triage method and a variance of 5), we can use the following function call.
Note, this function is only available for the unstratified design at this time, although work is currently underway to expand this to the stratified case.

Summary
The two-stage randomized clinical trial design is an important tool for researchers seeking to understand the impact of treatment preference on outcome. In cases where equivalence has been demonstrated between two treatments, patient preferences may be an important consideration for health care professionals seeking to provide the best possible treatment for a particular patient. Estimation of the effect of treatment preference will also be of particular interest when blinding is not feasible and when treatment success is closely tied to a patient's adherence.
This paper introduces the R package preference for the design and analysis of these trials. The IMAP study discussed above is used to demonstrate the main functionality of the package.
To aid in the design of the two-stage randomized trial, sample size formulas are provided to estimate the required sample size to detect a given preference, selection, and treatment effect. In cases where the sample size is fixed, functions to compute the study power to detect each effect are included. Finally, analysis functions are provided that can calculate the test statistic, p value, effects estimates and 95% confidence intervals for each effect using either raw or summary data. These functions can be used for both the unstratified and stratified study designs. In addition, the package provides tools for optimizing the distribution of patients in the initial randomization between choice and random arm in the unstratified case.
This package is an important contribution as it is the first available software to design twostage clinical trials incorporating patient preference. Previous simulation studies are neither available as part of supplementary material with the paper nor publicly available from the investigators over the web. Extensions to the preference package are being actively developed concurrently with extensions to the methodology. The above methods assume a continuous, approximately normally distributed outcome measure; extensions to these methods are being developed to allow for the design and analysis of studies with a binary or count endpoints.