OneArmPhaseTwoStudy : An R Package for Planning, Conducting, and Analysing Single-Arm Phase II Studies

In clinical phase II studies, the eﬃcacy of a promising therapy is tested in patients for the ﬁrst time. Based on the results, it is decided whether the development programme should be stopped or whether the beneﬁt-risk proﬁle is promising enough to justify the initiation of large phase III studies. In oncology, phase II trials are commonly conducted as single-arm trials with planned interim analyses to allow for an early stopping for futility. The speciﬁcation of an adequate study design that guarantees control of the type I and II error rates is a key task in the planning stage of such a trial. A variety of statistical methods exists which can be used to optimise the planning and analysis of such studies. However, there are currently neither commercial nor non-commercial software tools available that support the practical application of these methods comprehensively. The R package OneArmPhaseTwoStudy was implemented to ﬁll this gap. The package allows determining an adequate study design for the particular situation at hand as well as monitoring the progress of the study and evaluating the results with valid and eﬃcient analyses methods. This article describes the features of the R package and its application.


Introduction
In phase II clinical trials, the activity of a new therapy is evaluated to decide whether it warrants further investigation in large-scale phase III trials. In oncology, these trials are frequently performed in a single-arm design (Gan, Grothey, Pond, Moore, Siu, and Sargent 2010;Baghdadi and Laffler 2013). The primary endpoint is commonly a binary outcome measuring therapy response based on tumor shrinkage (Eisenhauer et al. 2009). For ethical and economical reasons, these trials are usually performed with interim analyses to allow for an early termination in case of a low observed response rate ("stop for futility"). Due to logistic restrictions and limited gain in statistical efficiency when increasing the number of interim analyses (Chen 1997), two-stage designs are most commonly applied. Early termination due to overwhelming activity ("stopping for efficacy") plays no major role for these phase II studies as there is no ethical imperative for early stopping in this situation. Furthermore, the collection of sufficient information on efficacy and safety in phase II is important before initiating a large phase III program.
We consider two-stage designs with the option of early stopping for futility where the null hypothesis H 0 : π ≤ π 0 is tested at one-sided level α and where the power 1 − β is evaluated at a response rate π 1 > π 0 . By searching algorithms based on the exact binomial distribution, designs fulfilling the type I and type II error restrictions can be identified. These designs are characterized by the sample size for the first and second stage, n 1 and n − n 1 , respectively, and by the boundary values r 1 and r (r 1 < r). The study is continued after the first stage if the number of responses at the interim analysis is greater than r 1 , and the null hypothesis can be rejected after stage 2 if the total number of responses exceeds r. Usually, several solutions (n 1 , r 1 , n, r) exist and additional criteria are required for selecting a specific design. Simon (1989) proposed the "minimax design" minimizing the maximum sample size n and the "optimal design" minimizing the expected sample size under the null hypothesis among those two-stage designs satisfying the constraints. Other criteria are available such as "admissible designs" (Jung, Lee, Kim, and George 2004) that are compromises between the minimax and the optimal design. By the choice of an adequate design, control of the type I and type II error rate is assured. A further important demand from a biostatistical viewpoint is the appropriate estimation of the treatment effect in the analyses. Treatment effect estimates obtained from phase II studies are used to compare the activity of different therapies under investigation and are the basis of planning the proceeding phase III studies. However, the common maximum likelihood estimator (MLE) is typically biased in multi-stage designs that allow for early stopping (Kunz and Kieser 2012). Similarly, calculation of the "naive" confidence interval without taking the sequential nature of the trial into account does usually not guarantee the desired coverage probability. Therefore, proper methods are required that are tailored to the design applied.
The above described two-stage designs foresee early stopping only after the first stage. However, it may become evident during the course of the trial that, based on the currently observed results, it is very unlikely or even impossible to reject the null hypothesis after the second stage. This disadvantage of standard two-stage designs can be resolved by implementing a statistical monitoring of the results and a curtailment procedure. The idea is to stop the trial if the probability to reject the null hypothesis, given the observed number of responses, falls below a pre-specified threshold ("stochastic curtailment") or is zero ("non-stochastic curtailment"). Curtailment can be restricted to the second stage only (Ayanlowo and Redden 2007) or can be performed in both stages (Kunz and Kieser 2012). Whether or not to implement a curtailment procedure depends on the balance between the reduction of sample size and the loss in power to be expected in the specific situation at hand. Again, appropriate methods and related software are needed.
The "classical" two-stage designs described above require conduct of the study exactly as pre-defined by specification of (n 1 , r 1 , n, r). Changing the design mid-course includes the risk of compromising the type I error rate (Englert and Kieser 2012a). However, if, for example, an unexpected high response rate is observed at the interim analysis, it may be desirable to reduce the sample size for the second stage. Recently, flexible single-arm two-stage designs have been developed that allow (data-driven) modifications of the initially specified design while still controlling the type I error rate (Englert and Kieser 2012a,b). Furthermore, it turns out that using these designs may even lead to an increased statistical efficiency (Englert and Kieser 2012b). Application of these methods is therefore highly attractive but requires related software. The same holds true for the so-called subset designs that allow a simultaneous assessment of two nested endpoints.
Currently available software for single-arm phase II studies are restricted to "classical " two-stage-designs and focus on specific aspects. For example, there are a number of noncommercial (e.g., Kirk and Fay 2014;Seshan 2015;Southwest Oncology Group 2015) and commercial software packages (e.g., Cytel 2015, NCSS 2015, SAS Institute Inc. 2015and StataCorp 2015 providing the feature of determining Simon's optimal and minimax design. Admissible designs are implemented in two other programs Kunz and Kieser 2011b). In addition, there are a number of web-based software tools. These are not explicitly referenced here, because a quality assessment of the source code is not directly possible. Until now, there exists no software dealing with flexible phase II designs. Furthermore, no software package is currently available that allows a comprehensive support of all aspects when performing single-arm phase II studies, namely planning, conduct and analysis. The R (R Core Team 2017) package OneArmPhaseTwoStudy (Wirths 2017) that is described below fills this gap. The package is available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=OneArmPhaseTwoStudy. As a helpful supplement, we recommend the validated, web-based software tool implemented by Englert (https://imbi.shinyapps.io/phaseII-app/). This tool is based on the work of Englert and Kieser (2015) which allows for a proper dealing with over-and underrunning.
The paper is organized as follows. In Section 2, we outline the statistical methods for singlearm two-stage designs. Implementation of these methods in the OneArmPhaseTwoStudy package as well as its features are described in Section 3. In Section 4, its application is demonstrated by an example, and we provide a brief discussion in Section 5.

Identification of designs
For a two-stage design defined by (n 1 , r 1 , n, r) the probability of rejecting the null hypothesis in case of a true response rate π * is given by 1 −   B(n 1 , r 1 , π * ) + min(n 1 ,r 1 ) where B(·) denotes the cumulative binomial distribution and b(·) the binomial probability mass function (see e.g., Simon 1989). Evaluating (1) at π * = π 0 or π * = π 1 provides the type I error rate and power, respectively. The probability of early stopping (PET ) and the expected sample size (EN ) under the null hypothesis are given by EN (π 0 ) = n 1 + (1 − PET (π 0 ))(n − n 1 ).
Under all designs fulfilling the type I and type II error constraints, the optimal design is defined as the one with the smallest EN (π 0 ) and the minimax design is the one with smallest total sample size n. If more than one design exists with smallest n, the one with smallest EN (π 0 ) is selected as minimax design (Simon 1989). Admissible designs minimize the Bayes risk qn + (1 − q)EN (π 0 ) for a given weight q ∈ [0, 1] . In general, a design is admissible not only for a single value but for a range of values for q. Admissible designs show a higher total sample size than the minimax design but a smaller total sample size than the optimal design and vice versa with respect to EN (π 0 ).
The algorithm to determine two-stage designs fulfilling the requirements with respect to the type I and II error and among the optimal, minimal, and admissible designs follows the description in Kunz and Kieser (2011b). A crude algorithm searches for each value of n over n 1 ∈ [1, n − 1], r 1 ∈ [0, n 1 − 1] and r ∈ [r 1 + 1, n − 1] for those designs for which expression (1) is less than or equal to α at π * = π 0 and at least 1 − β at π * = π 1 . This approach can be improved as follows. The starting values of the searching procedure are determined based on the following considerations. As (1 − π)n 1 ≤ B(n 1 , r 1 , π 1 ), the constraint with respect to the type II error rate leads to n 1 > log(β)/ log(1 − π 1 ). Hence, for β = 1 − π 1 the starting value for n 1 is ceil(log(β)/ log(1 − π 1 )) while it is 2 for β = 1 − π 1 . Here, ceil(x) denotes the function which returns the smallest integer value that is not less than x. As n has to be larger than n 1 , the starting value for n is n 1 + 1. For every pair (r 1 , n 1 ) with r 1 ∈ [0, n 1 − 1] it is checked whether B(n 1 , r 1 , π 1 ) ≥ 1 − β, as it can be shown that this inequality has to be true in order to find a parameter set (n 1 , r 1 , n, r) which fulfills the type II error condition. If this is not the case, n 1 is increased by 1 and the search continues; if the inequality holds true, the algorithm searches over r in the range of [r 1 + 1, n − n 1 + r 1 ]. For every value of r, B(n, r, π 1 ) < β has to hold true for any solution. Therefore, this condition is checked and the search over r is stopped and continued with the next r 1 whenever the inequality does not hold true. Otherwise, the type I error rate and power are calculated via (1) and the algorithm continues.
To improve the search algorithm, the OneArmPhaseTwoStudy package provides a method to approximate a maximal sample size maxN such that the search algorithm stops if n = maxN . The parameter maxN is approximated in a way that the optimal design is included among the identified designs. To determine maxN , n 1 is set to the minimal possible value (as described above) and r 1 is set to 0. After that, an algorithm searches over all possible values of maxN and r ∈ [1, maxN ], where maxN is increased until a combination of r 1 , n 1 , r, and maxN is found such that the corresponding error constraints are fulfilled. The idea behind this approach is that maxN has to be very large when n 1 is set to the minimal possible value and r 1 is set to 0. There is no formal proof that the search algorithm will always find the optimal design when using maxN as upper boundary for the total sample size. However, in the multitude of examples we considered there was no case indicating this approach may be wrong.
If not all possible designs should be identified but only optimal, minimax, and admissible designs, the algorithm can be further improved by taking into account the inequalities EN (π 0 ) optimal ≤ EN (π 0 ) admissible ≤ EN (π 0 ) minimax and n 1 < EN (π 0 ) for π 0 > 0. Therefore, n 1 is smaller than EN (π 0 ) admissible for admissible designs and smaller than EN (π 0 ) optimal for the optimal design. Consequently, the above described algorithm starts searching for n 1 up to a maximal value of n − 1 until the first solution is identified. The maximal value for n 1 is then replaced by EN (π 0 ) of this design, and whenever another solution is identified the upper bound of the range of n 1 is replaced with the smallest EN (π 0 ) found so far.

Point estimation, confidence intervals, and p values
In an early work by Girshick, Mosteller, and Savage (1946), unbiased estimators for several samples from binomial distributions were developed. Based on this approach,  derived an unbiased estimator for the true response rate and proved that it is the uniformly minimum variance unbiased estimator (UMVUE). Let t 1 denote the number of responses in the first stage and t denote the cumulative number of responses when the trial is continued to the second stage, then this estimator is given bŷ For the derivation of appropriate p values and confidence intervals that match the test decision and that take into account the sequential nature of the design, an ordering of the sample space has to be defined. Armitage (1957) suggested a stage-wise ordering of the sample space. In the case of Simon's design, stage-wise ordering means that outcomes observed in the second stage of the trial are more extreme than outcomes observed in the first stage of the trial. Another option would be to sort the sample space based on the UMVUE. For Simon's design this approach, however, leads to the same ordering. The corresponding p value is given by Jung, Owzar, and George (2006) and Koyama and Chen (2008). For the stagewise ordering, Koyama and Chen (2008) derived the p value for testing H 0 : π ≤ π 0 by This p value reflects the decision rule of the underlying design in that the null hypothesis can be rejected at level α if and only if p ≤ α. Furthermore, a two-sided (1 − 2α)-confidence interval [π L ,π U ] can be obtained by inverting the test: [π L ,π U ] includes all values π * 0 for which the p value for testing H * 0 : π ≤ π * 0 within the given two-stage design lays within the interval [α, 1 − α]. Note that this (1 − 2α)-confidence interval (CI) matches the test decision as the null hypothesis H 0 is rejected if and only ifπ L > π 0 . Jovic and Whitehead (2010) give a nice overview on the computation and evaluation of point estimates and confidence intervals for single-arm two-stage designs.

Non-stochastic and stochastic curtailment
Non-stochastic and stochastic curtailment are based on the conditional power, i.e., the probability to reject H 0 : π ≤ π 0 after the second stage under the alternative π = π 1 given the results observed so far. If we denote byñ, 0 ≤ñ ≤ n, the number of patients for which the outcome has been observed and by k the number of responses that occurred for these patients, the null hypothesis cannot be rejected after the second stage if r 1 − k + 1 > n 1 −ñ or r − k + 1 > n −ñ, independently on any result that may be observed for future patients. The conditional power is thus zero in those cases, and stopping the trial for this reason is referred to non-stochastic curtailment. Stochastic curtailment means to stop the study for futility if the conditional power falls below a pre-defined threshold θ (0 < θ < 1). A formula for the conditional power when a stochastic curtailment procedure is applied in both stages of the study is given in the Appendix of Kunz and Kieser (2012). This formula was implemented in our package to allow a simulation-based investigation of the effect of including stochastic curtailment with a defined threshold θ. Monte Carlo simulations are used to generate possible study outcomes based on the corresponding binomial distribution under H 0 and H 1 , respectively. The type I and type II error rates are estimated by the relative frequencies of rejection or acceptance of H 0 over all simulated data sets. Furthermore, PET (π 0 ) and EN (π 0 ) are simulated analogously.

Adaptive single-arm two-stage designs
The "classical" two-stage designs discussed in Section 2.1 require strict adherence to the sample sizes and decision rules pre-specified in the planning stage of the study. In case of deviations from these values, control of the type I error rate is no longer guaranteed (Englert and Kieser 2012a). For practical applications, this is a severe restriction. Englert and Kieser (2012b) defined the conditional error function for two-stage designs with discrete outcomes based on the approach initially introduced for continuous test statistics (see, e.g., Proschan and Hunsberger 1995;Posch and Bauer 1999;Müller and Schäfer 2001). It can be shown that any "classical" one-arm two-stage design with a binary endpoint can be re-written in terms of the conditional error function CE, which is given by where k defines the number of responses observed in the first stage. The p values p 1 and p 2 of the first and second stage, respectively, are given by p 1 (k) = 1 − B(n 1 , k − 1, π 0 ) and p 2 (l) = 1 − B(n − n 1 , l − 1, π 0 ), where k and l denote the number of observed responses at stage 1 and 2. Then the null hypothesis can be rejected if p 2 (l) ≤ CE(k). Furthermore, the type I error rate is controlled when applying this decision rule even if arbitrary design modifications are performed after the first stage, e.g., a recalculation of the sample size based on the results of the interim analysis (Englert and Kieser 2012b). Due to the discreteness of the outcome and with it the test statistic, the available type I error rate α is usually not exhausted but the actual level generally falls below α. By increasing the boundaries of the "natural" conditional error function CE(k) given above and thereby implementing the remaining level α − α , the conservatism can be reduced and the efficiency can be increased (Englert and Kieser 2012b). Such a modification of the conditional error function can be done in a multitude of ways. The software package includes the options of increasing the boundaries equally, proportionally to the probability of observing p 1 (k), or increasing only the smallest value of the conditional error function that is unequal to zero.

Subset designs
Lin, Allred, and Andrews (2008) proposed a single-arm phase II design which is based on two endpoints where a response for endpoint 1 implies a response for endpoint 2. Thus, endpoint 1 defines a subset of endpoint 2, as, e.g., in case of disease-free survival and overall survival as endpoints 1 and 2. These designs are also called "Simon's designs with ordinal outcomes". However, due to ease of readability we use the term "subset design". The decision to continue to the second stage is based only on one endpoint. In our package we implemented a subset design where the decision to proceed is based on endpoint 1.
The global test problem for the subset design is given by where π sub and π super denote the true response rates for endpoint 1 (subset) and endpoint 2 (superset), respectively. The probabilities π sub 0 and π super 0 denote the response rates for endpoint 1 and endpoint 2 under the global null hypothesis, and π suba and π super a denote the response rates under the global alternative hypothesis.

Identification of designs
The subset design implemented in the OneArmPhaseTwoStudy package is defined by the parameters (n 1 , r 1 , n, r, s), where more than r 1 responses for endpoint 1 under the first n 1 patients are needed to proceed to the second stage. To reject H 0 after the second stage, more than r responses for endpoint 1 or more than s responses for endpoint 2 under all n enrolled patients are needed.
The probability of rejecting the global null hypothesis for true response rates π * sub and π * super is given by where m(·) denotes the multinomial probability mass function and n 2 = n − n 1 (see e.g., Kunz and Kieser 2011a).
Evaluating (9) at π * sub = π sub 0 and π * super = π super 0 or π * sub = π sub 1 and π * super = π super a provides the type I error rate and power, respectively. The probability of early termination and the expected sample size under the global null hypothesis are based on the subset endpoint only and are therefore given by Under all designs fulfilling the type I and type II error constraints, the optimal design is defined as the one with the smallest EN (π sub 0 ). If this solution is not unique, the one with the highest power is chosen. The minimax design is the one with smallest total sample size n. If more than one design with smallest n exists, the one with smallest EN (π sub 0 ) is selected as minimax design (Kunz and Kieser 2011a). Admissible designs can be derived in the same way as described in Section 2.1.
The algorithm to determine subset designs fulfilling the type I and II error requirements and to choose among them the optimal, minimal, and admissible designs follows the description in Kunz (2011). This algorithm has many similarities with the algorithm described in Section 2.1 to detect Simon's designs but includes only a few changes. As before, a naive algorithm to detect subset designs searches for each value of n over n 1 ∈ [1, n − 1] and r 1 ∈ [0, n 1 − 1] as well as r ∈ [r 1 + 1, n − n 1 + r 1 ] and s ∈ [r, n] for those designs for which expression (9) is less than or equal to α at π * sub = π sub 0 and π * super = π super 0 and equal or larger than 1 − β at π * sub = π sub 1 and π * super = π super 1 . This approach can be improved in a similar way as described in Section 2.1. The starting value for n 1 is determined in the same way as for the Simon's design but is based on the response rate for endpoint 1 under the alternative hypothesis. Therefore, n 1 is ceil(log(β)/ log(1 − π sub 1 )) while it is 2 for β = 1 − π sub 1 . As n has to be larger than n 1 , the starting value for n is n 1 + 1. As for the Simon's twostage designs, the same inequality B(n 1 , r 1 , π suba ) ≥ 1 − β has to hold true for every pair (r 1 , n 1 ) with r 1 ∈ [0, n 1 − 1]. If the inequality does not hold true, n 1 is increased by 1 and the search continues; otherwise, the algorithm searches backwards over r in the range of [n − n 1 + r 1 , r 1 + 1]. Because of the fact that the actual type I error rate for Simon's design α simon for the parameter set (n 1 , r 1 , n, r) is smaller or equal to the actual type I error rate of a subset design with the same parameters, the condition α simon (n 1 , r 1 , n, r) ≤ α is checked for every value of r. If it holds true, the algorithm searches over s in the range of [r, n−1], else r 1 is increased and the search continues. The last step of the algorithm is to check whether the type II error rate of the multinomial test for the parameter set (n 1 , r 1 , n, r, s) under the alternative is less than β. If this condition holds true, the type I error rate and power are calculated via (9) and the algorithm continues. Otherwise, s is skipped and r is decreased by 1. As for Simon's design, the inequalities EN (π sub 0 ) optimal ≤ EN (π sub 0 ) admissible ≤ EN (π sub 0 ) minimax and n 1 < EN (π sub 0 ) hold true for subset designs, which leads to the same improvement of the algorithm as described in Section 2.1.

Point estimation, confidence intervals, and p values
Based on the work of Girshick et al. (1946), the uniformly minimum variance unbiased esti-mator for endpoint 1 (π sub,UMVUE ) can be obtained by (4), and for endpoint 2 (π super,UMVUE ) the estimator is given by where t 1 denotes the number of observed responses for endpoint 1 in the first stage, u 1 denotes the observed responses for endpoint 2 in the first stage, and t and u denote the observed responses for endpoint 1 and 2 in the whole trial.
For the derivation of appropriate p values that match the test decision, Kunz (2011) derived the following formula Since the exact p value depends on π sub 0 and π super 0 , the confidence interval for the response rate of endpoint 1 depends on the response rate of endpoint 2 and vice versa. This results in a one-sided confidence area which is called the confidence set (Reiczigel, Abonyi-T'oth, and Singer 2008). The boundary of this area is given by all combinations ofπ sub,lower and π super,lower with p exact (π sub,lower ,π super,lower ) = α.
The idea and implementation of (non-)stochastic curtailment can be applied to subset designs in the same way as described in Section 2.1 for the Simon's designs.

Package overview
The OneArmPhaseTwoStudy package consists of 25 functions. These functions are implemented for the purpose of planning, monitoring, and analyzing one-arm phase II studies with binary outcomes. The supported designs are two-stage designs with a single endpoint as well as subset designs, where for the single endpoint designs the "classical" as well as the adaptive variants are available. In the following, all functions will be outlined. Each section is separated into the three parts planning, monitoring, and analysis.

Classical two-stage designs
The OneArmPhaseTwoStudy package implements 8 functions, which are provided solely for the "classical" two-stage designs. In the following each of these functions will be described.

Planning classical two-stage designs
In the planning stage, a main task is to identify an adequate design for the given study situation at hand. Our package implements three functions to fulfill this purpose, which will be described below.
The algorithm to find possible designs for given values of α, β, π 0 , and π 1 requires a high computational effort. Therefore, the package uses the programming language C++ internally. Because C++ is a compiled language, computations can be performed up to 80 times faster. returns what we will reference as a 'simon' object. This 'simon' object allows access to an internally used C++ object. The arguments alpha, beta, p0, and p1 correspond to α, β, π 0 , and π 1 , respectively. The parameters can be changed any time by invoking the function setSimonParams.
Once a 'simon' object is generated, the function getSolutions can be used to start the search algorithm described in Section 2.1 to identify possible designs for given α, β, π 0 , and π 1 getSolutions(simon = setupSimon(), useCurtailment = FALSE, curtail_All = FALSE, cut = 0, replications = 10000, upperBorder = 0) The first argument passed to getSolutions must be a pre-specified 'simon' object. To investigate the effect of (non-)stochastic curtailment, the argument useCurtailment must be set to TRUE. By this, the function getSolutions will determine the changes in the type I and II error rate for all identified designs as well as the impact on PET (π 0 ) and EN (π 0 ). The threshold θ for the conditional power can be specified by the argument cut and has to be chosen as a value between 0 and 1. To evaluate the effect of different thresholds simultaneously, the argument curtail_All can be set to TRUE. In this case, the algorithm will calculate the effect of curtailment for all values from the value of cut to 1 in steps of 0.05. This allows the user to get an impression which threshold is weighing the best decrease in sample size and the loss in power. The argument replications determines how many studies are simulated to evaluate the effect of curtailment. Due to the fact that C++ is used internally, even large values for replications (like 100,000) lead to results within a couple of seconds (tested on a computer with a dual core processor with 2.8GHz).
The function getSolutions returns a list object containing several data frames which summarize all identified designs and the consequences of curtailment. The application of function getSolutions is described in Section 4.

Monitoring classical two-stage designs
The OneArmPhaseTwoStudy package includes the two functions plot_simon_study_state and getCP_simon which are dedicated to monitoring purposes. The first function allows the user to get a graphical overview of the current status of the study. An exemplary call of this function is given below.

R> set.seed(25) R> design <-getSolutions()$Solutions[3, ]
R> stoppingRules <-data.frame(Enrolled_patients = c(design$n1, design$n), + Needed_responses_ep1 = c(design$r1, design$r)) R> enrolledPat <-data.frame(ep1 = rbinom(18, 1, design$p1)) R> plot_simon_study_state(stoppingRules, enrolledPat, design$r1, design$n1, + design$r, design$n) This call results in the output shown in Figure 1. The horizontal green dashed lines indicate the critical values (r 1 , r) for the given two-stage design, whereas the blue lines denote the sample size for the interim and the final analysis (n 1 , n). The black circles depict the patients which have already been enrolled. The red area illustrates the stopping rules for the given design. If the black circles enter the red area, the study has to be stopped. Moreover, the user can easily see when the interim analysis has to be performed and which number of responses is required for continuation. When the design is planned without curtailment, the stopping rules are simply defined by r 1 , n 1 , r, and n. If curtailment is applied, these stopping rules change as there are more options to stop for futility. Figure 2 shows the same design as illustrated in Figure 1 but with stochastic curtailment for a threshold of θ = 0.2 which can be generated through the call given below. + tmp$Curtailment_Results$ StoppingrulesForID:2 $ Stoppingrules_for_Row:1 R> names(stoppingRules) <-c("Needed_responses_ep1", "Enrolled_patients") R> enrolledPat <-data.frame(ep1 = rbinom(18, 1, design$p1)) R> plot_simon_study_state(stoppingRules, enrolledPat, design$r1, + design$n1, design$r, design$n) The second function dedicated to monitoring purposes is the function getCP_simon which allows calculating the conditional power at any time point of an ongoing study. This function can also be used to decide whether a study should be stopped for futility when (non-)stochastic curtailment is applied. As arguments, this function requires specification of the number of observed responses, the number of enrolled patients, and the design parameters r 1 , n 1 , n, and π 1 .

Analyzing classical two-stage designs
As the conduct of interim analysis is included in the monitoring procedure, this section focuses on the functions provided for the final analysis of a "classical" two-stage designs, which are get_p_KC, get_CI, and get_UMVUE_GMS. The function get_p_KC calculates the exact p value based on the approach of Koyama and Chen (2008) according to (5) given in Section 2.1. With this tool, the user can decide whether to reject or accept H 0 . Based on the function get_p_KC it is possible to derive the (1 − 2α)-CI given by [π L ,π U ], which can be calculated by calling the function get_CI. Internally, this function performs a stage-wise ordering by iterating over different values forπ L , which is increased with every iteration step. Analogously,π U is decreased with every iteration step as long as the values of get_p_KC(π L ) and get_p_KC(π U ) are less than α. As mentioned in Section 2.1, H 0 can be rejected if and only if π 0 is less thanπ L . The definition of this function is illustrated below.
get_CI(k, r1, n1, n, alpha = 0.05, precision = 4) The first argument has to be set to the number of observed responses. The following four arguments correspond to r 1 , n 1 , n, and α, respectively. The argument precision can be used to select to which digit the result of get_CI should be accurate.
Besides a correct test decision, the estimated response rate plays a major role in the planning of proceeding phase III studies. Therefore, the package includes the function get_UMVUE_GMS which implements the UMVUE of the true response rate based on the work of  (see Section 2.1). The listing below illustrates the definition of this function get_UMVUE_GMS(k, r1, n1, n), where k, r1, n1, and n correspond to the number of observed responses, the critical value for the first stage, the number of patients enrolled in the first stage, and the total number of patients enrolled in the whole trial, respectively. The calculation of the UMVUE is illustrated in Section 4.3.

Adaptive two-stage designs
As described in Section 2.2, every "classical" two-stage design presented in Section 2.1 can be "translated" into an adaptive design and may furthermore be improved with respect to efficiency. The OneArmPhaseTwoStudy package provides eight functions which implement the algorithms described in Section 2.2.

Planning adaptive two-stage designs
To plan an adaptive two-stage design, the first step is to identify a "classical" two-stage design using the functions described in Section 3.2. After that, a rule to increase the boundaries of the conditional error function CE(k) must be specified. For this purpose, the package implements four functions denoted by getD_none, getD_equally, getD_proportional, and getD_distributeToOne. These functions return data frames with all possible values of k (number of observed responses at the interim analysis) and the corresponding value of the conditional error function. The function getD_none corresponds to the case where the remaining level α − α is not used to modify the conditional error function but where the original function CE(k) is used. The functions getD_equally, getD_proportional, and getD_distributeToOne spend the remaining level α − α by increasing the boundaries returned by CE(k) either equally, proportionally to the probability of observing p 1 (k), or to the smallest value of CE(k) that is unequal to zero, respectively.

Monitoring adaptive two-stage designs
For monitoring purposes, the same functions as described in Section 3.2 can be used. The main differences to a "classical" design occur during the conduct of the interim analysis, which is described in the next section.

Analyzing adaptive two-stage designs
As mentioned in Section 2.2, adaptive designs allow to modify the number of patients to be enrolled in the second stage based on the results of the interim analysis. The package implements three functions which are dedicated for this purpose and which are denoted by getCP, getN2, and get_r2_flex.
The function getCP returns the conditional power of the study if the number of patients to be enrolled in the second stage is changed to n2 when k responses were observed at the interim analysis and under the assumption that p1 is the true response rate.
getCP(n2, p1, design, k, mode = 0, alpha = 0.05) The argument design is to be specified as a data frame containing the columns r1, n1, r, n, and p0 which correspond to the values of r 1 , n 1 , r, n, and π 0 . The assumed true response rate is that under the alternative hypothesis and is given by p1. The argument mode dedicates in which way the remaining level α − α is spent to modify the boundaries returned by CE(k). mode has to be a value in {0, 1, 2, 3} where 0 indicates that the remaining level was not spent (getD_none), 1 indicates a proportional spending (getD_proportional), 2 indicates an equal spending (getD_equally), and 3 stands for an allocation to the smallest value of CE(k) which is unequal to zero. The argument alpha specifies the overall type I error rate of the study.
The function getN2 returns the number of patients to be enrolled in the second stage in order to achieve the specified conditional power.
getN2(cp, p1, design, k, mode = 0, alpha = 0.05) The arguments of getN2 are exactly the same as for getCP with the only difference that the first argument specifies the conditional power the study should achieve.
Changing the number of patients to be enrolled after the interim analysis results in a different critical value to be applied to the number of responses observed in the second stage of the study. To calculate the new value for r 2 , the function get_r2_flex can be used. This function requires three arguments: The first argument is the conditional error (value of the modified CE(k)), the second is π 0 , and the last argument is n 2 . As outlined in Section 2.2, H 0 can be rejected if p 2 (l) ≤ CE(k), where p 2 (l) is implemented in the function getP.

Subset designs
The following sections will outline all functions of the OneArmPhaseTwoStudy package which support the subset designs described in Section 2.3.

Planning subset designs
Planning a subset design follows similar steps as for the "classical" two-stage designs with a single endpoint described in Section 3.2. The corresponding functions supporting these steps for subset designs are given by setupSub1Design, setSub1Params, and getSolutionsSub1.
Similar to the procedure described in Section 3.2, at first a 'sub1' object has to be defined which establishes a link between C++ and R code. By this, it is possible to perform the calculations much faster as compared to plain R code. Nevertheless, the identification of subset designs is computationally more intensive than the identification of "classical" designs. Therefore, depending on the underlying parameter constellation it may take several minutes until all possible designs are identified. To generate a 'sub1' object, the function setupSub1Design is used which is illustrated below.
setupSub1Design(alpha = 0.1, beta = 0.1, pc0 = 0.6, pt0 = 0.7, pc1 = 0.8, pt1 = 0.9) The arguments alpha and beta are used to specify the significance level and the type II error rate. The other arguments are used to set π sub 0 , π super 0 , π suba , and π super a , respectively. The arguments can be changed any time by invoking the function setSub1Params.
The identification of subset designs is similar to the two-stage designs with a single endpoint. The function getSolutionsSub1 starts the search algorithm described in Section 2.3. This function accepts the same arguments as getSolutions but uses a 'sub1' object instead of a 'simon' object. Moreover, the function getSolutionsSub1 provides the additional arguments skipS, skipR, and skipN1 which should be set either to TRUE or FALSE. These arguments instruct the search algorithm to skip the range of s, r or n 1 every time a design is identified which fulfills the type I and II error constraints. This results in a performance improvement. However, if one or more of these arguments are set to TRUE the algorithm will only be able to determine the minimax, admissible, and optimal design among the identified designs. This does not assure that the overall minimax, admissible, and optimal designs are found.

Monitoring subset designs
The OneArmPhaseTwoStudy package provides two functions for monitoring subset designs. The function plot_sub1_study_state generates a plot similar to the one as described in Section 3.2. The listing below illustrates the application of this function.
The second function dedicated to monitoring purposes is get_conditionalPower which can be used to calculate the conditional power for a given subset design in a similar manner as described in Section 3.2 for the "classical" two-stage designs. The function requires specification of the number of observed responses for endpoint 1 and 2 as well as the number of enrolled patients. Moreover, the parameter set (r 1 , n 1 , r, s, n, π suba , π super a ) has to be provided.

Analyzing subset designs
The OneArmPhaseTwoStudy package implements four functions which can be used to perform the final analysis of a subset design, three of which are designated to calculate the exact p value and the confidence set which are described in Section 2.3. The function get_p_exact_subset computes the exact p value for a given subset design (see Equation 13). The function is defined as follows.
get_p_exact_subset(t, u, r1, n1, n, pc0, pt0, sub1 = setupSub1Design()) The first two arguments t and u have to be set equal to the number of responses observed for endpoint 1 and 2, respectively. The following arguments correspond to the values of r 1 , n 1 , n, π sub 0 , and π super 0 . As the decision to proceed to the second stage of the study is based on r 1 only, the function does not depend on the parameters r and s. The last argument sub1 is internally used by the function get_confidence_set and should not be overwritten.
As described in Section 2.3, the confidence interval for the response rate of endpoint 1 depends on the response rate of endpoint 2 and vice versa which results in a so-called confidence set. The boundaries of this confidence set can be calculated through the function get_confidence_set. Internally, this function uses get_p_exact_subset for different values of pc0 and pt0 to determine different sets of [π sub,lower ,π super,lower ] for which p exact (π sub,lower ,π super,lower ) ≤ α. To illustrate the calculated confidence set, the function plot_confidence_set can be used. A call of this function results in a plot as shown in Figure 4 where the green area represents the confidence set. The black dot illustrates the point estimate of the true response rates of endpoint 1 and 2 given by [π sub,UMVUE ,π super,UMVUE ]. Finally, the red area indicates the acceptance area which means that H 0 cannot be rejected if the confidence set overlaps with this region.
The point estimatesπ sub,UMVUE andπ super,UMVUE are provided by function get_UMVUE_GMS for the subset endpoint and function get_UMVUE_GMS_subset_total for the superset endpoint.

Graphical user interface
In addition to the R package OneArmPhaseTwoStudy, a graphical user interface (GUI) was developed in order to provide an easy to use application. The GUI is implemented in Qt (Nord and Chambe-Eng 2017) which is an extension to the C++ standard (for more information visit http://qt-project.org/). This extension is especially suited for the development of platform-independent graphical interfaces. The purpose of the GUI is to provide the full functionality of the OneArmPhaseTwoStudy package to users with no or limited knowledge in R. Internally, the GUI uses the R package Rinside (Eddelbuettel and François 2015) to access the OneArmPhaseTwoStudy package. A GUI installer for Windows can be downloaded at http://www.klinikum.uni-heidelberg.de/fileadmin/inst_ med_biometrie/Aktuelles/R-Paket/installer.exe). Also the source code is available on GitHub at https://github.com/imbi-heidelberg/OneArmPhaseTwoStudy_GUI. The tools provided by the GUI are the same as described in the Sections 3.2 to 3.4. Therefore, the following subsections will only exemplarily illustrate the application of the GUI.

Study planning with the GUI
To plan a new study, the option "Create new study" in the "File" menu must be selected. After that, some general information like the study name, the principal investigator, and the name of the involved biometrician must be provided. Once the general information has been entered, the GUI displays a window with all available design options ( Figure 5).
At first, a choice between Simon's two-stage design or the subset design must be made. If "Simon's Design" is selected (see section a) of Figure 5), all necessary design parameters (α, β, π 0 , π 1 ) have to be entered in section b) of Figure 5. After that, the search algorithm described in Section 2.1 can be started by pressing the button "Start calculation" which internally invokes the function getSolutions provided by the R package. All identified designs are displayed in a table as illustrated in section c) of Figure 5. With a click into the   table, the user can select which design to use. The GUI provides an overview of all selections made during the planning of the study by clicking "Next" (see Figure 6).
If the selected design should be applied, the user has to click on the "Create study" button. Then a save menu will be provided so that this design can be re-used for monitoring and analysis of the study at any time later.

Monitoring with the GUI
After planning a new or opening a previously saved study, the GUI continues to the monitoring mode. To add a new patient to the study, a patient ID must be provided and the information whether or not a response was observed for this patient. With a click on "Add patient" the provided information is included into the study. It is possible to save the current study state at any time through the "File" menu.
Moreover, the monitoring mode of the GUI provides three different pages which sum up all available information up to the current study state. The first page "Study details" displays all design parameters as well as the number of enrolled patients, the number of observed responses, and the current conditional power (Figure 7). On the second page "Enrolled patients", all included patients are displayed in a table. The third page "Study progress" (see Figure 8) shows a graphical overview of the current study state, which is internally generated After having entered n 1 patients, a pop-up message gives the information that the interim analysis is to be performed. Depending on the number of observed responses, the pop-up message reports whether the study has to be stopped or can proceed to the second stage.
If the study was planned in an adaptive design, the number of patients to be enrolled in the second stage of the trial can be changed at the interim analysis. Note that it is impossible to change the number of patients to be enrolled in the second stage after more than n 1 patients are enrolled. If further patients are added to the study, the GUI switches back to the monitoring mode until n patients in total are included.

Analyzing with the GUI
After having entered a total of n patients, the GUI switches to the final analysis window which shows the results of the test decision as well as the estimated response rate together with the confidence interval (see Figure 9).
In the middle of the screen, the graphical overview of the study with all enrolled patients is displayed, which is internally obtained by a call of plot_simon_study_state (see Section 3.2). All design parameters are displayed in the group box "Study design". The group box "Final study state" contains the number of enrolled patients, the number of observed responses, and the exact p value which is internally calculated through invocation of get_p_KC. Moreover, two point estimators are provided, namely the MLE and the UMVUE. The MLE is simply calculated by the number of observed responses divided by the number of enrolled patients. For the calculation of the UMVUE, the GUI internally invokes the function get_UMVUE_GMS. Finally, the (1 − 2α)-CI is displayed which is calculated with the function get_CI. Razak et al. (2013) conducted a single-arm phase II trial to investigate the clinical activity of a new orally administered agent in recurrent or metastatic squamous-cell cancer of the head and neck. Primary endpoint was objective response for which the null hypothesis H 0 : π ≤ π 0 = 0.05 was assessed at one-sided level α = 0.05. A power of 1 − β = 0.80 should be reached for a true objective response rate of π 1 = 0.15. Simon's optimal two-stage design was implemented. This design is identified by calling the function getSolutions that provides the result (n 1 , r 1 , n, r) = (23, 1, 56, 5). All designs fulfilling the constraints with respect to the type I and type II error rate with a maximum sample size of at most 100 are obtained by using the following code. This generates the following output (note that some columns are hidden due to space limitations). The maximum sample size of the minimax design is by four lower than for the optimal design while its expected sample size is by more than six higher. Two admissible designs are identified with maximum and expected sample size in between the minimax and optimal ones.

Statistical monitoring
Four responses were observed in the 23 evaluable patients of the first stage and thus the study proceeded to stage two, which can be seen in the graphical overview ( Figure 10) generated by calling R> set.seed(20) R> sr <-data.frame(Enrolled_patients = c(23, 56), + Needed_responses_ep1 = c(1, 5)) R> enrolledPat <-data.frame(ep1 = logical(23)) R> enrolledPat[sample.int(23, 4), ] <-TRUE R> plot_simon_study_state (sr,enrolledPat,1,23,5,56) Here, the black circles which represent the enrolled patients together with the observed responses do not fall into the red area at the interim analysis which is indicated by the vertical blue dotted line.
Let us assume that two responses were observed within the 23 patients of the first stage. The study could then be continued but the conditional power to reject the null hypothesis after the second stage amounts to R> getCP_simon(2,23,1,23,5,56,0.15) [1] 0.7504551 If after a total of 25 (30 / 35) patients still only two responses were observed, the conditional power amounts to 0.7039 (0.5615 / 0.3887) and one may think about stopping the trial for futility based on stochastic curtailment considerations.

Analysis
After the second stage, seven responses were observed for the 56 evaluable patients enrolled in the study performed by Razak et al. (2013). As r = 5, this leads to the rejection of the null hypothesis. The MLE of the response rate is 7/56 = 0.11 and the related "naive" two-sided 90%-CI and one-sided p value that do not take into account the sequential design, are given by [0.0602, 0.2006] and p = 0.0212, respectively. To obtain the UMVUE as well as the 90%-CI and p value tailored to the sequential nature of the design, the functions get_UMVUE_GMS, get_CI, and get_p_KC have to be called.

R> results
This results in the following output.
UMVUE CI_low CI_high p_value 1 0.1379133 0.0617 0.21439 0.01882311 As can be seen, the MLE underestimates the response rate which is a general feature in two-stage designs with the option of early stopping for futility. Consequently, reporting this estimate may then lead to an inappropriate judgment of the treatment effect.

Planning and performing adaptive designs
As an alternative to the "classical" Simon's optimal design, the study by Razak et al. (2013) could also have been planned within an adaptive framework that allows to react in a flexible way to unforeseen events by data-driven modifications while still controlling the type I error rate. The flexible counterpart of Simon's optimal design that exhausts the available significance level by equal allocation of the available undershoot in type I error rate to the conditional error function can be obtained by calling getD_distibuteToOne.
k ce 1 0 0.00000000 2 1 0.00000000 3 2 0.08085087 4 3 0.22730544 5 4 0.49677692 6 5 0.81810800 7 6 1.00000000 8 7 1.00000000 9 8 1.00000000 10 9 1.00000000 11 10 1.00000000 12 11 1.00000000 13 12 1.00000000 14 13 1.00000000 15 14 1.00000000 Values of the conditional error function of 0 or 1, respectively, mean that the study is to be stopped for futility (number of observed responses is less than or equal to r 1 ) or efficacy (number of observed responses is greater than r) after the first stage. Let us assume that five responses were observed within the 23 patients of the first stage. With the "classical" Simon's optimal design, further 33 patients must be included in stage two although only one additional response has to occur to reject the null hypothesis. In contrast, the adaptive design allows recalculation of the sample size taking into account the result observed in the interim analysis. Within the conditional error rate approach pursued in the adaptive design framework, a p value smaller or equal to 0.818108 has to be achieved in the second stage. Calling the function getN2 results in R> getN2(0.8, 0.15, optimal_design, 5, 2) [1] 10 Based on these considerations the total sample size n could, for example, be changed from 56 (23 + 33) to 33 (23 + 10) maintaining a conditional power of 80%, which corresponds to the initial power the study was planned for. Choice of an adaptive design may therefore have led to a much smaller sample size and thus to considerable savings in time and financial resources.

Discussion
In this article, we presented an overview of the OneArmPhaseTwoStudy package to plan, monitor, and analyze single-arm two-stage clinical trials with a binary outcome. The theory behind the implemented methods is sketched, the package is described in detail, and practical application is illustrated by a real clinical study example. Although to our knowledge OneArmPhaseTwoStudy provides the most comprehensive spectrum of methods of available software tools in this field, there are several options for extension of the package. Such extensions may cover designs with more than two stages (Chen 1997) or alternative designs with more than one endpoint (Kunz andKieser 2011a, 2012). One of the methodological research we are currently pursuing and whose results will be integrated in the package concerns construction of point estimates and confidence intervals for adaptive single-arm two-stage designs. Finally, we are working on the problem on how flexible designs can be used to deal with the situation that the initially specified sample sizes n 1 or n are not exactly met. As this is frequently the case in practice, the availability of related methods and software would be a further major step forward.