Semi-Parametric Maximum Likelihood Method for Interaction in Case-Mother Control-Mother Designs: Package SPmlﬁcmcm

The analysis of interaction eﬀects involving genetic variants and environmental exposures on the risk of adverse obstetric and early-life outcomes is generally performed using standard logistic regression in the case-mother and control-mother design. However such an analysis is ineﬃcient because it does not take into account the natural family-based constraints present in the parent-child relationship. Recently, a new approach based on semi-parametric maximum likelihood estimation was proposed. The advantage of this approach is that it takes into account the parental relationship between the mother and her child in estimation. But a package implementing this method has not been widely available. In this paper, we present SPmlﬁcmcm , an R package implementing this new method and we propose an extension of the method to handle missing oﬀspring genotype data by maximum likelihood estimation. Our choice to treat missing data of the oﬀspring genotype was motivated by the fact that in genetic association studies where the genetic data of mother and child are available, there are usually more missing data on the genotype of the oﬀspring than that of the mother. The package builds a non-linear system from the data and solves and computes the estimates from the gradient and the Hessian matrix of the log proﬁle semi-parametric likelihood function. Finally, we analyze a simulated dataset to show the usefulness of the package.


Introduction
We focus on the problem of analyzing interaction effects involving genetic variants and environmental exposures on the risk of adverse obstetric and early-life outcomes such as premature birth and preeclampsia, and small-for-gestational-age (SGA) neonates. Obstetric and earlylife outcomes involve the mother and her child. Several epidemiological studies suggest that the fetal susceptibility to environmental factors depends on his or her genotype and on the genotype of his or her mother (Infante-Rivard 2007). It thus appears important to include both genotypes as well as the environmental factors as predictors in the same model. Usually, standard logistic regression is used in the case-mother and control-mother design. However such analysis is inefficient here because it does not take into account the natural family-based constraints present in the parent-child relationship (Shi, Umbach, Vermeulen, and Weinberg 2008).
Recently, Chen, Lin, and Hochner (2012) proposed an extension of the approach based on semi-parametric maximum likelihood estimation. The extension proposed by Chen et al. (2012) in case-mother and control-mother design is interesting, because it takes into account both the genotype of the mother and of the child. The parental link between mother and her child is modeled through the parametric function of the joint distribution of the maternal and fetal genotype under the assumptions of random mating, Hardy-Weinberg equilibrium (HWE), and Mendelian inheritance. The distribution of the environmental variables given the maternal genotype and child genotype is considered as a nuisance parameter in the estimation. Furthermore, the authors make the assumption that the environmental factors are linked only to the genetic profile of the mother and not the one of her child, reducing the dimension of the nuisance parameter and permiting to simplify the likelihood function. In their approach, they also assume that the totals of case and control mother-child pairs eligible for recruitment (population totals) are available. The authors show by simulation studies the greater efficiency of their approach for estimating the association parameters compared to logistic regression. Chen et al. (2012)'s method is actually appealing because of the wide array of studies of obstetric and early-life outcomes where it can be applied. For example, the study seeking to link SGA neonates to drinking water disinfection by-products conducted on case-mother and controlmother pairs from Quebec City (Canada) area (Levallois et al. 2012) or that of Infante-Rivard (2004), where they used logistic regression to study the modifying effect of genetic variants. A package implementing this method has not been widely available. In this paper, we present SPmlficmcm (Nguile-Makao and Bureau 2015), an R (R Core Team 2015) package implementing this method and an extension of the method to handle missing offspring genotype data by maximum likelihood (Allison 2001). The R package is available from the Comprehensive R Archive Network (CRAN) at http://CRAN.R-project.org/package=SPmlficmcm.
Our choice of how to treat missing offspring genotype data was motivated by the fact that in genetic association studies where the genetic data of mother and child are available, there are usually more missing data on the genotype of the offspring than that of the mother. Parents may be reluctant to consent to collect saliva and blood samples from their children. And even with the parent's consent, it is not easy to get a sufficient sample of saliva from the children. For example, in a sample of mother-child pairs from the Quebec City area (SGA cases and controls), among 1719 genotyped mother-child pairs we found on average 1.5% of missing maternal genotypes and on average 8% of missing offspring genotypes.
This article is structured in the following way: In Section 2, we describe Chen et al. (2012)'s method and the extension that we propose. In Section 3, we present a description of the package. In Section 4, we present an illustration on simulated data. And we finish by a discussion in Section 5.

Mathematical background
In this section, we will review briefly the semi-parametric maximum likelihood method (SML) proposed by Chen et al. (2012) and an extension of this method to the treatment of missing offspring genotype data that we propose. In the rest of the paper, we will denote this extension by SMLMD. Section 2.1 presents the notations, Section 2.2 gives the empirical log-likelihood function for the complete data with a new parametrization and Section 2.3 presents the empirical log-likelihood function for missing offspring genotype data.

Notations
Let Y denote the binary case-control status, G M and G C the respective maternal and offspring genotype, X the vector of environmental variables collected from the mother and G the set of possible values of the genotype. Data (Y, X, G M , G C ) is collected from n 1 case mother pairs and n 0 control mother pairs, which are sampled from N 1 (N 1 > n 1 ) case pairs and N 0 (N 0 > n 0 ) control pairs. Let n ijmc denote the total number of pairs in the case-control sample with Y = i, X = j, G M = m and G C = c and P ijmc (β) their probability with P ijmc (β) = P(Y = i | X = j, G M = m, G C = c; β) and θ the log odds of minor allele frequency (MAF) Under random mating, Hardy Weinberg equilibrium (HW) and Mendelien inheritance, the joint distribution, represents the mother genotype and P c|m (θ) = P(G C = c|G M = m; θ) the conditional offspring genotype distribution given maternal genotype. We denote by C jm = i, c n ijmc the number of subjects in the case-control sample with X = j and G M = m.

General semi-parametric maximum likelihood estimation
Let first suppose the data collected on the mother-child pairs (Y, X, G M , G C ), has no missing data. For all subjects in the sample with Y = i, X = j, G M = m and G C = c, we define the following functions: h ijmc (β, θ) = P ijmc (β)P c|m (θ) where P ijmc (β) is derived from a logistic regression model, and P c|m (θ) is the conditional distribution of children genotype given maternal genotype. Let n ijmc be the number of mother-child pairs having the status (i, j, m, c). Chen et al. (2012) make the following assumption that the paternal allele of offspring genotype is independent from environmental factors, by consequence, P(X = j | G M = m, G C = c) = P(X = j | G M = m) = δ jm and δ jm satisfies the constraint j δ jm = 1. The empirical log-likelihood is written: where V x is the set of observed values of the vector X and the maternal and child genotype G M and G C in the data and d i = N i − n i . For more detail, see Chen et al. (2012). The semi-parametric likelihood estimator is obtained in the following way: Step 1: We estimate δ jm using Lagrange multipliers for the constraints j δ jm = 1 and we where Following Chen et al. (2012) we obtain a closed-form expression for δ jm as We note that the u im (β, θ), i = 0, 1, m ∈ G M constitute a non-linear equation system. We have and when we plug δ jm in Equation 3 and we substitute the expression obtained for P(Y = i) in Equation 2, we have the following equation system ∀i, m: where f jm (β, θ; u m ) = i, c h ijm c (β, θ)u im (β, θ) and u m = {u im ; i = 1, 2}. We note that, for all i, m the u im (β, θ) are bounded. Chen et al. (2012) If we denote n ++m+ = j C jm and ρ i = #{u; Yu=i} n , an empirical estimator of P(Y = i), and n = n 0 + n 1 , then we have the inequalities: with N − n = N 0 + N 1 − (n 0 + n 1 ).
Step 2: We solve the non-linear system in Equation 4. We denote by u im (β, θ) the solution of the non-linear system, we plug δ jm as well as u im (β, θ) in Equation 1 to obtain the log profile likelihood: Let η 0 = ( β 0 , θ 0 ) be initial values of the parameters η = (β, θ) obtained respectively by the modified logistic regression of Chen et al. (2012) and the following equation: Finally, an estimate of the parameter η is obtained by the following formula: Note that the variance of η is estimated by the gradient and the inverse of the Hessian matrix of the log profile likelihood evaluated at point η. In numerical computations of the gradient, we have checked that keeping the u im (β, θ) fixed to the solution for the specified values of β and θ or resolving them at each evaluation of p leads to nearly identical values. We have therefore implemented the analytical gradient of the log profile likelihood function p for fixed values of u im (β, θ). The Hessian matrix is computed numerically from the gradient.

Generalization to missing offspring genotype data
Let (Y, X, G M , G C ) be the data collected on the mother-child pairs where we suppose that the genotype of a subset of children is missing at random (Little and Rubin 2002), as missingness is allowed to depend on case-control status and maternal genotype. We suppose that the missing offspring genotype data are completely at random. From the total sample, we constitute two sub-samples that we denote by S 1 IJM C for the complete data and S 2 IJM for the missing offspring genotype data. If we examine the non-linear system of the complete data in Equation 4, we notice that the system is completely defined through the information (i, j, m). The offspring genotypes in the sample are not required because the system is written with a summation over all possible values of the offspring genotype compatible with the maternal genotype. Consequently, the non-linear system with missing offspring genotype data stays identical to the one for the complete data. Starting from the likelihood function of Equation 6 for the complete data, p can be written the following way: and p 2 (β, θ, u im (β, θ)) is equal to the remaining terms of the right-hand side of Equation 6. The function p 1 (β, θ) is completely defined only if we observe n ijmc in the study sample. However, we only need the totals for each combination of (i, j, m) in the study sample to define the function p 2 (β, θ, u im (β, θ)). As a consequence, the missing offspring genotype data modify the likelihood function of Equation 6 only within the function p 1 (β, θ). For all (i, j, m), we denote by n ijm = c n ijmc the number of mother-child pairs in the samples with Y = i, X = j, G M = m and where the offspring genotype is observed. We denote also for all (i, j, m), n ijm the number of mother-child pairs in the sample with Y = i, X = j, G M = m and where the offspring genotype is missing. The log profile semi-parametric likelihood function with summation over missing offspring genotype is written: This quantity represents the modification brought by the missing offspring genotype data to the likelihood function. The log profile semi-parametric likelihood is then written: Remark: Tthe distribution of the offspring genotype is conditional on the mother genotype and not on the covariate values, and so is the missingness probability factoring out of the likelihood. We apply the same steps as in Section 2.2 to obtain the parameter estimates.

Package description
The R package SPmlficmcm (semi-parametric maximum likelihood for interaction in casemother control-mother) implements the method of general semi-parametric maximum likelihood estimation for the complete data and data with missing offspring genotype. It contains one main function for the user: Spmlficmcm() performing the analysis for the complete data and data with missing offspring genotype. This function uses two auxiliary functions: Est.Inpar() to compute the initial values of the parameters (β, θ) and of the non-linear system, and Nlsysteq() to build the non-linear system. Finally the function Spmlficmcm() solves the non-linear system, builds the log profile likelihood and its gradient and computes the parameter estimates as well as the estimates of their standard errors. In this section, we describe the main functions and the estimation steps.

Main arguments of the functions
All functions use the main arguments described below and some functions use optional arguments. The main arguments are: • formula: The model formula.
• N: A numeric vector of length two giving the number of eligible controls and cases in the population (N = (N0, N1)). If this information is unavailable, it is possible to specify the disease population prevalence in the argument p instead of N. In that case, N1 is set equal to 5 n1, in order to avoid observing N1 < n1 when prevalence is small. We then set N0 = 1−p p N1.
• gnma: A character variable representing the name of the maternal genotype.
• gnch: A character variable representing the name of the child genotype.
• start: Vector of the initial values of the model parameters.
• data: A data frame in long format containing the following variables: id: Identity of the mother-child pairs.
Remark: In the formula, the variables coding the maternal and child genotypes can differ from the gnma and gnch variables representing the number of minor alleles. This allows for instance to code dominant or recessive effects.

Description of the main function
Spmlficmcm() is the function used to estimate the regression parameters and the minor allele frequency (β, θ) via generalized semi-parametric maximum likelihood estimation. It uses the following steps: Step 1: Obtaining initial values for the parameters The main function calls Est.Inpar() to compute initial values for the parameters, which takes the optional argument typ to distinguish the data with missing offspring genotype (2) and the complete data (1). Est.Inpar() uses logistic regression (glm()) to estimate β 0 (initial value of β) and Equation 7 to estimate θ 0 (initial value of θ).
To resolve Equation 4 the nleqslv() function from the nleqslv package (Hasselman 2015) is called. We choose the Broyden method of global strategies such as line search and trust region. In particular we use the Broyden method because this method often shows superlinear convergence towards a solution (Dennis and Schnabel 1983). The same function computes the initial values of the relevant non-linear system using Equation 5.
Step 2: Construction of the non-linear system Spmlficmcm() calls the function Nlsysteq() to build the non-linear system of Equation 4. It must be noted that the number of equations depends on the number of distinct maternal genotypes. The function uses the nleqslv() function of the nleqslv package and initial values to solve the non-linear system.

Step 3: Determining the log profile likelihood and estimation of parameters
In this last step, Spmlficmcm() determines and evaluates the log profile likelihood function of Equation 6 or 11 and its gradient, and computes the Hessian matrix numerically from the gradient. The method described in Equation 8 is then applied to evaluate the parameter estimates.
Spmlficmcm() returns an object containing as components the solution of the non-linear system, the matrix of the estimates with their standard errors, the variance-covariance matrix from the sample, the log likelihood function and the value of the log likelihood function assessed at the estimated parameters.
Remark: It is important to verify the following arguments: The data frame data must not contain missing values on the covariates such as environmental factors. The maternal and offspring genotypes must be coded as number of minor alleles carried by the individual. The missing offspring genotype data should be coded as NA.

Illustrations
In this section, we present an illustration of the use of the SPmlficmcm package on simulated data. We use the functions include in package SPmlficmcm to generate the data according to the model equation formula.

Simulation parameters
We generate the data respecting the constraint between offspring genotype and maternal genotype. Firstly, we generate genotype data (G M , G C ) for a cohort of N sample = 20000 mother-child pairs and consider two continuous covariates X1 and X2, which are correlated with G M . Given the values of (G M , G C , X1, X2), we generate a binary disease outcome outc from a logistic regression model with the following covariate effects: the log odds of the MAF 0.3, β = c(−0.916, 0.857, 0.588, 0.405, −0.693, 0.488) corresponding respectively to the coefficients of the following terms X1, X2, gm, gnch, X1 : gnch, X2 : gm and the intercept −2.23, in a simulation scenario describing the main and interaction effect of (G M , G C , X1, X2). We solve for the value of the intercept parameter so that the resultant phenotype prevalence, P(outc = 1), is around 6% in the simulation. We then sample n 1 = 327 case pairs and n 0 = 1232 control pairs from the cohort. We mainly report results when the MAF is 0.3 and both G M and G C are coded as the number of minor alleles. Secondly, we create another database from the first, introducing an average 9% of missing offspring genotype data and run the analysis summing over missing offspring genotypes. We repeat this process B = 500 times to create 500 complete samples and 500 samples containing the missing offspring genotypes data.

Simulated database
In the following code, we use the function FtSmlrmCMCM() to generate a dataset with environment factors that are continuous variables. M represents the size of the population, rho is the disease prevalence and N is a vector containing the number of affected and unaffected subjects in the population. On the last line, n mother-child pairs are sampled with the function SeltcEch(), where n= n0 + n1.

Creation of the complete and missing data
The following code creates the sample with complete data (DatfEmd) and the full sample including subjects with missing offspring genotype (DatfEcd) from the full sample with complete data DatfE1. We created 9% missing data with the sample function. Each database is a data.frame containing 6 columns. Column 1 represents the number of the mother-child pair; the next three columns represent the binary response variable and two continuous environmental variables. The last two variables represent respectively the maternal genotype and the offspring genotype. Both genotypes are coded as number of minor alleles. We have finally two databases, DatfEmd that contains the missing data on the offspring genotype and DatfEcd that contains only the pairs with complete data. Firstly, we show the use of the two mains functions of the package, thereafter we compare the results of three estimators obtained respectively by logistic regression, the SML approach and the SMLMD approach.

Estimation of parameters without missing data
On the first line, we give the model equation, then we apply the function Spmlficmcm() to estimate the parameters on the sample with complete data.

R> round(Rsnm[["Uim"]], digits = 3)
The results shown below include first the solution U im of the non-linear system, second the estimates of parameters, third the variance-covariance matrix, and fourth the value of the likelihood function evaluated at the parameter estimates.
[  [1] -17897.71 We also illustrate the use of the function with specification of the disease prevalence p, assuming N is unknown. Coefficient estimates and standard errors changed little. When varying N while keeping the disease prevalence p = txpN 1 N constant, we observed that the coefficient estimates were insensitive to the value of N, but the standard errors of the coefficient estimates decreased slightly with N (data not shown). It is thus preferable to underestimate N and slightly overestimate standard errors than the opposite. Also, setting N too large leads to numerical errors.

Estimation of the parameters (with missing data)
We use the function Spmlficmcm() with the option typ equal to 2 to estimate the model parameters on the missing data. The following code gives the solution U im of the non-linear system, the parameter estimates, the variance-covariance matrix, the log-likelihood function and the log-likelihood function evaluated at the parameter estimates.  Here again disease prevalence can be specified when N is unknown, with little change to the coefficient estimates and standard errors.  Using the same data, we applied logistic regression with the same equation, we computed the standard errors of the estimates and we compared them to the other two estimates. The results are reported in Table 1. When we compare the SML approach with the standard approach (logistic regression) on one simulated sample, we notice an amelioration i.e., a reduction of variance and the estimates are nearer to the true values. These results corroborate the results of Chen et al. (2012). Indeed Chen et al. (2012) showed that using the SML approach when the assumption is satisfied reduces the variance on average by 30%. When using the SMLMD approach, we again observed a reduction of the standard errors of the estimates. To confirm this error reduction and validate the properties of the semi-parametric maximum likelihood estimator, we assessed the following quantities: the bias, the mean square error (MSE), the empirical variance (V emp ), the mean estimated variance and the confidence interval coverage for the two approaches (SML and SMLMD) on B = 500 replicates of the generated data. We removed 20 replicates where we observed negative variances and/or coefficient estimates beyond 3 median absolute deviations (scaled to estimate the standard deviation) from the median for both methods. The results are reported in Table 2. Tables 3 and 4 show the empirical covariances and the mean estimated covariances.
The MSE and empirical variance of the coefficient estimates for the mother genotype gm and the covariates X1 and X2 are slightly reduced when making use of the entire dataset, as  Table 2: Comparison of results of two methods on the simulated data (B = 480 replicates where estimation succeeded). MSE: mean square error, V: estimated variance, V emp : empirical variance, P( η ∈ CI95): 95% confidence interval coverage, SML: semi-parametric maximum likelihood method (no missing data), SMLMD: semi-parametric maximum likelihood method (missing data).     expected. The estimates were unbiased and the mean estimated variance of the coefficient estimates was slightly below the empirical variance, and confidence interval coverage was close to or slightly below the nominal 95% level. Some covariances are also estimated closer to 0 than the empirical value.
Computing times are reported in Table 5. In summary the computing time depends on the number of terms in the formula and the size of the sample.

Discussion
Package SPmlficmcm implements the semi-parametric maximum likelihood estimation for case-mother control-mother designs, allowing for missing offspring genotype. This method is important in studies where we want to determine the role of a polymorphism in interaction with the mother exposure to an environmental factor on obstetric and early-life outcome risk. Indeed these models permit to take into account the correlation between the maternal genotype and offspring genotype under the assumptions of Hardy-Weinberg equilibrium and Mendelian inheritance. The statistical properties of the estimates were satisfactory, although a slight underestimation of the empirical variance by the variance estimate led to a slight undercoverage of the confidence intervals. SPmlficmcm has been made available on CRAN. This package executes the computation relatively quickly. We hope that this package will encourage applied researchers to use this type of modeling, as it provides a relevant way to study gene-environment interactions in case-mother and control-mother designs.