scikit-fda: A Python Package for Functional Data Analysis

The library scikit-fda is a Python package for Functional Data Analysis (FDA). It provides a comprehensive set of tools for representation, preprocessing, and exploratory analysis of functional data. The library is built upon and integrated in Python's scientific ecosystem. In particular, it conforms to the scikit-learn application programming interface so as to take advantage of the functionality for machine learning provided by this package: pipelines, model selection, and hyperparameter tuning, among others. The scikit-fda package has been released as free and open-source software under a 3-Clause BSD license and is open to contributions from the FDA community. The library's extensive documentation includes step-by-step tutorials and detailed examples of use.

Due to the growing interest in functional data, several specialized software tools for FDA have emerged in recent years (Scheipl, Arnone, Hooker, Tucker, and Wrobel 2022).One of the main references in the field is the fda package (Ramsay, Hooker, and Graves 2024), which is available in R (R Core Team 2024) and MATLAB (The MathWorks Inc. 2021).This general-purpose library provides an implementation of the methods described in Ramsay and Silverman (2005) and Ramsay, Hooker, and Graves (2009).It utilizes a basis expansion representation of the functional observations.Another important reference in the FDA community is the fda.usc package (Febrero-Bande and Oviedo de la Fuente 2012), in which the non-parametric approach developed in Ferraty and Vieu (2006) is adopted.One of the contributions of this package is the introduction of a novel structure for the representation of functional data in discrete form, as a collection of measurements at a grid of points.This R package provides an extensive range of FDA tools, including methods for regression and classification.
A more recent general-purpose R package is tidyfun (Scheipl, Goldsmith, and Wrobel 2024).In this package, a novel data structure (vectors of class 'tf') is introduced to represent functional observations.These 'tf' vectors can be included as columns in an R 'data.frame'alongside with other variables.Furthermore, they can be manipulated using the tools of the tidyverse ecosystem (Wickham et al. 2019).Another recent contribution is funData (Happ-Kurz 2020).This package provides a representation for discretized univariate and multivariate functional data of arbitrary dimensions based on S4 classes in R.
Finally, a variety of computational tools have been developed to address specific problems in FDA.Some relevant examples are the packages refund (Goldsmith et al. 2024), which, among others, provides tools for functional regression and principal component analysis (FPCA), FDboost (Brockhaus, Rügamer, and Greven 2020), focused on regression problems, funFEM (Bouveyron, Côme, and Jacques 2015) and funHDDC (Schmutz, Jacques, Bouveyron, Chèze, and Martin 2020) for functional clustering, fpca (Peng and Paul 2011) and fdapace (Zhou et al. 2022), which are mainly devoted to functional principal component analysis (FPCA).The fdapace package available in R, and its MATLAB counterpart PACE (Yao, Müller, and Wang 2015), provide methods for both sparsely or densely sampled random trajectories based on FPCA, via the principal analysis by conditional estimation (PACE) algorithm.The package fdasrvf (Tucker 2024b) contains tools for alignment, elastic registration, PCA, and regression with functional data based on the square-root velocity framework (SRVF) described in Srivastava and Klassen (2016).This package is available under this name both in R and in MATLAB, as fdasrsf (Tucker 2024a) in Python (Van Rossum et al. 2011), and as ElasticFDA (Tucker 2021) in Julia (Bezanson, Edelman, Karpinski, and Shah 2017).The package roahd (Ieva, Paganoni, Romo, and Tarabelloni 2019) includes a collection of methods for robust analysis of functional data.Outlier detection tools are provided also in fdaoutlier (Ojo, Lillo, and Fernandez Anta 2023).Visualization tools, including interactive ones, are provided also in the packages rainbow (Hyndman and Shang 2010) and refund.shiny(Wrobel, Park, Staicu, and Goldsmith 2016).A recent contribution is the R package mlrFDA described in Pfisterer, Beggel, Sun, Scheipl, and Bischl (2021), which gives access and extends the machine learning framework mlr (Bischl et al. 2016) for the analysis of functional data.
In recent years, the Python language has become more relevant in statistics, data science, and machine learning.However, in contrast to the wide variety of alternatives available for FDA in R, the options in Python are much more limited both in number and functionality.Some Python libraries devoted to FDA are fdasrsf, which has been described earlier, and the recently released FDApy (Golovkine 2021), that provides methods for principal component analysis and clustering.
In this context, we present scikit-fda (Ramos-Carreño et al. 2022), a general-purpose library that makes functional data processing and analysis accessible to the Python community.This package offers data structures for the representation of functional data both in discretized form and as a basis expansion, and an extensive set of tools for preprocessing (smoothing, registration and dimensionality reduction), and statistical analysis, including interactive visualization and outlier detection tools.In addition, it provides infrastructure to facilitate the application of the machine learning tools of scikit-learn to functional data.Comprehensive documentation is supplied that includes installation instructions, tutorials, API reference, and illustrative examples.
In what follows, an overview of the functionality provided by the scikit-fda package is given: In Section 2, the discretized and the basis expansion representations of the functional observations are introduced.Interpolation, extrapolation, and derivation tools are presented also in this section.In Section 3, the functionality of the package is described.It includes subsections devoted to preprocessing (smoothing, registration, dimensionality reduction, and variable selection), exploratory analysis (descriptive statistics, depth measures, interactive visualization, and outlier detection), and the integration with the machine learning tools of scikit-learn.Examples of code that illustrate scikit-fda's functionality are provided throughout the article.The tools and processes implemented to ensure the quality of the code are described Section 4. This section includes also an overview of the project's extensive documentation.

Representation of functional data in scikit-fda
The scikit-fda package provides tools for the representation of functional observations of the form x : T ⊆ R p → R q , with p ≥ 1, and q ≥ 1.The parameter p is the dimension of the domain of the functions (p = 1 for curves, p = 2 for surfaces, etc.).The parameter q is the dimension of the co-domain; that is, the number of output coordinates for vector-valued functions.For instance, a gray-scale two-dimensional image can be treated as a functional datum with p = 2, for the location of the pixels in the image, and q = 1, for the intensity at each pixel.A color image consisting of three channels (e.g., red, green, and blue) would have p = 2 and q = 3.The values of p and q are the same for all the observations in a functional t 1 t 2 t 3 t 4 t 5 t 6 x 1 (t 4 ) x 1 (t 1 ) x 2 (t 1 ) x 3 (t 1 ) Figure 1: Functional observations in discretized form.The quantity x n (t j ) represents the value of the n-th trajectory at t j .
dataset.For the sake of clarity, we focus on the case of real-valued univariate functions (that is, p = q = 1) for which most of the functionality described in this article is implemented.
We will further assume that the functions are defined on a compact interval in the real line.
In higher dimensions, the domain is assumed to be rectangular.Typically, the continuous parameter on which the functions depend, t ∈ T , is assumed to be time.
A functional dataset consists of a collection of N observations {x i (t), t ∈ T } N i=1 , where x i (t) is the i-th observation in the sample.Each observation can be represented either in discretized form, or as a basis expansion.In the former representation, a functional observation consists of a set of measurements at a grid of points t = (t 1 , . . ., t M ) ∈ T M , which is common for all observations.This grid need not be regularly spaced.The i-th observation in the sample is represented by the vector {x i (t)} N i=1 , where x i (t) = (x i (t 1 ), . . ., x i (t M )).The discretization grid is assumed to be sufficiently fine so that the functional character of the data is apparent (Ramsay and Silverman 2005;Ferraty and Vieu 2006;Hsing and Eubank 2015).As an illustration of this representation, three example trajectories measured at a grid of irregularlyspaced points are displayed in Figure 1.
Alternatively, a functional observation can be represented as an expansion in a functional basis where {c i } i≥1 are the coefficients of the expansion.
The package scikit-fda provides data structures for both types of representation: The class 'FDataGrid' for discretized data, and the class 'FDataBasis' for expansions in a functional basis.They are derived from the abstract class 'FData', which provides common properties and methods.In what follows, these classes are described in detail.

The class 'FData'
In the class 'FData', the attributes and methods shared by the discretized and the basis expansion representations of the functional dataset are collected.Thus, it provides a common interface for both 'FDataGrid' and 'FDataBasis' objects.Specifically, objects of the 'FData' class have the following attributes: • dataset_name: Name of the functional dataset.
• n_samples: Size (number of functional observations) of the dataset.
• dim_domain: Dimension of the domain in which the functions are defined (p ≥ 1).
• argument_names: Names of each of the p arguments of the function (domain dimensions).
• domain_range: Limits of the intervals for each of the domain arguments.They are used as the default ranges for plotting and numerical integration.
Values q > 1 correspond to vector-valued functions.
• coordinate_names: Names of the q codomain coordinates.
• extrapolation: Default extrapolation strategy; for instance, constant or periodic.See Section 2.4 for details.
As an illustration of this data structure, consider the case of observations that are bidimensional RGB images.For this functional dataset, dim_domain would be 2, corresponding to the two dimensions of the image.The names of these dimensions (e.g., "x" and "y", or "horizontal" and "vertical") would be stored in the attribute argument_names.The attribute dim_codomain would be 3, corresponding to the three color channels.The names of these channels ("R", "G", and "B") would be stored in the attribute coordinate_names.
The class 'FData' provides also methods that are common to both representations; for instance, methods for evaluation, addition, multiplication by a scalar, and plotting.Since 'FData' is abstract, it is not possible to directly instantiate an object of this class.Instead, objects of one of its subclasses, 'FDataGrid' or 'FDataBasis', need to be created.

Discretized representation: The class 'FDataGrid'
Functional data are often the result of monitoring a continuous process at a discrete set of points.For the general case, x : R p → R q , with p, q ≥ 1, we assume that the discretization grid in the j-th dimension is t j = (t j1 , . . ., t jM j ), with j = 1, . . ., p.In the scikit-fda library, the points in the grid need not be regularly spaced.The grid has to be the same for all observations in the functional dataset.The dataset {x i : R p → R q } N i=1 is represented as an object of the class 'FDataGrid'.The values of the functional observations are stored in the tensor {x Here t 1 × . . .× t p is the grid of points obtained as the Cartesian product of t 1 , . . ., t p .In the case p = q = 1, the discretized sample is simply an N × M matrix {x i (t) = (x i (t 1 ), . . ., x i (t M ))} N i=1 .In addition to those inherited from 'FData', objects of this class have the following attributes: • grid_points: Sequence of discretization grids, one for each of the domain dimensions (t 1 , t 2 , . . ., t p ).The values of the functions are specified at the Cartesian product of the 1-D arrays in the sequence.
• data_matrix: NumPy array of dimensions N × M 1 × • • • × M p × q in which the values of the N functional observations are stored.
• interpolation: Default interpolation strategy for locations within the rectangular discretization grid.See Section 2.4 for details.
Since the attribute data_matrix is a NumPy array, it is possible to carry out pointwise operations, such as powers, exponentials, logarithms, and trigonometric functions by directly applying the corresponding NumPy functions (Harris et al. 2020).

Basis expansion representation: The class 'FDataBasis'
Assume that the functional observations in the dataset considered belong to F, a separable Hilbert space; for instance, L 2 .Under such assumption there exists a countable basis {ϕ i (t)} i≥1 that is complete, so that any x(t) ∈ F can be expressed as where {c i } i≥1 , are the coefficients of basis expansion.The scikit-fda library provides support for such a representation in the constant, monomial, B-spline, and Fourier bases.In addition, other types of bases can be implemented by the user.An example showing how to define new bases is available in https://fda.readthedocs.io/create_new_bases.The choice of basis should be made taking into consideration the characteristics of the data at hand.For instance, the Fourier basis is well-suited to representing periodic functions.For non-periodic data, a representation in the B-splines basis is probably more appropriate.The monomial basis is useful to represent polynomials.Monomials are also the building blocks of the Maclaurin series.Therefore, the monomial basis can be used to represent local approximations of analytic functions.The first five elements of the different bases provided in scikit-fda are shown in Figure 2. In general, these types of representation are infinite-dimensional.In practice, the expansion is truncated to the first K terms (1) Truncation often results in the smoothing of the original functional observations.This smoothing effect can be beneficial for the representation of noisy data (see Section 3.3.1).
In scikit-fda, the class 'FDataBasis' is used to represent functional data as a finite basis expansion.In addition to those inherited from 'FData', objects of this class have the following attributes: • basis: The basis for the representation of the functional observations.It is an object of the class 'Basis', one of whose attributes, n_basis is the number of elements of the basis considered.The bases available are 'ConstantBasis', 'MonomialBasis', 'BSplineBasis' and 'FourierBasis'.
• coefficients: The N × K matrix that contains the coefficients of the basis expansion.
The following code is used to illustrate scikit-fda's support for this type of representation.
>>> import skfda >>> from skfda.representation.basisimport FourierBasis, BSplineBasis >>> X, y = skfda.datasets.fetch_phoneme(return_X_y=True)>>> X.to_basis(BSplineBasis(n_basis=5)) >>> X.to_basis(FourierBasis(n_basis=5)) In this code, the Phoneme dataset (Hastie, Tibshirani, and Friedman 2009) is used.The curves in this dataset correspond to log-periodograms of the time series of utterances of five different phonemes by different speakers.The functional observations are transformed from their original discretized representation into different basis expansions.A sample of the transformed trajectories, together with the original ones, is displayed in Figure 3.In these plots, the smoothing effect of the transformation to the basis representations is apparent.

Interpolation and extrapolation
The scikit-fda package provides a variety of interpolation methods for functional data in discretized form.By default, linear interpolation is performed.Other types of interpolation can be specified in the attribute interpolation of the 'FDataGrid' object.Specifically, support for spline interpolation is provided by the class 'SplineInterpolation'.It is also possible to employ other interpolation and extrapolation strategies defined by the user.An example showing how to define such custom strategies is available at https: //fda.readthedocs.io/create_new_interpolation.Different extrapolation strategies for 'FDataGrid' and 'FDataBasis' objects are available in scikit-fda.In particular, it is possible to specify a constant value (for instance, the value of the function at one of the limits of the domain), or to assume a periodic structure.In the case of 'FDataBasis' objects, one can also directly evaluate the basis expansion outside the domain.Finally, as in interpolation, user-defined extrapolation strategies can be utilized.

Derivatives
The computation of derivatives is of particular importance in functional data analysis.For instance, derivatives can reveal significant information that is not apparent in the original curves.Furthermore, the norm of a derivative is a natural measure of the function's roughness (Ramsay and Silverman 2005).For this reason, they are often employed to define penalties for regularization.In the scikit-fda package, the method derivative() can be used to perform this operation for both 'FDataGrid' and 'FDataBasis' objects.In the case of 'FDataGrid' objects, derivatives are approximated using finite differences.For 'FDataBasis' objects, they are computed exactly in terms of the derivatives of the basis functions.Therefore, if a new type of basis is designed, it is necessary to implement the derivatives of the basis functions in the corresponding class.

Regularization
Regularization methods consist in favoring simpler models to improve the quality and robustness of the solutions of an optimization problem.In FDA, regularization is used to obtain smooth functional approximations to noisy discrete data, for registration, and for principal component analysis, among others.For the purpose of regularization, the complexity of a function can be quantified in terms of its norm, or of a linear transformation thereof (e.g., the function derivatives).A penalty term proportional to this measure of complexity is then added to the cost function to be minimized.The package scikit-fda provides the necessary infrastructure to implement regularization based on the L 2 -norm of the function in class 'L2Regularization'.Alternatively, a linear operator can be passed as a parameter to the constructor of 'L2Regularization' objects.Some common linear operators are readily available in scikit-fda's operators module for that purpose.The following code illustrates this type of regularization to obtain smooth representations of a set of functions in the basis of B-splines.In this example, the L 2 norm of second derivatives of the trajectories is used to penalize their curvature.

Functionality of scikit-fda
In this section, an overview of the utilities for functional data analysis provided by the scikitfda package is given.The first step in the analysis is to generate functional datasets or to retrieve them from external sources.In the scikit-fda package it is possible to generate synthetic data, to simulate random trajectories from stochastic processes, and to load data from files in standard formats and from repositories of real-word datasets.The library provides also an extensive set of tools for the analysis of functional data both in discretized and basis expansion forms.In particular, it offers methods for exploratory analysis, smoothing, registration, dimensionality reduction, the computation of functional depths, outlier detection, and interactive visualization, among others.An important feature of scikit-fda is the integration with the extensive collection of scikit-learn's tools for machine learning, including data preprocessing, training, testing, and hyperparameter selection.Specifically, the methods provided are designed so that they can be utilized in scikit-learn pipelines.In what follows these functionalities will be described in detail.

Generation of synthetic data
A variety of methods for the generation of functional data, either from some simple models or sampled from stochastic processes, are available in scikit-fda.In particular, the function make_multimodal_samples() can be used to generate functions with several maxima.This is used in the synthetic registration example illustrated in Figure 8.The function make_gaussian_process() can be used to simulate trajectories sampled from a Gaussian process with a specified mean and covariance function.Several commonly employed covariance functions, such as Brownian, exponential, radial basis function (RBF), Matérn, and polynomial kernels are supplied in the package.Additional types of covariance functions can be defined by the user.
The following code illustrates the generation of 50 trajectories from standard Brownian Motion, an Ornstein-Uhlenbeck process (a Gaussian process with an exponential covariance function), and a Gaussian process with an RBF covariance function.A regular grid of 100 equally spaced points in T = [0, 1] is employed for the discretized representation.The simulated trajectories are displayed in Figure 5.

Real-world data
The package scikit-fda provides tools to retrieve functional datasets from other libraries and public access repositories.The datasets themselves are retrieved using the package scikitdatasets (Díaz-Vico and Ramos-Carreño 2022).The data are downloaded only once and cached on disk, so as to reduce network traffic and make it possible to work with the data offline.They are then converted to the 'FData' format for further processing and analysis.For instance, the function fetch_cran() can be used to retrieve datasets from R packages in the CRAN repository.The following code illustrates the functionality described for three well-known datasets: Gun-Point from the UCR repository, Canadian Weather, and the Berkeley Growth Study.scikitfda's functions are used to load the data and plot some of the datasets' trajectories.Note that the Canadian Weather dataset has two codomain dimensions (q = 2): temperature and precipitation.In this example, the first is selected using the coordinates property of an 'FData' object.

Preprocessing
Functional observations often need to be subject to some form of processing to facilitate ulterior manipulation.To this end, the scikit-fda package provides utilities for smoothing, registration, and dimensionality reduction.In what follows, these utilities are described in detail.

Smoothing
Smoothing consists in replacing the original functional observation x(t) with a smoothed version x(t).This replacement yields a more manageable, possibly more faithful, representation of the underlying process.In particular, smoothing can be used to recover the signal component from noisy measurements.Furthermore, smoothed approximations with continuous derivatives can be used for regularization (Wang et al. 2016).The methods of scikit-fda's classes 'BasisSmoother' and 'KernelSmoother' can be employed to this end.The package provides also utilities to determine an appropriate degree of smoothing using some form of statistical validation.
As discussed in Section 2, the approximation of a function by a truncated basis expansion, as in (1), is a form of smoothing.The coefficients of the finite basis expansion can be estimated by least squares (Ramsay and Silverman 2005).This kind of smoothing is performed in scikitfda by the class 'BasisSmoother'.Further smoothing can be achieved by a regularization approach based on roughness penalties, as described in Section 2.6 (Green and Silverman 1993;Ramsay and Silverman 2005).
Smoothing can be achieved also by performing a linear transformation of the original functional observations x(t) = T s t (τ )x(τ )dτ. (2) The weighting function s t (τ ) quantifies the contribution of the value of the function at τ to the smoothed value at t.This weighting function should be localized, so that the values of the function at points close to t contribute more to the average.
For functional data in discretized form, (2) can be expressed as a matrix transformation x = Sx, where x = x(t) is the vector of values of the function at the discretization points t = (t 1 , . . ., t M ), the vector x consists of the smoothed function values at a grid of points, which can be different from the original ones, and S is the smoothing matrix.By default, the grid at which the smoothed values are computed is t, the set of sampling points.The smoothing matrix S is sometimes referred to as the "hat" matrix, a name borrowed from regression analysis because it transforms the dependent variable vector x into its fitted version x (Ramsay and Silverman 2005).
Smoothing with local weights can be implemented using kernels (Wasserman 2006).scikit-fda provides three smoothers of this type: Nadaraya-Watson, local linear regression, and k-nearest neighbors.As an illustration, for the the Nadaraya-Watson smoother, the hat matrix is where K is the kernel function and h is the parameter that controls the degree of smoothing.
Commonly used kernel functions, such as Gaussian, uniform, and Epanechnikov, are available in scikit-fda.It is also possible to employ user-defined kernels for this type of smoothing.
In these methods, the value of the smoothing parameter needs to be carefully adjusted to avoid under-or oversmoothing.In scikit-fda this value can be determined by statistical validation with the help of the class 'SmoothingParameterSearch'.Particular scoring criteria, such as the leave-one-out cross-validation ('LinearSmootherLeaveOneOutScorer') or, alternatively, generalized cross-validation ('LinearSmootherGeneralizedCVScorer') are provided to guide the search.The criterion that is maximized in leave-one-out cross validation is where h is the smoothing parameter and x(t m ; h) is the smoothed value.
The generalized cross validation (GCV) criterion is where Ξ is a penalty function.By default, the penalty is Additional penalty functions, such as Akaike's information criterion (AIC), implemented as akaike(), or Shibata's model selector, implemented as shibata(), are provided as well.In the code that follows, the smoothing functionality provided by scikit-fda is illustrated using the functions of the Phoneme dataset (Hastie et al. 2009)  In this example, k-nearest neighbors is used to smooth the data.The optimal number of neighbors is selected between the values 2, 3, 4 and 5. Model selection is carried out using 'LinearSmootherGeneralizedCVScorer' with the shibata() penalty function.The first five original curves and the smoothed ones are displayed in Figure 7.

Registration
Another type of preprocessing, which is especially relevant in FDA, is registration.Registration consists in applying transformations to the raw data so that the functional observations are properly aligned.There is a variety of reasons why misalignment can occur.In some cases, it is the result of errors in the measurement process.In others, the domain has to be warped because the functions depend on an internal parameter, which is different from the one observed.For periodic functions, such as the signal of a heartbeat, the starting time for the different measurements could be different.A number of strategies can be used for registration.For instance, maxima, minima, zeros, and other landmarks can be used as reference points for alignment.Alternatively, some measure of dispersion between the observations can be minimized.It is also possible to register a set of functional observations to a reference function.After registration, it may be necessary to evaluate the functional observations at points in the domain that are different from the ones in the original grid.This can be made utilizing the interpolation and extrapolation techniques described in Section 2.4.To carry out such an alignment, the package scikit-fda offers support for shift registration, and for elastic registration.

Shift registration consists in aligning the functional observations by a translation xi
where δ i is the time shift applied to x i (t), and xi (t) is the registered function (Ramsay and Silverman 2005).Shifting modifies the lower and upper bounds of the interval on which the function observations are defined.The values of the shifted functions that lie outside the original interval are discarded.For the subinterval in which the function values are not available, they are estimated by extrapolation.The method for extrapolation can be provided as an input.
The shifting constants {δ i } N i=1 can be determined using different procedures.If a single landmark, such as a maximum, a minimum, or a zero crossing, is present in every curve, and their locations, {τ i } N i=1 , are known, then the i-th curve can be shifted by δ i = τ i − τ * .After registration, the location of the landmark is τ * for every curve, xi (τ * ) = x i (τ i ), i = 1, . . ., N.
In scikit-fda, the function landmark_shift_registration() can be used to carry out this transformation.The values of the δ i can be retrieved using the landmark_shift_deltas() function.
Alternatively, the values δ i can be computed by minimizing a least squares criterion (Ramsay and Silverman 2005) where μ(t) is the sample mean of the registered data {x i (t)} N i=1 .This type of shift registration can be performed with methods of class 'LeastSquaresShiftRegistration'.Instead of the sample mean, which is the default value, a user-defined template function can be employed.In this case, the values for the δ i are stored as the attribute deltas_ after the registration.
Another type of registration available in the package scikit-fda is elastic registration.In elastic registration, one attempts to align the data by applying a warping transformation The warping γ i is a monotonically increasing function defined in T = [a, b].Assuming that the values of the function at the endpoints of this interval are fixed, it obeys the constraints γ i (a) = a and γ i (b) = b.If the locations of some landmarks are known, the warping function for elastic registration can be approximated by monotonically increasing splines (Ramsay and Silverman 2005).Besides the boundary constraints specified earlier, the spline interpolator for the i-th functional observation has to satisfy where {τ il } L l=1 are the landmark locations in the i-th observation, and τ * l = γ −1 i (τ il ) is the location of the l-th landmark in the registered curves.In scikit-fda this type of registration can be carried out using the function landmark_elastic_registration().The warpings can be retrieved with the function landmark_elastic_registration_warping().A drawback of this approach is that the landmarks and their locations need to be identified beforehand.
An alternative type of elastic registration is to align the observations to a reference template.This has the advantage that no information on landmarks is needed.In the elastic registration method described in Srivastava, Wu, Kurtek, Klassen, and Marron (2011), the template is defined in terms of the Karcher mean under the Fisher-Rao metric (Srivastava and Klassen 2016).Then, an energy function depending on the Fisher-Rao distance between each curve and the template is minimized.To efficiently compute this distance, the square root velocity function (SRVF) transform is introduced (Joshi, Klassen, Srivastava, and Jermyn 2007).The main reason for introducing this transform is that the Fisher-Rao distance between two functions is given by the L 2 distance between their SRVF representations.The scikitfda's class 'FisherRaoElasticRegistration' includes methods for this particular type of registration.The warping functions used are stored in the warping_ attribute of this class.The implementation makes use of the dynamic programming routines for alignment to a scikit-fda: Functional Data Analysis in Python template from the Python package fdasrsf (Tucker 2024a).This type of elastic registration is illustrated in Figure 8 with synthetic data.The original trajectories, which are displayed on the left plot, exhibit two local maxima whose relative locations are different in each of the curves.In consequence, alignment cannot be achieved via a simple shift.Note that after this type of elastic registration curves are well aligned even without previous information about the location of the landmarks.
The results of applying shift registration by least squares and elastic Fisher-Rao registration to the Berkeley Growth Study data (Tuddenham and Snyder 1954) are compared in Figure 9.This figure has been generated using the following code: >>> import skfda >>> from skfda.preprocessing.registrationimport ( ... FisherRaoElasticRegistration, ... LeastSquaresShiftRegistration) >>> X, y = skfda.datasets.fetch_growth(return_X_y=True)>>> X_aligned_elastic = FisherRaoElasticRegistration().fit_transform(X)>>> X_aligned_shift = LeastSquaresShiftRegistration().fit_transform(X)>>> X.plot() >>> X_aligned_shift.plot()>>> X_aligned_elastic.plot() The curves displayed in the left panel of Figure 9 trace the evolution of the heights of 54 girls and 39 boys since their birth until their 18th birthday.The nominal ages at which the measurements are made coincide for all individuals.However, each child has a different growth profile.In particular, landmark features manifest themselves at different ages.For instance, even though most curves exhibit a growth spurt at puberty, the precise ages at which this occurs are different for each individual.Therefore, an elastic deformation of the actual age axis may uncover an internal age, which is more meaningful from a biological perspective.The effects of shift and elastic registration based on the Fisher-Rao distance are displayed in the middle and right panels of Figure 9 respectively.

Dimensionality reduction
Functional data are infinite-dimensional objects.Even in the case that they are represented by a set of discrete measurements, their dimensionality is typically very high.In addition, nearby observations exhibit a large degree of dependence.Due to these characteristics, technical and computational difficulties arise in the analysis of these types of data.To alleviate such difficulties one can represent the functional data in a lower-dimensional space while preserving as much information as possible.The use of dimensionality reduction methods leads to gains in efficiency and, in some cases, improvements in interpretability and predictive capacity.Furthermore, in this lower-dimensional representation, the methods of multivariate statistics can be employed (Vieu 2018).
A simple dimensionality-reduction method is to select a set of impact or design points that are relevant for the task at hand; for instance, the most informative points for clustering, classification, or regression (Delaigle, Hall, and Bathia 2012;Ferraty, Hall, and Vieu 2010;Kneip, Poss, and Sarda 2016).Specifically, scikit-fda's class 'EvaluationTransformer' can be used to evaluate the functions as a set of points in the domain of the function.Alternatively, a truncated basis representation can be used (Biau, Bunea, and Wegkamp 2005;Poskitt and Sengarapillai 2013).The coefficients of a functional data object represented as a basis expansion can be extracted using the methods of the class 'CoefficientsTransformer'.Besides these, the scikit-fda package provides methods for functional principal components analysis (FPCA) and variable selection methods.These types of methods are described in what follows.
Functional principal components analysis.Functional principal component analysis (FPCA) is a widely used dimensionality reduction method in FDA.In this method, the individual functions are represented in the orthonormal basis of eigenfunctions of the stochastic process' covariance operator.Dimensionality reduction is achieved by retaining the projections of the original functions onto the subspace of L 2 spanned by the set of eigenfunctions that correspond to the largest eigenvalues.This representation is the one, among those of the same dimension, that explains the most of the data's variance.
A random function X ∈ L 2 can be represented as where µ(t) = E [X(t)], ϕ j (t) is the j-th principal component, and ξ j = T (X(t) − µ(t))ϕ j (t)dt, denotes the projection (score) along the j-th principal component.By the Karhunen-Loève theorem, the scores {ξ j } j≥1 are uncorrelated random variables (Wang et al. 2016).Smoothed versions of the principal components can also be computed by applying the regularization penalties described in Section 2.6, using the procedure described in Section 9. 4.2 of Ramsay and Silverman (2005).The smooth principal components are obtained by optimizing a function that takes into account not only the sample variance but also a term that penalizes the roughness of the principal components.A reduction of the dimension of the data can be achieved by truncating the basis expansion in (3), so that only the first K components are included In scikit-fda, functional principal component analysis can be carried out using the methods of class 'FPCA'.The following code illustrates this functionality for the Berkeley Growth Study data: >>> import skfda >>> import matplotlib.pyplotas plt >>> X, y = skfda.datasets.fetch_growth(return_X_y=True)>>> fpca = skfda.preprocessing.dim_reduction.FPCA(n_components=2) >>> fpca.fit(X)>>> skfda.exploratory.visualization.FPCAPlot(X.mean(), .In this example, the first two principal components are computed.Then, the functional observations are projected onto the two-dimensional subspace spanned by these components.A numerical quadrature is used to compute the corresponding inner products.The class 'FPCAPlot' is used to display the curves {µ(t) ± ϕ j (t); j = 1, 2}, which are the result of adding and subtracting the j-th eigenfunction to the sample mean.In the case of the Berkely growth study, the first component captures overall deviations (either positive or negative) with respect to the mean.The second one reveals patterns associated to differences of growth speed.In particular, it exhibits a maximum followed by a sign change at around puberty.The resulting plots are shown in the left and middle panels in Figure 10.Finally, the projection of the curves onto the first two principal components is obtained using the transform method of the class 'FPCA'.The scores of these two components are displayed as points in the right panel of Figure 10.Note that, with some exceptions, boys (blue) and girls (orange) appear grouped in two separate clusters in this plot.
Variable selection.The package scikit-fda provides a variety of tools to carry out variable selection.A simple approach is to apply a multivariate variable selection method to the discretized representation of the functional observations (Berrendero, Cuevas, and Torrecilla 2016a;Jiménez-Cordero and Maldonado 2021).The scikit-fda class 'EvaluationTransformer' can be used to transform 'FData' objects into NumPy arrays.Then, any Python library for multivariate variable selection can be used.If this approach does not take into account the functional nature of the data, there can be difficulties in the analysis (Aneiros and Vieu 2016).A multivariate method that takes into account the redundancy that arises from the continuity of functional data is minimum-redundancy-maximum-relevance (mRMR) (Ding and Peng 2005;Peng, Long, and Ding 2005;Berrendero et al. 2016a).This method is implemented in scikit-fda in the class 'MinimumRedundancyMaximumRelevance'.The dependence measures that quantify the relevance and the redundancy can be specified by the user.
In addition, the scikit-fda package includes a collection of variable selection methods that specifically take into account the functional nature of the data: a method based on the theory of Reproducing Kernel Hilbert Spaces (RKHS-VS), maxima hunting (MH), and recursive maxima hunting (RMH).
The RKHS-VS method, implemented in the class 'RKHSVariableSelection', was introduced for binary classification problems (Berrendero, Cuevas, and Torrecilla 2018).For a specified value of d, the goal is to identify the set of points t = (t 1 , . . ., t d ) ⊤ ∈ T d and select the corresponding function values, X(t 1 ), . . ., X(t d ), that maximize the Mahalanobis distance between groups (µ where µ 0 (t), µ 1 (t), and K(t, t) are the mean functions of each class and the covariance function evaluated at X(t 1 ), . . ., X(t d ), respectively.In homoscedastic binary classification problems, with a fixed dimension d this selection is optimal in terms of classification error.In practice, the exploration of all possible combinations is often infeasible.To reduce the computational costs, a greedy search is implemented.
In MH (Berrendero, Cuevas, and Torrecilla 2016b; Ordóñez, Oviedo de la Fuente, Roca-Pardiñas, and Rodríguez-Pérez 2018) one selects the values of t ∈ T that correspond to local maxima of a non-negative dependence measure between X(t) and the class label.The selected variables are thus the most relevant in a region.Furthermore, the values of X(t) that are close to these local maxima, which generally provide redundant information, are automatically discarded.In Berrendero et al. (2016b), the distance correlation (Székely, Rizzo, and Bakirov 2007) is used as the dependence measure.This variable selection method is implemented in the class 'MaximaHunting', using the implementation of distance correlation function provided by the dcor package (Ramos-Carreño 2020).MH is an interpretable, fully functional method with optimal performance in an important class of functional classification problems.In spite of its simplicity and good performance, MH has some limitations.Specifically, there can be numerical difficulties to identify the local maxima of the dependence measure.Furthermore, MH takes into account only the marginal relevance of a single variable.Variables that are only relevant when selected in combination with other cannot be identified by these procedures.
RMH (Torrecilla and Suárez 2016) addresses these limitations by assuming a particular form of the stochastic process from which the trajectories are sampled.The algorithm proceeds as follows: First the value of t that is the global maximum of the dependence between the variable X(t) and the label is selected.Let t * be such optimum and, therefore, X(t * the variable selected.The information conveyed by X(t * ) is removed by subtracting from the trajectories the conditional expectation of the process given the value of the selected variable.
Then, the global maximum of the resulting process is identified.The algorithm proceeds in this iterative manner until a pre-specified number of variables have been selected, or until a convergence criterion is fulfilled.In scikit-fda, the class 'RecursiveMaximaHunting' provides an enhanced, very customizable implementation of this method.

Exploratory analysis
Exploratory analysis methods are used to identify salient features, visualize, and describe the data from a statistical point of view.Specifically, the scikit-fda package provides tools for the computation of summary statistics, including robust ones, interactive tools for visual analysis, and outlier detection.

Summary statistics
Common summary statistics, such as the sample mean function and the sample covariance function can be estimated using the tools provided by scikit-fda.Consider a set of functional observations {x i (t)} N i=1 .The sample mean, x i (t), can be computed by applying the function mean() to the 'FData' object in which the data are stored.The functional observations can be either in discrete form or in a basis representation.The resulting mean function is a 'FData' object of the same type as the input (i.e., discretized or in a basis representation).
The sample covariance function k, can be computed applying the function cov() to the corresponding 'FData' object.Similarly, the function var() can be used to computed the sample variance, k(t, t).Irrespective of the representation of the functional observations, the sample variance and covariance are returned in discretized form.Instead of the functions mean(), cov() and var(), the 'FData' methods of the same name can be used to compute these summary statistics.
Figure 11 presents an illustration of this functionality for the Canadian Weather dataset.The sample estimate of the mean temperature curves is shown as a blue curve.The shaded area corresponds to one standard deviation around the estimated mean.Other measures of centrality, such as the trimmed mean, the geometric mean, and the median, are displayed in this figure as well.These robust statistics will be described in some detail in Section 3.4.3.

Depth measures
Depth measures quantify the centrality of a function in relation to a set of functions.These measures are used for exploratory analysis, to compute robust statistics, detect outliers, and for data visualization (e.g., the functional box-plot).In contrast to the univariate case, a variety of definitions of functional depth can be given that yield different orderings of the functional observations in the sample.Each of these functional depths lead to different definitions of robust statistics and of degrees of outlyingness.Some common functional depth measures are implemented in the package scikit-fda.In particular, the methods of class 'IntegratedDepth' can be used to compute integrated depth measures (Fraiman and Muniz 2001), which are averages of univariate depths.Specifically, the integrated depth of the function x is where D is an univariate depth function.This function can be selected by the user.The default is the measure proposed by Fraiman and Muniz (2001): where F X(t) denotes the distribution function of the marginal.
An alternative definition is the band depth (BD), introduced by López-Pintado and Romo (2009).To compute this functional depth, one needs to identify the bands that are delimited by all possible pairs of functional observations in the sample.The BD value is the fraction of bands that completely encompass the curve.In scikit-fda, this quantity can be computed using methods of the class 'BandDepth'.A related, less restrictive measure, is the modified band depth (MBD).This measure takes into account not only the number of bands that contain x, but also the time that x lies within each band.The MBD has better statistical properties than the original BD, in part because it is an integrated depth measure (Nagy, Gijbels, Omelka, and Hlubinka 2016).In scikit-fda, MBD is implemented in the class 'ModifiedBandDepth'.

Robust statistics
The package scikit-fda provides support for the computation of robust statistics.Robust statistics may provide a better characterization of the data than non-robust ones (e.g., the mean or the covariance functions), especially in the presence of outliers.One of the most important robust statistics is the geometric median (Lardin-Puech, Cardot, and Goga 2014) It can be computed with the function geometric_median().Alternatively, the median can be defined as the deepest point in the sample.Different depth measures yield different definitions of the median.These types of medians can be computed with the function depth_based_median().
Functional depth measures can be used also to define the degree of outlyingness of a function in a sample: The larger the depth value the more central the functional observation is.Finally, functional depth measures can be used to define trimmed means (Fraiman and Muniz 2001).A trimmed mean is a robust version of the standard mean in which the most outlying functional observations (the ones with the lowest depth values) are discarded.In scikit-fda, the trimmed mean is implemented in function trim_mean().
The geometric median, the MBD-based median, and the MBD-based trimmed mean in which 10% of the data are discarded, of the Canadian Weather dataset are shown in Figure 11.

Interactive visualization tools and outlier detection
Visualization tools can be used to gain insight into the data.In particular, trends, salient features, and other patterns in the data can be identified simply by inspection.Visualization tools can be utilized also to single out functional observations that are markedly different from the other observations in the sample (outliers).Outlier detection is useful to identify rare events, novel patterns, anomalies, or erroneous measurements.The package scikit-fda provides a number of interactive tools for data visualization and outlier detection.Their implementation utilizes the functionality provided by matplotlib (Hunter 2007).Functional data objects have a plot() method that can be used to graph the curves.Some customization options, such as group colors or labels, are available for this method.An illustration of its use with the Berkeley Growth Study dataset is shown in the left panel of Figure 12.For 'FDataGrid' objects, the scatter() method can be used to display the values of the function as individual points in a graph.This method was used to generate Figure 1.
Another tool for visual exploration provided by scikit-fda is the functional boxplot (Sun and Genton 2011).This is an generalization of the univariate boxplot for functional data.The functional boxplot consists of a graph of the functional median (i.e., the deepest curve in the sample) surrounded by a central envelope, which encompasses the deepest 50% of the observations, and a maximum non-outlying envelope.The width of this outer envelope is determined by scaling the central one by a constant factor.This constant factor can be selected by the user.Its default value is 1.5.In scikit-fda, the class 'Boxplot' can be used to generate and customize functional boxplots.In this plot, a trajectory is marked as an outlier if it lies beyond the maximum non-outlying envelope for some interval.The class 'BoxplotOutlierDetector' can be used for outlier detection based on this criterion.Some customizable elements of 'Boxplot' objects are the depth measure, and the definition of centered bands that encompasses a user-specified fraction of the deepest observations.The following code provides an illustration of these functionalities with the Berkeley Growth Study dataset.The plots that result from the execution of this code are displayed in Figure 12.

>>> boxplot.plot()
Another tool for functional data visualization and outlier detection is the magnitude-shape plot (MS-plot) (Dai andGenton 2018, 2019).In this method, the degree of outlyingness of a functional observation is characterized in terms of two quantities: The magnitude outlyingness (MO) and the shape outlyingness (VO).The MS-plot is the scatter plot of the values MO and VO for each functional observation.This two-dimensional representation of the data can be used, for instance, to identify clusters of functions, or detect potential outliers, either in shape or in magnitude.
The following code can be used to display the MS-plot for the temperature curves of the Canadian Weather dataset together with the original trajectories.Additionally, outliers are identified according to the MS-plot criterion and marked in red.The class 'MagnitudeShapePlot' generates the MS-plot and uses internally the methods of the class 'MSPlotOutlierDetector' for outlier detection.The resulting plots are shown in Figure 13.
>>> import skfda >>> X, y = skfda.datasets.fetch_weather(return_X_y=True)>>> X = X.coordinates [0] >>> ms_plot = skfda.exploratory.visualization.MagnitudeShapePlot(X) >>> ms_plot.plot()>>> fig = X.plot(group=ms_plot.outliers, group_colors=["blue", "red"]) The class 'Outliergram' provides an additional method for data visualization and detection of shape outliers (Arribas-Gil and Romo 2014).The graph is defined in terms of two related quantities: The modified epigraph index (MEI) and the MBD.The MEI of a trajectory is the average over time of the fraction of curves in the sample that lie above it.Each curve is a point (MEI, MBD) in the scatter plot.The outliergram takes advantage of the fact that points corresponding to typical functional observations lie on a parabola, whose analytical form is known.This parabola is used as a reference for the identification of shape outliers.Specifically, the degree of outlyingness of a curve is quantified in terms of its vertical distance to the parabola.The scikit-fda's classes 'Outliergram' and 'OutliergramOutlierDetection' can be used to generate the outliergram and to detect outliers by using this criterion, respectively.The following code illustrates this functionality with the temperatures of the Canadian Weather dataset.The original trajectories and the corresponding outliergram are shown in Figure 14.
>>> import skfda >>> import matplotlib.pyplotas plt >>> X, y = skfda.datasets.fetch_weather(return_X_y=True)>>> X = X.coordinates [0] >>> fig = X.plot() >>> fig = skfda.exploratory.visualization.Outliergram(X).plot() In addition to standard plotting capabilities, most graphs generated with scikit-fda incorporate some interactive features.For example, the cursor can be placed at a point in the graph to display the actual coordinate values and the label of the observation.In addition, if different plots are used for visual exploration of some functional dataset, selecting a particular curve in one plot highlights the corresponding curve in the other active plots.Finally, widgets such as sliders can be used to select curves by some property, such as the label of the observation, or their depth in the sample.
An illustration of this interactive functionality is presented in Figure 15.In this figure, three different kinds of plots are displayed for the temperature curves of the Canadian Weather dataset: A graph of the sample trajectories, the MS-plot, and the outliergram.In the lower right side a slider has been created that displays the MBD value of the selected curve.A functional observation can be selected either by choosing a value in the slider widget or by clicking on the corresponding point in the MS-plot or in the outliergram.The datum selected is then highlighted in all plots.Finally, the cursor has been placed at a point in the MS- plot.This brings up a tool-tip in which relevant information of the corresponding functional observation is displayed.

Integration with scikit-learn for machine learning
The scikit-fda package has been especially designed for seamless integration with scikit-learn (Pedregosa et al. 2011).Specifically, there are a number of methods that transform the functional data into a two-dimensional array so that scikit-learn's machine learning algorithms can be applied.For instance, the method transform() of the class 'EvaluationTransformer' returns an array with the values of the functions in the sample at a specified set of points.
The coefficients of a functional data object represented as a basis expansion can be extracted using the methods of the class 'CoefficientsTransformer'. Additionally, other scikit-fda methods, such as variable selection, can be used to this end.
The provided classes and methods for preprocessing conform to scikit-learn's application programming interface (API) (Buitinck et al. 2013).An advantage of adopting this standard is that they can be employed in a scikit-learn pipeline (class 'Pipeline').The following code illustrates how to build such a pipeline for a classification problem with functional data: >>> import skfda >>> from sklearn.model_selection import GridSearchCV, train_test_split >>> from sklearn.pipeline import Pipeline >>> from sklearn.svm import SVC package can be readily applied to functional data.To ensure the quality and robustness of the software, a comprehensive suite of unit and integration tests is provided.These automated tests are executed regularly in a continuous integration environment.We have attempted also to minimize the number of dependencies and to provide flexible interfaces that are intuitive and consistent throughout the application.
The scikit-fda package is free and open-source software distributed under the OSI-approved 3-clause BSD license (https://opensource.org/licenses/BSD-3-Clause).The version described in this paper is 0.9.1 and was released on 2024-02-26.The GitHub page https: //github.com/GAA-UAM/scikit-fda is the main communication channel with the developers of the package for questions, bug reports, and feature requests.Contributions from the members of the FDA community are encouraged, as are comments and suggestions to improve the quality of the software.
To facilitate the use of scikit-fda, exhaustive documentation, including installation instructions, tutorials, API references, and illustrative examples are provided.The documentation, which is available online at https://fda.readthedocs.io/, is built with the Python tool Sphinx (Sphinx Development Team 2020).The examples and tutorials have been devised with Sphinx-Gallery (Nájera et al. 2020).They can be viewed online or downloaded as interactive Jupyter notebooks (Kluyver et al. 2016).In Figure 16, two screenshots of the documentation pages are shown: A reference page for the Brownian covariance function and an example of use of the functional boxplot functionality are displayed on the left and right panels of the figure, respectively.
The package scikit-fda is under active development, with new features being continuously incorporated.Users are encouraged to consult the documentation at https://fda.readthedocs.io/ for an up-to-date description of the project functionality.

Figure 3 :
Figure 3: Different representations of the first ten trajectories of the Phoneme dataset.From left to right: original trajectories, B-spline, and Fourier basis representation.In both cases, 5 basis functions are considered

Figure 4 :
Figure 4: Smoothed representation of the first ten trajectories of the Phoneme dataset in a B-spline basis with 40 basis functions for different values for the regularization parameter λ ≥ 0; from left to right: λ = 0 (no regularization), λ = 1, and λ = 10.

Figure
Figure 6: Real datasets fetched with scikit-fda.From left to right: GunPoint, Canadian Weather, and the Berkeley Growth Study.

Figure 7 :
Figure 7: Five trajectories of the Phoneme dataset before (left) and after (right) nearest neighbors smoothing.

Figure 8 :
Figure 8: Elastic Fisher-Rao registration with synthetic data: The original curves are shown in the left panel.The registered curves are displayed in the right panel.

Figure 9 :
Figure 9: Registration of the Berkeley Growth Study data with different methods.From left to right: the original curves, shift registration by least squares and elastic Fisher-Rao registration.

Figure
Figure Principal components analysis for the Berkeley Growth Study data.The first and second principal components are plotted in the leftmost panel as perturbations around the mean (in blue).In the right panel, the scores of individual functional observations for the first two components are plotted.The orange and blue points correspond to girls' and boys' growth curves, respectively.

Figure 11 :
Figure 11: Centrality statistics of the Canadian Weather dataset.The shaded band corresponds to one standard deviation around the mean.

Figure 12 :
Figure 12: Functional boxplots of the Berkeley Growth Study dataset.The original curves are depicted in the left panel.The standard functional boxplot is shown in the central panel.In this panel, The black line stands for the functional median.The central envelope is displayed as a pink band around the median.The blue whiskers and their fences mark the maximum non-outlying envelope.Outliers are shown as a red dashed lines.In the right panel, different shades of pink are used for the deepest 25%, 50%, and 75% of the data.

Figure 13 :
Figure 13: MS-plot and outliers of the Canadian Weather dataset.The original yearly temperature curves are displayed in the left panel.The corresponding MS-plot is shown in the right panel.The observations identified as outliers by the MS-plot criterion (those outside the blue ellipse) are marked in red.

Figure 14 :
Figure 14: Outliergram of the yearly temperature curves for the Canadian Weather dataset.The original trajectories are in the left panel and the corresponding outliergram is shown in the right panel.The blue line corresponds to the reference parabola.The orange dashed line separates the typical curves (above) from the outliers (below).

Figure 15 :
Figure 15: Interactive features in multiple plots for the Canadian Weather dataset.

Figure 16 :
Figure 16: scikit-fda documentation: On the left panel, the reference page of the Brownian covariance function is shown.On the right one, an illustration of the functional boxplot functionality.
6: Real datasets fetched with scikit-fda.From left to right: GunPoint, Canadian Weather, and the Berkeley Growth Study.
, which are rather noisy.