intRinsic: an R Package for Model-Based Estimation of the Intrinsic Dimension of a Dataset

This article illustrates intRinsic, an R package that implements novel state-of-the-art likelihood-based estimators of the intrinsic dimension of a dataset, an essential quantity for most dimensionality reduction techniques. In order to make these novel estimators easily accessible, the package contains a small number of high-level functions that rely on a broader set of efficient, low-level routines. Generally speaking, intRinsic encompasses models that fall into two categories: homogeneous and heterogeneous intrinsic dimension estimators. The first category contains the two nearest neighbors estimator, a method derived from the distributional properties of the ratios of the distances between each data point and its first two closest neighbors. The functions dedicated to this method carry out inference under both the frequentist and Bayesian frameworks. In the second category, we find the heterogeneous intrinsic dimension algorithm, a Bayesian mixture model for which an efficient Gibbs sampler is implemented. After presenting the theoretical background, we demonstrate the performance of the models on simulated datasets. This way, we can facilitate the exposition by immediately assessing the validity of the results. Then, we employ the package to study the intrinsic dimension of the Alon dataset, obtained from a famous microarray experiment. Finally, we show how the estimation of homogeneous and heterogeneous intrinsic dimensions allows us to gain valuable insights into the topological structure of a dataset.


Introduction
Statisticians and data scientists are often called to manipulate, analyze, and summarize datasets that present high-dimensional and elaborate dependency structures. In numerous cases, these large datasets contain variables characterized by a considerable amount of redun-dant information. One can exploit these redundancies to represent a large dataset on a much lower-dimensional scale. This summarization procedure, called dimensionality reduction, is a fundamental step in many statistical analyses. For example, dimensionality reduction techniques grant the feasibility of otherwise challenging tasks such as large data manipulation and visualization by reducing computational time and memory requirements.
More formally, dimensionality reduction is possible whenever the data points take place on one or more manifolds characterized by a lower dimension than what has been observed initially. In this context, the word manifold is used to indicate a constraint surface embedded in high-dimensional space along which dissimilarities between data points are best represented (Tenenbaum, De Silva, and Langford 2000;Lee, Rodriguez, and Madabhushi 2008). We call the dimension of a latent, potentially nonlinear manifold the intrinsic dimension (ID). Several other definitions of ID exist in the literature. For example, we can regard the ID as the minimal number of parameters needed to represent all the information contained in the data without significant information loss (Ansuini, Laio, Macke, and Zoccolan 2019;Rozza, Lombardi, Rosa, Casiraghi, and Campadelli 2011;Bennett 1969).
Intuitively, the ID is an indicator of the complexity of the features of a dataset. It is a necessary piece of information to have before attempting to perform any dimensionality reduction, manifold learning, or visualization tasks. Indeed, most dimensionality reduction methods would be worthless without a reliable estimate of the true ID they need to target: an underestimated ID value can cause needless information loss. At the same time, the reverse can lead to an unnecessary waste of time and computational resources (Hino, Fujiki, Akaho, and Murata 2017). Beyond dimensionality reduction, ID estimation methods have been successfully employed, for instance, in studying physical systems (Mendes-Santos, Turkeshi, Dalmonte, and Rodriguez 2021) and analyzing neural networks (Ansuini et al. 2019). For more examples of applications, see also Carter, Raich, and Hero (2010) and the references in the discussion in Bac, Mirkes, Gorban, Tyukin, and Zinovyev (2021).
Over the past few decades, a vast number of methods for ID estimation and dimensionality reduction have been developed. The algorithms can be broadly classified into two main categories: projection and geometric approaches. The former maps the original data to a lower-dimensional space. The projection function can be linear, as in the case of principal component analysis (PCA ;Hotelling 1933) or nonlinear, as in the case of locally linear embedding (Roweis and Lawrence 2000), Isomap (Tenenbaum et al. 2000), and the tSNE (Van der Maaten and Hinton 2009). For more examples, see Jollife and Cadima (2016) and the references therein. In consequence, there is a plethora of R packages that implement these types of algorithms. To mention some examples, one can use the packages RDRToolbox (Bartenhagen 2022), lle (Kayo 2006), Rtsne (Krijthe 2022), and the classic princomp() function from the base package stats (R Core Team 2022).
Instead, geometric approaches rely on the topology of a dataset, exploiting the properties of the distances between data points. Within this family, we can find fractal methods (Falconer 2003), graphical methods (Costa and Hero 2004), model-based likelihood approaches (Levina and Bickel 2005), and methods based on nearest neighbors distances (Pettis, Bailey, Jain, and Dubes 1979). Also in this case, numerous packages are available: for example, for fractal methods alone there are fractaldim (Sevcikova, Percival, and Gneiting 2021), nonlinearTseries (Garcia 2022), and tseriesChaos (Di Narzo 2019), among others. For a recent review of the methodologies used for ID estimation we refer to Campadelli, Casiraghi, Ceruti, and Rozza (2015).
Given the abundance of approaches in this area, several R developers have also attempted to provide unifying collections of dimensionality reduction and ID estimation techniques. For example, remarkable ensembles of methodologies are implemented in the packages ider (Hino 2017), dimred and coRanking (Kraemer, Reichstein, and Mahecha 2018), dyndimred (Cannoodt and Saelens 2021), IDmining (Golay and Kanevski 2017), and intrinsicDimension (Johnsson 2019). Among the various options, the package Rdimtools (You 2022a) stands out, implementing 150 different algorithms, 17 of which are exclusively dedicated to ID estimation (You 2022b). Finally, it is worth mentioning that there are also Python (Van Rossum et al. 2011) packages implementing different methods for ID estimation: two prominent examples are scikit-learn (Bac et al. 2021) and DADApy (Glielmo et al. 2022). See Appendix B for more details.
In this paper, we introduce and discuss the R package intRinsic (version 1.0.0; Denti and Gilardi 2023). The package is openly available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=intRinsic, and can be installed by running R> install.packages("intRinsic") Future developments and updates will be uploaded both on CRAN and GitHub at https: //github.com/Fradenti/intRinsic.
The package implements the two nearest neighbors (TWO-NN), the generalized ratio ID estimator (GRIDE), and the heterogeneous ID algorithm (HIDALGO) models, three state-of-theart ID estimators recently introduced in Facco, D'Errico, Rodriguez, and Laio (2017); Denti, Doimo, Laio, and Mira (2022) and Allegra, Facco, Denti, Laio, and Mira (2020), respectively. These methods are likelihood-based estimators that rely on the theoretical properties of the distances among nearest neighbors. The first two models estimate a global, unique ID of a dataset and are implemented under both the frequentist and Bayesian paradigms. Moreover, one can exploit these models to study how the ID depends on the scale of the neighborhood considered for its estimation. On the contrary, HIDALGO is a Bayesian mixture model that allows for the estimation of clusters of points characterized by heterogeneous ID. In this article, we focus our attention on the exposition of TWO-NN and HIDALGO, and we discuss the pros and cons of both models with the aid of simulated data. More details about the additional routines, such as GRIDE, are reported in Appendix A. Finally, in Appendix B, we elaborate more on the strengths and weaknesses of the methods implemented in intRinsic in comparison to the other existing packages.
Broadly speaking, the package contains two sets of functions, organized into high-level and low-level routines. The former set contains user-friendly and straightforward R functions. Our goal is to make the package as accessible and intuitive as possible by automating most tasks. The low-level routines are not exported, as they represent the package's core. The most computationally-intensive low-level functions are written in C++, exploiting the interface with R provided by the packages Rcpp and RcppArmadillo (Eddelbuettel and François 2011;Eddelbuettel and Sanderson 2014). The C++ implementation considerably speeds up timeconsuming tasks, like running the Gibbs sampler for the Bayesian mixture model HIDALGO. Moreover, intRinsic is well integrated with external R packages. For example, we enriched the package's functionalities defining ad-hoc methods for generic functions like autoplot() from the ggplot2 package (Wickham 2016) to produce the detailed graphical outputs.
The article is structured as follows. Section 2 introduces and describes the theoretical background of the TWO-NN and HIDALGO methods. Section 3 illustrates the basic usage of the implemented routines on simulated data. We show how to obtain, manipulate, and interpret the different outputs. Additionally, we assess the robustness of the methods by monitoring how the results vary when the input parameters change. Section 4 presents an application to a famous real microarray dataset. Finally, Section 5 concludes by discussing future directions and potential extensions to the package.

The modeling background
Let X be a dataset with n data points measured over D variables. We denote each observation as x i ∈ R D , with i = 1, . . . , n. Despite being observed over a D-dimensional space, we suppose that the points take place on a latent manifold M with intrinsic dimension d ≤ D. Generally, we expect that d ≪ D. Thus, we postulate that a low-dimensional data-generating mechanism can accurately describe the dataset.
Then, consider a single data point x i . Starting from this point, one can order the remaining n − 1 observations according to their distance from x i . This way, we obtain a list of nearest neighbors (NNs) of increasing order. Formally, let ∆ : R D × R D → R + be a generic distance function between data points. We denote with x their distance, for l = 1, . . . , n − 1. Given the sequence of NNs for each data point, we can define the volume of the hyper-spherical shell enclosed between two successive neighbors of x i as for l = 1, . . . , n − 1, and i = 1, . . . , n, where d is the dimensionality of the latent manifold in which the points are embedded (the ID) and ω d is the volume of the d-dimensional hyper-sphere with unitary radius. For this formula to hold, we need to set x i,0 ≡ x i and r i,0 = 0. We provide a visual representation of the introduced quantities in Figure 1 for l = 1, 2, which depicts the three-dimensional case.
From a modeling perspective, we assume that the dataset X is a realization of a Poisson point process characterized by density function ρ (x). Facco et al. (2017) showed that the hyperspherical shells defined in Equation 1 are the multivariate extension of the well-known interarrival times (Kingman 1992). Indeed, they proved that under the assumption of homogeneity of the Poisson point process, i.e., ρ(x) = ρ ∀x, all the ν i,l 's are independently drawn from an Exponential distribution with rate equal to the density ρ: ν i,l ∼ Exp(ρ), for l = 1, . . . , n − 1, and i = 1, . . . , n. This fundamental result motivates the derivation of the estimators we will introduce in the following.

The TWO-NN estimator
Building on the distribution of the hyper-spherical shells, Facco et al. (2017) noticed that if the intensity of the underlying Poisson point process is assumed to be constant on the scale of the second NN, the following distributional result holds: In other words, if the intensity of the Poisson point process that generates the data can be regarded as locally constant (on the scale of the second NN), the ratio of the first two x i x (i,1) x (i,2) x (i,3) Figure 1: An illustration in three dimensions of the quantities involved in the TWO-NN modeling framework. The central dot represents the i-th data point. The selected observation is connected by dashed lines, representing the distances r i,j , j = 1, 2, 3, to its first three NNs. The different spherical shells have volumes v i,j , j = 1, 2, 3.
NN distances from each point is Pareto distributed. Recall that the Pareto random variable is characterized by a shape parameter a, scale parameter b, and density function f X (x) = ab a x −a−1 defined over x ∈ [b, +∞). Remarkably, Equation 2 states that the ratio r i,2 /r i,1 follows a Pareto distribution with scale b = 1 and shape a = d, i.e., the shape parameter can be interpreted as the ID of the dataset. Estimating d by exploiting the ratios of distances between each point and its first two NNs is the core of the TWO-NN procedure.
One can also attain more general results by considering ratios of distances with NNs of generic orders. A generalized ratio will be denoted with µ i,n 1 ,n 2 = r i,n 2 /r i,n 1 for i = 1, . . . , n. Here, n 1 and n 2 are the NN orders, integer numbers that need to comply with the following constraint: 1 ≤ n 1 < n 2 ≤ n. In this paper, we will mainly focus on methods involving the ratio of the first two NN distances. The generalized ratios will be only mentioned when discussing the function compute_mus(), for the sake of completeness. Therefore, to simplify the notation, we will write µ i = µ i,1,2 and µ = (µ i ) n i=1 . Once the vector µ is computed, we can employ different estimation techniques for the TWO-NN model. All of the following methods can be called via the intRinsic function twonn(). The reader can find examples of its usage in Section 3.3.
Linear estimator. Facco et al. (2017) proposed to estimate the ID via the linearization of the Pareto cumulative distribution function F (µ i ) = (1 − µ −d i ). The ID estimate is obtained as the least squares solution of the following zero-intercept linear regression problem: of the sample and the µ (i) 's are the ratios defined in Equation 2 sorted by increasing order. The authors also suggested trimming from µ a percentage c TR of the most extreme values to obtain a more robust estimation. This choice is justified because the extreme ratios often correspond to observations that do not comply with the local homogeneity assumption.

Maximum likelihood estimator.
In a similar spirit, Denti et al. (2022) took advantage of the distributional results in Equation 2 to derive a simple maximum likelihood estimator (MLE) and the corresponding confidence interval (CI). Trivially, the (unbiased) MLE for the shape parameter of a Pareto distribution is given by: while the CI of level (1 − α) is defined as where q α/2 IG a,b denotes the quantile of order α/2 of an Inverse-Gamma distribution of shape a and scale b.
Bayesian estimator. It is also straightforward to derive an estimator according to a Bayesian perspective , obtained with the specification of a prior distribution on the shape parameter d. The most natural prior to choose is a conjugate prior d ∼ Gamma (a, b). It is then immediate to derive the posterior distribution for the shape parameter: With this method, one can quickly recover the principal quantiles of the posterior distribution and obtain point estimates and uncertainty quantification with a credible interval (CRI) of level α.

HIDALGO: The heterogeneous intrinsic dimension algorithm
The TWO-NN model implicitly assumes the existence of a single manifold. However, postulating a global, unique ID value for the entire dataset can often be limiting, especially when the data present complex dependence structures among variables. To extend the previous modeling framework, one can imagine that the data points are divided into clusters, each belonging to a latent manifold with its specific ID. Allegra et al. (2020) employed this heterogeneous ID estimation approach in their model: the heterogeneous ID algorithm (HIDALGO). The authors proposed as density function of the generating point process a mixture of K distributions defined on K different latent manifolds, expressed as ρ (x) = K k=1 π k ρ k (x), where π = (π 1 , . . . , π K ) is the vector of mixture weights. This assumption induces a mixture of Pareto distributions as the distribution of the ratios µ i 's: where d = (d 1 , . . . , d K ) is the vector of ID parameters. Allegra et al. (2020) adopted a Bayesian perspective, specifying independent Gamma priors for each element of d : , and a Dirichlet prior for the mixture weights π ∼ Dirichlet(α 1 , . . . , α K ). Regarding the latter, we suggest setting α 1 = . . . = α K = α < 0.05, fitting a sparse mixture model as indicated by Malsiner-Walli, Frühwirth-Schnatter, and Grün (2017). This prior specification encourages the data to populate only the necessary number of mixture components. Thus, we see K as an upper bound on the number of active clusters K * ≤ K.
Unfortunately, a model-based clustering approach like the one presented in Equation 7 is ineffective at modeling the distance ratios. The problem lies in the fact that the different Pareto kernels have extremely similar shapes. Therefore, Pareto densities with varying shape parameters can fit the same data points equally well, compromising the clustering allocation and the consequent ID estimation. Even when considering very diverse shape parameters, the right tails of the resulting Pareto distributions overlap to a great extent. This issue is evident in Figure 2, where we depict examples of various P areto(1, d) densities.
To address this problem, Allegra et al. (2020) introduced a local homogeneity assumption, assuming that neighboring points are more likely to be part of the same latent manifold.
To incorporate this idea in the model, the authors added an extra penalizing term in the likelihood. We now summarize their approach.
First, they introduced the latent membership labels z = (z 1 , . . . , z n ) to assign each observation to a cluster, where z i = k means that the i-th observation belongs to the k-th mixture component. Then, they defined the binary adjacency matrix N (q) , whose entries are N (q) i,j = 1 if the point x j is among the first q NNs of x i , and 0 otherwise. Finally, they defined N z i to be the number of observations with label equal to z i and assumed the following probabilities: i,j = 1|z i = z j = ζ 0 , with ζ 0 > 0.5 and P N (q) i,j = 1|z i ̸ = z j = ζ 1 , with ζ 1 < 0.5. These probabilities are employed to define a distribution over the neighboring structure of each is the normalizing constant, and 1 A is the indicator function, equal to 1 when the event A is true, and equal to 0 otherwise. A more technical discussion of this model extension and the validity of the underlying hypotheses can be found in the Supplementary Material of Allegra et al. (2020). For simplicity, we assume ζ 0 = ζ and ζ 1 = 1 − ζ. The new likelihood becomes where Cat K denotes a Categorical distribution over the set {1, . . . , K}. A closed-form for the posterior distribution is not available, so we rely on Markov chain Monte Carlo (MCMC) techniques to simulate a posterior sample.
In our package, this Bayesian mixture model is implemented by the function Hidalgo().
For the ID parameters, we use conjugate Gamma prior specifications or variations thereof to account for modeling inconsistencies. For example, when the nominal dimension D is low, the unbounded support of a Gamma prior may provide unrealistic results, where the posterior distribution assigns positive density to the interval (D, +∞). Santos-Fernandez, Denti, Mengersen, and Mira (2022) proposed to employ a more informative prior for d: where they denoted the normalizing constant of a Gamma(a, b) truncated over (0, D] with C a,b,D . That is, the prior distribution for d k is a mixture between a truncated Gamma distribution over (0, D] and a point mass located at D. The parameter ρ denotes the mixing proportion. When ρ = 1, the distribution in Equation 9 reduces to a simple truncated Gamma. Both approaches are implemented in intRinsic, but we recommend using the latter. We report the details of the implemented Gibbs sampler in Appendix C, while in Appendix F we comment more on the concept of heterogeneous ID estimation related to global and local ID definitions.

Examples using intRinsic
This section illustrates the main routines of the intRinsic package. Here, we indicate the number of observations and the observed nominal dimension with n and D, respectively. Also, N k (m, Σ) represents a multivariate normal distribution of dimension k, mean m, and covariance matrix Σ. Moreover, let U (k) (a,b) represent a multivariate Uniform distribution with support (a, b) in k dimensions. Finally, we denote with I k an identity matrix of dimension k.
To start, we load our package by running R> library("intRinsic")

Simulated datasets
To illustrate how to apply the different ID estimation techniques available in the package, we will use three simulated datasets: Swissroll, HyperCube, and GaussMix. This way, we can  compare the results of the experiments with the ground truth. One can generate the exact replica of the three simulated datasets used in this paper by running the code reported below.
The first dataset, Swissroll, is obtained via the classical Swissroll transformation S : R 2 → R 3 defined as S(x, y) = (x cos(x), y, x sin(x)), where each pair of points (x, y) is sampled from two independent Uniform distributions on (0, 10). To simulate such a dataset, we can use the intRinsic function Swissroll(), specifying the number of observations n as the input parameter.
R> set.seed(123456) R> Swissroll <-Swissroll(n = 1000) The second dataset, HyperCube, contains a cloud of 500 points sampled from U (0,1) embedded in an eight-dimensional space R 8 . To fill the gap between the nominal dimension (8) and the ID (5), we add three columns of zeros.
We embed each of them in a five-dimensional space by adding the corresponding number of columns of zeros, as we did for the HyperCube dataset.
R> x0 <-rnorm(500, mean = -5, sd = 1) R> x1 <-cbind(x0, 3 * x0, 0, 0, 0) R> x2 <-cbind(replicate(3, rnorm(500)), 0, 0) R> x3 <-replicate(5, rnorm(500, 5)) R> GaussMix <-rbind(x1, x2, x3) R> class_GMix <-rep(c("A", "B", "C"), rep(500, 3)) Next, we need to establish the true values of the ID for the different datasets. Scatterplots are useful exploratory tools to spot any clear dependence across the columns of a dataset. For example, if we plot all the possible two-dimensional scatterplots from the Swissroll dataset, we obtain Figure 3. The different panels show that two of the three coordinates are free, and we can recover the last coordinate as a function of the others. Therefore, the ID of Swissroll is equal to 2. Moreover, from the description of the data simulation, it is also evident that the ID is equal to 5 for the HyperCube dataset. However, it is not as simple to figure out the value of the true ID for GaussMix, given the heterogeneity of its data-generating mechanism. Table 1 summarizes the sample sizes along with the true nominal dimensions D and IDs that characterize the three datasets.

Ratios of nearest neighbors distances
The ratios of NN distances constitute the core quantities on which the theoretical development presented in Section 2 is based. We can compute the ratios µ i defined in Equations 2 with the function compute_mus(). All in all, the function can also compute the generalized ratios µ n 1 ,n 2 , where n1 < n2, as presented in Section 2.1. In fact, the function needs the following arguments: • X: a dataset of dimension n×D of which we want to compute the distance ratios; • dist_mat: a n×n symmetric matrix containing the distances between data points which one can pass instead of X; • n1 and n2: the orders of the closest and furthest nearest neighbors to consider, respectively. As default, n1 = 1 and n2 = 2.
The function has two additional arguments, Nq and q, that we will introduce later in Section 3.4 when we will illustrate the HIDALGO model.
Note that the specification of dist_mat overrides the argument passed as X. Instead, if the distance matrix dist_mat is not specified, generate_mus() relies on the function get.knn() The main output of the function is the vector of ratios µ n 1 ,n 2 , an object of class 'mus' for which appropriate print() and plot() methods are defined. To use the function, we can easily write R> mus_Swissroll <-compute_mus(X = Swissroll) R> mus_Hyperube <-compute_mus(X = HyperCube) R> mus_GaussMix <-compute_mus(X = GaussMix) Calling the function with default arguments produces µ = µ 1,2 . To explicitly compute gen-  Table 2: Summary statistics of the generalized ratios obtained from the GaussMix dataset. Each row corresponds to a different combination of NN orders.
eralized ratios, we need to specify the NN orders n1 and n2. Here, we report two different examples R> mus_Swissroll_1 <-compute_mus(X = Swissroll, n1 = 5, n2 = 10) R> mus_HyperCube_1 <-compute_mus(X = HyperCube, n1 = 5, n2 = 10) R> mus_GaussMix_1 <-compute_mus(X = GaussMix, n1 = 5, n2 = 10) and R> mus_Swissroll_2 <-compute_mus(X = Swissroll, n1 = 10, n2 = 20) R> mus_HyperCube_2 <-compute_mus(X = HyperCube, n1 = 10, n2 = 20) R> mus_GaussMix_2 <-compute_mus(X = GaussMix, n1 = 10, n2 = 20) R> mus_GaussMix_2 Ratio statistics mu's: NN orders: n1 = 10, n2 = 20 Sample size: 1500 Nominal Dimension: 5 The histograms of the computed ratios are presented in Figure 4. The panels in each row correspond to different datasets, while the varying NN orders are reported by column. The horizontal axes are truncated over the interval [0, 4] to improve the visualization. The histograms present the right-skewed shape typical of the Pareto distribution. However, some ratios could assume extreme values, especially when low values of NN orders are chosen. To provide an example, in Table 2 we display the summary statistics of the three vectors of ratios computed on GaussMix. The maximum in the first line has a high magnitude, but it significantly reduces when higher NN orders are considered. It is also interesting to observe how the distribution for the ratios of GaussMix when n1 = 10 and n2 = 20 (bottom-right panel) is multimodal, a symptom of the presence of heterogeneous manifolds. We remark again that we will focus on the TWO-NN and HIDALGO models for the rest of the paper. Therefore, we will only use compute_mus() in its default specification, simply computing µ = (µ i ) n i=1 . The ratios of NNs of generic order are necessary when using the GRIDE model. See Appendix A and Denti et al. (2022) for more details.
Finally, recall that the model is based on the assumption that a Poisson point process is the generating mechanism of the dataset. Ergo, the model cannot handle ties among data points. From a more practical point of view, if ∃i ̸ = j such that x i = x j , the computation of µ i would be unfeasible since r i,1 = 0. We devised the function compute_mus() to automatically detect if duplicates are present in a dataset. In that case, the function removes the duplicates and provides a warning. We showcase this behavior with a simple example: R> Dummy_Data_with_replicates <-rbind(c(1, 2, 3), c(1, 2, 3), + c(1, 4, 3), c(1, 4, 3), c(1, 4, 5)) R> mus_shorter <-compute_mus(X = Dummy_Data_with_replicates) Warning: Duplicates are present and will be removed.
The function compute_mus() is at the core of many other high-level routines we use to estimate the ID. The following subsection shows how to implement the TWO-NN model to obtain a point estimate of a global, homogeneous ID accompanied by the corresponding CIs or CRIs.

Estimating a global ID value with TWO-NN
We showcase how to carry out inference on the ID with the TWO-NN model using linear, MLE, and Bayesian estimation methods. The low-level functions that implement these methods are twonn_mle(), twonn_linfit(), and twonn_bayes(), respectively. One can call these low-level functions via the high-level function twonn(). Regardless of the preferred estimation method, the twonn() function takes the following arguments: the dataset X or the distance matrix dist_mat (refer to previous input descriptions for more details), along with • mus: the vector of second-to-first NN distance ratios. If this argument is provided, X and dist_mat will be ignored; • method: a string stating the preferred estimation method. Could be "mle" (the default), "linfit", or "bayes"; • alpha: the confidence level (for "mle" and "linfit") or the posterior probability included in the CRI ("bayes"); • c_trimmed: the proportions of most extreme ratios to exclude from the analysis.
The object that the function returns is a list, characterized by a class that varies according to the selected estimation method. Tailored R methods have been devised to extend the generic functions print(), summary(), plot(), and autoplot() to interact with these new classes. The first element of the returned list always contains the estimates, while the others provide additional information about the chosen estimation process.
Linear estimator. We apply the linear estimator to the Swissroll dataset. As an example, we fit five linear models by setting method = "linfit", adopting different trimming proportions. The function summary() provides an informative recap of the estimation process. We show the results for lin_2 as an example.
R> lin_1 <-twonn(X = Swissroll, method = "linfit", c_trimmed = 0) R> lin_2 <-twonn(X = Swissroll, method = "linfit", c_trimmed = 0.001) R> lin_3 <-twonn(X = Swissroll, method = "linfit", c_trimmed = 0.01) The results of these experiments are collected in Table 3. This first example allows us to comment on the trimming level to choose. Trimming the most extreme observations may be fundamental since outliers may distort the estimate. However, too much trimming would remove important information regarding the tail of the Pareto distribution, which is essential for the correct estimation of the ID. The estimates improve for very low levels of trimming but start to degenerate as more than 5% of observations are removed from the dataset.
We can also visually assess the goodness of fit via a dedicated autoplot() function, which displays the data and the estimated regression line. The slope of the regression lines corresponds to the linear fit ID estimates. For example, we obtain the plots in Figure 5 with the following two lines of code: R> autoplot(lin_1, title = "No trimming") R> autoplot(lin_5, title = "10% trimming")

MLE.
A second way to obtain an ID estimate, along with its CI, is via MLE. The formulas implemented are presented in Equations 4 and 5. We compute the MLE by calling the lowlevel twonn_mle() function setting method = "mle". In addition to the previous arguments, one can also specify • unbiased: logical, if TRUE the point estimate according to the unbiased estimator (where the numerator is n − 1, as in Equation 4) is computed.
We compute the ID on HyperCube via MLE using different distance definitions: Euclidean, Manhattan, and Canberra. These distances are calculated with the dist() function of the stats package.
Other distance matrices can be employed as well. In this example, we also show how the widths of the CIs change by varying the confidence levels. We print the results stored in the object mle_12 as an example. We write: The results are reported in Table 4. The type of distance can lead to differences in the results. Overall, the estimators agree with each other, obtaining values close to the ground truth.
Bayesian estimator. The third option for ID estimation is to adopt a Bayesian perspective and specify a prior distribution for the parameter d.  Table 4: MLEs obtained with the TWO-NN model applied to the HyperCube dataset. Different distance functions and confidence level specifications are adopted. the low-level function twonn_bayes() setting method = "bayes" in twonn(). Along with the arguments mentioned above, we can also specify the following: • a_d and b_d: shape and rate parameters of the Gamma prior distribution on d. A vague specification is adopted as default with a_d = 0.001 and b_d = 0.001. This implies E(d) = 1, and Var(d) = 1000.
Differently from the previous two cases, alpha is now assumed to be the probability contained in the CRI computed from the posterior distribution. Along with the CRI, the function outputs the posterior mean, median, and mode. In the following, four examples showcase the usage of this function on the Swissroll dataset with different combinations of alpha and Gamma hyperparameters. The results are summarized in Table 5.
As an example, we compare the prior specification used for the object bay_4 (d ∼ Gamma(1, 1)) with a more informative one (d ∼ Gamma(10, 10)) by writing: The posterior distribution is depicted in black, the prior in blue, and the dashed vertical red lines represent the estimates.
So far, we have discussed methods to determine a global ID estimate accurately and efficiently.
Knowing the simulated data-generating processes, we could easily compare the obtained estimates with the ground truth for the Swissroll and HyperCube datasets. However, the same task is not immediate when dealing with GaussMix. For GaussMix, relying only on a global ID estimate may constitute an oversimplification since the data points are generated from Gaussian distributions defined over different manifolds of heterogeneous dimensions. This scenario is more likely to occur with datasets describing real phenomena, often characterized by complex dependencies, and it will be the focus of the next section.

Detecting the presence of multiple manifolds
In contexts where data may exhibit heterogeneous ID, we face two main challenges: (i) detect the actual presence of multiple manifolds in the data, and (ii) accurately estimate their IDs.
To tackle these problems, we start by applying the twonn() function to GaussMix with method equal to "linfit" and "mle".
R> mus_gm <-compute_mus(GaussMix) R> summary(twonn(mus = mus_gm, method = "linfit")) The estimates obtained with the different methods do not agree. Figure 7 raises concerns about the appropriateness of a model postulating the existence of a single, global manifold.

Model
In the top panel of Figure 7, the data points are colored according to their generating mixture component. The sorted log-ratios present a non-linear pattern, where the slope values vary across the different mixture components A, B, and C.
We can get another empirical assessment by inspecting the evolution of the cumulative average distances between a point and its nearest neighbors. More formally, for each point x i we consider the evolution of the ergodic mean of the sorted sequence of distances (r i,1 , . . . , r i,n ), given by r i (j) = j k=1 r i,k /k, ∀i, j. Figure  case) datasets. On the one hand, the left plot displays the ideal condition: there are no visible departures from the overall mean distance, which reassures us about the homogeneity assumption we made about the data. But, on the other hand, the right panel tells a different story. We immediately detect the different clusters in the data by focusing on their different starting values. The ergodic means remain approximately constant until the 500th NN, which corresponds to the size of each subgroup. The behavior of the cumulative means abruptly changes after the 500th NN, negating the presence of a unique manifold. This type of plot provides an empirical but valuable overview of the structure between data points, highlighting clusters that may reflect the existence of multiple manifolds.
These visual assessments help detect signs of the inappropriateness of the global ID assumption. The most direct approach to adopt in this case would be to divide the dataset into homogeneous subgroups and apply the TWO-NN estimator within each cluster. Such an approach is highlighted in the bottom panels of Figure 7. However, knowing ex-ante such well-separated groups is not a realistic expectation to have about actual data. Therefore, we will rely on Hidalgo(), the Bayesian finite mixture model for heterogeneous ID estimation described in Section 2.2.

Fitting the HIDALGO model
HIDALGO allows for the presence of multiple manifolds in the same dataset, yielding a vector of different estimated ID values. As already discussed, estimating a mixture model with Pareto components is challenging because of their extensive overlap. A naive modelbased estimation can lead to inaccurate results since there is no clear separation between the kernel densities. The extra term n i=1 π(N (q) i |z) added into the likelihood in Equation 8 induces local homogeneity, which helps identify the model parameters.
The adjacency matrix N (q) can be easily computed by specifying two additional arguments in the function compute_mus(): • Nq: logical, if TRUE, the function adds the adjacency matrix to the output; • q: integer, the number of NNs to be considered in the construction of the matrix N (q) .
The default value is 3.
To provide an idea of the structure of the adjacency matrix N (q) , we report three examples obtained from a random sub-sample of the GaussMix dataset for increasing values of q. We display the heatmaps of the resulting matrices in Figure 9. Given this premise, we are now ready to discuss Hidalgo(), the high-level function that fits the Bayesian mixture. It implements the Gibbs sampler described in Appendix C, relying on low-level Rcpp routines. Also, the function internally calls compute_mus() to automatically generate the ratios of distances and the adjacency matrix needed to evaluate the likelihood from the data points.
The function has the following arguments: X, dist_mat, q, D, and • K: integer, number of mixture components; • nsim, burn_in, and thinning: number of MCMC iterations to collect, initial iterations to discard, and thinning interval, respectively; • verbose: logical, if TRUE, the progress of the sampler is printed; • xi: real number between 0 and 1, a local homogeneity parameter. The default is 0.75; • alpha_Dirichlet: hyperparameter of the Dirichlet prior on the mixture weights; • a0_d and b0_d: shape and rate parameters of the Gamma prior on d. The default is 1 for both values; • prior_type: string, type of Gamma prior on d which can be -"Conjugate": a classic Gamma prior is adopted (default); -"Truncated": a truncated Gamma prior on the interval (0, D) is used. This specification is advised when dealing with datasets characterized by a small number of columns, to avoid the estimated ID exceeding the nominal dimension D; -"Truncated_PointMass": same as "Truncated", but a point mass is placed on D.
That is, the estimated ID is allowed to be exactly equal to the nominal dimension D; • pi_mass: probability placed a priori on D when a "Truncated_PointMass" prior specification is chosen.
We apply the HIDALGO model on the GaussMix dataset with two different prior configurations: conjugate and truncated with point mass at D = 5. The code we used to run the models is: We can print one of the returned objects to visualize a short summary of the run: By using alpha_Dirichlet = 0.05, we have adopted a sparse mixture modeling approach in the spirit of Malsiner-Walli, Frühwirth-Schnatter, and Grün (2016). The sparse mixture approach would automatically let the data estimate the number of mixture components required. As a consequence, the argument K should be interpreted as an upper bound on the number of active clusters. Nonetheless, we stress that estimating the number of well-separated clusters with Pareto kernels is challenging. Hence, we will discuss how to analyze the output to perform proper inference. The output object hid_fit is a list of class 'Hidalgo', containing six elements:

R> hid_fit_TR
• cluster_prob: matrix of dimension nsim×K. Each column contains the MCMC sample of a mixing weight for every mixture component; • membership_labels: matrix of dimension nsim×n. Each column contains the MCMC sample of a membership label for every observation; • id_raw: matrix of dimension nsim×K. Each column contains the MCMC sample for the ID estimated in every cluster; • id_postpr: matrix of dimension nsim×n. It contains a chain for each observation, corrected for label-switching; • id_summary: a matrix containing the posterior mean and the 5%, 25%, 50%, 75%, 95% quantiles for each observation; • recap: a list with the specifications passed to the function as inputs.
To inspect the output, we can employ the dedicated autoplot() function devised for objects of class 'Hidalgo'. There are several arguments that can be specified, producing different graphs. The most important is • type: string that indicates the type of plot that is requested. It can be: -"raw_chains": plot the MCMC and the ergodic means not corrected for labelswitching (default); -"point_estimates": plot the posterior mean and median ID for each observation, along with their CRIs; -"class_plot": plot the estimated ID distributions stratified by the groups specified in an additional class vector; -"clustering": plot the posterior co-clustering matrix. Rows and columns can be stratified by an exogenous class and/or a clustering structure.
For example, we can plot the raw chains of the two models with the aid of the patchwork package (

R> autoplot(hid_fit) / autoplot(hid_fit_TR)
Plotting the traceplots of the elements in d allows us to assess the convergence of the algorithm. First, however, we need to be aware that these chains may suffer from label-switching issues, preventing us from directly drawing inference from the MCMC output. Due to labelswitching, mixture components can be discarded, emptied, or repopulated across iterations. This behavior is observed in Figure 10, which shows the MCMC traceplots of the two models, with the ergodic means for each mixture component superimposed. In this type of plot, we can often notice that various chains overlap around the prior mean of d k . These chains represent the parameters of the empty clusters, which are sampled from the prior. For example, in the top panel of Figure  To address the label-switching issue and perform meaningful inference, the raw MCMC sample needs to be postprocessed. In Appendix D, we discuss the algorithm used to map the K chains to n observation-specific chains that can be employed for inference. The algorithm is already implemented in Hidalgo(), and produces the elements id_postpr and id_summary in the returned list. We can obtain a visual summary of the postprocessed estimates via R> autoplot(hid_fit, type = "point_estimates") + + autoplot(hid_fit_TR, type = "point_estimates") The resulting plots are shown in Figure 11. The panels display the mean and median ID estimates for each data point. Here, the separation of the data into different generating manifolds is evident. Also, we notice that some of the estimates in the conjugate case are incorrectly above the nominal value D = 5, once again justifying the need for a truncated prior. The default plots were modified with coord_cartesian(ylim = c(0, 5.5)) to highlight the effect of the truncation.

Estimated clustering solutions
It is natural to seek model-based clustering solutions when dealing with a mixture model. To this extent, the key source of information is the posterior similarity -or co-clusteringmatrix (PSM). The entries {s i,j } n i,j=1 of this matrix are computed as the proportion of times in which two observations have been assigned to the same mixture component across the MCMC iterations. Thus, the PSM describes the underlying clustering structure of the data detected by HIDALGO. Given the PSM, one can evaluate various loss functions on the space of the partitions. By minimizing the loss functions, we can retrieve the optimal partition of the dataset into clusters. To obtain such an estimate, we rely on the function salso() from the R package salso (Dahl, Johnson, and Müller 2022a). Otherwise, a faster alternative method proceeds by building a dendrogram from the implied posterior dissimilarity matrix (PDM), whose entries are given by Once the dendrogram is built, we can threshold it to segment the data into a pre-specified number of clusters K.
These approaches are implemented in the dedicated function clustering() which takes as arguments, along with the object output from the Hidalgo() function, • clustering_method: string indicating the method to use to perform clustering. It can be "dendrogram" or "salso". The former method thresholds the dendrogram constructed from the PDM to retrieve exactly K clusters. The latter method estimates the optimal clustering solution by minimizing a loss function on the space of the partitions. The default loss function is the variation of information (VI, Wade and Ghahramani 2018). For additional details about the VI loss function, see Appendix E; • K: integer, used when "dendrogram" is chosen. It corresponds to the number of clusters to recover when thresholding the dendrogram obtained from the PDM; • nCores: integer, argument used in the functions called from package salso. It represents the number of cores used to compute the PSM and the optimal clustering solution.
Additional arguments can be passed to modify the partition estimation via salso(). Given the large sample size of the GaussMix dataset, we opt for the dendrogram approach, truncating the dendrogram at K = 3 groups. We highlight that relying on the minimization of a loss function is a more principled approach. However, the method can be misled by the strongly overlapping clusters estimated across the MCMC iterations, providing overly conservative solutions.
R> psm_cl <-clustering(object = hid_fit_TR, + clustering_method = "dendrogram", K = 3, nCores = 5) R> psm_cl To visualize the results, we can also plot the PSM by passing an object of class 'Hidalgo' to autoplot() with type = "clustering". The autoplot() method internally calls the function clustering() to compute the PSM. One can also specify an additional argument class to stratify the observations according to exogenous factors. Finally, we remark that the user can also rely on partitioning around medoids on the dissimilarity matrix PDM, a robust clustering procedure implemented in the R package PReMiuM (Liverani, Hastie, Azizi, Papathomas, and Richardson 2015).

The presence of patterns in the data uncovered by the ID
Once the observation-specific estimates are computed, we can investigate the presence of potential patterns between the recovered IDs and given exogenous variables. To explore these possible relations, we can use the function id_by_class(). Along with an object of class 'Hidalgo', we need to specify: • class: factor, a variable used to stratify the ID posterior estimates.
For the GaussMix dataset, the exogenous information is contained in the class_GMix vector, which we pass as class.

R> id_by_class(object = hid_fit_TR, class = class_GMix)
Posterior ID by class: The estimates in the three classes are very close to the ground truth. The same argument, class, can be passed to the autoplot() function, in combination with • class_plot_type: string, if type = "class_plot", one can visualize the stratified ID estimates with a "density" plot or a "histogram", or using "boxplots" or "violin" plots; • class: a vector containing class information used to stratify the observations; to visualize ID estimates of the GaussMix dataset stratified by the generating manifold of the observations. As an example of possible graphs, Figure 12 shows the stratified boxplots (left panel) and histograms (right panel).
R> autoplot(hid_fit_TR, type = "class", class = class_GMix, + class_plot_type = "boxplot") + + autoplot(hid_fit_TR, type = "class", class = class_GMix, + class_plot_type = "histogram") We have introduced and discussed the principal functions of the intRinsic package concerning the TWO-NN and HIDALGO models. Then, by employing simulated data with known IDs, we have suggested a pipeline to guide our study. In the next section, we present a real data analysis, highlighting how the ID estimation can be used to effectively reduce the size of a dataset while capturing and preserving important features.

The ID of gene microarray measurements
In this section, we present a real data example investigating the ID of the Alon dataset. The dataset, first presented in Alon et al. (1999), contains microarray measurements for 2000 genes measured on 62 patients. Among the patients, 40 were diagnosed with colon cancer, and 22 were healthy subjects. A factor variable named status describes the patient health condition (coded as "Cancer" vs. "Healthy"). A copy of this famous dataset can be found in the R package HiDimDA (Duarte Silva 2015). We store the gene measurements in the object Xalon, a matrix of nominal dimension D = 2000, with n = 62 observations. To load and prepare the data, we write: R> data("AlonDS", package = "HiDimDA") R> status <-factor(AlonDS$grouping, labels = c("Cancer", "Healthy")) R> Xalon <-as.matrix(AlonDS[, -1]) To obtain a visual summary of the dataset, we plot the heatmap of the log-data values annotated by status. The result is shown in Figure 13. No clear structure is immediately visible.
We ultimately seek to uncover hidden patterns in this dataset. The task is challenging, especially given the small number of available observations. As a first step, we investigate how well a unique, global ID estimate can represent the data.

Homogeneous ID estimation
Let us start by describing the overall complexity of the dataset by estimating a homogeneous ID value. Using the TWO-NN model, we can compute: The estimates based on the TWO-NN model obtained with different methods are very similar.
The results are also illustrated in Figure 14, which shows the linear fit (left panel) and posterior distribution (right panel) for the TWO-NN model. According to these results, we conclude that the information contained in the D = 2000 genes can be summarized with approximately ten variables. For example, the first ten eigenvalues computed from the spectral decomposition of the matrix Λ = X ⊤ Alon X Alon contribute to explaining 95.4% of the total variance. Although the linear fit plot and the TWO-NN estimates do not raise any evident sign of concern, as a final check, we explore the evolution of the average distances between NN, reported in Figure 15. As expected, the plot does not highlight any abrupt change in the evolution of the ergodic means. However, it suggests that investigating the presence of multiple manifolds could be interesting. In fact, despite the evolution of most of the ergodic means being stationary, their heterogeneous levels highlight some potential data inhomogeneities that should be deepened.

Heterogeneous ID estimation
To investigate the presence of heterogeneous latent manifolds in the Alon dataset, we employ Hidalgo(). Since the nominal dimension D is large, we do not need to truncate the prior on d.
Moreover, given the small number of data points, we opt for an informative and regularizing prior Gamma(1, 1) (the default) instead of a vague specification. Also, we set a conservative upper bound for the mixing components K = 15 and choose again α = 0.05 to fit a sparse mixture. We run: R> set.seed(1234) R> Alon_hid <-Hidalgo(X = Xalon, K = 15, a0_d = 1, b0_d = 1, + alpha_Dirichlet = 0.05, nsim = 10000, burn_in = 100000, thin = 5) R> Alon_hid  Once the model is fitted, we first explore the estimated clustering structure. Here, instead of directly plotting the heatmap of the PSM, we build the dendrogram from the PDM, and we report it in the top panel of Figure 16. We construct such a plot with the help of the package ggdendro (De Vries and Ripley 2022). We can detect four clusters, and therefore we decide to set K = 4 when running R> Alon_psm <-clustering(Alon_hid, K = 4) As illustrated in the previous section, all these plots can be obtained by calling autoplot() with the proper argument specifications. The next natural step is to investigate how strongly the estimated partition and, in general, the estimated IDs are associated with the health status. The bottom two panels of Figure 16 Table 7: Confusion matrices summarizing the classification performance of the two random forest models, trained with the original gene expressions (rfm1, left) and with a summary of the ID estimates (rfm2, right).

R> id_by_class(Alon_hid,class = Alon_psm$clust)
to obtain summary results linking the variations in the ID with health status and cluster. We report the output in Table 6. As the estimated ID increases, the proportion of healthy subjects in each cluster decreases. This result suggests that the microarray profiles of people diagnosed with cancer are slightly more complex than healthy patients' ones.
The analyses conducted so far helped us uncover interesting descriptive characteristics of the IDs in the dataset. Nevertheless, these results can also be effectively used as a representative data summary. Here, we show how the estimated individual IDs are valid to potentially classify the health status of new patients according to their genomic profiles. As a simple example, we perform a classification analysis using two random forest models, predicting the target variable Y = status. To train the models, we use two different sets of covariates: X_OR, the original dataset composed of 2000 genes, R> X_OR <-data.frame(Y = status, X = Xalon) R> set.seed(1231) R> rfm1 <-randomForest::randomForest(Y~., data = X_OR, + type = "classification", ntree = 100) and X_ID, the observation-specific ID summary returned by Hidalgo(), along with our estimated partition.
Remarkably, a simple dataset with seven variables summarizing the main distributional traits of the observation-specific posterior IDs obtains good performance in predicting health status, similar to the original dataset. More precisely, the random forest on the original dataset got an out-of-bag estimated error rate of 20.97%, while the error is reduced to 17.74% when using our ID-based covariates. We can conclude that, in this case, the topological properties of the dataset are associated with the outcome of interest and convey important information.
We showed how the estimation of heterogeneous ID provides a reliable complexity index for elaborate data structures and helps unveil relationships among data points hidden at the topological level. The application to the Alon dataset showcases how reliable ID estimates give additional fundamental perspectives that help us discover non-trivial data patterns. Furthermore, one can exploit the extracted information in many downstream investigations, such as patient segmentation or predictive analyses.

Summary and discussion
In this paper, we illustrated intRinsic, an R package that implements novel routines for the ID estimation according to the models recently developed in Facco et al. (2017); Allegra et al. (2020); Denti et al. (2022), andSantos-Fernandez et al. (2022). intRinsic consists of a collection of high-level, user-friendly functions that, in turn, rely on efficient, low-level routines implemented in R and C++. We also remark that intRinsic integrates functionalities from external packages. For example, all the graphical outputs returned by the functions are built using the well-known package ggplot2. Therefore, they are easily customizable using the grammar of graphics (Wilkinson 2005).
The package includes frequentist and Bayesian model specifications for the TWO-NN global ID estimator. Moreover, it implements the Gibbs sampler for posterior simulation of the HIDALGO model, which can capture the presence of heterogeneous ID within a single dataset. We showed how discovering multiple latent manifolds could help unveil the topological traits of a dataset, primarily when additional exogenous variables are used to stratify the ID estimates.
As a general analysis pipeline for practitioners, we suggested starting with the efficient TWO-NN functions to understand how appropriate the hypothesis of homogeneity is for the data at hand. If departures from the assumptions are visible from nonuniform estimates obtained with different estimation methods and from visual assessment of the evolution of the average NN distances, one should rely on HIDALGO.
The most promising future research directions stem from HIDALGO. First, we plan to develop more reliable methods to obtain an optimal partition of the data based on the ID estimates since the one proposed heavily relies on a mixture model of overlapping distributions. Moreover, another research avenue worth exploring is a version of HIDALGO with likelihood distributions based on generalized NN ratios, exploiting the information coming from varying neighborhood sizes.
We also know that the mixture model fitting may become computationally expensive if the analyzed datasets are large. Therefore, faster solutions, such as the Variational Bayes approach, will be explored. Also, we highlight that HIDALGO, a mixture model within a Bayesian framework, lacks a frequentist-based estimation counterpart, such as an expectation maximization algorithm. Its derivation is not immediate since the neighboring structure introduced via the N (q) matrix makes the problem non-trivial. We plan to keep working on this package and continuously update it in the long run as contributions to this line of research become available. The novel ID estimators we discussed have started a lively research branch, and we intend to include all the future advancements in intRinsic.

A. Additional methods implemented in the package
In this paper, we have focused our attention on the TWO-NN and the HIDALGO models. In Section 2, we explained that both methods are based on the distributional properties of the ratios of distances between a point and its first two NNs. However, this modeling framework has been extended by Denti et al. (2022), where the authors developed a novel ID estimator called GRIDE. This new estimator is based upon the ratios of distances between a point and two of its NNs of generic order, namely n 1 and n 2 . Extending the neighborhood size leads to two major implications: more stringent local homogeneity assumptions and the possibility of computing ID estimates as a function of the chosen NN orders. Monitoring the ID evolution as the order of the furthest NN n 2 increases allows the extraction of meaningful information regarding the link between the ID and the scale of the considered neighborhood. In doing so, GRIDE produces estimates that are more robust to noise present in the data, which is not directly addressed by the model formulation.
The GRIDE model is implemented in intRinsic, and the estimation can be carried out under both the frequentist and Bayesian frameworks via the function gride(), which is very similar to twonn() in its usage. Additionally, one can use the functions twonn_decimation() and gride_evolution() to study the ID dynamics. More details about these functions are available in the package documentation.
The map in Figure 17 provides a visual summary of the most important functions contained in the package. The main topics are reported in the diamonds, while the high-level, exported functions are displayed in the blue rectangles. These routines are linked to the (principal) lowlevel function via dotted lines. Finally, the light-blue area highlights the functions discussed in this paper.

B. intRinsic and other packages
As mentioned in Section 1, there is a large number of ID estimators available in the literature, and many of them have been implemented in R. A valuable survey of the availability of dimensionality reduction methods and ID estimators has been recently reported in You (2022b). From there, we see that two packages are the most important when it comes to ID estimation: Rdimtools and intrinsicDimension. Rdimtools, in particular, is comprised of an unprecedented collection of methods -including also the least-squares TWO-NN.
At the moment of writing, two main traits of intRinsic are unique to this package. First, our package is devoted to the recently proposed likelihood-based estimation methods introduced by the seminal work of Facco et al. (2017) and the literature that followed. Therefore, as of today, many of the R implementations presented here are exclusively contained in this package. This is true, for example, for the MLE and Bayesian versions of the TWO-NN model, the HIDALGO model, and all the routines linked to GRIDE. To the best of our knowledge, the function Hidalgo() is available outside this package. However, one can only find it on GitHub repositories, coded in Python and C++. Note that Python versions of the TWO-NN estimator have also been implemented in the recent scikit-dimension and DADApy packages (Bac et al. 2021;Glielmo et al. 2022). Moreover, DADApy contains routines dedicated to GRIDE. Table 8 presents a summary of recent software packages containing ensembles of ID estimation methods. Second, all the functions in our package allow -and emphasize -the uncertainty quantification around the ID estimates, which is a crucial component granted by our model-based approach. This feature is often overlooked in other implementations.
Overall, the wide variety of methods and ongoing research in this area indicate that there is no globally optimal estimator to employ regardless of the application. Thus, a practitioner should be aware of the strengths and limitations of every method.
One limitation of the likelihood-based models offered in this package, shared with many other ID estimators in general, is the underestimation of the ID when the true latent manifold's dimension is large. As an empirical rule, for cases where the estimated ID is large (e.g., d > 20), the retrieved value should be cautiously regarded as a lower bound for the actual ID (Ansuini et al. 2019). An alternative method we found particularly robust to this issue is the expected simplex skewness (ESS) algorithm proposed by Johnsson, Soneson, and Fontes (2015). For example, consider 5000 observations sampled from a D = d = 50 dimensional Gaussian distribution. With the following code, we can see how twonn() underestimates the true ID, which the ESS instead recovers well.

C. Gibbs sampler for Hidalgo()
The steps of the Gibbs sampler are the following: 1. Sample the mixture weights according to 2. Let z −i denote the vector z without its i-th element. Sample the cluster indicators z i according to: i |z 1 , . . . , z i−1 , k, z i+1 , . . . , z n , d .
We emphasize that, given the new likelihood we are considering, the cluster labels are no longer independent given all the other parameters. Let us define z k i = (z 1 , . . . , z i−1 , k, z i−1 , . . . , z n ) .
Then, let N z i (z −i ) be the number of elements in the (n − 1)-dimensional vector z −i that are assigned to the manifold (mixture component) indicated by the label z i . Moreover, let m in i = l N (q) l,i 1 {z l =z i } be the number of points sampled from the same manifold of the i-th observation that have x i as neighbor, and let n in i (z) = l N (q) i,l 1 {z l =z i } ≤ q be the number of neighbors of x i sampled from the same manifold. Then, we can simplify the previous formula, obtaining the following full conditional: (c) Finally, let us define a * = a 0 + n k and b * = b 0 + i:z i =k log µ i . If G 0 is assumed to be a truncated Gamma with point mass at D, we obtain d k | · · · ∼ρ * 1 ρ * 1 +ρ * 0 Gamma (a * , b * ) 1(·) (0,D) +ρ * 0 ρ * 1 +ρ * 0 δ D (·), whereρ * 1 =ρ · (C a * ,b * ,D /C a,b,D ) andρ * 0 = (1 −ρ) · D n k · exp{−D i:z i =k log µ i }.

D. Postprocessing to address label-switching
The postprocessing procedure adopted for the raw chains fitted by Hidalgo() works as follows.
Recall that we are working with n observations and K mixture components. Let us consider an MCMC sample of length T , and denote a particular MCMC iteration with t, t = 1, . . . , T . Let z i (t) indicate the cluster membership of observation i at the t-th iteration, with i = 1, . . . , n.
Similarly, d k (t) represents the value of the estimated ID in the k-th mixture component at the t-th iteration, where k = 1, . . . , K.
We map the K chains of the parameters in d to each data point via the values of z. That is, we construct n chains, one for each observation. At the t-th iteration, we will have {d z i (t) (t)} n i=1 . In so doing, we obtain a collection of n chains that link every observation to its ID estimate. When the chains have been postprocessed, the local observation-specific ID can be estimated by the ergodic mean or median.

E. The variation of information metric
This section provides additional details regarding the variation of information (VI) distance between partitions, a quantity often employed to estimate optimal posterior clustering configurations.
Wade and Ghahramani (2018) used Equation 11 as a loss function to measure the discrepancy between the targeted posterior partition ρ 1 = ρ and an estimated one ρ 2 =ρ. By minimizing the posterior expectation of L V I (ρ,ρ) w.r.t.ρ, one obtains the optimal clustering solution: where D denotes the data. Unfortunately, minimizing the previous quantity (or one of its variations) over the partition space is extremely challenging. Therefore, numerous authors focused on developing efficient algorithms for this task; see, for example, Wade and Ghahramani (2018); Rastelli and Friel (2018) and the review in Dahl et al. (2022b).

F. Global and local intrinsic dimensions
In Section 2.2, we introduced HIDALGO as a heterogeneous ID estimator. The Bayesian mixture segments the data into groups, each belonging to a specific manifold with a specific ID. However, to be more precise, we provide additional comments on the meaning of the word heterogeneous. In the ID literature, there is a clear distinction between global and local ID estimators.
On the one hand, methods in the former group estimate a single ID value for the whole dataset. A single measure for the complexity of the data is extremely useful, for example, as a starting point for dimensionality reduction techniques.
On the other hand, the latter group contains methods that attribute a specific ID to each data point. Such output is instrumental in monitoring the behavior of the ID across the entire dataset to detect significant differences in its topology. Moreover, local ID estimation has an impact when employed for subspace outlier detection, subspace clustering, or, more in general, other applications in which the ID can vary from location to location (Amsaleg, Chelly, Houle, Kawarabayashi, Radovanović, and Treeratanajaru 2019). Finally, we can recover a global ID value by aggregating local ID estimates.
Broadly speaking, HIDALGO can be seen as both, a global and local estimator. Its mixture formulation allows the segmentation of the observations in homogeneous and spatially related model-based clusters. Hence, we can see HIDALGO as an estimator for multiple global IDs (one for each manifold but not for each point). Nonetheless, we can take advantage of the Bayesian framework and the postprocessing procedure we propose to deal with the labelswitching issue (see Appendix D for more details). Indeed, while solving for label-switching, our procedure delivers a valuable byproduct. By mapping the K mixture parameters {d k } K k=1 into n different observation-specific chains, we can effortlessly obtain an ID value for each data point. For example, see Figure 11, where observation-specific posterior mean and median ID estimates are displayed.

G. System configuration
We obtained the results in this vignette by running our R code on a MacBook Pro with a 2.6 GHz 6-Core Intel Core i7 processor.