tidypaleo: Visualizing Paleoenvironmental Archives Using ggplot2

This paper presents the tidypaleo package for R, which enables high-quality reproducible visualizations of time-stratigraphic multivariate data that is common to several disciplines of the natural sciences. Rather than introduce new plotting functions, the tidypaleo package defines several orthogonal components of the ggplot2 package that, when combined, enable most types of stratigraphic diagrams to be created. We do so by conceptualizing multi-parameter data as a series of measurements (rows) with attributes (columns), enabling the use of the ggplot2 facet mechanism to display multi-parameter data. The orthogonal components include (1) scales that represent relative abundance and concentration values, (2) geometries that are commonly used in paleoenvironmental diagrams created elsewhere, (3) facets that correctly assign scales and sizes to panels representing multiple parameters, and (4) theme elements that enable tidypaleo to create elegant graphics. Collectively, this approach demonstrates the efficacy of a minimal ggplot2 wrapper to create domain-specific plots.


Introduction
Paleoenvironmental archives are critical to our understanding of the past, present, and future (Smol 2009). Analysis of rocks, soils, sediments, and fossils from throughout Earth's history is a powerful approach to reconstruct environmental conditions in the absence of direct human measurements (Smol 2010). For example, tree rings, ice cores, and lake sediments can be used to reconstruct environmental conditions over hundreds to thousands of years (Cohen 2003). Ocean sediments, loess deposits, and ancient soils can be used to reconstruct environmental conditions over hundreds to millions of years (Williams et al. 2018). Collectively, paleoenvironmental archives are our link to the past and inform our predictions about the future environment.
Effective visualization of these complex, multivariate data matters. For example, researchers have used paleoenvironmental archives from previous periods of rapid warming to understand our current warming planet (Vincent and Cwynar 2016). Aquatic scientists have used lake sediment archives to recognize that lakes acidified as a result of sulfur emissions from coal burning (Charles et al. 1990). Environmental scientists have used records from lake and ocean sediments to confirm the source of potentially toxic pollutants (Dunnington et al. 2020). These are data that directly inform policy, and lead to legislation (Smol 2009); high-quality reproducible figures are critical if these data are to be understood and trusted.
tidypaleo is a package for R statistical software (R Core Team 2021) to create effective visualizations for paleoenvironmental data. These data have several characteristics that distinguish them from traditional time series data (Dunnington and Spooner 2018): • They are generally multi-proxy and multi-archive, in that evidence from multiple parameters (e.g., the relative abundance of dozens of taxa and/or geochemical measures) and multiple archives (e.g., two lake sediment cores from different locations) must be interpreted collectively. • The connection between position in the archive (e.g., depth in the archive) and calendar age (e.g., AD 1900±15) is important and both must be communicated alongside parameter measurements. • These archives tend to be oriented vertically (e.g., an ice core), and tend to be plotted vertically with parameter measurements on the x-axis and depth or time on the y-axis.
Because of the unique nature of paleoenvironmental data, constructing diagrams that are elegant, technically correct, and reproducible is challenging and often time-consuming. Whereas previous software packages for creating stratigraphic diagrams (e.g., C2, Tillia*Graph, rioja, and analogue) use a graphical user interface or base R plotting approach (Grimm 2016;Juggins 2011Juggins , 2020Simpson 2007), in the tidypaleo package, we use the defaults and flexible interface of the ggplot2 package (Wickham 2016) as a base on which effective paleoenvironmental diagrams can be built. Package tidypaleo (Dunnington 2022) is available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=tidypaleo.

Example data
In this paper, we use the kellys_lake_geochem and kellys_lake_cladocera data set, which contain geochemical measurements and microfossil zooplankton (Cladocera) counts from Kellys Lake, Nova Scotia, Canada, to demonstrate the features of the tidypaleo package ( Figure 1-2).

Design and implementation
Like ggplot2, the tidypaleo package provides a number of reusable components that can be combined in flexible and powerful ways to communicate a wide variety of data types. Here, we use the structure laid out by Wilkinson (2005) when describing the Grammar of Graphics.

Data
Essential to tidypaleo is a tabular data structure composed of one row per measurement. Columns contain information about each measurement, including common dimensions (for example depth, age, core identifier, and parameter measured), and values specific to each measurement (e.g., measured values, units, and errors). This form of data provides an opportunity for measurement-level details to be retained that is not possible with a more traditional "tidy" format (Wickham 2014;Dunnington and Spooner 2018), where columns are mix of common dimensions and which parameter was measured, rows represent individual sediment samples, and cells can only represent a single value. Data with measurements as rows and attributes as columns (e.g., core identifier and parameter name) allows a natural use of the ggplot2 grouping and facet mechanisms to plot multiple parameters and locations. Importantly, this also supports communication of error, since value error is another column in the data structure. The kellys_lake_geochem data set used in Figure 1 is shown below. Most data are not provided to or entered by paleoenvironmental researchers in this format. A more common storage/data entry format is a spreadsheet with one row per sample and one column per geochemical parameter and/or taxon. For geochemical data, paired columns are required to store error and/or detection limits for each measurement (Dunnington and Spooner 2018). To read and convert a spreadsheet to the form required by tidypaleo, one can use the readxl and tidyr packages (Wickham and Bryan 2019;Wickham 2021). In particular, the pivot_longer() function from the tidyr package converts the table from a form where each row represents a sample to one where each row represents a measurement. Because the table contains error information in addition to measurement values for each parameter, two pivot operations are needed with a join to ensure that error information is available in the final table.

Scales
Paleoenvironmental data in a one row per measurement structure has several common data types that should be scaled differently when passed to ggplot2. Discrete variables such as location, parameter, and sample groupings, such as zones, are well-represented by the existing discrete scales in ggplot2. Continuous variables require special consideration, including archive position (typically depth), age, relative abundance, and concentration values.
Position in the archive (e.g., depth) and its age (e.g., year common era (CE) or years before present) are values that are related by a monotonic transformation. Whereas position in the archive is known to a high degree of precision, age values are typically estimated, and communicating uncertainty around this value is essential. In the tidypaleo package, the relationship between archive position, age, and age uncertainty is represented by an age_depth_model(). An age_depth_model() is constructed using previously estimated age and depth values, and provides various options for interpolating and extrapolating. Default interpolation and extrapolation provides a reasonable approximation for visualization, although ideally these values should be provided at a high enough resolution so that interpolation and extrapolation is minimal. Age-depth models can be passed to scale_(x|y)_depth_age() and scale_(x|y)_age_depth(), which use ggplot2's sec_axis() framework to communicate both age and depth ( Figure 3). These scales also enforce the convention that time should be visualized from bottom to top when on the y-axis, and from left to right when on the x-axis. R> data("kellys_lake_ages", package = "tidypaleo") R> kellys_adm <-age_depth_model(depth = kellys_lake_ages$depth, + age = kellys_lake_ages$age_ad) R> kellys_geochem_plot + + scale_y_depth_age(kellys_adm, age_name = "Year CE") Relative abundance of microfossils is a common type of value in paleoenvironmental diagrams.
Relative abundance values are always zero or positive, and breaks should always occur at the same intervals on all panels. Because negative values are impossible, zero should always be the minimum limit. Expansion below zero can be misleading and tends to produce unnecessary space. Expansion above the maximum value should be a fixed amount (rather than the default 5%) to keep spacing between panels uniform. These defaults are encapsulated in the scale_(x|y)_abundance() scales, which wrap ggplot2::scale_(x|y)_continuous() to produce the optimal labels, limits, and expansion for relative abundance values.
Concentration values are also common in paleoenvironmental diagrams. Concentration values are theoretically always positive, although in practice values below detection or quantification limits are frequently (if incorrectly) encoded as zeroes. Concentration values are usually wellrepresented by the default continuous scale, which scales to the minimum and maximum of the data. Occasionally it is useful to convey the relative change in concentration between parameters, in which setting the bottom limit to zero (limits = c(0, NA)) is appropriate. Similarly, log-scales are appropriate for some parameters (e.g., 210 Pb activities). Because the scale modifications required for concentration values are minimal, the tidypaleo package does not provide specific scales for concentration values.

Geometries
Two common visual representations of relative abundance values are difficult to recreate using existing ggplot2 geometries. First, vertical or horizontal segments drawn from the xor y-axis to the relative abundance value are common and recommended for some types of data (Juggins and Telford 2012). The tidypaleo package provides geom_col_segs() to create this type of visual representation. This geometry is implemented as a subclass of 'ggplot2::GeomSegment', and is parameterized identically to geom_col(), geom_area(), geom_point(), and geom_line(). Second, "exaggerations" are common to communicate low-magnitude variability when one or more large relative abundance values obscure this trend. From a Grammar of Graphics perspective, these "exaggerations" are statistics, as they modify the original data in such a way that existing graphical representations can be used to draw them (Wilkinson 2005). In tidypaleo we implement exaggerations as subclassses of the existing 'ggplot2::Geom' classes, because implementing them as subclassess of 'ggplot2::Stat' results in the scales expanding to include all exaggerated values. The tidypaleo package provides geom_point_exaggerate(), geom_line_exaggerate(), and geom_area_exaggerate() to create this type of graphical representation (Figure 4).
R> patchwork::wrap_plots( + kellys_demo_base + geom_col_segsh(), + kellys_demo_base + geom_lineh(), + kellys_demo_base + geom_col_segsh() + geom_lineh(), + kellys_demo_base + geom_areah(), + nrow = 1) Communicating error is important for geochemical data, age-depth models, and quantities calculated from bioindicator data such as the results of inference models. Communicating error in stratigraphic diagrams is infrequently discussed and we suspect that the complexity of adding error bars to multi-panel diagrams has contributed to their infrequent use. Geometries that communicate error (e.g., geom_errorbar() and geom_errorbarh()) are already included in ggplot2; however, the data structure used by tidypaleo is well-suited to using them with minimal effort (Figure 6), particularly for those already familiar with adding error bars to diagrams created in ggplot2.

Facets
In the Grammar of Graphics, facets are defined as displaying subsets of a data set in panels of the same graphic (Wilkinson 2005). Using a data structure with one row per measurement, plotting multiple parameters on same plot using facets is a natural way to represent these data. In ggplot2, facets also coordinate the scales and labels for each panel. As noted in Section 3.2, different parameter types can have different scaling requirements. Furthermore, proper labeling of geochemical parameters and species names can be challenging. Facets are a useful way to solve these challenges for diagrams with a common data type. The tidypaleo package provides facet types for two common cases: relative abundance data and geochemical data.
Facets for relative abundance data (facet_abundance() and facet_abundanceh()) wrap ggplot2::facet_grid(), applying best-practice rules for the relationship between abundance scales. When plotted, relative abundance values on each panel should have the same weight (i.e., 5% on one panel should take up the same amount of space as 5% on another panel; space = "free_x" or space = "free_y"). Facet labels must have species names italicized for publication in most journals, but modifiers (e.g., strain III) must not be italicized. This constraint can be accommodated by setting the default labeller to a function that understands the form of most species input (label_species()). Finally, facet labels are typically too long to fit horizontally above each panel and must be rotated to be legible. While rotating a facet label is possible using ggplot2 theme modifications, eliminating the horizontal clip is not. Control over the clip parameter of the strip text is likely in a future ggplot2 version, however packaging both as part of the facet is a way make this implementation detail transparent to the user.
Panels representing different geochemical parameters do not require space = "free_x" or space = "free_y", and thus can wrap either facet_wrap() or facet_grid(). The tidypaleo package provides facet_geochem_wrap(), facet_geochem_grid(), facet_geochem_wraph(), and facet_geochem_gridh() for geochemical measurements (Figure 1). Scales should be independent between panels representing different parameters, as fixed scales could hide meaningful trends in parameters with a smaller absolute magnitude (for this reason, it is also generally not appropriate to display two geochemical parameters on the same panel). Units of measurement may be different between panels, and thus need to be included in the facet label ( Figure 7). Furthermore, many common parameter names contain non-ASCII char-acters that require R plotmath to display on all graphics devices. Both of these labeling constraints are practically difficult to achieve, so we include label_geochem() as the default labeller to convert common ASCII representations of parameter names to parseable plotmath and provide an interface to specify measurement units. Finally, including many parameters on a vertically-oriented plot inevitably results in overlapping x-axis labels. Because of this, we also rotate x-axis labels by 90 degrees by default in facet_geochem_wraph() and facet_geochem_gridh().
The facets included in the tidypaleo package do not solve the problem of paleoenvironmental diagrams that display more than one data type; however, the patchwork package provides syntax to align multiple ggplot2 plots and assign common elements among them (Pedersen 2020). Creating diagrams in this way imposes the constraint that plots with the same data type must be grouped. This constraint makes it easier to describe any differences in plot scales (e.g., for panels where a log scale is appropriate), and we do not think a more complex implementation of a facet would result in better diagrams or syntax.

Theme elements
This package includes theme_paleo(), which is a minimally-modified version of the function ggplot2::theme_bw() that removes the grey background of panel labels. This reflects the look and feel of these diagrams created by other software. Elements of the abundance and geochemical facets are also available as theme modifiers, including rotated_axis_labels() and rotated_facet_labels().

Statistical helpers
The definition of data used by tidypaleo is different than the type of data that is needed to compute common multivariate summaries such as ordinations and/or stratigraphicallyconstrained cluster analysis (Juggins 2020;Grimm 1987). To facilitate visualizing the results of these analyses, tidypaleo provides a framework for transforming one-row-per-measurement data into a data frame suitable for input to most multivariate summaries (generally one row per sample). The tidypaleo package exposes the most common analyses (principal components analysis (PCA) and CONISS) and transformation options with the nested_*() functions. The example below computes a PCA for each location in the data set using the nested_data() and nested_prcomp() functions.
R> keji_lakes_prcomp <-keji_lakes_plottable %>% + group_by(location) %>% + nested_data(qualifiers = depth, key = taxon, value = rel_abund, + trans = sqrt) %>% + nested_prcomp() The purpose of the nested_data() function is to compute a data frame with one row per sample and one column per predictor, which is usually a geochemical parameter or taxon. The options provided by nested_data() include (1) the transformation to apply to each column, (2) the filter to apply to the data, and (3) the selection criteria to apply to the data. Option (1) can be used to apply scaling appropriate for a specific statistical analysis (e.g., scale() for PCA); options (2) and (3) can be used to handle non-finite values resulting from parameter measurements on different samples. The keji_lakes_plottable data set does not contain any non-finite values and are on a common scale (relative abundance), but applying a squareroot transformation makes euclidean distance (preserved by PCA) a reasonable measure of dispersion between samples (Legendre and Birks 2012).
A single data frame can be obtained from the result of a nested_*() function by calling unnested_data() on one or more list-columns whose values contain nested data frames with the same number of rows. For example, the object returned by nested_prcomp() has columns qualifiers and scores, both of which are aligned row wise (each contain one row per sample). The unnested_data() function can combine these columns into a single data frame whose value is more useful. Whereas experienced users may prefer to use their existing knowledge of data frame manipulation to fit one or more ordinations and/or cluster analyses to their data, new users may prefer the nested_data(), nested_prcomp(), and nested_chlust() functions. These functions are designed to produce reasonable results for common analyses in most situations; however, as analyses become more specialized and users become more familiar with data frame manipulation, we expect that users will transition to more generic approaches.

Example
A motivating example for the development of this package was the ability to create stratigraphic diagrams of multiple archives with statistical summaries. For relative abundance data, it is common to plot the results of an ordination and a cluster analysis alongside the raw data. For this example, we will use diatom count data from lakes in Kejimkujik National Park, Nova Scotia, Canada sourced from the Neotoma paleoecological database (Goring et al. 2015;Ginn, Cumming, and Smol 2007).
The stratigraphic diagram for these lakes is an example where exaggerated geometries are useful: while common scales for the same taxa between lakes correctly communicates the difference in magnitude, relative trends are difficult to assess for some taxa without the exaggerated geometry (geom_areah_exaggerate()). Relative abundance scales with comparable size across panels are added via facet_abundanceh(), which also correctly labels taxa with a rotated facet label.
One measure of the quality of a stratigraphically-constrained cluster analysis is a brokenstick analysis to obtain the number of statistically plausible groups (Bennett 1996).
These results can be obtained by unnesting the broken_stick column in the result of nested_coniss_chclust(). For both cores, the cluster analysis identified 3 groups whose dispersion was more than would be expected from a cluster analysis of a random shuffle of the samples. The dendrogram associated with these cluster analyses can be added to a plot using function layer_dendrogram(). Because the scale of the dendrogram is not in relative abundance, a separate plot is needed R> dendro_plot <-ggplot() + + layer_dendrogram(keji_coniss, aes(y = depth), label = "CONISS") + + scale_y_reverse() + + facet_grid(vars(location), vars(label), scales = "free_y") + + labs(y = NULL, x = "Dispersion")

R>
These plots can be combined using patchwork::wrap_plots(), resulting in the finished stratigraphic plot (Figure 8). In this case we have chosen to align the relative abundance scales to communicate trend between cores at the expense of trend within each core being less clear for some taxa. The patchwork::wrap_plots() function can also be used to stack stratigraphic diagrams without aligning the relative abundance scales between cores.

Discussion
Several programs are available that are capable of producing paleoenvironmental diagrams of the type produced by this package. In R, the analogue and rioja packages have functions that specifically produce stratigraphic diagrams (Simpson 2007;Juggins 2020). Functions in both these packages require data with parameters as columns and use dozens of arguments to a single function to control the various options for display. This makes the functions highly useful for specific types of data, but limit the ease with which multiple plots can be aligned and the ease with which error can be communicated.
Several desktop applications are also commonly used to create paleoenvironmental diagrams, including Tilla*Graph (Grimm 2016) and C2 (Juggins 2011). Tilla*Graph creates graphics that are highly customizable, and the program includes the ability to visualize core lithology in addition to plots of multiple parameters. C2 also creates graphics that are highly customizable, and the program includes the ability to use two y-axes next to each other to clearly communicate possible changes to the sedimentation rate. Both programs share the disadvantages of many desktop applications, notably, these applications are only available for Windows, and the need for point-and-click edits introduces non-reproducibility if the figure data must be updated.
The tidypaleo package has attempted to build on the best from each of these software packages: functions from analogue and rioja produce excellent reproducible representations of relative abundance data and include the ability to visualize hierarchical clusters as zones or dendrograms as part of the visualization. Tilla*Graph and C2 produce visually appealing diagrams and have interfaces that are well-suited to a non-technical audience. While tidypaleo does require coding to make a publishable diagram, there are a large number of resources from which users can draw to learn ggplot2 (Wickham and Grolemund 2017;Wickham 2016;Healy 2018). We think this also benefits users, who can re-use ggplot2 concepts from tidypaleo to create high-quality diagrams in other disciplines.

Conclusions
The tidypaleo package is an extension of ggplot2 for R statistical software that provides a number of reusable components for visualizing paleoenvironmental data (Wickham 2016; R Core Team 2021). These components are based on a data structure in the form of one row per measurement, which allows existing concepts in ggplot2 and the Grammar of Graphics (Wilkinson 2005) to be used to create high-quality paleoenvironmental diagrams. It is our hope that this software will increase the reproducibility of figures included in paleoenvironmental publications, and demonstrate best practices for creating a minimal discipline-specific wrapper around the ggplot2 package.