trackeR : Infrastructure for Running and Cycling Data from GPS-Enabled Tracking Devices in R

The use of GPS-enabled tracking devices and heart rate monitors is becoming increas-ingly common in sports and ﬁtness activities. The trackeR package aims to ﬁll the gap between the routine collection of data from such devices and their analyses in R . The package provides methods to import tracking data into data structures which preserve units of measurement and are organized in sessions. The package implements core infrastructure for relevant summaries and visualizations, as well as support for handling units of measurement. There are also methods for relevant analytic tools such as time spent in zones, work capacity above critical power (known as W (cid:48) ), and distribution and concentration proﬁles. A case study illustrates how the latter can be used to summarize the information from training sessions and use it in more advanced statistical analyses.


Introduction
Recent technological advances allow the collection of detailed data on fitness activities and on multiple aspects of training and competition in professional sport. The focus of this paper is on data collected by GPS-enabled tracking devices and heart rate monitors. Such devices are routinely used in fitness activities such as running and cycling, and also during training in sports like field hockey and football. Basic questions associated with tracking data include how often, much, or hard an individual or a group trains, and a more advanced outlook tries to explain the impact of training on athlete physiology or performance.
Tools for basic analytics are usually offered by the manufacturers of the tracking devices, such as Garmin, Polar, and Catapult, and through a wide range of applications for devices such as smartphones and smartwatches, e.g., Strava Running and Cycling GPS, Endomondo -Running & Walking, and Runtastic Running GPS Tracker. A notable open-source effort is Golden Cheetah (http://www.goldencheetah.org/), which has now, perhaps, become the Despite the wide range of R packages available, there is only a handful of packages specific to sports data and their analysis. The available packages focus on topics such as sports management (RcmdrPlugin.SM, Champely 2012), ranking sports teams (mvglmmRank, Karl and Broatch 2015), and accessing betting odds (pinnacle.API, Blume, Jhirad, and Gassem 2017). SportsAnalytics is a package that focuses on the analysis of performance data, and currently offers only "a selection of data sets, functions to fetch sports data, examples, and demos" (Eugster 2013). The cycleRtools package (Mackie 2016) provides functionality to import cycling data into R as well as tools for cycling-specific, descriptive analyses.
The trackeR package (Frick and Kosmidis 2017) aims to fill the gap between the routine collection of data from GPS-enabled tracking devices and the analyses of such data within the R ecosystem. The package provides utilities to import sports data from GPS-enabled devices, and, after careful processing, organizes them in data objects which are organized in separate sessions/workouts and carry information about the units of measurement (e.g., distance and speed units) as well as of any data operations that have been carried out (e.g., smoothing). The package also implements core infrastructure for the handling of measurement units and for summarizing and visualizing tracking data. It also provides functionality for calculating time in zones (e.g., Seiler and Kjerland 2006), work capacity W (Skiba, Chidnok, Vanhatalo, and Jones 2012), and distribution and concentration profiles (Kosmidis and Passfield 2015), including a few methods for the analysis of these profiles. The package is available from the Comprehensive R Archive Network at https://CRAN.R-project.org/package=trackeR. Section 2 gives an overview of the package and introduces the basic objects and the methods that apply to them. Section 3 describes the importing utilities, and Section 4 details the structure and construction of the 'trackeRdata' object, which is at the core of package trackeR. Section 5 is devoted to the calculation of relevant summaries (time in zones, work capacity, distribution and concentration profiles) and the corresponding methods for visualization. Section 6 and Section 7 focus on basic methods for unit manipulation as well as smoothing and thresholding. The case study in Section 8 investigates the key features in 27 sessions through a functional principal components analysis (e.g., Ramsay and Silverman 2005) on the concentration profiles for speed. Figures 1 and 2 show a schematic overview of the package structure, split into two parts for reading data and further operations. Squared boxes indicate objects of a particular class, diamonds indicate files of a particular format, and boxes with rounded corners represent methods that apply to those objects. The respective class and method names are given in the boxes. An arrow from an object/file type to a method indicates that the method applies to objects of the respective class; an arrow from a method to an object indicates that the method outputs objects of the respective class. A bi-directional arrow between a method and an object indicates that the method's input and output are of the same class, such as the method threshold() and objects of class 'trackeRdata'. Arrows to or from groups of boxes apply to each box in the group. For example, the method changeUnits() applies to objects of classes 'trackeRdataZones ', 'trackeRdataSummary', 'trackeRWprime', 'distrProfile', and 'conProfile'. Data in various formats are imported and stored in the central data object of the class 'trackeRdata' from which summaries for descriptive purposes or further analyses can be derived. Methods for visualization and data handling are available for data objects and summary objects. A list of all functionality is provided in Tables 1 and 2.

Import utilities
Package trackeR provides utilities for data in common formats from GPS-enabled tracking devices. The family of the supplied reading functions, read*(), currently includes functions for reading TCX (Training Centre XML), DB3 (for SQLite, used, e.g., by devices from GP-Sports) and Golden Cheetah's JSON files. These functions read the tracking data, and return a data.frame with a specific structure.
The following code chunk illustrates the use of the readTCX() function using a TCX file that ships with the package and shows the name and type of variables that are present in the resulting data frame.
Package trackeR can accommodate the addition of extra formats by simply authoring appropriate import functions. Such functions should take as input the path of the file to be read and return a data frame with the same structure as in the above example.

Object structure
The core object of package trackeR has class 'trackeRdata'. The 'trackeRdata' objects are session-based, unit-aware and operation-aware structures, which organize the data in a list of multivariate 'zoo' objects (Zeileis and Grothendieck 2005) with one element per session. The observations within each session are ordered according to the time stamps as these are read from the GPS-enabled tracking devices. Each 'trackeRdata' object has an attribute on the measurement units of the data it holds, and, if applicable, an attribute detailing the operations, such as smoothing, it has gone through.
'trackeRdata' objects result from the constructor function trackeRdata(), which takes as input the output of the read*() functions. Apart from the allocation of observations into distinct sessions, the constructor function also performs some data processing, including basic sanity checks (for example, removing observations with negative or missing values for cumulative distance or speed), handling of measurement units, correction of distances using altitude data if required, and data imputation, as discussed in Section 4.5.

Constructor function
The interface of the constructor function for class 'trackeRdata' is trackeRdata(dat, units = NULL, cycling = FALSE, sessionThreshold = 2, correctDistances = FALSE, country = NULL, mask = TRUE, fromDistances = TRUE, lgap = 30, lskip = 5, m = 11) dat is the data frame containing the tracking data and units is used to specify the units of measurement. Table 3 shows the currently supported units and notes the units that are used by default when units = NULL. The argument cycling flags the data as coming from cycling session(s) rather than running session(s). This affects the calculation of W (based on power or speed for cycling and running, respectively) and the thresholds applied before plotting the session data. The other arguments are specific to the data processing operations, which are briefly described in the following subsections.

Identifying distinct sessions
The constructor function groups the observations into sessions according to their time stamps. Specifically, the time stamps in the data from the read*() functions are first sorted, and all consecutive observations whose time stamps are no further apart from each other than a specified threshold t * are considered to belong to a distinct session. The value of t * is set via the sessionThreshold argument of the trackeRdata() function and it defaults to 2 hours.

Distance correction using altitude data
If the distances in the data have been calculated solely based on latitude and longitude data, without taking into account the altitude, then the distance covered can be underestimated. The correctDistances argument of the trackeRdata() function controls whether the distances should be corrected for altitude changes.
If the uncorrected distance covered at time point t i is d 2,i , then setting correctDistances = TRUE uses the Pythagorean theorem to correct the distance covered between time point t i−1 and time point t i to where d i and a i are the corrected cumulative distance and the altitude at time t i , respectively.
If no altitude measurements are available, these are extracted from SRTM 90m Digital Elevation Data via the raster package (Hijmans 2016) using the latitude and longitude measurements. The arguments country and mask control the extraction of altitudes. Figure 3: Illustration of the imputation process for speed with m = 11.

Imputation process
Occasionally, there is a large time difference between consecutive observations in the same session, sometimes of the order of several minutes. This can happen, for example, if the device is intentionally paused by the athlete or if the proprietary algorithm controlling the operating sampling rate of the device detects no significant change in position. For example, in the manual of a GPS device, the Forerunner © 310XT, it is stated that "The Forerunner uses smart recording. It records key points where you change direction, speed, or heart rate" (Garmin Ltd. 2013). In both cases, interpolating directly to get the speed or power will lead to overestimation of the total workload within those intervals.
We assume that such intervals appear only when there is no significant work happening, and hence impute them with observations with zero speed (for running) or zero speed and power (for cycling). Figure 3 shows a schematic representation of the imputation process for speed. The parameters l gap , m and l skip control the imputation, and can be specified via the lgap, m and lskip arguments of the trackeRdata() function, respectively.
If the observations at times t i and t i+1 are more than l gap seconds apart, then it is assumed that there is no significant work happening between t i and t i+1 . The number of imputed records in the interval is m, and consists of two "outer" records and m − 2 "inner" records. The "outer" records are l skip seconds apart from the existing observations forming the beginning and the end of the interval, respectively. The "inner" records are h The imputed records between t i and t i+1 have zero speed or power, and the latitude, longitude and altitude measurements are set to their values at time t i . All other variables are set to NA.
trackeRdata() also adds five records at the beginning and five at the end of a session, based on the assumption that there is no activity before and after the available records. These observations have zero speed or power, their latitude, longitude and altitude measurements are as in the first and last observations, respectively, and all other variables are set to NA.
The imputed records are one second apart from each other and from the first and the last observation, respectively.
After the imputation process, the cumulative distances are updated based on the imputed speeds and the time differences between consecutive observations, according to where s i and d i denote the speed and cumulative distance at time point t i , respectively.
The following code chunk takes as input the raw data in the data frame runDF and constructs the corresponding 'trackeRdata' object.

R> runTr0 <-trackeRdata(runDF)
The function readContainer() is a convenience wrapper that calls the suitable reading function and, then, trackeRdata() for the data processing and the organization of the data in a 'trackeRdata' object (see ?readContainer for the available arguments). For example, R> runTr1 <-readContainer(filepath, type = "tcx", timezone = "GMT") R> identical(runTr0, runTr1) [1] TRUE The function readDirectory() allows the user to read all files of a supported format in a directory, rather than calling, e.g., readContainer() on each file separately. Using the argument aggregate, the user can decide if all data are first combined in a data frame and then split into sessions solely based on the time difference between consecutive observations. This way, e.g., warm-up and cool-down phases are put into the same session as the central part of training, even if they are recorded in separate container files. Alternatively, data from different container files are always stored in separate sessions.
Package trackeR ships with two 'trackeRdata' objects containing 1 and 27 running sessions, respectively, and which can be loaded via R> data("run", package = "trackeR") R> data("runs", package = "trackeR") We will use those objects for the illustrations throughout the paper.

Session summaries and visualization
Package trackeR provides methods for summarizing sessions in terms of scalar summaries, the time spent exercising in specified zones, the concept of work capacity, and distribution and concentration profiles.

Visualization
For a first visual inspection of the data, the plot() method shows by default the evolution of heart rate and pace over the course of the selected sessions. For example, Figure 4 shows the evolution of heart rate and pace for the first three sessions in the runs object.  R> plot(runs, session = 1:3) The route covered during a session can also be displayed on a static map via the plotRoute() method. The plotRoute() method uses the ggmap package (Kahle and Wickham 2013) and, hence, can work with the sources and maps supported by ggmap. For example, Figure 5 shows the route covered during session 4 in runs using a map downloaded from Google. Interactive maps can be produced with leafletPlot(), using the leaflet package (Cheng and Xie 2017).

Scalar summaries
Each session can be summarized through common summary statistics using the summary() method. Such a session summary includes estimates of the total distance covered, the total duration, the time spent moving, and work to rest ratio. It also includes averages of speed, pace, cadence, power, and heart rate, calculated based on total duration or the time spent moving.
An athlete is considered to be moving if the speed is larger than some threshold s * . This threshold can be set via the movingThreshold argument of the summary() method, and the package assumes that anything between not moving at all and walking with a speed below that threshold is resting. The default value for movingThreshold has been set to 1 meter per second, which is just below the speed humans prefer to walk at on average (1.4 meters per second; see Bohannon 1997  The "average speed moving" is calculated as total distance covered divided by time moving while "average speed" is calculated as total distance divided by total duration. The average pace (moving) is calculated as the inverse of the average speed (moving). The work to rest ratio is calculated as time moving divided by (total duration − time moving). The averages for cadence, power, and heart rate (total and moving) are weighted averages with weights depending on the time difference to the next observation. These averages also need to take into account missingness in the observations. For a variable of interest V , we can calculate a weighted mean for the total session while accounting for missing values via and its counterpart for the part of the session spent in motion via where v i is the value of V at time point t i , K i is 1 if v i is available, i.e., not missing, and 0 otherwise, I(·) denotes the indicator function, and ∆ i = t i − t i−1 the time difference between observations at t i and t i−1 . Moving threshold: 1 m_per_s The plot() method shows the evolution of the various summary statistics over calendar time or over the course of the sessions. For example, the following code chunk produces Figure 6.

Times in zones
A common way to summarize and characterize a session is to calculate how much time was spent exercising in certain zones, e.g., heart rate zones.

Quantifying work capacity
The critical power model (Monod and Scherrer 1965) describes the relationship between the power output P and the time t e to exhaustion at that power output in terms of two parameters W 0 and CP . The critical power (CP ) is defined by Monod and Scherrer (1965) as "the maximum rate (of work) that [can be kept] up for a very long time without fatigue." Skiba et al. (2012) describe CP as"a power output that could theoretically be maintained indefinitely on the basis of principally 'aerobic' metabolism." W (read W prime) represents a finite work capacity above CP . Skiba et al. (2012) assume that W gets depleted during exercise with a power output above CP but also replenished during exercise with a power output of or below CP . We denote as W the general concept of work capacity above CP , and W (t) is the state of W at time t. The latter is also sometimes referred to as W balance at time t. Additionally, the initial state of W at the start of an exercise t = t 0 is W 0 = W (t 0 ), which is one of the parameters in the critical power model (Equation 1). Total depletion of W 0 results in the inability to produce a power output above CP . Thus, knowledge of the current state W (t), i.e., how much of that finite work capacity W 0 is left at time t, is important to an athlete, particularly in a race.
While this concept is most commonly applied to cycling, where the power output is routinely measured, Skiba et al. (2012) suggest that it can also be applied to running, substituting power and critical power by speed and critical speed, respectively. For running, the model postulates that each runner has a finite capacity in terms of distance covered above the critical speed. Depending on how much the runner exceeds this critical speed, the finite capacity W 0 is being exhausted in shorter times. Below we describe the models for depletion and replenishment of work capacity and how they are combined in package trackeR.

Depletion of work capacity
Assuming constant power for periods of exertion above CP , Skiba, Fulford, Clarke, Vanhatalo, and Jones (2015) assume that W is depleted at a rate directly proportional to the difference between the power output and CP d dt Solving Equation 2 for W (t) gives where D ∈ R is constant over t.
Suppose that the exercise over time t 0 = 0 to t n = T can be split into n intervals with breakpoints t 0 , t 1 , . . . , t n such that the power output within each interval is constant, that is . . , n}. Then, using Equation 3, the change in W (t) over the interval can be expressed as Replenishing of work capacity Skiba et al. (2015) assume that the periods with a power output at or below CP are periods of recovery during which W is replenished with a rate that depends on the difference between CP and the power output, and the amount of W 0 remaining, as follows: Equation 5 assumes that recovery slows down as W (t) approaches the initial capacity W 0 . Employing the substitution rule for integrals while solving Equation 5 and reexpressing in terms of W (t i−1 ) (see Appendix A for details) gives Since W (t i−1 ) is the amount of W 0 remaining at the start of the interval [t i−1 , t i ), W 0 − W (t i−1 ) is the amount of W 0 which has been depleted prior to t i−1 and not yet been replenished. Skiba et al. (2012) refer to this as W expended. Skiba et al. (2015) describe the replenishing of W indirectly by describing how W expended is reduced over the course of such a recovery interval. The exponential decay factor used in Equation 6 here is the same as their Equation 4 with only different notation. Skiba et al. (2015) use t to describe the length of the interval, D CP = CP − P i for the difference between critical power and power output, and W exp for the amount of W previously expended. For P i < CP , as is required for replenishment, −D CP and P i − CP are negative and thus the exponential factor is smaller than 1, leading to an exponential decay as described. Skiba et al. (2012) also assume an exponential decay of previously expended W to describe replenishing W , albeit with a different decay factor. Instead of (P i − CP )/W 0 , they use 1/τ W . The relationship between the time constant of replenishing τ W and the difference between critical power and recovery powerP is estimated based on experimental data as τ W = 546 exp −0.01(CP −P ) + 316 with recovery powerP estimated by the mean of all power outputs below CP .
Using Equation 6, i.e., the formulation of Skiba et al. (2015), the change in W over the corresponding interval [t i−1 , t i ) can be described through Work capacity at time t j Equation 4 describes the depletion of W (when P i > CP ) and Equation 7 describes replenishment of W (when P i ≤ CP ) over an interval [t i−1 , t i ). These two aspects can be combined to describe the change over the interval as The amount of W left at time point t j , j ∈ {1, . . . , n}, can thus be described through the initial amount W 0 and the changes happening in the j intervals of constant power previous to t j : Function Wprime() can be used to calculate W expended by setting argument quantity to "expended". If quantity is set to "balance", Wprime() calculates the current state W (t) (Equation 8). Wprime() contains implementations for Skiba et al. (2012) and Skiba et al. (2015), which can be selected via the version argument. For example, session 11 of the example data is an interval training with a warm-up and cool-down phase. Assuming a critical speed of 4 meters per second, the following code chunk produces Figure 8, which shows W expended, based on the specification of Skiba et al. (2012), along with the corresponding speed profile.
R> wexp <-Wprime(runs, session = 11, quantity = "expended", cp = 4, + version = "2012") R> plot(wexp, scaled = TRUE) During the warm-up phase speed rarely exceeds 4 meters per second and W expended remains low. Over the course of the interval training, W expended rises during the high-intensity phases and drops during the recovery phases. In the last part of the session, speeds are mostly below 4 meters per second and W expended drops again. Kosmidis and Passfield (2015) introduce the concept of distribution profiles for which the trackeR package provides an implementation. These profiles are motivated by the need to compare sessions and use information on such variables as heart rate or speed during a session for further modeling.

Distribution and concentration profiles
For a session lasting t n seconds, the distribution profile is defined as the curve {v, Π(v)|v ≥ 0}, where The function Π(v) is monotone decreasing and describes the time spent exercising above a threshold v for a variable V under consideration (e.g., heart rate or speed).
On the basis of observations v 0 , . . . , v n for V , at respective time points t 0 , . . . , t n , the observed version of Π(v) can be calculated as This can subsequently be smoothed respecting the positivity and monotonicity of Π(v), e.g., via a shape constraint additive model with Poisson responses (Pya and Wood 2015).
The concentration profile is defined in Kosmidis and Passfield (2015) as the negative derivative of a distribution profile and is suitable for revealing concentrations of time around certain values of the variable under consideration.
Distribution profiles can be calculated using the distributionProfile() function which returns an object of class 'distrProfile'. Concentration profiles can be derived from distribu- tion profiles using concentrationProfile(), which returns an object of class 'conProfile'. Table 2 includes an overview of constructor functions and available methods for distribution and concentration profiles.
By default, distribution profiles are calculated for speed and heart rate on grids covering the ranges of [0, 12.5] meters per second and [0, 250] beats per minute, respectively. The following code chunk illustrates the use of distributionProfile() and shows how users can specify the variables for which to calculate profiles and the respective grids.
R> dProfile <-distributionProfile(runs, session = 1:4, + what = c("speed", "heart.rate"), + grid = list(speed = seq(0, 12.5, by = 0.05), heart.rate = seq(0, 250))) R> plot(dProfile, multiple = TRUE) The multiple argument of the plot() method determines whether to plot the profiles in separate panels (FALSE) or overlay them in a common panel (TRUE), as in Figure 9. The different session lengths are clearly visible in the height of the curves at 0. Amongst the distribution profiles for speed, the descent of the profile for session 3 is slower than for the other sessions. This difference is most apparent in the concentration profiles, which are shown in Figure 10 and are produced by the following code chunk.
R> cProfile <-concentrationProfile(dProfile, what = "speed") R> plot(cProfile, multiple = TRUE) The profile for session 3 has a mode at around 3.5 meters per second and another one at 5 meters per second, showing that this session involved training at a combination of low and high speeds.    Table 3 shows the variables and the corresponding units that are currently supported in package trackeR.

Handling units of measurement
If objects with different units are c()ombined in one object, the units of the first session are applied to all other sessions. Furthermore, the changeUnits() method uses name matching to figure out which conversion needs to be done. This allows the user to easily add support for converting from unitOld to unitNew by authoring a function named unitOld2unitNew.
If we wish to report the speed summaries for session 1 in runSummary in feet per hour (not currently supported) instead of meters per second, we need to simply provide the appropriately named conversion function as illustrated below. Note that the conversion applies to all speed summaries, i.e., to "average speed" and "average speed moving".

Thresholding and smoothing
There are instances where the data include artifacts due to inaccuracies in the GPS measurements. These can be handled with the threshold() method for objects of class 'trackeRdata', which replaces values outside the specified thresholds with NA. The variables and the (lower and upper) thresholds which should be applied for each variable can be specified through the arguments variable, lower, and upper, respectively. An example is given in ?threshold.
The default thresholds are listed in Table 4 and, if necessary, are converted to the units of measurement used for the 'trackeRdata' object.
The other option for data handling is the smoother() method for 'trackeRdata' objects. This applies a summarizing function, such as the mean or median, over a rolling window. Both operations threshold() and smoother() are used in the plot() method for 'trackeRdata' objects. The default settings for plot() are to apply the thresholds specified in Table 4

Case study
The example data set included in the package contains 27 sessions of a single male runner in June 2013. A visualization of scalar summaries for the sessions can be found in Figure 6. The distance covered in those sessions ranges from 2.79 km to 22.35 km, and most sessions were spent moving almost the entire time.
The code chunk below loads the data, applies thresholds, and calculates the smoothed distribution profiles for the 27 sessions. The corresponding concentration profiles are shown in Figure 12.
R> library("trackeR") R> data("runs", package = "trackeR") R> runsT <-threshold(runs) R> dpRuns <-distributionProfile(runsT, what = "speed") R> dpRunsS <-smoother(dpRuns) R> cpRuns <-concentrationProfile(dpRunsS) The majority of the profiles for speed concentrate around 4 meters per second. However, the curves differ in their shape (unimodal or multimodal), height, and location (revealing concentrations at higher or lower speeds). Functional PCA (e.g., Ramsay and Silverman 2005) can be used to explain those differences ensuring that the profiles are treated directly as functions. Package trackeR contains a convenience function funPCA() which converts concentration/distribution profiles to the required functional data format and performs a functional PCA. Package trackeR can also be viewed as a stepping stone to further analysis of tracking data with other R packages. For example, it contains a conversion function, profile2fd(), that transforms concentration and distribution profiles to class 'fd' so that users have direct access to the facilities of the fda package (Ramsay, Wickham, Graves, and Hooker 2017) for functional data analysis.
The following code chunk shows the conversion to the required functional data format and the fitting of a functional PCA in separate steps. The PCA has four components and the share of variance is displayed in the last step.  Figure 14: PC1 score vs. "duration moving" (left) and PC2 score vs. "average speed moving" (right).
The first two harmonics capture 91% of the variation between curves. Since further harmonics capture considerably less variation, only the first two are chosen for further inspection. Figure 13 shows the mean function (solid line) and the variation captured in the two harmonics (between the dashed and dotted lines). The first harmonic (top panel) illustrates that the most important characteristic of the concentration profiles is the relative value, which is closely related to the overall session duration. The left panel of Figure 14 shows the score on the first harmonic versus "duration moving" which is calculated as part of the scalar session summaries. The second harmonic in the bottom panel of Figure 13 shows variation along the speed thresholds in the center of the curve. This variation can be explained well by the scalar measure "average speed moving" as shown in the right panel of Figure 14.
The concentration profiles and a functional PCA thus indicate that the two scalar summaries "duration moving" and "average speed moving" provide a good summary of the speed information in the sessions and can be used, for example, in order to incorporate speed as explanatory information in regression analyses.