Analyzing Remote Sensing Data in R : The landsat Package

Research and development on atmospheric and topographic correction methods for multispectral satellite data such as Landsat images has far outpaced the availability of those methods in geographic information systems software. As Landsat and other data become more widely available, demand for these improved correction methods will increase. Open source R statistical software can help bridge the gap between research and implementation. Sophisticated spatial data routines are already available, and the ease of program development in R makes it straightforward to implement new correction algorithms and to assess the results. Collecting radiometric, atmospheric, and topographic correction routines into the landsat package will make them readily available for evaluation for particular applications.


Introduction
Satellite remote sensing data can provide a complement or even alternative to ground-based research for large scale studies or over long periods. The Landsat platform is the premier example: images have been collected continuously since the early 1970s. If both Landsat 5 and 7 are considered, every location within the United States is imaged every 8 days. Other satellite platforms have become available more recently. Images from all platforms can be used for mapping land cover, tracking land use change, and even estimating plant biomass.
Image characteristics vary from date to date. Some of this variation is due to solar elevation angle and can be easily corrected for given date and time (radiometric calibration). Additional variation is caused by atmospheric conditions at the time of imaging: scatter at different wavelengths due to haze. Atmospheric corrections are very complex. If one image every few years is analyzed, for example to track development patterns, atmospheric corrections may be unnecessary because of the magnitude of the change in land use during that interval. If researchers wish to take advantage of the temporal data density offered by Landsat, careful correction is required. Otherwise differences in apparent reflectance due to atmospheric variation can swamp differences caused by actual change on the ground. In mountainous areas, topographic corrections are also necessary to compensate for differences in incident radiation due to slope and aspect. Without correction, the same ground cover on opposite sides of the mountain could return a different signal.
Many procedures to implement atmospheric and topographic corrections have been proposed, each with strengths and weaknesses, but few are readily available in geographic information systems (GIS) or image processing software. It is impossible to compare methods to identify the one most suited for a given application. Indeed, it is not always clear exactly what proprietary GIS software is doing. R offers an alternative to GIS software for research and testing of satellite image processing algorithms. The interpreted nature of R makes it possible to implement, test and modify algorithms easily. The available graphical and statistical tools are vastly superior to anything available in common GIS packages, making it straightforward to compare algorithms. Tools for processing and display of spatially-referenced data are already available in R (R Development Core Team 2011). The sp (Bivand, Pebesma, and Gomez-Rubio 2008) and rgdal (Keitt, Bivand, Pebesma, and Rowlingson 2011) packages provide these capabilities for several commonly-used spatial formats.
The landsat package described here, and available from the Comprehensive R Archive Network at http://CRAN.R-project.org/package=landsat, provides basic tools for working with satellite imagery such as automated georeferencing and cloud detection. It contains functions for radiometric normalization, and several different approaches to atmospheric correction. Four topographic correction algorithms have been implemented. Other useful functions such as bare soil line and tasseled cap calculations have been included. While these functions were developed with Landsat data in mind, they are suitable for use with satellite imagery from other platforms as long as appropriate calibration data are used.

Landsat platform characteristics
Landsat 5 (TM) was launched on 1984-03-01 and Landsat 7 (ETM+) on 1999-04-15. Each revisits a location every 16 days, and the two orbits are staggered so an image is taken for each location every 8 days. The Landsat Data Continuity Mission is planned to launch in 2011, and will record data in bands compatible with the TM and ETM+ instruments. Due to a hardware failure on 2003-05-31, Landsat 7 scenes are now missing 22% of the pixels. The problem is most severe near the edges of a image. Band characteristics are largely consistent, but ETM+ added an additional band (Table 1).
The Landsat images have been converted to integer digital numbers (DN ) before distribution to facilitate storage and display. They may require conversion to radiance or reflectance, topographic correction or atmospheric correction. If using a single image, or images widely separated in time to examine gross changes, minimal processing may be required. For detailed comparison of vegetation indices from multiple images, however, careful correction is needed. The most accurate atmospheric corrections require ground data taken during the satellite overpass. For retrospective studies this is impossible to obtain, and less-accurate image-based correction methods must be used. The following sections describe the procedures needed for calibration of Landsat data and the R implementations of these algorithms in the landsat package.

Sample data
The landsat package includes a 300 × 300 pixel subset of United States Geological Survey (USGS) Landsat ETM+ images from two dates, 2002-07-20 and 2002-11-25 (US Geological Survey 2010b. These images are in SpatialGridDataFrame format, and can be displayed using image(). Even after clouds are removed, the July image has a higher mean DN and greater dynamic range than the November image (Figure 1c, d). Complete metadata are included in the help files. A USGS digital elevation model (DEM) covering the same area and at the same resolution has been included ( Figure 1f; US Geological Survey 2010a).

Tools
This package provides a few basic tools for working with Landsat images. The lssub() function is an R interface to the image subsetting tools from the Geospatial Data Abstraction Library (GDAL, GDAL Development Team 2011). This function is provided for convenience; while the same effect can be obtained using the subsetting property of sp, it is considerably faster to use the GDAL functions to remove a smaller section of a geotiff image. Landsat images are usually distributed as geotiff files, and can be imported in SpatialGridDataFrame format using readGDAL. The lssub() function preserves the geotiff format.
Image processing requires some information not often available in the metadata. The Earth-Sun distance for a given date can be calculated with ESdist(). A date can be conveniently converted to decimal format with ddist().

Automated georeferencing
A simple error-minimization routine can be used to provide relative georeferencing by matching one image to a reference image (georef and geoshift by means of vertical and horizontal shifts. If the reference image has been absolutely georeferenced, then all of the subsequent images will also be spatially referenced. This function can find local minima, so results should always be checked visually. The matching process only needs to be done once for each date; Band 3 or Band 4 generally provide good results, but any band may be used. The results of this step can then be used for each band image from that date. Sufficient padding must be added around the edges of the image to accommodate the magnitude of the shift. The larger the image area, the more effective the matching process. The below example illustrates the procedure, but because of the small image sizes used for the demo data the shift coefficients are unreliable. R> july.shift <-georef(nov3, july3, maxdist = 50) R> july1.corr <-geoshift(july1, padx = 10, pady = 10, july.shift$shiftx, + july.shift$shifty)

Topographic calculations
Topographic corrections require the use of slope and aspect data calculated from a DEM of the same resolution as the satellite data. The slopeasp() function will calculate both given a DEM. While all GIS software will provide topographic calculations, this function was included for the convenience of being able to do all the processing within R and to allow exploration of different algorithms. Most GIS software offers only a tiny subset of the methods that have been proposed.
The most common method for slope and aspect calculations is the third-order finite difference weighted by reciprocal of distance (Unwin 1981;Clarke and Lee 2007). This method is the equivalent of using a 3 × 3 Sobel filter to determine the east-west slope and the north-south slope. Given a cell Z i,j with east-west cell size EW res and north-south cell size NS res , slope in either percent or degrees, and aspect with north as 0 • , east as 90 • and south as 180 • can be calculated using Equation 1. A simple smoothing correction (dividing by a smoothing parameter before taking the arctan) can reduce extreme slope values (Riano, Chuvieco, Salas, and Aguado 2003).

Radiometric calibration
Landsat images are distributed as digital numbers, integer values from 0-255. While less crucial now, this correction was originally necessary to make it possible to store, distribute and portray these images efficiently. Radiometric calibration is a two-step process. First the DN values are converted to at-satellite radiance using parameters provided in the image metadata. Data on solar intensity are used to convert the at-satellite radiance to at-satellite reflectance. These parameters are also in the image metadata.

At-sensor radiance
The first step in processing is to convert DN to at-sensor spectral radiance L, also called top-of-atmosphere radiance. The conversion coefficients are available in the metadata accompanying the images. Whenever possible the metadata values should be used, as coefficients vary by platform and over time, but standard coefficients are given in Table 2.
Coefficients are provided in one of three band-specific formats: gain 2 and offset; G rescale (also called gain) and B rescale (bias); or radiances associated with minimum and maximum DN values (L max and L min ). Any of the three can be used to convert from DN to at-sensor radiance (Equation 2).
It is not always clear which form of coefficients the metadata contain because "gain" has been used to refer to both gain 2 and G rescale . Most recent image metadata provide G rescale and B rescale , but these formulations are interconvertible (Equation 3). The magnitudes of G rescale and gain 2 are similar for most bands, but the former is paired with B rescale , which will have   negative values for non-thermal bands, while gain 2 is paired with offset, which is positive for non-thermal bands.
All radiometric functions in landsat accept either gain and offset or G rescale and B rescale .

At-sensor reflectance
The at-sensor radiance values calculated using Equation 2 must be corrected for solar variability caused by annual changes in the Earth-Sun distance d, producing unitless at-sensor (or top-of-atmosphere) reflectance ρ AS (Equation 4).
E sun is the band-specific exoatmospheric solar constant (Table 3; W m −2 µm −1 ). The solar zenith angle θ z can be derived from the image metadata, where the solar elevation angle θ s is usually included; θ z = 90 • −θ s . At-sensor reflectance can be calculated using the radiocorr() function with method = "apparentreflectance".

Thermal bands
Band 6 contains thermal infrared data. Landsat 7 offers two thermal bands, while Landsat 5 provides one. Instead of calculating top-of-atmosphere reflectance, these data can be converted to temperature (K • ) using thermalband(). This function provides default coefficients, and requires only the DN data and the band number (6 for Landsat 5; 61 or 62 for Landsat 7).

Cloud identification
Clouds are reflective (high) in Band 1 and cold (low) in Band 6, so the ratio of the two bands is high over clouds (Martinuzzi, Gould, and González 2007). The absolute value of this ratio must be adjusted for data type, whether reflectance, radiance, or DN . The clouds() function will create a cloud mask (1 where clouds are present; NA where they are not) given Band 1 and Band 6. The default parameters for the ratio level (level) and for adding a buffer around the cloud edge (buffer) were adequate for the test data once converted to at-sensor reflectance and temperature. This function can be used with DN data if the level argument is adjusted appropriately. The mask does not demarcate areas of cloud shadow, but only the clouds themselves ( Figure 1e).

Atmospheric correction
For most applications, ground reflectance is of greater interest than at-sensor reflectance so atmospheric correction is required. Variation in atmospheric conditions at the time of overpass can overwhelm any changes in surface reflectance, so it is crucial to correct for these differences. If measured atmospheric data such as optical depth are available an accurate correction can be applied, but these are rarely available for retrospective studies. Most commonly, an image-based method is used instead. Two categories of corrections are available. Relative normalization methods match the spectral characteristics of each image to a reference image in such a way that each transformed image appears to have been taken using the same sensor and with the same atmospheric conditions as the reference image. Functions are available for relative atmospheric correction methods using the entire image or unchanging subsets.
Absolute atmospheric correction methods rely on a mechanistic understanding of atmospheric effects to adjust each image individually. Instead of correcting to a reference image, information contained in part of an image, for instance the darkest areas, is extracted and used to correct the rest of the image. This extracted information substitutes for measured parameters. Three such methods have been included here.

Relative atmospheric correction using the entire image
Statistical correction methods are entirely empirical and do not consider physical principles or atmospheric conditions in any way. These algorithms force the distribution of values in one image to match that in another. Unless otherwise indicated, these methods can operate on DN or reflectance.

Relative normalization
The most aggressive correction method is to regress all of the pixels in the image to be corrected onto the corresponding pixels in the reference image. This method requires georeferenced images covering the same area and at the same resolution. Since both variables are random, model II regression method such as Major Axis regression implemented in lmodel2() from the lmodel2 package is recommended (Legendre 2008). The relnorm() function implements this method for spatial data, and returns both a corrected image and the coefficients used. In the example given here, Band 4 of the July image is corrected to match the November image.

Histogram matching
Histogram matching forces the distribution of the DN values in one image to match another. This algorithm is commonly used in other areas of image processing, and the images do not need to cover the same area, or to match in any way. The histmatch() function should be used only with the integer DN values. In the example given here, Band 4 of the July image is corrected to match the November image.
R> july4.hmcoef <-histmatch(nov4, july4, mask = july.cloud) R> july4.hm <-july4.hmcoef[["newimage"]] For this image pair, histogram matching performs better than relative normalization (Figure 2). The former method forces the histogram for July to acquire the shape of the histogram for November, while keeping the range of values (compare Figure 1c to Figure 2c). Relative normalization created a similar histogram profile, but greatly compressed the dynamic range, thereby losing much of the information contained in the data. This method also inverted the values of the original image: the slope of the regression line was negative. This inversion is a drawback of relative normalization methods, especially for image pairs where seasonal vegetation differences are pronounced.

Relative atmospheric correction using a subset of the image
Using the entire image for statistical correction includes areas of the image that are likely to change between dates, particularly vegetation, so the correction factors incorporate nonatmospheric effects. As shown above this may have unwanted side effects. These corrections may thus conceal actual changes between dates. Identifying pixels that cover developed areas or other land uses that would be expected to possess constant reflectance properties from date to date could produce more accurate statistical corrections. Two methods for doing so are included in landsat: pseudo-invariant features and radiometric control sets.

Pseudo-invariant features
The reflectance of a specific developed area such as a large rooftop or parking lot should not change seasonally, so differences in apparent reflectance in these areas between dates are assumed to be due to atmospheric differences. Such areas, termed pseudo-invariant features (PIF) by Schott, Salvaggio, and Volchok (1988), can be identified using Band 7 and the ratio of Band 4 to Band 3. The Band 4 to Band 3 ratio is low where there is no vegetation, including water and developed areas. Band 7 is low over water areas, so it can be used to distinguish between the unvegetated areas identified using the Band 4 to Band 3 ratio. As implemented here, the images must cover the same area at the same resolution, but that is not a requirement of the method. The PIF() function can be used to identify invariant features within an image using the above criteria. Visual inspection may be needed to ensure that the threshold value is correct for the particular images being analyzed. Having an insufficient number of PIF points is a concern with this method, especially with the small images in these examples.
Using the data included in the landsat package, November is the clearer image, as would be expected because of the higher humidity in the summer months, so it is used to identify PIF pixels. These PIFs are then used in major axis regression to correct the July image (only Band 4 is shown).

Radiometric control sets
The radiometric control set (RCS) procedure implemented in RCS() takes a different approach to choosing invariant features Hall, Strebel, Nickeson, and Goetz (1991). The tasseled cap procedure (tasscap()) is used to identify the greenness and brightness components of the image (Crist 1985;Crist and Kauth 1986;Huang, Wylie, Yang, Homer, and Zylstra 2002). These are then used to identify dark sets and bright sets with low greenness: unvegetated radiometric control sets. As with PIF(), the threshold value may need to be adjusted for a particular set of images. The tasselled-cap procedure included in landsat contains default values for at-sensor reflectance for both Landsat 5 (TM) and 7 (ETM+), although other formulations are available in the literature. In the example given, reflectance is used for RCS identification, but the adjustment is done on the DN values to maintain consistency with previous correction examples.
The clearer November image is used to identify the radiometric control set, and major axis regression is used to correct the July image based on that RCS. Finally the corrected data are put in SpatialGridDataFrame format, and cloud areas removed.
R> nov.tasscap <-tasscap("nov", 7) R> nov.RCS <-RCS(nov.tasscap) R> july4.rcscorr <-lmodel2(nov4@data[nov.  dates, rather than just in the clearer November image set, did not improve the results. The RCS correction reduced the dynamic range somewhat, and inverted the data range, as did relative normalization (Figure 3d; Figure 2d). Of the four whole-image relative atmospheric correction methods implemented here, only the histogram matching method produced acceptable results.
The root mean square error (RM SE) between the target and reference image can be used to assess the effectiveness of a relative correction. The greater the reduction in RM SE, the more effective the transformation was at matching that image pair. The RM SE for the untransformed Band 4 data was 0.2, and after histogram matching it was reduced to 0.067. Despite their poor overall performance in other ways, the other three methods reduced RM SE values as much or more: 0.043 for relative normalization, 0.043 for PIF, and 0.065 for RCS. The correspondence between image pairs can increase even when other properties such as range that are crucial for further analysis and interpretation are not preserved.

Absolute atmospheric correction
This class of methods attempts to deduce values for atmospheric parameters from information contained within the image itself rather than using externally-measured data. Each image is treated on its own.
The conversion of at-sensor radiance to atmospherically-corrected surface reflectance is described in Equation 5. Absolute atmospheric correction methods use measurements or atmospheric simulation models to determine the parameters T z , T v , E down , and L haze (Chavez 1989). The major difference between the relative atmospheric correction methods is the procedure for estimating these values. With default parameter choices, this simplifies to the equation for at-sensor reflectance (Table 4; Equation 4).

Dark object subtraction
The dark object subtraction (DOS) method assumes that if there are areas in an image with very low actual reflectance values, any apparent reflectance should be due to atmospheric scattering effects, and this information can be used to calibrate the rest of the image (Chavez 1988(Chavez , 1989. The darkest pixels can be selected by examining the histogram of the DN values in an image, or by setting a threshold such as 'lowest DN value found in at least n pixels,' or some other criterion appropriate for the size of image being analyzed. The chosen DN value, the Starting Haze Value (SHV ) is then converted to radiance (e.g., Equation 2 or similar conversion). It is unlikely that most images contain entire pixels that are true black, so a correction is applied that assumes a 1% actual reflectance of these areas.
The simplest form of DOS simply converts the calculated L haze value to at-sensor reflectances (Equation 4) and subtracts it from the entire image (also converted to at-sensor reflectances).
A new SHV value must be calculated for each band (Chavez 1988).
The improved method developed by Chavez (1989) uses information from a single band to calculate L haze values for the remaining bands of an image. This method produces correlated haze values, and may be more accurate if there are few shadows or dark areas. Scattering is band-specific, and the band effects are correlated with atmospheric conditions. The DOS method implemented here uses a realistic relative atmospheric scattering model, and maintains the spectral relationship between bands (Chavez 1989). This may be important for vegetation Iterative Iterative Iterative Iterative Table 4: Parameters values used in four different atmospheric correction methods.
indices whose values depend on the ratios between bands. The DOS method works poorly for bands 5 and 7, hugely overestimating the L haze value. Only small components of the total atmospheric scattering occurs in these bands, so the DN corrections for very clear atmosphere were used for bands 5 and 7. The DOS() function will take the SHV value for the starting band (usually Band 1, though bands 2 or 3 may be used), and calculate the related SHV value for each other band.
DOS correction as implemented here is a two-step process. In this example, the July image is corrected. Band 1 is used to find the SHV , here the lowest DN with at least 1000 pixels. That SHV (69) is then used in DOS() to find the corrected SHV values for the remaining bands (Chavez 1989). The above R output shows the possible SHV values calculated for five scattering coefficients, with the smallest coefficient (-4) describing the clearest conditions and -0.5 representing extreme haze. For this example, Band 1 was used to calculate the SHV from the actual image, so it is not adjusted for scattering coefficient. The user must select the appropriate set of band-specific SHV values. Chavez (1989) suggested using the original SHV as a guide: SHV ≤ 55 very clear (-4); SHV 56-75 clear (-2); SHV 76-95 moderate (-1); SHV 96-115 hazy (-0.7); SHV > 115 very hazy (-0.5). The SHV for Band 1 from July was 69, so the atmosphere was clear, and the corresponding column coef-2 was selected for use in DOS().  Chavez (1996) improved on his earlier dark object subtraction methods by adding a correction for the multiplicative transmittance component of the atmospheric scatter (cos θ z , abbreviated as COSTZ). In this revised procedure, cos θ z is used as an approximation of T z , and cos θ v is used as an approximation of T v . For Landsat, the latter parameter is one because the satellite sensor has a nadir view (θ v = 0 • ). The value for L haze is determined just as for the DOS method. This method is not appropriate for use with bands 5 or 7; the original DOS method should be used with these data (Song, Woodcock, Seto, Lenney, and Macomber 2001).

Modified dark object subtraction
The modified dark object subtraction method (DOS4) was developed to incorporate the effect of atmospheric aerosols into atmospheric correction (Song et al. 2001). The value for L haze calculated in the DOS method is used. Corrected values for T z and T v are determined through an iterative process. Both are set to initial values of one, and Equation 7 is solved for τ , the Rayleigh atmospheric optical depth. New values are calculated for T z and T v (Equation 8), and the process is repeated until τ stabilizes. This generally requires fewer than ten iterations. When compared to at-sensor reflectance, all three absolute atmospheric correction methods produced similar results for the July test image (Figure 4). Each reduced the dynamic range, but only slightly, and the overall appearance was similar to the original, unlike the results of most of the relative correction methods.
Other research has found that COSTZ was effective in the visible-light bands (1-3), but less accurate in the near-infrared, particularly in humid conditions (Wu, Wang, and Bauer 2005). Song et al. (2001) found that DOS4 outperformed COSTZ or DOS.

Topographic correction
The interaction between sun angle, surface slope and satellite position produces variations in surface reflectance unrelated to true reflectance. The same land use type can return a different signal if located on opposite sides of a hill. The intent of topographic correction is to remove this source of variation, and leave only the portion of the reflectance signal actually due to ground cover, thus converting the reflectance from an inclined surface (ρ T ) to that from an equivalent horizontal surface (ρ H ).
The sun angle at the time of the November image was lower, so differential shading of the north and south faces of the ridge in the middle of the image can be clearly seen (Figure 1b, and see also Figure 1f). A successful topographic correction would eliminate this shading difference; corrected images should appear flat.

Illumination
Slope and aspect are used in conjunction with solar and satellite parameters to model illumination (IL) conditions. Information derived from the DEM is required to compute the incident angle (γ i ), defined as the angle between the normal to the ground and the sun rays.
The IL parameter varies from −1 (minimum illumination) to 1 (maximum illumination) and is calculated from the slope angle θ p , solar zenith angle θ z , solar azimuth angle φ a and aspect angle φ o (Equation 9). IL = cos γ i = cos θ p cos θ z + sin θ p sin θ z cos (φ a − φ o ) The illumination is then used in one of several topographic correction methods. Seven such algorithms are currently implemented in topocorr(). Many of these algorithms are described and compared in Riano et al. (2003).

Lambertian methods
Lambertian methods assume that the reflectance for all wavelengths is constant regardless of viewing angle. The correction factor is identical across all bands.

Cosine correction
The cosine method is a trigonometric approach that assumes that irradiance is proportional to the cosine of the incidence angle (Teillet, Guindon, and Goodenough 1982). The cosine correction discounts indirect illumination and saturates in dark areas.

Improved cosine correction
The improved cosine method attempts to compensate for the overcorrection often seen in the cosine method by including average illumination (IL) in the calculation (Civco 1989).

Gamma correction
The Gamma method is an extension of the cosine correction that adds terms for the sensor view angle on flat terrain (θ v ) and on inclined terrain (β v ). This is another attempt to reduce overcorrection in areas with low illumination (Richter, Kellenberger, and Kaufmann 2009).

Sun-canopy-sensor method
The sun-canopy-sensor method (SCS) was developed by Gu and Gillespie (1998) for use in forested areas where the canopy geometry contributes to reflectance on a sloped surface (Gao and Zhang 2009).

Non-Lambertian methods
This group of methods recognizes that the combination of angles of incidence and observation can affect reflectance: surface roughness matters.

Minnaert method
The Minnaert method adds a band-specific constant K to the cosine method (Minnaert 1941, in Riano et al. 2003Equation 14). If K = 1, the Minnaert and cosine methods are equivalent (Lambertian behavior is assumed).
The K parameter is the slope of the ordinary linear regression with K and ln ρ H as regression coefficients, and is constant across the entire image for each band (Equation 15).
ln(ρ T ) = ln(ρ H ) + K ln IL cos θ z The topocorr() function with the argument method = "minnaert" will take care of both steps, returning the corrected image.

C-correction
The C-correction method is a statistical approach based on the ratio of the band-specific regression coefficients b and m, as given in Equation 17 (Teillet et al. 1982).
The cosine and SCS methods also resulted in reflectances greater than one. The Minnaert methods increased the dynamic range, though not the median, and the gamma correction reduced both. Meyer, Itten, Kellenberger, Sandmeier, and Sandmeier (1993) found that the Minnaert and C-correction methods gave similar results, but that the cosine method results were very different. Other work found that Minnaert was superior under most conditions (Richter et al. 2009).

Conclusions
Implementing atmospheric and topographic correction methods in R was straightforward, given R's good preexisting spatial capabilities. While large-scale production work is probably best done in a GIS environment, the scripting features available in R make it a feasible solution for small-and medium-sized tasks involving satellite remote sensing imagery.
The landsat package will be useful for researchers investigating the utility of these algorithms for their own work, and for facilitating development of new methods. The topographic algorithms especially appear to need further research. The wide range of sophisticated statistical tools available in R make it suitable for research and development on these methods. The open source nature of both R and of the landsat package allow for examination and alteration of the code base, facilitating further work on processing satellite imagery.