webchem: An R Package to Retrieve Chemical Information from the Web

A wide range of chemical information is freely available online, including identifiers, experimental and predicted chemical properties. However, these data are scattered over various data sources and not easily accessible to researchers. Manual searching and downloading of such data is time-consuming and error-prone. We developed the open-source R package webchem that allows users to automatically query chemical data from currently 14 web sources. These cover a broad spectrum of information. The data are automatically imported into an R object and can directly be used in subsequent analyses. webchem enables easy, structured and reproducible data retrieval and usage from publicly available web sources. In addition, it facilitates data cleaning, identification and reporting of substances. Consequently, it reduces the time researchers need to spend on chemical data compilation.


Introduction
Before each statistical analysis, data cleaning is often required to ensure good data quality. Data cleaning is the process of detecting errors and inconsistencies in data sets (Chapman 2005). In practice, the data cleaning step is often more time consuming than the subsequent statistical analysis, particularly, when the analysis relies on the joining of multiple data sources.
When dealing with chemical data sets (e.g., environmental monitoring data, toxicological data), a first step is often to validate the names of chemicals or to link them to unique codes that simplify subsequent querying and appending of compound-related physico-chemical or toxicological information. Several web sources provide chemical names or link them to unique codes (see also Section 3). However, manual searching for each compound, often through a graphical web interface, is tedious, error-prone and not reproducible (Peng 2009).
To simplify, robustify and automate this task, i.e., to search and retrieve chemical information from the web, we created the webchem package (Szöcs et al. 2020) for the free and open source R language (R Core Team 2020; Wehrens 2011). R is one of the most widely used software environments for data cleaning, analyzing and visualizing data, and supports full reproducibility of each step (Marwick 2016).
In the following, we describe the basic functionality of the package and demonstrate with a few use cases how to clean and retrieve new data with webchem.

Implementation and design details
The webchem package is written entirely in R and available under an MIT license. The development repository is hosted on GitHub at https://github.com/ropensci/webchem/ and a stable version is released on the Comprehensive R Archive Network (CRAN) and available at https://CRAN.R-project.org/package=webchem. webchem is part of the rOpenSci project (Boettiger, Chamberlain, Hart, and Ram 2015), which aims at fully reproducible data analysis. webchem is registered on the Resource Identification Portal (SciCrunch Inc 2020).
Some data sources provide application programming interfaces (API). Web APIs define functions that allow accessing services and data via http and return data in a specific way. webchem uses the API of a data source provider, where available. For sources where an API is lacking, but policies of the service provider permit programmable access, data is directly searched and extracted from the web pages, analogous to manual interaction with a website.
Only few design decisions have been made: Each function name has a prefix and suffix separated by an underscore (Chamberlain and Szöcs 2013). They follow the format of source_function, e.g., cs_compinfo uses ChemSpider as source (see Section 3) to retrieve compound information. Some functions require querying first a unique identifier from the data source and then use this identifier to query further information. The prefix get is used to denote these functions, e.g., get_csid to retrieve the identifier used in ChemSpider.
webchem is friendly to the resources of data providers. Between each request there is a time-out of 0.3 to 2 seconds depending on the data source. Therefore, processing of larger data sets can take some time, but still represents a major improvement compared to manual lookup. We provide a link to the Terms of Use of data providers in the documentation of each function and we encourage the users to read these before using webchem. Moreover, all functions return an URL of the source, which can be used for (micro-)attribution.

Data sources
The backbone of webchem are data sources providing their data and functionality to the public. Currently, data can be retrieved from 14 sources. These cover a broad spectrum of available data, like identifiers, experimental and predicted properties and regulatory information (a detailed overview of all sources is included in Figure 1 NIH Chemical Identifier Resolver (CIR) (NIH 2020) A web service that converts from and to various chemical identifiers. The database holds millions of identifier combinations.
ChemSpider (Pence and Williams 2010) A free chemical structure database providing access to over 67 million structures from hundreds of data sources. It provides identifiers, properties and can also be used to convert identifiers.
Chemical Translation Service (CTS) (Wohlgemuth, Haldiya, Willighagen, Kind, and Fiehn 2010) A web service that converts from and to various chemical identifiers.
ETOX (UBA 2020) Information System Ecotoxicology and Environmental Quality Targets by the German Federal Environmental Agency. Provides basic identifiers, synonyms, ecotoxicological data for over 64,000 entries and quality targets for different countries.
ChemIDplus (Tomasulo 2002) A large web-based database provided by the National Library of Medicine. It contains identifiers, synonyms, toxicological data and chemical properties for over 420,000 records.
OPSIN (Lowe, Corbett, Murray-Rust, and Glen 2011) The Open Parser for Systematic IUPAC nomenclature is a chemical name interpreter and provides InChI and SMILES identifiers. (Wood 2020) The compendium provides information on pesticide common names, identifiers and classification. The compendium contains more than 1800 active ingredients and more than 350 ester and salt derivatives.
PubChem (Kim et al. 2019) PubChem is a public repository for information on more than 250 million chemical substances, providing identifiers, properties and synonyms. We use an interface to the PUG-REST web service (Kim, Thiessen, Bolton, and Bryant 2015).
PAN Pesticide Database (PAN 2020) Information on pesticides -provides basic identifiers, ecotoxicological data, chemical properties, uses and regulatory status for 6,500 pesticides.
Flavornet (Acree and Arn 2020) Flavornet is a compilation of 738 aroma compounds found in human odor space.
NIST (NIST 2018) The NIST Chemistry WebBook provides access to data compiled and distributed by NIST under the Standard Reference Data Program. The WebBook provides chemical and physical property data on over 40,000 compounds.
ChEBI (Hastings, Owen, Dekker, Ennis, Kale, and Muthukrishnan 2016) Chemical Entities of Biological Interest (ChEBI) is a freely available dictionary of molecular entities focused on "small" chemical compounds. The database contains more than 56,000 compounds.
SRS (U.S. EPA 2020b) Substance Registry Services is the US EPA central system for information about substances that are tracked or regulated by EPA or other sources.
Though the data sources exhibit some overlap in the provided information, each has been selected because it also provides unique information and we encourage the interested reader to consult the related source for details.

Installation
webchem can be easily installed from CRAN and loaded: The package is under active development. The latest development version is available from GitHub and also permanently available at Szöcs et al. (2020). This document has been created using webchem version 1.0.0.

Sample data sets
To demonstrate the capabilities of webchem we use two small publicly available real world data sets. The data sets are only used for purpose of demonstration, have been slightly preprocessed (not shown) and are available through the package.
(i) jagst: This data set comprises environmental monitoring data of organic substances in the river Jagst, Germany, sampled in 2013. The data is publicly available and can be retrieved from LUBW (2020). It comprises concentrations (in µg / L) of 34 substances on 13 sampling occasions. First we load the data set and inspect the first six rows: R> data("jagst", package = "webchem") R> head (

Query identifiers
The jagst data set covers 34 substances that are identified by (German) names. Merging and linking these to other tables is hampered by differences and ambiguity in compound names.
One possibility to resolve this, is to use different chemical identifiers allowing easy identification. There are several identifiers available, e.g., registry numbers like CAS or EC, database identifiers like PubChemCID (Kim et al. 2019) or ChemSpiderID (Pence and Williams 2010), line notations like SMILES (Weininger 1990), InChI and InChIKey (Heller, McNaught, Pletnev, Stein, and Tchekhovskoi 2015). In this first example we query several identifiers to create a table that can be used as (i) supplemental information to a research article or (ii) to facilitate subsequent matching with other data.
As we are are dealing with German substance names we start to query ETOX for CAS registry numbers. A common work flow when dealing with web resources is to 1) query a unique identifier of the source, 2) use this identifier to retrieve additional information and 3) extract the parts that are needed from the R object (Chamberlain and Szöcs 2013).
First we search for ETOX internal ID numbers using the substance names: R> subs <-unique(jagst$substance) R> ids <-get_etoxid(subs, match = "best") R> head(ids) Only three substances could not be found in ETOX. Here we specify that only the "best" match (in terms of the Levenshtein distance between query and results) is returned. A manual check confirms appropriate matches. Other options include: "all" -returns all matches; "first" -returns only the first match (not necessarily the best match); "ask" -this enters an interactive mode, where the user is asked for a choice if multiple matches are found and "na" which returns NA in case of multiple matches. We use these data to retrieve basic information on the substances. When possible, webchem returns a data frame with one or more rows per substance. However, data from some sources can be very voluminous and not tabular. In these cases webchem always returns a named list (one entry for each substance). We provide extractor functions for the common identifiers: CAS, SMILES and InChIKeys.
In the same manner, we can now query other identifiers from another source using these CAS numbers (see Figure 1), like PubChem.

Toxicity of different pesticide groups
Another question we might ask is How does toxicity vary between insecticide groups? Answering this question would require tedious lookup of insecticide groups for each of the 124 CAS numbers in the lc50 data set. The Compendium of Pesticide Common Names (Wood 2020) contains such information and can be easily queried using CAS numbers with webchem: R> aw_data <-aw_query(lc50$cas, type = "cas") To extract the chemical group from the retrieved data set, we write a simple extractor function and apply this to the retrieved data: R> igroup <-sapply(aw_data, function(y) { + if (is(y, "list")) y$subactivity[1] else NA_character_ + }) R> igroup[1:3] 50-29-3 "organochlorine insecticides" 52-68-6 "phosphonate insecticides" 55-38-9 "phenyl organothiophosphate insecticides"

Identify common names and roles of chemicals
We might be interested in common names and roles of the most toxic chemicals in the lc50 data set, i.e., chemicals with the lowest LC 50 values. webchem can query the ChEBI database (Hastings et al. 2016) to retrieve common names and roles of chemicals. We extract the CAS numbers of the three most toxic chemicals, and then use get_chebiid() to retrieve ChEBI identifiers and substance names. Subsequently, we use the acquired ChEBI identifiers to query the complete entity by using chebi_comp_entity(). From the result, we extract the parents table and refine the data to chemical roles.

Querying partitioning coefficients
Some data sources also provide data on chemical properties that can be queried. Here we query for the lc50 data predicted octanol-water partitioning coefficients (log P oct/wat ) from the PubChem database to build a simple quantitative structure-activity relationship (QSAR) to predict toxicity. The resulting data and model are displayed in Figure 3.

Regulatory information
Regulatory information is particularly of interest if concentrations exceed national thresholds. In the European Union (EU) the Water Framework Directive (WFD; EU 2000) defines Environmental Quality Standards (EQS). Similarly, the U.S. and Canadian EPA and the WHO define Quality Standards. Information on these standards can be queried with webchem from the PAN Pesticide Database (using pan_query()) and from ETOX (using etox_targets()).
In this example we search for the minimum EQS for the EU for the compounds in the jagst data set, join these with measured concentrations and evaluate whether exceedances occurred.
We re-use the above queried ETOX-IDs to obtain further information from ETOX, namely the Maximum Acceptible Concentration EQS (MAC-EQS): Finally, we can compare the measured value to the MAC, which reveals that there have been no exceedances of these 6 compounds.

Utility functions
Furthermore, webchem provides also basic functions to check identifiers that can be used for data quality assessment. The functions either use simple formatting rules,
[1] FALSE R> is.cas("64-17-6")  (Swain 2018) and CIRpy (Swain 2016) are available for similar tasks as those outlined here. webchem is not specialized and tries to integrate many data sources and for some of these it provides a unique programmatic interface. The Chemical Translation Service (Wohlgemuth et al. 2010), which is also one of the sources that can be queried, allows batch conversion of chemical identifiers. However, it does not provide access to other data (experimental, modeled or regulatory data).

Open science
An increasing number of scientific data is becoming publicly available (Gewin 2016;Reichman, Jones, and Schildhauer 2011;O'Boyle et al. 2011), either in public data repositories or as supplements to publications. To be usable for other researchers chemical compounds should be properly identified, not only by chemical names but also with accompanying identifiers like InChIKey, SMILES and authority-assigned identifiers. webchem provides an easy way to create such meta tables as shown in Table 1 and facilitates chemical data availability to researchers. However, good quality of data is crucial for every analysis (Stieger, Scheringer, Ng, and Hungerbühler 2014) and additional effort and methods are needed to validate data quality.

Further development
We have outlined only a few use cases that will likely be useful for many researchers. Given the huge amount of publicly available information, many other possibilities can be envisioned. webchem is currently under active development and several other data sources have not been implemented yet but may be in the future. GitHub makes contributing easy and we strongly encourage contribution to the package. Moreover, comments, feedback and feature requests are highly welcome.

Conclusions
Researchers need to have easy access to global knowledge on chemicals. webchem can save hundreds of working hours gathering this knowledge (Münch and Galizia 2016), so that researchers can focus on other tasks.