Working with User Agent Strings in Stata : The parseuas Command

With the rising popularity of web surveys and the increasing use of paradata by survey methodologists, assessing information stored in user agent strings becomes inevitable. These data contain meaningful information about the browser, operating system, and device that a survey respondent uses. This article provides an overview of user agent strings, their speciﬁc structure and history, how they can be obtained when conducting a web survey, as well as what kind of information can be extracted from the strings. Further, the user written command parseuas is introduced as an eﬃcient means to gather detailed information from user agent strings. The application of parseuas is illustrated by an example that draws on a pooled data set consisting of 29 web surveys.


Introduction
In recent years, web surveys have become an increasingly popular and important data collection mode, and today, they account for a great share of the studies in the social sciences. Web surveys have been used to complement traditional offline surveys as a less expensive form of pre-test or a mode of choice for respondents who are not willing to use "traditional" modes, such as face-to-face or telephone interviews.
Along with the rising popularity of web surveys, survey methodologists have taken an increasing interest in paradata, which are collected as a byproduct of the survey process (Couper 2000). Examples of paradata include the device used to complete a survey, response latencies, and response patterns. These data are used for several purposes, such as to address the strength and consistency of attitudes (Bassili 1993;Mayerl 2013;Meyer and Schoen 2014), to study data quality issues, for example, those associated with interview durations or item level Until now, user agent strings had to be coded manually by referring to freely available sources like http://www.useragentstring.com/ as suggested by Callegaro (2013) or by employing external tools, for example, the application programming interface (API) of the aforementioned website. Both approaches impose a burden on researchers using Stata (StataCorp 2019). First, researchers need to separately code the strings manually for every data set by rifling through the different sources of user agent string information. This approach is highly inefficient, since user agent strings are widely available data, and extracting information from them is often needed. Second, researchers need to convert data sets and switch between different software, which is labor intensive and takes up time that could be better used for substantive research.
The present article introduces the user written Stata command parseuas, which automatically extracts information from user agent strings and thus remedies the aforementioned shortcomings. We give a brief introduction on the technical properties and the structure of user agent strings and explain how to extract information from them. Then we explain how to collect these paradata with frequently used web survey software and, more generally, by using JavaScript. In the next section, we introduce the syntax and options of parseuas. After introducing the command, we provide a comprehensive example using a pooled data set of 29 web surveys, and in the process, demonstrate the application of parseuas.

User agent strings
Since many different browsers and devices are used to access the Internet, the software developers of web pages need to be able to detect the capabilities of browsers. This is necessary because different browsers have different ways of rendering and displaying web pages and vary in their implementation of JavaScript. With the emergence of more browsers and smartphones, variety has increased substantially. With respect to survey research, the ability of respondents to use different devices to participate in web surveys -and even change their device during an interview -poses new challenges in terms of equivalent measurement. A lot of research has identified best practices and problems in visual design (for an overview see Couper 2008;Dillman, Smyth, and Christian 2014;Tourangeau, Conrad, and Couper 2013), and the impact of participants who use smartphones and other mobile devices to answer web surveys. While the discussions about the best visual design continue, researchers still need to decide whether to develop their survey along the principles of a mode-specific design or opt for a uni-mode design. Unfortunately, a single right answer does not exist, since much depends on the research goals. Meanwhile, the software industry has coined the term mobile-first design, which means that developers program survey software with the goal to improve the survey experience for mobile participants. As it turns out, respondents using desktop computers and notebooks also can benefit from such an approach. Several web survey software companies claim that they can detect the device type being used and are able to utilize this information to send device-tailored questionnaires to respondents. Advanced survey software offers this capability as part of a rich set of features that survey researchers can use to decide how different types of questions (e.g., grid questions) should behave on different device types or whether mobile devices actually should receive a different questionnaire layout.
In summary, since survey researchers need a way to tap into the information that tells them what device is being used by respondents, the user agent string offers this type of information. Researchers can either use web survey software that includes the user agent string as part of the standard data download, or they can collect the user agent string themselves. Once the user agent strings are in the data set, the Stata command parseuas extracts the most useful information for further analyses.

The structure of a user agent string
The user agent string was developed so that each device can inform a server about the particular product and product version being used. In an ideal world, each device would send specific information to enable servers (and for the purpose of this article survey software) to detect the device so that the server knows the capabilities of the browser and can send appropriate web pages. However, as more products and new versions are developed, the variety of user agent strings has increased to many thousands. For a comprehensive overview of the history and current use of user agent strings and their different formats, see Zakas (2015, pp. 276-307).
The modern user agent string includes information about the web browser, the browser version, the rendering engine of the browser, the device type, and the operating system running on the device. In addition, the user agent string can contain additional device-specific information such as supported encryption or other proprietary information. Different companies are using different formats. Since the strings are produced to be parsed by specialized programs, it will be difficult or even impossible for most researchers to understand every part of a given user agent string. By comparison, Internet Explorer 8 follows a rather easy-to-read format (Zakas 2015, p. 279): 'Mozilla/4.0 (compatible; MSIE Version; Operating System; Trident/TridentVersion)' The first term 'Mozilla/4.0' is fixed and a remnant of the early days of the Internet. The same holds for the word 'compatible'.
In the next example, the interesting information is 'MSIE 8.0', which indicates Internet Explorer version 8. 'Windows NT 5.1' actually means Windows XP. 'Trident/4.0' is a token used to indicate that this is Internet Explorer version 8. This enables us to detect the correct version even when Internet Explorer is running in compatible mode; in this case, the first part would be 'MSIE 7.0'. While older detection scripts would see Internet Explorer 7, newer scripts are able to look for 'Trident' and detect Internet Explorer 8. This also illustrates why a simple string match approach would deliver insufficient and potentially wrong data. The '.NET CLR' entries are part of the potentially long series of additional information, in this case indicating that the machine runs five different versions of the .NET framework: 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)' A principal problem of parsing user agent strings is that there is only modest uniformity amongst the different parts of the user agent string, which complicates coding. Furthermore, due to fast technological development, the content of user agent strings changes over time. As a consequence, the coding scheme or dictionary used for coding has to be updated on a regular basis to facilitate correct coding. We used information from the World Wide Web Consortium 1 , the Mozilla Developer Network 2 , and specialized web pages 3 to generate the coding scheme for this Stata command.
The script is written in a way so that it also works with a 10-year older format of user agent strings, which is possible because the industry has not changed the basic user agent string format. Instead, user agent strings have been gradually complemented with additional content for newly developed browsers, devices, and operating systems.

Device type: mobile, tablet, or desktop
The identification of the device used by the client is a common application of user agent strings. In general, three types of devices can be detected: desktop computers (including notebooks, ultrabooks, and netbooks), tablet computers, and mobile phones. Depending on the brand and type of the device, it can be identified directly or by combining different pieces of information. For instance, the user agent string of Android smartphones contains the terms 'android' and 'mobi(le)' whereas the user agent string of Android tablets only contains 'android' or a combination of 'android' and 'tablet'. The user agent string of other brands of mobile phones include the terms 'iPhone', 'Windows Phone', 'Symbian', or 'BlackBerry'. Other tablet computers are associated with the terms 'iPad', 'Playbook', or 'Kindle (Fire)'. User agent strings that do not include any of these terms are most likely desktop computers. However, some minor exceptions exist, especially since some older devices have to be identified by other information included in the user agent string, for instance, the device number.

Operating system
The operating system is given in most user agent strings. In addition, it often is possible to identify the version of the operating system. For instance, the user agent string 'Windows NT 6.1' refers to Windows 7, 'Windows NT 6.3' is Windows 8.1, whereas the Windows NT token's value changed to 'Windows NT 10.0' with the new Windows 10.0 operating system. In some cases, the user agent string also contains information on the hardware of the device (e.g., if the software runs on a 32-or 64-bit system, or if a Macintosh computer is used).

Browser name, version, and rendering engine
With regard to the web browser, the user agent string includes several bits of information. First, the rendering engine can be Gecko, WebKit, AppleWebKit, Presto, Trident, Edge-HTML, or Blink. Second, the user agent string usually includes the name of the browser. The currently most popular browsers are Firefox, Internet Explorer, Chrome, Safari, and Opera. Third, in most instances, the user agent string includes the version of the web browser. Usually, the version directly follows the browser name (e.g., 'Firefox/*.*' or 'MSIE *.*'). Detection of the browser and the browser version are complicated due to some notable exceptions. First, the same user agent string can be used potentially by different browsers, for example, some Chrome versions use the Safari user agent string. Second, some specific issues occur with respect to gathering the version of the browser that needs to be considered (e.g., the version number statement changes between different versions of the Opera browser). Accordingly, rules for the detection of the browser name and version have to take into account some specific cases.

Examples indicating different devices and browsers
In the following, we discuss five instructive examples of common combinations of the browser, operating system, and device in user agent strings.
The first example contains the substring 'Firefox/31.0' which indicates that the browser is Firefox 31.0. The expression 'Gecko' reveals that the rendering engine of the browser is Gecko. 'Windows NT 6.1' is Windows 7, while 'WOW64' shows that a 32-bit application is running on a 64-bit processor. The computer is a desktop or notebook. The third example is Chrome 34.0.1847.114, which uses an AppleWebKit rendering engine. It can be inferred from the combination of the expressions 'Android 4.4.2' and 'Mobile' that the device is a smartphone using Android 4.4.2 as the operating system. The term 'GT-I9505' tells us that the device is a Samsung Galaxy S4. The substrings 'Safari' and 'Version/7.0' indicate that the browser in the fourth example is Safari 7.0. The combination of 'iPhone' and 'OS 7_1_1' further reveals that the device is an iPhone using iOS 7.1.1 as the operating system. The last example is a desktop or notebook computer. The substrings 'Safari' and 'Version/7.0.3' show that the browser is Safari 7.0.3. The operating system is Mac OS X 10.9.3 as displayed by the expression 'Mac OS X 10_9_3'.

How to obtain the user agent string during a web survey
Basically, there are two ways to access user agent strings when conducting a web survey. The information is either part of the standard export of a survey software tool or the survey researcher needs to insert program code in the web survey to obtain and store the information.
Many survey software tools provide this information as a variable in the standard data export. These variables usually have names like "browser" or "agent". The information usually is obtained when the respondent begins to answer a survey. This means that the variable stores the user agent from the device that is being used when a respondent starts a survey. This information might be sufficient to get an overview of the devices being used; however, when survey researchers expect that respondents may pause a survey and switch their devices during participation, one measurement will not convey the full story of which devices have been used to complete the survey. In this case, researchers could include program code on every page of a web survey or at least on several pages to capture the user agent information. When comparing six different surveys in a Dutch probability-based panel (LISS), Lugtig and Toepoel (2016) found that between 1.5% and 4.7% of respondents switched from a personal computer to a mobile device in a following survey. However, the likelihood of switching from a tablet or smartphone to a personal computer was much higher. Among respondents who used a mobile device in a survey, between 16.4% and 46.0% switched to using a PC in a following survey.
When looking at the variables in a data set, survey researchers will find it easy to see whether the user agent string is provided by the survey software tool as a standard output variable because of its unique content and string type (see Section 2). Since hundreds of web survey software tools are available to conduct web surveys (e.g., http://www.websm.org/c/1283/ Software/?preid=1283 lists about 340 web survey software tools) and updates often are released several times a year, a list of features would soon be outdated. The authors of the present article suggest as general rule of thumb: the more features and flexibility a web survey software tool offers, the more likely it is that the user agent string is already in the data set. Whenever a survey software tool offers the flexibility to insert HTML and JavaScript code to program survey questions, it is always possible to collect the user agent string. It seems that professional survey software tools that focus on the low-budget, fast, do-it-yourself marketwith a large but fixed set of features -are less likely to offer user agent information.
We checked whether the user agent string is available for a couple of web survey software tools. 4 At the time of the writing of this article, customer support at Google Consumer Surveys and SurveyMonkey informed us that they do not provide access to the raw user agent string data in their standard data output, and it also is not possible to program your own survey questions from scratch with the standard tools they offer. However, if you are a professional programmer and plan to program your own tool or app and are able to access an available API, it is likely that you also will be able to access this information.
To give some examples, the following software tools provide the user agent string as a standard variable or allow the collection of this information as part of a web survey: Blaise, Confirmit, EFS, Illume, and Qualtrics. It should be noted that companies who support research by providing respondents, such as online access panel providers 5 or paid crowd-sourcing services (e.g., Amazon Mechanical Turk or WorkHub), are not part of this discussion whether the user agent string can be obtained or not because a researcher would still need a web survey software tool or at least an online form to collect the data.
Whenever it is possible to insert your own survey questions in a web survey software tool, the user agent string can be obtained by adapting and inserting the following code. 6 We provide two examples. The first includes an input field that will be filled with the user agent string. <html> <!--This is the input field which will be filled with the user agent string. The input field is usually invisible and the contents is submitted to the server together with other information when a respondent clicks on a next -button. --> <input id="v1" value="" type="text" size="120"> <script type="text/javascript"> /* to store the user agent string in an input field we need to know the id of the input field */ var id = "v1"; 4 The authors of the present article do not endorse any particular products. All product names are only provided as examples. A survey researcher should always consider the full set of required features before making a purchase decision. Kaczmirek (2008) gives an overview of the features of web survey software tools that should be taken into consideration. 5 Many professional companies are members of professional survey and market research organizations such as http://www.aapor.org/ or http://www.dgof.org/.
// to access the input field we get the reference to the input field var useragent = document.getElementById(id); /* we change the value of the input field to contain the user agent string */ useragent.value = navigator.userAgent; </script> In contrast, the second example features a hidden input field.

Description
The parseuas command is built on an engine that acknowledges the variable structure of user agent strings. As discussed previously in this article, user agent strings incorporate a varying degree of information on browser, operating system, and device. This information is partially non-consistent over time, for instance, the string 'Windows NT 6.3' does not refer to a latter version of Windows NT but to Windows 8.1, while the Android operating system is expressed by a string 'Android *.*' where *.* resembles the Android's version. Accordingly, when programming parseuas, the parsing process was designed stepwise, to search the user agent string for information on the browser, then the operating system, and finally the device. In each step, we relied on information from a variety of sources (see Section 2) to draw as much information from the string as possible.
Each step builds on a sequence of queries about whether the user agent string contains an identifying piece of information. Most of these queries rely on regular expressions, especially the regexm function because the exact position of information in the user agent string is unknown. regexm verifies whether a string (in this case the user agent string) contains another string (i.e., the identifying information). For example, to detect an Android operating system, we would use an expression like: (...) regexm( useragentstring , "Android") (...) The parseuas command relies on queries like this to parse user agent strings into useful information. Depending on what information the user requires, respective variables are created.
The code is written to minimize maintenance and consider expectations of future, yet unknown user agent strings. Thus, parseuas is able to automatically parse the version of the most common browsers and operating systems. For example, the code can detect "Android 5.5" even if it was released later than the most recent version of parseuas. In this example it would detect 'Android' and then search the string for information on the version.
However, when extracting information from user agent strings, we sometimes fail to achieve optimal results. This failure may be a result of a less common user agent or technological development (e.g., new browsers or operating systems) not yet covered by parseuas. 7 Thus, we believe residual categories to be a crucial indicator of validity when applying the parseuas command. To prevent misinterpretation, the command is designed to extract as much information as possible. If detailed information is missing (e.g., the version of an operating system), parseuas provides more general information. For example, the user agent string is searched for information about whether the operating system was Android. If the information on the version of the operating system is missing, the user agent string will be coded as "Android (other)". Those user agent strings that contain non-interpretable information will be coded into broader residual categories, e.g., "Browser (other)". Note that these residual categories only apply when more detailed information cannot be extracted.

Syntax
The syntax for parseuas to extract information from user agent strings is:

Options
The parseuas command optionally stores the information from the user agent strings in new variables.
• browser(newvar ) generates a variable containing the information on the browser name.
• browserversion(newvar ) generates a variable containing the information on the version of the browser.
• os(newvar ) generates a variable containing the information on the operating system of the device.
• device(newvar ) generates a variable containing the information on the device type.
• smartphone(newvar ) generates a dummy variable, which indicates whether a smartphone was used.
• tablet(newvar ) generates a dummy variable, which indicates whether a tablet computer was used.
• numeric causes parseuas to create numeric variables instead of strings.
• noisily provides output of frequency tables for browser name, browser version, operating system, and device type.

Examples
To demonstrate the use of parseuas, we present two examples. The first illustrates the application of the parseuas command, while the second provides the reader with a basic understanding of how the content of user agent strings has changed over time and, hence, how technical development has progressed over several years. In the second example, we also show how to interpret the residual categories to face-validate the application of parseuas.

Application of parseuas
To illustrate the application of parseuas, we pooled 29 web surveys, which were conducted between 2009 and 2015 as part of the German Longitudinal Election Study (Rattinger, Roßteutscher, Schmitt-Beck, Weßels, andWolf 2009-2015). Each survey included about 1,100 respondents, thus giving us a total of 36,575 observations. The user agent strings were collected by the web survey software and included in the data set as a string variable.
We used parseuas with all options to extract the information from the user agent strings: . parseuas useragent, bro(browser) browserv(browserversion) > os(operatingsystem) device(device) smart(smartphone) tab(tablet) numeric > noisily parseuas performed well in our example and extracted the requested information from 36,575 strings in approximately 2 seconds. After noisily running parseuas, we could assess the data stored in the newly created variables, and the command reported basic findings on the data: Total | 36,575 100.00 Operating system version | Freq. Percent Cum.

Further applications
The data set we collected enabled us to illustrate how user agent strings changed between 2009 and 2015. As mentioned previously, due to technological development, new user agent strings emerge. For example, when a new operating system is released, we can find this information in the user agent string.
In our example, we focus on the emergence of mobile devices and different versions of Windows. The former is an application that we would expect to be used in the context of a web survey, whereas the latter illustrates the use of parseuas when, for instance, analyzing data of website users.
We relied on information on the device type to analyze the use of mobile devices to complete the 29 web surveys over time (2009)(2010)(2011)(2012)(2013)(2014)(2015). The results shown in Figure 1 reflect the technological development that has led to the increasing spread of mobile devices. The ability of parseuas to detect the increasing use of smartphones and tablets is due to user agent strings containing the necessary information to detect a mobile device. 8 In Figure 2,    User agent strings may lack information or include content that parseuas is not able to fully interpret. As outlined in Section 4.1, the command is set up to handle these cases by coding the available information into residual categories. Evidently, the residual categories are an important source when validating the results of the application of parseuas. "Browser (other)", "OS (other)", "Mobile phone (other)", or "Tablet (other)" indicate that parseuas did not identify the exact type of browser, operating system, or device. In our examples, we relied on a data set that was collected when we were developing parseuas. Hence, only a few user agent strings (<1% in each variable) were not optimally parsed, and these numbers are what we consider an ideal case of residuals when using the command. Table 1 illustrates the distribution of "others" for parsed information on the browser, operating system, and device.

Remarks
The Stata command will be updated on a regular basis to keep up with the development of new browsers, operating systems, and devices. Thus, we recommend using the latest version of parseuas. For this purpose, Stata provides the command adoupdate, which automatically updates user-written ados (see adoupdate). In addition, to guarantee the reproducibility of analyses using parseuas, we recommend citing the ado with information on the used version, e.g., as in Roßmann and Gummer (2016a).
Moreover, as outlined in Section 4.1 (for an application, see Section 4.4), we recommend inspecting the frequencies of residual categories after applying parseuas to a data set. If user agent strings cannot be parsed in an optimal way, they are coded into these categories. This situation may be the result of either drawing on a data set including a large amount of uncommon user agents or technological developments (e.g., completely new browsers or operating systems) that have not yet been accounted for in the most recent version of parseuas.
In the exemplary application based on 36,575 user agent strings that we collected over 7 years, the residual categories did not exceed 1% of all observations. Hence, we would recommend this threshold as a rule of thumb for researchers to face-validate the successful application of parseuas.

Conclusion
In this article, we introduced the new Stata command parseuas to extract detailed information from user agent strings. These data can be used for methodological and substantive research questions. Particularly in the field of web survey research, interest in paradata (e.g., device types) has been increasing. Apart from survey methodology, user agent strings are commonly available data on the user level when using web services. Thus, analyzing information from user agent strings is of great importance to researchers and practitioners in a multitude of fields ranging from computer sciences to market research.
As our example illustrates, parseuas provides Stata users with an easily applicable command to automatically generate meaningful data. Nevertheless, new user agents will emerge due to technological developments (e.g., new devices, browsers, and operating systems), or users may want to extract information that is not provided by the latest version of parseuas. Thus, our article details how to parse user agent strings, which should enable users to modify the parseuas command to serve their individual purposes. In addition to detailed information on user agent strings and how to parse them, we have provided an overview of how to collect these paradata with frequently used web survey software and, more generally, by implementing JavaScript code.