ITA 2.0: A Program for Classical and Inductive Item Tree Analysis

Item Tree Analysis (ITA) is an explorative method of data analysis which can be used to establish a hierarchical structure on a set of dichotomous items from a questionnaire or test. There are currently two diﬀerent algorithms available to perform an ITA. We describe a computer program called ITA 2.0 which implements both of these algorithms. In addition we show with a concrete data set how the program can be used for the analysis of questionnaire data.


Introduction
Item tree analysis (ITA) is a data analytical method which allows constructing a hierarchical structure on the items of a questionnaire or test from observed response patterns. Assume that we have a questionnaire I with m items and that subjects can answer positive (1) or negative (0) to each of these items, i.e. the items are dichotomous. If n subjects answer the items in I this results in a binary data matrix D with m columns and n rows.
Typical examples of this data format are test items which can be solved (1) or failed (0) by subjects. Other typical examples are questionnaires where the items are statements to which subjects can agree (1) or disagree (0).
Depending on the content of the items it is possible that the response of a subject to an item j determines her or his responses to other items. It is, for example, possible that each subject who agrees to item j will also agree to item i. In this case we say that item j implies item i and write shortly i ≤ j. The goal of an ITA is to uncover such deterministic implications from the data set D.
ITA was originally developed by Van Leeuwe (1974). The result of his algorithm, which we refer in the following as Classical ITA, is a logically consistent set of implications i ≤ j. Logically consistent means that i ≤ j and j ≤ k implies i ≤ k for each triple i,j,k of items, i.e. the relation ≤ is transitive. Thus the outcome of an ITA is a reflexive and transitive relation on the item set, i.e. a quasi-order on the items.
Recently Schrepp (1999) developed a different algorithm to perform an ITA, which we refer in the following as Inductive ITA. Classical ITA and inductive ITA both construct a quasi-order on the item set by explorative data analysis. But both methods use a different algorithm to construct this quasi-order. For a given data set the resulting quasi-orders from classical and inductive ITA will usually differ.
ITA belongs to a group of data analysis methods called Boolean analysis of questionnaires. Boolean analysis was introduced by Flament (1976). The goal of a Boolean analysis is to detect deterministic dependencies (formulas from Boolean logic connecting the items, like for example i → j, i ∧ j → k, and i ∨ j → k) between the items of a questionnaire or test.
These methods share the goal to derive deterministic dependencies between the items of a questionnaire from data, but differ in the algorithms to reach this goal. A comparison of ITA to other methods of boolean data analysis can be found in Schrepp (2003).
Boolean analysis is closely related to the GUHA method (Hájek, Havel, and Chytil (1966) or Hájek and Havránek (1977)). The basic idea of this method is to use formal logic to generate all hypotheses which are of interest in a given research task and supported by the data. Statistical methods are used to evaluate these generated hypotheses. There is also a close connection of boolean analysis to data mining techniques, especially the extraction of association rules from data. See, for example, Klemettinen, Mannila, Ronkainen, Toivonen, and Verkamo (1994) or Toivonen (1996).
There is a close connection between item tree analysis and knowledge space theory. The theory of knowledge spaces (Doignon and Falmagne (1985), Doignon and Falmagne (1998) or Albert and Lukas (1999)) provides a theoretical framework for the formal description of human knowledge. A knowledge domain is in this approach represented by a set I of problems. The knowledge of a subject in the domain is then described by the subset of problems from I he or she is able to solve. This set is called the knowledge state of the subject. Because of dependencies between the items (for example, if solving problem j implies solving problem i) not all elements of the power set of I will, in general, be possible knowledge states. The set of all possible knowledge states is called the knowledge structure. Obviously, item tree analysis can be used to construct a knowledge structure from data. See, for example, Schrepp (1999).
The investigation of deterministic dependencies has some tradition in educational psychology. The items represent in this area usually skills or cognitive abilities of subjects. Bart and Airasian (1974) use ITA to establish implications on a set of Piagetian tasks. Other examples in this tradition are the learning hierarchies of Gagné (1968) or the theory of structural learning of Scandura (1991).
A recent application of classical ITA on educational test items can be found in Held and Korossy (1998) who extracted logical implications on a set of mathematical problems from observed response patterns. The extracted implications are then compared to implications obtained by querying an expert.
Another example for the use of deterministic dependencies in psychology are approaches to formalize the diagnostic process of psychologists. The goal of this approach is to uncover the rules on which the decisions of diagnosticians are based. See, for example, Härtner, Mattes, and Wottawa (1980) or Wottawa and Echterhoff (1982).
An example for the application of ITA to sociological data is Bart and Krus (1973) who used an ITA related procedure to establish a hierarchical order on items that describe socially unaccepted behavior. Janssens (1999) used a method of boolean analysis to investigate the integration process of minorities into the value system of the dominant culture. Applications of inductive ITA on the analysis of questionnaire data can be found in Schrepp (2002) or Schrepp (2003). In these papers subsets of questions from the International Social Science Survey Program (ISSSP) are analyzed by inductive ITA.

Algorithms of classical and inductive ITA
The main purpose of this paper is to describe the program ITA 2.0. But for an understanding of the program it is important to give a rough description of the implemented algorithms. For a more detailed introduction into this algorithms please refer for classical ITA to Van Leeuwe (1974) and for inductive ITA to Schrepp (2003).
Let us first define some basic notational conventions which are necessary for both algorithms: • I is a set of m dichotomous items.
• ≤ is a quasi-order (reflexive and transitive relation) on I. Such a quasi-order on I can be represented as a set {(i, j) | i ≤ j} of item pairs.
• D = {d 1 , ..., d n } is a set of n observed response patterns to the items in I. Each d s is a mapping d s : I → {0, 1} which assigns to each item i the response d s (i) ∈ {0, 1} of subject s.
• p i is the relative frequency of subjects who answer positive (1) to item i, i.e.
• For a quasi-order ≤ on I the set of all consistent patterns Cons(≤) is given by Thus, Cons(≤) contains all 0-1-patterns of length m which are consistent with all the dependencies i ≤ j in the quasi-order ≤.
be the minimal distance of d to a consistent 0-1-pattern. Then the reproducibility coefficient Repro(≤, D) is defined by: The reproducibility coefficient can be interpreted as the percentage of cells in the data matrix which can be reproduced by the quasi-order ≤.

The algorithm of classical ITA
To describe the algorithm of classical ITA we need in addition to the definitions given above the following conventions: • r ij is the Pearson-Correlation of items i and j.
• C(≤, D) is the number of response patterns which are not consistent with ≤, i.e.
• The expected correlation r * ij under the assumption that ≤ is the correct quasi-order underlying the data (see Van Leeuwe (1974) for a description on how r * ij is derived from this assumption) is defined by: The algorithm of ITA consists accordingly to Van Leeuwe (1974) of the following 5 steps: τ is interpreted as the lowest acceptable REP-PO(≤, D) value for the best-fitting quasi-order ≤ resulting from the analysis. (1974) showed that ≤ 0 is transitive but that this must not be the case for relations ≤ L with L > 0. In a direct reply to this paper Schrepp (2006) showed that the critique raised byÜnlü and Albert (2004) is not justified. In fact this paper shows that some of the problems of CA(≤ L , D) described byÜnlü and Albert (2004) are properties which a good measure of fit for a quasiorder should have.

All relations
Classical ITA can be generalized to the case that some subjects have not answered all items (missing data). This is done simply by ignoring such rows in the calculation of the values r ij and r * ij .

The algorithm of inductive ITA
Inductive ITA is a two step procedure. In the first step a set of quasi-orders is constructed inductively on I. In the second step a best-fitting quasi-order is chosen from this set.
Step 1: Inductive construction of a set of quasi-orders We start with the quasi-order ≤ 0 defined by i ≤ 0 j ⇔ b ij = 0 for all i, j ∈ I. Assume that we have constructed a quasi-order ≤ L . In step L + 1 of the process we construct a quasi-order ≤ L+1 by adding all item pairs (i, j) to ≤ L for which the following two conditions hold: • b ij ≤ L + 1 • i ≤ L+1 j does not cause an intransitivity to other dependencies which are already contained in ≤ L or added in this step to ≤ L .
The algorithm to construct ≤ L+1 from ≤ L consists of three sub-steps.

Define
2. The following steps are repeated until the stopping criterion is valid: • Elements of A L+1 which cause intransitivity to elements in ≤ L ∪ A L+1 are added to a set B L+1 . If no such elements exists in A L+1 the process stops.
• All elements of B L+1 are removed from A L+1 .
• All elements are removed from B L+1 and we proceed with the first step. When this process ends none of the remaining elements in A L+1 causes an intransitivity with other elements in ≤ L ∪ A L+1 .
Step 2: Determine the best fitting quasi-order Assume that ≤ L is the correct quasi-order underlying the data set D. A violation of i ≤ L j is then only possible by the influence of random errors. We estimate the probability γ that a true dependency i ≤ L j is violated due to random errors by: We distinguish two cases to calculate the expected number of violations t ij to i ≤ L j: 1. i ≤ L j: In this case we assume that the items i and j are independent. Thus, t ij equals the expected number of response patterns d with d(i) = 0 and d(j) = 1, so we have 2. i ≤ L j and i = j: In this case all violations to i ≤ L j must result from random errors. Thus, t ij = γ p j n.
The fit between ≤ L and the data set D is measured by the diff(≤ L , D) coefficient: To determine the optimal tolerance level L we calculate the diff(≤ L , D) value for all quasiorders ≤ L in IPQO(D) and chose the quasi-order for which this value is minimal as best-fitting quasi-order accordingly to inductive ITA.
Again this algorithm can easily be extended to the case of missing data in some response patterns. Response patterns where one of the items i or j is not answered by the subject are ignored for the calculation of the b ij values and in the calculation of the error parameter γ. The rest of the algorithm stays more or less the same. For an exact mathematical description of the generalized algorithm see Schrepp (2003).
3. The analysis program ITA 2.0 ITA 2.0 1 is a Windows based program which allows analysing a binary data set D by classical or inductive ITA. The program was tested under Windows XP and Windows 2000, but should also run on other Windows versions. It has a simple user interface (see Figure 1) which allows the user to set the available options for the analysis.
The data are read from the ASCII file specified in the field Input file. Currently two input formats (Patterns with frequencies, List of Patterns) are supported. The format List of Patterns assumes that the input file contains per subject one row of data which describes the answers of the subject to the items in the item set. Each row can contain the signs 1, 0, -(for missing data), and white space. The spaces are ignored in the input file, thus it does not matter if the entries for two items are separated by space or not.

The format
Patterns with frequencies assumes that the input file contains for each observed response pattern a string which describes the response pattern and in addition the frequency of the response pattern in the data. Both parts of the row must be separated by a space. Inside the part of the row which describes the response pattern only the signs 1, 0, -(for missing data) are allowed.
In both data formats the number of the items must not be specified explicitly. This number is determined from the first row of the input file. The number of items is restricted to a maximum of 30 and the number of response patterns in the data file is restricted to a maximum of 10000. Please make sure that each row of the data file ends with a new line ( ¶)!
The result of the analysis is written to a file (enter the name of the file into the field Result file) which can be a simple ASCII file (choose Result format Text) or an HTML file (choose Result format HTML).
Both formats contain the same information. The only exception is that the HTML file contains also a graphical visualization of the best fitting quasi-order as a Hasse-Diagram. This Hasse-Diagram is displayed in a Java-Applet inside the HTML file. For drawing this Hasse-Diagram the program uses the Interactive Poset and Lattice Drawing Java Applet from Peter Jipsen (freely available under http://math.vanderbilt.edu/~pjipsen/gap/posets.html).

Result file for classical ITA
The header of the result file contains some descriptive data, like for example the number of items in the data file.
• Distribution of values per column: This table provides an overview about the distribution of the values 0, 1, and -in each of the columns of the data matrix.
• Table of the a ij : The value a ij is the number of rows in the data matrix which contain a 0 in column j and a 0 in column i.
• Table of the b ij : The value b ij is the number of rows in the data matrix which contain a 1 in column j and a 0 in column i.
• Table of the c ij : The value c ij is the number of rows in the data matrix which contain a 0 in column j and a 1 in column i.
• Table of the d ij : The value d ij is the number of rows in the data matrix which contain a 1 in column j and a 1 in column i.
• Empirical correlations r ij : This table contains the Pearson-Correlations of all item pairs.
• CA values: For each constructed relation ≤ L the number of non-reflexive implications in ≤ L and the fit indices CA(≤ L , D) and REP-PO(≤ L , D) are listed. The relatively best-fitting relation ≤ L is the transitive relation which shows the maximal CA-value.
Since the constructed relations ≤ L are not always transitive the information about the transitivity of the relation is displayed in the last column.
• Constructed quasi-order for level x: This table lists all non-reflexive implications from the best fitting quasi-order. Together with each implication the the b ij -value of the implication is listed.
• Fit indices: The mean violation of an implication (mean over all b ij -values for all i ≤ j in the best-fitting quasi-order ≤), reproducibility coefficient, correlational agreement coefficient, and the REP-PO value of the best fitting quasi-order are listed.
• Compatible states: This is the list of consistent patterns Cons(≤) for the best-fitting quasi-order ≤.
In the HTML format in addition to this information the best-fitting quasi-order is displayed as a Hasse-Diagram.

Result file for inductive ITA
The header of the result file contains some descriptive data, like for example the number of items in the data file.
• Distribution of values per column: This table provides an overview about the distribution of the values 0, 1, and -in each of the columns of the data matrix.
• Table of   • Diff-values: For each constructed quasi-order ≤ L the number of non-reflexive implications in ≤ L and the fit index diff(≤ L , D) are listed. The relatively best fitting quasi-order is the quasi-order which shows the minimal diff-value.
• Constructed quasi-order for level x: This table lists all non-reflexive implications from the best fitting quasi-order. Together with each implication i ≤ j the Support, Confidence, and the b ij -value of the implication are listed. Support and confidence are often used to evaluate the quality of an implication for prediction in data mining. They are not used in inductive ITA, but since they are easy to compute we list them together with the implication. Assume i ≤ j. We define: Then support and confidence of the implication i ≤ j are given by: -Support(i ≤ j) = y(i, j)/x(i, j) -Confidence(i ≤ j) = y(i, j)/q(i, j) • Fit indices: The mean violation of an implication and the reproducibility coefficient are listed.
In the HTML format in addition to this information the best-fitting quasi-order is displayed as a Hasse-Diagram.

Options
The user is able to decide if the set of consistent patterns for the best-fitting quasi-order should be computed and listed in the output file (mark the checkbox Calculate the consistent patterns for the best-fitting quasi-order). Please use that feature carefully if your input file contains many items! The time necessary to compute the compatible states increases exponentially with the number of items. Thus, for higher numbers of items you must expect a considerable time until the program finishes. Note also that the output file can be in this case quite large.
With the checkbox Display results automatically after analysis finishes you can force the program to display the result file directly after the analysis is finished. If this checkbox is checked then the two additional fields Command to view text files and Command to view HTML files are visible (otherwise these files are hidden).
If you choose this option you must specify the program which should be used to display the result files. If the output format is Text then you have to enter the name of an editor which is able to display .txt files in the field Command to view text files (simply choose notepad which is usually available on each machine). If the output format is HTML, then you have to enter the program name for your browser in the field Command to view HTML files (for example iexplore for MS Internet Explorer or firefox for the Mozilla Firefox Browser).

Installation
ITA 2.0 consists of the following executables: • IITA_UI.exe: The graphical user interface of ITA 2.0. Click on this executable to start the program.
• Classic_ITA_UI.exe: Analysis module for classical ITA (can not be started standalone).
• Inductive_ITA_UI.exe: Analysis module for inductive ITA (can not be started standalone).
• Node.class, Poset.class, Edge.class, GraphPanel.class: These are the Java classes for the Interactive Poset and Lattice Drawing Java Applet from Peter Jipsen, which is used to paint the Hasse-Diagram of the best-fitting quasi-order in the HTML output.
Please make sure that these files and the two files ita_mslg and ita_settings are located in the same directory. Otherwise the program will not work correctly. The mentioned executables and the corresponding source files required to compile them are available together with this article. Details concerning the installation can be found in the file README.txt.

Example of an analysis by ITA
We will now give an example for the possibilities of an analysis of a data set by ITA. Therefore we analyse the statements of question 4 of the International Social Science Survey Programme (ISSSP) for the year 1995 by inductive and classical ITA.
The ISSSP is a continuing annual program of cross-national collaboration on surveys covering important topics for social science research. The program conducts each year one survey with comparable questions in each of the participating nations. The theme of the 1995 survey was national identity. We analyze the results for question 4 for the data set of Western Germany. The statement for question 4 was: Some people say the following things are important for being truly German. Others say they are not important. How important do you think each of the following is: The subjects had the response possibilities Very important, Important, Not very important, Not important at all, and Can't choose to answer the statements. To apply ITA to this data set we changed the answer categories. Very importantt and Important are coded as 1. Not very important and Not important at all are coded as 0. Can't choose was handled as missing data. Figure 2 shows the resulting quasi-orders ≤ IIT A from inductive ITA and ≤ CIT A from classical ITA. Figure 2: Hasse-Diagrams of the best-fitting quasi-orders accordingly to classical and inductive ITA.
As we can see the quasi-order ≤ IIT A is more restrictive than ≤ CIT A . The set Cons(≤ IIT A ) contains 9 consistent patterns while Cons(≤ CIT A ) contain 36 consistent patterns.
Another remarkable result is that despite the fact that ≤ IIT A is much stricter than ≤ CIT A the reproducibility coefficients of both solutions are similar. We have Repro(≤ IIT A , D) = 0.94 and Repro(≤ CIT A , D) = 0.956. So the quasi-order ≤ IIT A still explains with 9 possible response patterns 94% of the cells in the data matrix. Thus, the additional restrictions contained in ≤ IIT A seem to be well supported by the data.