Venn Diagrams in R

This article describes R functions used to produce Venn diagrams and related incidence tables. The methods have applications in bioinformatics.


Introduction
In bioinformatics, an "expression library" is a collection of thousands of samples of messenger RNA (mRNA) that have been extracted from a tissue, cloned, and annotated (identified in a standard list). Information on the function of protein derived from the mRNA can be inferred by comparing the incidence of different types of mRNA in samples drawn from different tissues or under different conditions. Biologists commonly use Venn diagrams to compare libraries: mRNA types that occur uniquely, or more prevalently, under one condition than under others appear outside the intersections in the diagram. This note describes two functions written in R (Ihaka and Gentleman 1996) The function incidence.table(id, category) calculates a logical table indicating the incidence of each unique value of id in each category. An optional parameter cutoff (default value 1) only sets an entry TRUE if at least that many replicates of the id occur in that category. This could be used to select mRNA types that are heavily expressed, for instance.
The function venn() draws a Venn diagram. Venn diagrams are commonly used in set theory to illustrate unions and intersections. Here the diagrams are augmented with numerical counts in each region, a graphical presentation of a cross-tabulation. We draw circles to represent the sets; this limits us to 2 or 3 sets, as the general Venn diagram with 4 or more sets requires more complicated shapes (Ruskey 2001). The function venn() can work with the same arguments as incidence. These two functions are contained in the tiny R package venn which accompanies this note.

Venn Diagrams in R
They were written as part of the work described in (McKinney et al. 2003).

Example
A test dataset consisted of 1920 samples from 3 tissues. There were 1267 unique "accession numbers", or mRNA identifiers, in the dataset. Figure 1 shows three Venn diagrams of this data. On the left, all id values (accession numbers) are included. This figure was produced using the code R> venn(accession, libname, main = "All samples") where accession was a vector containing the codes identifying the RNA sequences, and libname was a vector containing the codes identifying the tissue sample (library). We see that 22 of the mRNA types were found in all three tissues, 25 in both tissues A and B, etc. In the centre, only those mRNA types for which 5 or more samples were observed are shown. We now see that there were 8 types of mRNA which appeared at least 5 times in tissue A but less frequently in the other tissues.
To narrow our attention to those 8 types, we do the following calculation: This sets the variable keep to TRUE for those samples whose mRNA is one of the 8 types identified above.  produces the Venn diagram on the right, indicating that 5 of those mRNA types do not occur at all in the other tissues. Since they occurred frequently in the samples of tissue A, but are unique to that tissue, there is a strong suggestion that their function is related to the function of that tissue.