Somoclu: An Efficient Parallel Library for Self-Organizing Maps

Somoclu is a massively parallel tool for training self-organizing maps on large data sets written in C++. It builds on OpenMP for multicore execution, and on MPI for distributing the workload across the nodes in a cluster. It is also able to boost training by using CUDA if graphics processing units are available. A sparse kernel is included, which is useful for high-dimensional but sparse data, such as the vector spaces common in text mining workflows. Python, R and MATLAB interfaces facilitate interactive use. Apart from fast execution, memory use is highly optimized, enabling training large emergent maps even on a single node.


Introduction
Self-organizing maps are a widespread visualization tool that embed high-dimensional data in a two-dimensional plane while preserving the topological layout of the original data (Kohonen, 2001). They provide a visual representation of groups of similar data instances, and they are useful in analyzing the dynamics of evolving data collections. Training a map, however, is computationally demanding. While tools exist that scale to large data sets using cluster resources (Sul and Tovchigrechko, 2011;Wittek and Darányi, 2012), there is room for improvement in memory use, speed, and usability. Self-Organizing Maps On a Cluster (Somoclu) addresses these aspects by the following improvements: • It improves the efficiency of a distributing the workload; • Memory use is reduced; • A kernel for sparse data is introduced; • Training time is reduced by graphics processing units (GPUs) when available; • An extensive command-line interface is available.

Self-Organizing Maps in Batch Mode
The self-organizing map (SOM) training algorithm constructs a nonlinear topology preserving mapping of the input data set X = {x(t)|t ∈ {t 0,...,t f }}, where t 0 and t f are the beginning and the end of the current training session, onto a set of neurons M = n 1 , . . . , n k of a neural network with associated weight vectors W = w 1 (t), . . . , w k (t) at a given time step t. W is known as the codebook.
Each data point x(t) is mapped to its best matching unit bm( where d is the distance on the data set. The neurons are arranged on a two dimensional map: each neuron i has a two coordinates in a grid. Next the weight vector of the best match neuron and its neighbours are adjusted toward the input pattern using the following equation: , where 0 < α < 1 is the learning factor, and h ck (t) is the neighbourhood function that decreases for neurons further away from the best match neuron in grid coordinates. A frequently used neighbourhood function is the Gaussian: ), where r k and r c stand for the coordinates of the respective nodes. The width δ(t) decreases from iteration to iteration to narrow the area of influence.
The training is repeated on the same data set to increase the fit, a training cycle is called an epoch. Eventually, the neighbourhood function decreases to an extent that training might stop. The time needed to train an SOM grows linearly with the dataset size, and it grows linearly with the number of neurons in the SOM. SOM has a batch formulation of updating the weights, which is widely used in parallel implementations: . (1)

The Software Package
The starting point was a MapReduce-based distributed implementation of SOM (Sul and Tovchigrechko, 2011). This implementation was later accelerated by GPUs (Wittek and Darányi, 2012). We found the MapReduce framework unnecessary, and derived a purely MPI-based implementation, that is more modular, faster, and includes a sparse kernel. Design details are provided in Subsection 3.1. We improved the usability by developing a command-line interface; usage examples are given in Subsection 3.2.

Design
The dense CPU kernel is almost identical to the one implemented by Sul and Tovchigrechko (2011). The MapReduce calls were replaced by MPI functions, leading to a more streamlined code. A single multicore node in a cluster will launch as many MPI threads as there are cores. A performance bottleneck in the original implementation was the accumulation of local weights into a new global codebook by one single process on the master node. This is parallelized by an OpenMP directive. Furthermore, the influence radius δ(t) of a best matching node is thresholded, which translates to further speed improvements without compromising the quality of the trained map. The GPU variant is comparatively fairly complex compared to the CPU kernel. The complexity stems from the way the distance function is evaluated between the nodes of the SOM and the training data. To maximize parallelism, the Gram matrix is calculated, that is, a matrix of the distances between every data instance and the nodes of the SOM. A naïve approach would be to extend an efficient matrix multiplication algorithm, replacing the dot product by the distance function. Opting for a Euclidean distance, it is possible to derive an alternative formulation of calculating the Gram matrix using linear algebra operations (Li et al., 2010). Benchmarking the two approaches, we found that the latter approach is a magnitude faster on the GPU, mainly due to a more favourable memory access pattern. We implemented the GPU kernel with Thrust, a C++ template library for CUDA, which has high-performance primitives, avoiding the need to manually tune individual GPU calls. Compared to the implementation by Wittek and Darányi (2012), the device memory use of the GPU code is reduced to approximately one-third, and the new implementation completely avoids costly matrix transposing operations.
Further complexity arises from the disparity between the number of GPUs and the number of cores in a computer. The CPU kernel achieves maximum speed by running an MPI thread on each available core, resulting in a far lower number of data instances per core and a speedier local update of the weights. For instance, if there are eight cores and two GPUs, than each GPU has four times more data to process and its corresponding MPI thread would have four times more data to update the local weights. While the GPU handles the load efficiently, it would be highly inefficient to use a single thread to update the local weights. We thus hybridized the kernel and rely on OpenMP to parallelize the weight update. The GPU implementation runs as many MPI threads on a node as there are GPUs, and uses all CPU cores in the weight update.
The sparse kernel is a straightforward extension of the dense kernel, and its main virtue is the reduced memory use. A vector space coming from a text processing pipeline typically contains 1-5 % nonzero elements, leading to a 20-100x reduction in memory use when using a sparse representation. This kernel does not have a GPU implementation, as the irregular access patterns that are inevitable with sparse data structures are not efficient on streaming architectures.

Command-line interface
The provided I/O routines read either a dense or a sparse matrix. The dense matrix is a plain-text matrix where the rows contain the data instances. The sparse representation is similarly row-oriented, and it uses the same sparse format as LIBSVM (Chang and Lin, 2001). Example files are available in the distribution. Extension to further matrix types is fairly simple.
The tool is used via the command line. Using default settings (50 × 50 map, dense CPU kernel) and the provided example file, the following command will write the codebook of the trained map and a corresponding U-matrix for visualization to respective files: As a more sophisticated example, the following command will run on four cores in a single node, using the dense CPU kernel, and it will generate a map of 20 × 20 neurons: $ mpirun -np 4 somoclu -k 0 -x 20 -y 20 data/rgbs.txt \ data/rgbs Further documentation and instructions for compilation are on the website.

As an API
Designed for MPI, Somoclu was not conceived to use as an API. Yet, given sufficient preparation on the calling side, it is possible to interface with Somoclu as an API. The primary entry point for training is the following function, as specified in somoclu.h: void train(int itask, float * data, svm_node ** sparseData, unsigned int nSomX, unsigned int nSomY, unsigned int nDimensions, unsigned int nVectors, unsigned int nVectorsPerRank, unsigned int nEpoch, const char * outPrefix, bool shouldSaveInterim, unsigned int kernelType); The parameters nSomX, nSomY, nEpoch, shouldSaveInterim, and kernelType are self-explanatory, they are the same as in the command-line interface. The parameter outP ref ix corresponds to the file name prefix that will be common to all output files. This is always the last parameter on the command line. The parameter itask specifies the rank of the current MPI process. Data are either stored in data or sparseData, depending on whether the data are dense or sparse. Note that only data belonging to the process is expected, not the full matrix. The parameter nVectors stores the total number of instances, whereas nVectorsPerRank the number of instances beloning to one MPI thread.

Visualization
The gnuplot script is provided with the source code as plot_som.gp.

Comparison and Benchmarks
To ensure replicability of the results, we benchmarked with publicly available cluster GPU instances provided by Amazon Web Services. The instance type was cg1.4xlarge, 1 equipped with 22 GiB of memory, two Intel Xeon X5570 quad-core CPUs, and two NVIDIA Tesla M2050 GPUs, running Ubuntu 12.04. The most direct comparison should be with the implementations by Sul and Tovchigrechko (2011) and Wittek and Darányi (2012). Unfortunately the former was not able to handle the matrix sizes we were benchmarking with. The implementation by Wittek and Darányi (2012) is sensitive to the size of map, and it did not scale to the map sizes benchmarked here, and thus we left it out from the comparison. The number of data instances ranged from 12,500 to 100,000, the number of dimensions was fixed at 1,000. The data elements were randomly generated, as we were interested in scalability. We tested map sizes of 25 × 25, 50 × 50, and 75 × 75. The largest map size with the largest data matrix filled the memory of a single device in a dual-GPU configuration, hence giving an upper limit to single-node experiments. The map size did not affect the relative speed of the different kernels, only results for the 50 × 50 map show in Figure 1. As it has been shown in the papers cited above, scaling across nodes is near-linear, having fairly limited communication at the end of each training epoch; the results are omitted here.
The GPU results translate to a steady 5x speedup across the range. The bottleneck for the GPU kernel is still the local weight update loop; this should be done on the device to achieve a better performance. The sparse kernel and dense kernel are nearly identical in performance, which is not surprising given that the codebook remains dense. The main advantage is the memory savings.

Summary
The core contributions of this work are as follows: • Purely MPI-based, distributed implementation of self-organizing maps.
• Optimized hybrid CPU-GPU kernel for boosting training on dense data.
• Massive reduction in memory use and training time for all kernel types.

Acknowledgement
This work was supported by AWS in Education Machine Learning Grant award.