Published by the Foundation for Open Access Statistics Editors-in-chief: Bettina Grün, Torsten Hothorn, Rebecca Killick, Edzer Pebesma, Achim Zeileis    ISSN 1548-7660; CODEN JSSOBK
Authors: Alexander H. Foss, Marianthi Markatou
Title: kamila: Clustering Mixed-Type Data in R and Hadoop
Abstract: In this paper we discuss the challenge of equitably combining continuous (quantitative) and categorical (qualitative) variables for the purpose of cluster analysis. Existing techniques require strong parametric assumptions, or difficult-to-specify tuning parameters. We describe the kamila package, which includes a weighted k-means approach to clustering mixed-type data, a method for estimating weights for mixed-type data (ModhaSpangler weighting), and an additional semiparametric method recently proposed in the literature (KAMILA). We include a discussion of strategies for estimating the number of clusters in the data, and describe the implementation of one such method in the current R package. Background and usage of these clustering methods are presented. We then show how the KAMILA algorithm can be adapted to a map-reduce framework, and implement the resulting algorithm using Hadoop for clustering very large mixed-type data sets.

Page views:: 13202. Submitted: 2016-08-28. Published: 2018-02-27.
Paper: kamila: Clustering Mixed-Type Data in R and Hadoop     Download PDF (Downloads: 6816)
kamila_0.1.1.2.tar.gz: R source package Download (Downloads: 154; 39KB) Replication materials Download (Downloads: 136; 133KB)

DOI: 10.18637/jss.v083.i13

This work is licensed under the licenses
Paper: Creative Commons Attribution 3.0 Unported License
Code: GNU General Public License (at least one of version 2 or version 3) or a GPL-compatible license.