Current Volume | Browse | Search | RSSHome | Instructions for Authors | JSS Style Guide | Editorial Board

Authors: Kurt Hornik, Patrick Mair, Johannes Rauch, Wilhelm Geiger, Christian Buchta, Ingo Feinerer
Title: [download]
(6549)
The textcat Package for n-Gram Based Text Categorization in R
Reference: Vol. 52, Issue 6, Feb 2013
Submitted 2011-07-11, Accepted 2012-10-01
Type: Article
Abstract:

Identifying the language used will typically be the first step in most natural language processing tasks. Among the wide variety of language identification methods discussed in the literature, the ones employing the Cavnar and Trenkle (1994) approach to text categorization based on character n-gram frequencies have been particularly successful. This paper presents the R extension package textcat for n-gram based text categorization which implements both the Cavnar and Trenkle approach as well as a reduced n-gram approach designed to remove redundancies of the original approach. A multi-lingual corpus obtained from the Wikipedia pages available on a selection of topics is used to illustrate the functionality of the package and the performance of the provided language identification methods.

Paper: [download]
(6549)
The textcat Package for n-Gram Based Text Categorization in R
(application/pdf, 421.2 KB)
Supplements: [download]
(325)
textcat_1.0-0.tar.gz: R source package
(application/x-gzip, 215.7 KB)
[download]
(332)
v52i06.R: R example and replication code
(application/octet-stream, 5.7 KB)
[download]
(350)
simTCWiki.rda: Supplementary data in R binary format
(application/octet-stream, 334.8 KB)
Resources: BibTeX | OAI
Creative Commons License
This work is licensed under the licenses
Paper: Creative Commons Attribution 3.0 Unported License
Code: GNU General Public License (at least one of version 2 or version 3) or a GPL-compatible license.
Current Volume | Browse | Search | RSSHome | Instructions for Authors | JSS Style Guide | Editorial Board