VCorpus {tm}R Documentation

Volatile Corpus

Description

Data structures and operators for volatile corpora.

Usage

Corpus(x, readerControl = list(reader = x$DefaultReader, language = "en"))
VCorpus(x, readerControl = list(reader = x$DefaultReader, language = "en"))
## S3 method for class 'VCorpus'
DMetaData(x)
## S3 method for class 'Corpus'
CMetaData(x)

Arguments

x

A Source object for Corpus and VCorpus, and a corpus for the other functions.

readerControl

A list with the named components reader representing a reading function capable of handling the file format found in x, and language giving the text's language (preferably as IETF language tags). The default language is assumed to be English ("en"). Use NA to avoid internal assumptions (e.g., when the language is unknown or is deliberately not set).

Details

Volatile means that the corpus is fully kept in memory and thus all changes only affect the corresponding R object. In contrast there is also a corpus implementation available providing a permanent semantics (see PCorpus).

The constructed corpus object inherits from a list and has two attributes containing meta information:

CMetaData

Corpus Meta Data contains corpus specific meta data in form of tag-value pairs and information about children in form of a binary tree. This information is useful for reconstructing meta data after e.g. merging corpora.

DMetaData

Document Meta Data of class data.frame contains document specific meta data for the corpus. This data frame typically encompasses clustering or classification results which basically are metadata for documents but form an own entity (e.g., with its name, the value range, etc.).

Value

An object of class VCorpus which extends the classes Corpus and list containing a collection of text documents.

Author(s)

Ingo Feinerer

Examples

reut21578 <- system.file("texts", "crude", package = "tm")
(r <- Corpus(DirSource(reut21578),
             readerControl = list(reader = readReut21578XMLasPlain)))

[Package tm version 0.5-10 Index]