PCorpus {tm}R Documentation

Permanent Corpus Constructor

Description

Construct a permanent corpus.

Usage

PCorpus(x,
        readerControl = list(reader = x$DefaultReader, language = "en"),
        dbControl = list(dbName = "", dbType = "DB1"))
DBControl(x)
## S3 method for class 'PCorpus'
DMetaData(x)

Arguments

x

A Source object for PCorpus, and a corpus for the other functions.

readerControl

A list with the named components reader representing a reading function capable of handling the file format found in x, and language giving the text's language (preferably as IETF language tags). The default language is assumed to be English ("en"). Use NA to avoid internal assumptions (e.g., when the language is unknown or is deliberately not set).

dbControl

A list with the named components dbName giving the filename holding the sourced out documents (i.e., the database), and dbType holding a valid database type as supported by package filehash. Under activated database support the tm package tries to keep as few as possible resources in memory under usage of the database.

Details

Permanent means that documents are physically stored outside of R (e.g., in a database) and R objects are only pointers to external structures. I.e., changes in the underlying external representation can affect multiple R objects simultaneously.

The constructed corpus object inherits from a list and has three attributes containing meta and database management information:

CMetaData

Corpus Meta Data contains corpus specific meta data in form of tag-value pairs and information about children in form of a binary tree. This information is useful for reconstructing meta data after e.g. merging corpora.

DMetaData

Document Meta Data of class data.frame contains document specific meta data for the corpus. This data frame typically encompasses clustering or classification results which basically are metadata for documents but form an own entity (e.g., with its name, the value range, etc.).

DBControl

Database control field is a list with two named components: dbName holds the path to the permanent database storage, and dbType stores the database type.

Value

An object of class PCorpus which extends the classes Corpus and list containing a permanent corpus.

Author(s)

Ingo Feinerer

Examples

txt <- system.file("texts", "txt", package = "tm")
## Not run: PCorpus(DirSource(txt),
        dbControl = list(dbName = "myDB.db", dbType = "DB1"))
## End(Not run)

[Package tm version 0.5-10 Index]