readXML {tm}R Documentation

Read In an XML Document

Description

Return a function which reads in an XML document. The structure of the XML document can be described with a specification.

Usage

readXML(spec, doc)

Arguments

spec

A named list of lists each containing two components. The constructed reader will map each list entry to an attribute or meta datum corresponding to the named list entry. Valid names include Content to access the document's content, any valid attribute name, and characters which are mapped to LocalMetaData entries.

Each list entry must consist of two components: the first must be a string describing the type of the second argument, and the second is the specification entry. Valid combinations are:

type = "node", spec = "XPathExpression"

The XPath expression spec extracts information from an XML node.

type = "attribute", spec = "XPathExpression"

The XPath expression spec extracts information from an attribute of an XML node.

type = "function", spec = function(tree) ...

The function spec is called, passing over a tree representation (as delivered by xmlInternalTreeParse from package XML) of the read in XML document as first argument.

type = "unevaluated", spec = "String"

The character vector spec is returned without modification.

doc

An (empty) document of some subclass of TextDocument

Details

Formally this function is a function generator, i.e., it returns a function (which reads in a text document) with a well-defined signature, but can access passed over arguments (e.g., the specification) via lexical scoping.

Value

A function with the signature elem, language, id:

elem

a list with the named component content which must hold the document to be read in.

language

a string giving the text's language.

id

a unique identification string for the returned text document.

The function returns doc augmented by the parsed information as described by spec out of the XML file in elem$content.

Author(s)

Ingo Feinerer

See Also

Vignette 'Extensions: How to Handle Custom File Formats', XMLSource.

getReaders to list available reader functions.

Examples

readGmane <-
readXML(spec = list(Author = list("node", "/item/creator"),
                    Content = list("node", "/item/description"),
                    DateTimeStamp = list("function", function(node)
                    strptime(sapply(XML::getNodeSet(node, "/item/date"), XML::xmlValue),
                             format = "%Y-%m-%dT%H:%M:%S",
                             tz = "GMT")),
                    Description = list("unevaluated", ""),
                    Heading = list("node", "/item/title"),
                    ID = list("node", "/item/link"),
                    Origin = list("unevaluated", "Gmane Mailing List Archive")),
                    doc = PlainTextDocument())

[Package tm version 0.5-10 Index]