| tokenizer {tm} | R Documentation |
Tokenize a document or character vector.
MC_tokenizer(x) scan_tokenizer(x)
x |
A character vector. |
The quality and correctness of a tokenization algorithm highly depends on the context and application scenario. Relevant factors are the language of the underlying text and the notions of whitespace (which can vary with the used encoding and the language) and punctuation marks. Consequently, for superior results you probably need a custom tokenization function.
Relies on scan(..., what = "character").
Implements the functionality of the tokenizer in the MC toolkit (http://www.cs.utexas.edu/users/dml/software/mc/).
A character vector consisting of tokens obtained by
tokenization of x.
Ingo Feinerer
data("crude")
MC_tokenizer(crude[[1]])
scan_tokenizer(crude[[1]])
strsplit_space_tokenizer <- function(x) unlist(strsplit(x, "[[:space:]]+"))
strsplit_space_tokenizer(crude[[1]])