| GPos-class {GenomicRanges} | R Documentation |
The GPos class is a container for storing a set of genomic positions, that is, genomic ranges of width 1. Even though a GRanges object can be used for that, using a GPos object can be much more memory-efficient, especially when the object contains long runs of adjacent positions.
GPos(pos_runs) # constructor function
pos_runs |
A GRanges object (or any other GenomicRanges derivative)
where each range is interpreted as a run of adjacent genomic positions.
If |
A GPos object.
GPos objects support the same set of getters as GRanges
objects (i.e. seqnames(), start(), end(),
ranges(), strand(), mcols(), seqinfo(),
etc...), plus the pos() getter which is equivalent to
start() or end(). See ?GRanges for the list of
getters supported by GRanges objects.
Note that a GPos object cannot hold names i.e. names()
always returns NULL on it.
Like GRanges objects, GPos objects support the following setters:
The mcols() and metadata() setters.
The family of setters that operate on the seqinfo component of
an object:
seqlevels(),
seqlevelsStyle(),
seqlengths(),
isCircular(),
genome(),
and seqinfo().
These setters are defined and documented in the GenomeInfoDb
package.
However, there is no seqnames(), pos(), or strand()
setter for GPos objects at the moment (although they might be
added in the future).
From GenomicRanges to GPos:
A GenomicRanges object x in which all the ranges have a width
of 1 can be coerced to a GPos object with as(x, "GPos").
The names on x are not propagated (a warning is issued if x
has names on it).
From GPos to GRanges:
A GPos object x can be coerced to a GRanges object
with as(x, "GRanges"). However be aware that the resulting object
can use thousands times (or more) memory than x!
See "MEMORY USAGE" in the Examples section below.
From GPos to ordinary R objects:
Like with a GRanges object, as.character(), as.factor(),
and as.data.frame() work with a GPos object x.
Note that as.data.frame(x) returns a data frame with a pos
column (containing pos(x)) instead of the start, end,
and width columns that one gets when x is a GRanges
object.
A GPos object can be subsetted exactly like a GRanges object.
GPos objects can be combined (a.k.a. appended) with c() or
append().
Like with a GRanges object, split() and relist() work
with a GPos object x. Note that they return a
GenomicRangesList object instead of a GRangesList object.
Like for any Vector derivative, the length of a
GPos object cannot exceed .Machine$integer.max (i.e. 2^31 on
most platforms). GPos() will return an error if pos_runs
contains too many genomic positions.
Hervé Pagès; based on ideas borrowed from Georg Stricker georg.stricker@in.tum.de and Julien Gagneur gagneur@in.tum.de
GRanges objects.
The seqinfo accessor and family in
the GenomeInfoDb package for accessing/modifying the seqinfo
component of an object.
GenomicRanges-comparison for comparing and ordering genomic positions.
findOverlaps-methods for finding overlapping genomic ranges and/or positions.
nearest-methods for finding the nearest genomic range/position neighbor.
The snpsBySeqname,
snpsByOverlaps, and
snpsById methods for
SNPlocs objects defined in the BSgenome
package for extractors that return a GPos object.
SummarizedExperiment objects in the SummarizedExperiment package.
## ---------------------------------------------------------------------
## BASIC EXAMPLES
## ---------------------------------------------------------------------
## Example 1:
gpos1 <- GPos(c("chr1:44-53", "chr1:5-10", "chr2:2-5"))
gpos1
length(gpos1)
seqnames(gpos1)
pos(gpos1) # same as 'start(gpos1)' and 'end(gpos1)'
strand(gpos1)
as.character(gpos1)
as.data.frame(gpos1)
as(gpos1, "GRanges")
as.data.frame(as(gpos1, "GRanges"))
gpos1[9:17]
## Example 2:
pos_runs <- GRanges("chrI", IRanges(c(1, 6, 12, 17), c(5, 10, 16, 20)),
strand=c("+", "-", "-", "+"))
gpos2 <- GPos(pos_runs)
gpos2
strand(gpos2)
## Example 3:
gpos3A <- gpos3B <- GPos(c("chrI:1-1000", "chrI:1005-2000"))
npos <- length(gpos3A)
mcols(gpos3A)$sample <- Rle("sA")
sA_counts <- sample(10, npos, replace=TRUE)
mcols(gpos3A)$counts <- sA_counts
mcols(gpos3B)$sample <- Rle("sB")
sB_counts <- sample(10, npos, replace=TRUE)
mcols(gpos3B)$counts <- sB_counts
gpos3 <- c(gpos3A, gpos3B)
gpos3
## Example 4:
library(BSgenome.Scerevisiae.UCSC.sacCer2)
genome <- BSgenome.Scerevisiae.UCSC.sacCer2
gpos4 <- GPos(seqinfo(genome))
gpos4 # all the positions along the genome are represented
mcols(gpos4)$dna <- do.call("c", unname(as.list(genome)))
gpos4
## Note however that, like for any Vector derivative, the length of a
## GPos object cannot exceed '.Machine$integer.max' (i.e. 2^31 on most
## platforms) so the above only works with a "small" genome.
## For example it doesn't work with the Human genome:
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
## Not run:
GPos(seqinfo(TxDb.Hsapiens.UCSC.hg19.knownGene)) # error!
## End(Not run)
## You can use isSmallGenome() to check upfront whether the genome is
## "small" or not.
isSmallGenome(genome)
isSmallGenome(TxDb.Hsapiens.UCSC.hg19.knownGene)
## ---------------------------------------------------------------------
## MEMORY USAGE
## ---------------------------------------------------------------------
## Coercion to GRanges works...
gr4 <- as(gpos4, "GRanges")
gr4
## ... but is generally not a good idea:
object.size(gpos4)
object.size(gr4) # 6951 times bigger than the GPos object!
## Shuffling the order of the positions impacts memory usage:
gpos4r <- rev(gpos4)
object.size(gpos4r) # significantly
gpos4s <- sample(gpos4)
object.size(gpos4s) # even worse!
## AN IMPORTANT NOTE: In the worst situations, GPos still performs as
## good as a GRanges object.
object.size(as(gpos4r, "GRanges")) # same size as 'gpos4r'
object.size(as(gpos4s, "GRanges")) # same size as 'gpos4s'
## Best case scenario is when the object is strictly sorted (i.e.
## positions are in strict ascending order).
## This can be checked with:
is.unsorted(gpos4, strict=TRUE) # 'gpos4' is strictly sorted
## ---------------------------------------------------------------------
## USING MEMORY-EFFICIENT METADATA COLUMNS
## ---------------------------------------------------------------------
## In order to keep memory usage as low as possible, it is recommended
## to use a memory-efficient representation of the metadata columns that
## we want to set on the object. Rle's are particularly well suited for
## this, especially if the metadata columns contain long runs of
## identical values. This is the case for example if we want to use a
## GPos object to represent the coverage of sequencing reads along a
## genome.
## Example 5:
library(pasillaBamSubset)
library(Rsamtools) # for the BamFile() constructor function
bamfile1 <- BamFile(untreated1_chr4())
bamfile2 <- BamFile(untreated3_chr4())
gpos5 <- GPos(seqinfo(bamfile1))
library(GenomicAlignments) # for "coverage" method for BamFile objects
cov1 <- unlist(coverage(bamfile1), use.names=FALSE)
cov2 <- unlist(coverage(bamfile2), use.names=FALSE)
mcols(gpos5) <- DataFrame(cov1, cov2)
gpos5
object.size(gpos5) # lightweight
## Keep only the positions where coverage is at least 10 in one of the
## 2 samples:
gpos5[mcols(gpos5)$cov1 >= 10 | mcols(gpos5)$cov2 >= 10]
## ---------------------------------------------------------------------
## USING A GPos OBJECT IN A SummarizedExperiment OBJECT
## ---------------------------------------------------------------------
## Because the GPos class extends the GenomicRanges virtual class, a GPos
## object can be used as the rowRanges component of a SummarizedExperiment
## object.
## As a 1st example, we show how the counts for samples sA and sB in
## 'gpos3' can be stored in a SummarizedExperiment object where the rows
## correspond to unique genomic positions and the columns to samples:
library(SummarizedExperiment)
counts <- cbind(sA=sA_counts, sB=sB_counts)
mcols(gpos3A) <- NULL
rse3 <- SummarizedExperiment(list(counts=counts), rowRanges=gpos3A)
rse3
rowRanges(rse3)
head(assay(rse3))
## Finally we show how the coverage data from Example 5 can be easily
## stored in a lightweight SummarizedExperiment object:
cov <- mcols(gpos5)
mcols(gpos5) <- NULL
rse5 <- SummarizedExperiment(list(cov=cov), rowRanges=gpos5)
rse5
rowRanges(rse5)
assay(rse5)
## Keep only the positions where coverage is at least 10 in one of the
## 2 samples:
rse5[assay(rse5)$cov1 >= 10 | assay(rse5)$cov2 >= 10]