EnsDb-AnnotationDbi.Rd
Several of the methods available for AnnotationDbi
objects are
also implemented for EnsDb
objects. This enables to extract
data from EnsDb
objects in a similar fashion than from objects
inheriting from the base annotation package class
AnnotationDbi
.
In addition to the standard usage, the select
and
mapIds
for EnsDb
objects support also the filter
framework of the ensembdb package and thus allow to perform more
fine-grained queries to retrieve data.
# S4 method for EnsDb
columns(x)
# S4 method for EnsDb
keys(x, keytype, filter,...)
# S4 method for EnsDb
keytypes(x)
# S4 method for EnsDb
mapIds(x, keys, column, keytype, ..., multiVals)
# S4 method for EnsDb
select(x, keys, columns, keytype, ...)
(In alphabetic order)
For mapIds
: the column to search on, i.e. from which values
should be retrieved.
For select
: the columns from which values should be
retrieved. Use the columns
method to list all possible
columns.
The keys/ids for which data should be retrieved from the
database. This can be either a character vector of keys/IDs, a
single filter object extending
AnnotationFilter
, an combination of
filters AnnotationFilterList
or a
formula
representing a filter expression (see
AnnotationFilter
for more details).
For mapIds
and select
: the type (column) that matches
the provided keys. This argument does not have to be specified if
argument keys
is a filter object extending
AnnotationFilter
or a list
of such objects.
For keys
: which keys should be returned from the database.
For keys
: either a single object extending
AnnotationFilter
or a list of such object to
retrieve only specific keys from the database.
What should mapIds
do when there are multiple values that
could be returned? Options are: "first"
(default), "list"
,
"filter"
, "asNA"
. See
mapIds
in the AnnotationDbi
package
for a detailed description.
The EnsDb
object.
Not used.
List all the columns that can be retrieved by the mapIds
and select
methods. Note that these column names are
different from the ones supported by the genes
,
transcripts
etc. methods that can be listed by the
listColumns
method.
Returns a character vector of supported column names.
Retrieves all keys from the column name specified with
keytype
. By default (if keytype
is not provided) it
returns all gene IDs. Note that keytype="TXNAME"
will
return transcript ids, since no transcript names are available in
the database.
Returns a character vector of IDs.
List all supported key types (column names).
Returns a character vector of key types.
Retrieve the mapped ids for a set of keys that are of a particular
keytype. Argument keys
can be either a character vector of
keys/IDs, a single filter object extending
AnnotationFilter
or a list of such objects. For
the latter, the argument keytype
does not have to be
specified. Importantly however, if the filtering system is used,
the ordering of the results might not represent the ordering of
the keys.
The method usually returns a named character vector or, depending
on the argument multiVals
a named list, with names
corresponding to the keys (same ordering is only guaranteed if
keys
is a character vector).
Retrieve the data as a data.frame
based on parameters for
selected keys
, columns
and keytype
arguments. Multiple matches of the keys are returned in one row
for each possible match. Argument keys
can be either a
character vector of keys/IDs, a single filter object extending
AnnotationFilter
or a list of such objects. For
the latter, the argument keytype
does not have to be
specified.
Note that values from a column "TXNAME"
will be the same
than for a column "TXID"
, since internally no database
column "tx_name"
is present and the column is thus mapped
to "tx_id"
.
Returns a data.frame
with the column names corresponding to
the argument columns
and rows with all data matching the
criteria specified with keys
.
The use of select
without filters or keys and without
restricting to specicic columns is strongly discouraged, as the
SQL query to join all of the tables, especially if protein
annotation data is available is very expensive.
See method description above.
library(EnsDb.Hsapiens.v86)
edb <- EnsDb.Hsapiens.v86
## List all supported keytypes.
keytypes(edb)
#> [1] "ENTREZID" "EXONID" "GENEBIOTYPE"
#> [4] "GENEID" "GENENAME" "PROTDOMID"
#> [7] "PROTEINDOMAINID" "PROTEINDOMAINSOURCE" "PROTEINID"
#> [10] "SEQNAME" "SEQSTRAND" "SYMBOL"
#> [13] "TXBIOTYPE" "TXID" "TXNAME"
#> [16] "UNIPROTID"
## List all supported columns for the select and mapIds methods.
columns(edb)
#> [1] "ENTREZID" "EXONID" "EXONIDX"
#> [4] "EXONSEQEND" "EXONSEQSTART" "GENEBIOTYPE"
#> [7] "GENEID" "GENENAME" "GENESEQEND"
#> [10] "GENESEQSTART" "INTERPROACCESSION" "ISCIRCULAR"
#> [13] "PROTDOMEND" "PROTDOMSTART" "PROTEINDOMAINID"
#> [16] "PROTEINDOMAINSOURCE" "PROTEINID" "PROTEINSEQUENCE"
#> [19] "SEQCOORDSYSTEM" "SEQLENGTH" "SEQNAME"
#> [22] "SEQSTRAND" "SYMBOL" "TXBIOTYPE"
#> [25] "TXCDSSEQEND" "TXCDSSEQSTART" "TXID"
#> [28] "TXNAME" "TXSEQEND" "TXSEQSTART"
#> [31] "UNIPROTDB" "UNIPROTID" "UNIPROTMAPPINGTYPE"
## List /real/ database column names.
listColumns(edb)
#> [1] "seq_name" "seq_length" "is_circular"
#> [4] "gene_id" "entrezid" "exon_id"
#> [7] "exon_seq_start" "exon_seq_end" "gene_name"
#> [10] "gene_biotype" "gene_seq_start" "gene_seq_end"
#> [13] "seq_strand" "seq_coord_system" "symbol"
#> [16] "tx_id" "protein_id" "protein_sequence"
#> [19] "protein_domain_id" "protein_domain_source" "interpro_accession"
#> [22] "prot_dom_start" "prot_dom_end" "tx_biotype"
#> [25] "tx_seq_start" "tx_seq_end" "tx_cds_seq_start"
#> [28] "tx_cds_seq_end" "tx_name" "exon_idx"
#> [31] "uniprot_id" "uniprot_db" "uniprot_mapping_type"
## Retrieve all keys corresponding to transcript ids.
txids <- keys(edb, keytype = "TXID")
length(txids)
#> [1] 216741
head(txids)
#> [1] "ENST00000000233" "ENST00000000412" "ENST00000000442" "ENST00000001008"
#> [5] "ENST00000001146" "ENST00000002125"
## Retrieve all keys corresponding to gene names of genes encoded on chromosome X
gids <- keys(edb, keytype = "GENENAME", filter = SeqNameFilter("X"))
length(gids)
#> [1] 2300
head(gids)
#> [1] "TSPAN6" "TNMD" "LAS1L" "CD99" "KLHL13" "ARX"
## Get a mapping of the genes BCL2 and BCL2L11 to all of their
## transcript ids and return the result as list
maps <- mapIds(edb, keys = c("BCL2", "BCL2L11"), column = "TXID",
keytype = "GENENAME", multiVals = "list")
maps
#> $BCL2
#> [1] "ENST00000398117" "ENST00000333681" "ENST00000590515" "ENST00000589955"
#>
#> $BCL2L11
#> [1] "ENST00000432179" "ENST00000308659" "ENST00000393256" "ENST00000393252"
#> [5] "ENST00000433098" "ENST00000405953" "ENST00000415458" "ENST00000436733"
#> [9] "ENST00000437029" "ENST00000452231" "ENST00000361493" "ENST00000431217"
#> [13] "ENST00000439718" "ENST00000438054" "ENST00000337565" "ENST00000622509"
#> [17] "ENST00000619294" "ENST00000610735" "ENST00000622612" "ENST00000357757"
#> [21] "ENST00000615946" "ENST00000621302" "ENST00000620862" "LRG_620t1"
#> [25] "LRG_620t2" "LRG_620t3" "LRG_620t4" "LRG_620t5"
#>
## Perform the same query using a combination of a GeneNameFilter and a
## TxBiotypeFilter to just retrieve protein coding transcripts for these
## two genes.
mapIds(edb, keys = list(GeneNameFilter(c("BCL2", "BCL2L11")),
TxBiotypeFilter("protein_coding")), column = "TXID",
multiVals = "list")
#> Warning: Got 2 filter objects. Will use the keys of the first for the mapping!
#> Note: ordering of the results might not match ordering of keys!
#> $BCL2
#> [1] "ENST00000398117" "ENST00000333681" "ENST00000589955"
#>
#> $BCL2L11
#> [1] "ENST00000432179" "ENST00000308659" "ENST00000393256" "ENST00000393252"
#> [5] "ENST00000405953" "ENST00000438054" "ENST00000337565" "ENST00000622509"
#> [9] "ENST00000619294" "ENST00000610735" "ENST00000622612" "ENST00000357757"
#> [13] "ENST00000615946" "ENST00000621302" "ENST00000620862"
#>
## select:
## Retrieve all transcript and gene related information for the above example.
select(edb, keys = list(GeneNameFilter(c("BCL2", "BCL2L11")),
TxBiotypeFilter("protein_coding")),
columns = c("GENEID", "GENENAME", "TXID", "TXBIOTYPE", "TXSEQSTART",
"TXSEQEND", "SEQNAME", "SEQSTRAND"))
#> Note: ordering of the results might not match ordering of keys!
#> GENEID GENENAME TXID TXBIOTYPE TXSEQSTART TXSEQEND
#> 1 ENSG00000171791 BCL2 ENST00000398117 protein_coding 63123346 63320128
#> 2 ENSG00000171791 BCL2 ENST00000333681 protein_coding 63127035 63319786
#> 3 ENSG00000171791 BCL2 ENST00000589955 protein_coding 63313802 63318812
#> 4 ENSG00000153094 BCL2L11 ENST00000432179 protein_coding 111119378 111124112
#> 5 ENSG00000153094 BCL2L11 ENST00000308659 protein_coding 111120914 111165048
#> 6 ENSG00000153094 BCL2L11 ENST00000393256 protein_coding 111120929 111168447
#> 7 ENSG00000153094 BCL2L11 ENST00000393252 protein_coding 111122670 111123960
#> 8 ENSG00000153094 BCL2L11 ENST00000405953 protein_coding 111123746 111128837
#> 9 ENSG00000153094 BCL2L11 ENST00000438054 protein_coding 111123752 111146284
#> 10 ENSG00000153094 BCL2L11 ENST00000337565 protein_coding 111120914 111128844
#> 11 ENSG00000153094 BCL2L11 ENST00000622509 protein_coding 111120914 111168445
#> 12 ENSG00000153094 BCL2L11 ENST00000619294 protein_coding 111120914 111168445
#> 13 ENSG00000153094 BCL2L11 ENST00000610735 protein_coding 111120914 111168445
#> 14 ENSG00000153094 BCL2L11 ENST00000622612 protein_coding 111120914 111168445
#> 15 ENSG00000153094 BCL2L11 ENST00000357757 protein_coding 111120914 111146281
#> 16 ENSG00000153094 BCL2L11 ENST00000615946 protein_coding 111120914 111168445
#> 17 ENSG00000153094 BCL2L11 ENST00000621302 protein_coding 111120914 111168445
#> 18 ENSG00000153094 BCL2L11 ENST00000620862 protein_coding 111120914 111168445
#> SEQNAME SEQSTRAND
#> 1 18 -1
#> 2 18 -1
#> 3 18 -1
#> 4 2 1
#> 5 2 1
#> 6 2 1
#> 7 2 1
#> 8 2 1
#> 9 2 1
#> 10 2 1
#> 11 2 1
#> 12 2 1
#> 13 2 1
#> 14 2 1
#> 15 2 1
#> 16 2 1
#> 17 2 1
#> 18 2 1
## Get all data for genes encoded on chromosome Y
Y <- select(edb, keys = "Y", keytype = "SEQNAME")
head(Y)
#> ENTREZID EXONID EXONIDX EXONSEQEND EXONSEQSTART GENEBIOTYPE
#> 1 8284 ENSE00001902471 1 19716695 19716596 protein_coding
#> 2 8284 ENSE00003665408 2 19716473 19716279 protein_coding
#> 3 8284 ENSE00003572891 3 19716004 19715823 protein_coding
#> 4 8284 ENSE00003452486 4 19715739 19715584 protein_coding
#> 5 8284 ENSE00003535737 5 19710472 19710376 protein_coding
#> 6 8284 ENSE00003306433 6 19710260 19709451 protein_coding
#> GENEID GENENAME GENESEQEND GENESEQSTART INTERPROACCESSION ISCIRCULAR
#> 1 ENSG00000012817 KDM5D 19744939 19703865 <NA> 0
#> 2 ENSG00000012817 KDM5D 19744939 19703865 <NA> 0
#> 3 ENSG00000012817 KDM5D 19744939 19703865 <NA> 0
#> 4 ENSG00000012817 KDM5D 19744939 19703865 <NA> 0
#> 5 ENSG00000012817 KDM5D 19744939 19703865 <NA> 0
#> 6 ENSG00000012817 KDM5D 19744939 19703865 <NA> 0
#> PROTDOMEND PROTDOMSTART PROTEINDOMAINID PROTEINDOMAINSOURCE PROTEINID
#> 1 NA NA <NA> <NA> <NA>
#> 2 NA NA <NA> <NA> <NA>
#> 3 NA NA <NA> <NA> <NA>
#> 4 NA NA <NA> <NA> <NA>
#> 5 NA NA <NA> <NA> <NA>
#> 6 NA NA <NA> <NA> <NA>
#> PROTEINSEQUENCE SEQCOORDSYSTEM SEQLENGTH SEQNAME SEQSTRAND SYMBOL
#> 1 <NA> chromosome 57227415 Y -1 KDM5D
#> 2 <NA> chromosome 57227415 Y -1 KDM5D
#> 3 <NA> chromosome 57227415 Y -1 KDM5D
#> 4 <NA> chromosome 57227415 Y -1 KDM5D
#> 5 <NA> chromosome 57227415 Y -1 KDM5D
#> 6 <NA> chromosome 57227415 Y -1 KDM5D
#> TXBIOTYPE TXCDSSEQEND TXCDSSEQSTART TXID TXNAME
#> 1 retained_intron NA NA ENST00000469599 ENST00000469599
#> 2 retained_intron NA NA ENST00000469599 ENST00000469599
#> 3 retained_intron NA NA ENST00000469599 ENST00000469599
#> 4 retained_intron NA NA ENST00000469599 ENST00000469599
#> 5 retained_intron NA NA ENST00000469599 ENST00000469599
#> 6 retained_intron NA NA ENST00000469599 ENST00000469599
#> TXSEQEND TXSEQSTART UNIPROTDB UNIPROTID UNIPROTMAPPINGTYPE
#> 1 19716695 19703865 <NA> <NA> <NA>
#> 2 19716695 19703865 <NA> <NA> <NA>
#> 3 19716695 19703865 <NA> <NA> <NA>
#> 4 19716695 19703865 <NA> <NA> <NA>
#> 5 19716695 19703865 <NA> <NA> <NA>
#> 6 19716695 19703865 <NA> <NA> <NA>
nrow(Y)
#> [1] 28264
## Get selected columns for all lincRNAs encoded on chromosome Y. Here we use
## a filter expression to define what data to retrieve.
Y <- select(edb, keys = ~ seq_name == "Y" & gene_biotype == "lincRNA",
columns = c("GENEID", "GENEBIOTYPE", "TXID", "GENENAME"))
#> Note: ordering of the results might not match ordering of keys!
head(Y)
#> GENEID GENEBIOTYPE TXID GENENAME SEQNAME
#> 1 ENSG00000129816 lincRNA ENST00000250776 TTTY1B Y
#> 2 ENSG00000129845 lincRNA ENST00000250805 TTTY1 Y
#> 3 ENSG00000131538 lincRNA ENST00000253838 TTTY6 Y
#> 4 ENSG00000147753 lincRNA ENST00000457100 TTTY7 Y
#> 5 ENSG00000147753 lincRNA ENST00000276770 TTTY7 Y
#> 6 ENSG00000147753 lincRNA ENST00000449828 TTTY7 Y
nrow(Y)
#> [1] 73