Several of the methods available for AnnotationDbi objects are also implemented for EnsDb objects. This enables to extract data from EnsDb objects in a similar fashion than from objects inheriting from the base annotation package class AnnotationDbi. In addition to the standard usage, the select and mapIds for EnsDb objects support also the filter framework of the ensembdb package and thus allow to perform more fine-grained queries to retrieve data.

# S4 method for EnsDb
columns(x)
# S4 method for EnsDb
keys(x, keytype, filter,...)
# S4 method for EnsDb
keytypes(x)
# S4 method for EnsDb
mapIds(x, keys, column, keytype, ..., multiVals)
# S4 method for EnsDb
select(x, keys, columns, keytype, ...)

Arguments

(In alphabetic order)

column

For mapIds: the column to search on, i.e. from which values should be retrieved.

columns

For select: the columns from which values should be retrieved. Use the columns method to list all possible columns.

keys

The keys/ids for which data should be retrieved from the database. This can be either a character vector of keys/IDs, a single filter object extending AnnotationFilter, an combination of filters AnnotationFilterList or a formula representing a filter expression (see AnnotationFilter for more details).

keytype

For mapIds and select: the type (column) that matches the provided keys. This argument does not have to be specified if argument keys is a filter object extending AnnotationFilter or a list of such objects.

For keys: which keys should be returned from the database.

filter

For keys: either a single object extending AnnotationFilter or a list of such object to retrieve only specific keys from the database.

multiVals

What should mapIds do when there are multiple values that could be returned? Options are: "first" (default), "list", "filter", "asNA". See mapIds in the AnnotationDbi package for a detailed description.

x

The EnsDb object.

...

Not used.

Methods and Functions

columns

List all the columns that can be retrieved by the mapIds and select methods. Note that these column names are different from the ones supported by the genes, transcripts etc. methods that can be listed by the listColumns method.

Returns a character vector of supported column names.

keys

Retrieves all keys from the column name specified with keytype. By default (if keytype is not provided) it returns all gene IDs. Note that keytype="TXNAME" will return transcript ids, since no transcript names are available in the database.

Returns a character vector of IDs.

keytypes

List all supported key types (column names).

Returns a character vector of key types.

mapIds

Retrieve the mapped ids for a set of keys that are of a particular keytype. Argument keys can be either a character vector of keys/IDs, a single filter object extending AnnotationFilter or a list of such objects. For the latter, the argument keytype does not have to be specified. Importantly however, if the filtering system is used, the ordering of the results might not represent the ordering of the keys.

The method usually returns a named character vector or, depending on the argument multiVals a named list, with names corresponding to the keys (same ordering is only guaranteed if keys is a character vector).

select

Retrieve the data as a data.frame based on parameters for selected keys, columns and keytype arguments. Multiple matches of the keys are returned in one row for each possible match. Argument keys can be either a character vector of keys/IDs, a single filter object extending AnnotationFilter or a list of such objects. For the latter, the argument keytype does not have to be specified.

Note that values from a column "TXNAME" will be the same than for a column "TXID", since internally no database column "tx_name" is present and the column is thus mapped to "tx_id".

Returns a data.frame with the column names corresponding to the argument columns and rows with all data matching the criteria specified with keys.

The use of select without filters or keys and without restricting to specicic columns is strongly discouraged, as the SQL query to join all of the tables, especially if protein annotation data is available is very expensive.

Value

See method description above.

Author

Johannes Rainer

Examples


library(EnsDb.Hsapiens.v86)
edb <- EnsDb.Hsapiens.v86

## List all supported keytypes.
keytypes(edb)
#>  [1] "ENTREZID"            "EXONID"              "GENEBIOTYPE"        
#>  [4] "GENEID"              "GENENAME"            "PROTDOMID"          
#>  [7] "PROTEINDOMAINID"     "PROTEINDOMAINSOURCE" "PROTEINID"          
#> [10] "SEQNAME"             "SEQSTRAND"           "SYMBOL"             
#> [13] "TXBIOTYPE"           "TXID"                "TXNAME"             
#> [16] "UNIPROTID"          

## List all supported columns for the select and mapIds methods.
columns(edb)
#>  [1] "ENTREZID"            "EXONID"              "EXONIDX"            
#>  [4] "EXONSEQEND"          "EXONSEQSTART"        "GENEBIOTYPE"        
#>  [7] "GENEID"              "GENENAME"            "GENESEQEND"         
#> [10] "GENESEQSTART"        "INTERPROACCESSION"   "ISCIRCULAR"         
#> [13] "PROTDOMEND"          "PROTDOMSTART"        "PROTEINDOMAINID"    
#> [16] "PROTEINDOMAINSOURCE" "PROTEINID"           "PROTEINSEQUENCE"    
#> [19] "SEQCOORDSYSTEM"      "SEQLENGTH"           "SEQNAME"            
#> [22] "SEQSTRAND"           "SYMBOL"              "TXBIOTYPE"          
#> [25] "TXCDSSEQEND"         "TXCDSSEQSTART"       "TXID"               
#> [28] "TXNAME"              "TXSEQEND"            "TXSEQSTART"         
#> [31] "UNIPROTDB"           "UNIPROTID"           "UNIPROTMAPPINGTYPE" 

## List /real/ database column names.
listColumns(edb)
#>  [1] "seq_name"              "seq_length"            "is_circular"          
#>  [4] "gene_id"               "entrezid"              "exon_id"              
#>  [7] "exon_seq_start"        "exon_seq_end"          "gene_name"            
#> [10] "gene_biotype"          "gene_seq_start"        "gene_seq_end"         
#> [13] "seq_strand"            "seq_coord_system"      "symbol"               
#> [16] "tx_id"                 "protein_id"            "protein_sequence"     
#> [19] "protein_domain_id"     "protein_domain_source" "interpro_accession"   
#> [22] "prot_dom_start"        "prot_dom_end"          "tx_biotype"           
#> [25] "tx_seq_start"          "tx_seq_end"            "tx_cds_seq_start"     
#> [28] "tx_cds_seq_end"        "tx_name"               "exon_idx"             
#> [31] "uniprot_id"            "uniprot_db"            "uniprot_mapping_type" 

## Retrieve all keys corresponding to transcript ids.
txids <- keys(edb, keytype = "TXID")
length(txids)
#> [1] 216741
head(txids)
#> [1] "ENST00000000233" "ENST00000000412" "ENST00000000442" "ENST00000001008"
#> [5] "ENST00000001146" "ENST00000002125"

## Retrieve all keys corresponding to gene names of genes encoded on chromosome X
gids <- keys(edb, keytype = "GENENAME", filter = SeqNameFilter("X"))
length(gids)
#> [1] 2300
head(gids)
#> [1] "TSPAN6" "TNMD"   "LAS1L"  "CD99"   "KLHL13" "ARX"   

## Get a mapping of the genes BCL2 and BCL2L11 to all of their
## transcript ids and return the result as list
maps <- mapIds(edb, keys = c("BCL2", "BCL2L11"), column = "TXID",
               keytype = "GENENAME", multiVals = "list")
maps
#> $BCL2
#> [1] "ENST00000398117" "ENST00000333681" "ENST00000590515" "ENST00000589955"
#> 
#> $BCL2L11
#>  [1] "ENST00000432179" "ENST00000308659" "ENST00000393256" "ENST00000393252"
#>  [5] "ENST00000433098" "ENST00000405953" "ENST00000415458" "ENST00000436733"
#>  [9] "ENST00000437029" "ENST00000452231" "ENST00000361493" "ENST00000431217"
#> [13] "ENST00000439718" "ENST00000438054" "ENST00000337565" "ENST00000622509"
#> [17] "ENST00000619294" "ENST00000610735" "ENST00000622612" "ENST00000357757"
#> [21] "ENST00000615946" "ENST00000621302" "ENST00000620862" "LRG_620t1"      
#> [25] "LRG_620t2"       "LRG_620t3"       "LRG_620t4"       "LRG_620t5"      
#> 

## Perform the same query using a combination of a GeneNameFilter and a
## TxBiotypeFilter to just retrieve protein coding transcripts for these
## two genes.
mapIds(edb, keys = list(GeneNameFilter(c("BCL2", "BCL2L11")),
                        TxBiotypeFilter("protein_coding")), column = "TXID",
       multiVals = "list")
#> Warning: Got 2 filter objects. Will use the keys of the first for the mapping!
#> Note: ordering of the results might not match ordering of keys!
#> $BCL2
#> [1] "ENST00000398117" "ENST00000333681" "ENST00000589955"
#> 
#> $BCL2L11
#>  [1] "ENST00000432179" "ENST00000308659" "ENST00000393256" "ENST00000393252"
#>  [5] "ENST00000405953" "ENST00000438054" "ENST00000337565" "ENST00000622509"
#>  [9] "ENST00000619294" "ENST00000610735" "ENST00000622612" "ENST00000357757"
#> [13] "ENST00000615946" "ENST00000621302" "ENST00000620862"
#> 

## select:
## Retrieve all transcript and gene related information for the above example.
select(edb, keys = list(GeneNameFilter(c("BCL2", "BCL2L11")),
                        TxBiotypeFilter("protein_coding")),
       columns = c("GENEID", "GENENAME", "TXID", "TXBIOTYPE", "TXSEQSTART",
                   "TXSEQEND", "SEQNAME", "SEQSTRAND"))
#> Note: ordering of the results might not match ordering of keys!
#>             GENEID GENENAME            TXID      TXBIOTYPE TXSEQSTART  TXSEQEND
#> 1  ENSG00000171791     BCL2 ENST00000398117 protein_coding   63123346  63320128
#> 2  ENSG00000171791     BCL2 ENST00000333681 protein_coding   63127035  63319786
#> 3  ENSG00000171791     BCL2 ENST00000589955 protein_coding   63313802  63318812
#> 4  ENSG00000153094  BCL2L11 ENST00000432179 protein_coding  111119378 111124112
#> 5  ENSG00000153094  BCL2L11 ENST00000308659 protein_coding  111120914 111165048
#> 6  ENSG00000153094  BCL2L11 ENST00000393256 protein_coding  111120929 111168447
#> 7  ENSG00000153094  BCL2L11 ENST00000393252 protein_coding  111122670 111123960
#> 8  ENSG00000153094  BCL2L11 ENST00000405953 protein_coding  111123746 111128837
#> 9  ENSG00000153094  BCL2L11 ENST00000438054 protein_coding  111123752 111146284
#> 10 ENSG00000153094  BCL2L11 ENST00000337565 protein_coding  111120914 111128844
#> 11 ENSG00000153094  BCL2L11 ENST00000622509 protein_coding  111120914 111168445
#> 12 ENSG00000153094  BCL2L11 ENST00000619294 protein_coding  111120914 111168445
#> 13 ENSG00000153094  BCL2L11 ENST00000610735 protein_coding  111120914 111168445
#> 14 ENSG00000153094  BCL2L11 ENST00000622612 protein_coding  111120914 111168445
#> 15 ENSG00000153094  BCL2L11 ENST00000357757 protein_coding  111120914 111146281
#> 16 ENSG00000153094  BCL2L11 ENST00000615946 protein_coding  111120914 111168445
#> 17 ENSG00000153094  BCL2L11 ENST00000621302 protein_coding  111120914 111168445
#> 18 ENSG00000153094  BCL2L11 ENST00000620862 protein_coding  111120914 111168445
#>    SEQNAME SEQSTRAND
#> 1       18        -1
#> 2       18        -1
#> 3       18        -1
#> 4        2         1
#> 5        2         1
#> 6        2         1
#> 7        2         1
#> 8        2         1
#> 9        2         1
#> 10       2         1
#> 11       2         1
#> 12       2         1
#> 13       2         1
#> 14       2         1
#> 15       2         1
#> 16       2         1
#> 17       2         1
#> 18       2         1

## Get all data for genes encoded on chromosome Y
Y <- select(edb, keys = "Y", keytype = "SEQNAME")
head(Y)
#>   ENTREZID          EXONID EXONIDX EXONSEQEND EXONSEQSTART    GENEBIOTYPE
#> 1     8284 ENSE00001902471       1   19716695     19716596 protein_coding
#> 2     8284 ENSE00003665408       2   19716473     19716279 protein_coding
#> 3     8284 ENSE00003572891       3   19716004     19715823 protein_coding
#> 4     8284 ENSE00003452486       4   19715739     19715584 protein_coding
#> 5     8284 ENSE00003535737       5   19710472     19710376 protein_coding
#> 6     8284 ENSE00003306433       6   19710260     19709451 protein_coding
#>            GENEID GENENAME GENESEQEND GENESEQSTART INTERPROACCESSION ISCIRCULAR
#> 1 ENSG00000012817    KDM5D   19744939     19703865              <NA>          0
#> 2 ENSG00000012817    KDM5D   19744939     19703865              <NA>          0
#> 3 ENSG00000012817    KDM5D   19744939     19703865              <NA>          0
#> 4 ENSG00000012817    KDM5D   19744939     19703865              <NA>          0
#> 5 ENSG00000012817    KDM5D   19744939     19703865              <NA>          0
#> 6 ENSG00000012817    KDM5D   19744939     19703865              <NA>          0
#>   PROTDOMEND PROTDOMSTART PROTEINDOMAINID PROTEINDOMAINSOURCE PROTEINID
#> 1         NA           NA            <NA>                <NA>      <NA>
#> 2         NA           NA            <NA>                <NA>      <NA>
#> 3         NA           NA            <NA>                <NA>      <NA>
#> 4         NA           NA            <NA>                <NA>      <NA>
#> 5         NA           NA            <NA>                <NA>      <NA>
#> 6         NA           NA            <NA>                <NA>      <NA>
#>   PROTEINSEQUENCE SEQCOORDSYSTEM SEQLENGTH SEQNAME SEQSTRAND SYMBOL
#> 1            <NA>     chromosome  57227415       Y        -1  KDM5D
#> 2            <NA>     chromosome  57227415       Y        -1  KDM5D
#> 3            <NA>     chromosome  57227415       Y        -1  KDM5D
#> 4            <NA>     chromosome  57227415       Y        -1  KDM5D
#> 5            <NA>     chromosome  57227415       Y        -1  KDM5D
#> 6            <NA>     chromosome  57227415       Y        -1  KDM5D
#>         TXBIOTYPE TXCDSSEQEND TXCDSSEQSTART            TXID          TXNAME
#> 1 retained_intron          NA            NA ENST00000469599 ENST00000469599
#> 2 retained_intron          NA            NA ENST00000469599 ENST00000469599
#> 3 retained_intron          NA            NA ENST00000469599 ENST00000469599
#> 4 retained_intron          NA            NA ENST00000469599 ENST00000469599
#> 5 retained_intron          NA            NA ENST00000469599 ENST00000469599
#> 6 retained_intron          NA            NA ENST00000469599 ENST00000469599
#>   TXSEQEND TXSEQSTART UNIPROTDB UNIPROTID UNIPROTMAPPINGTYPE
#> 1 19716695   19703865      <NA>      <NA>               <NA>
#> 2 19716695   19703865      <NA>      <NA>               <NA>
#> 3 19716695   19703865      <NA>      <NA>               <NA>
#> 4 19716695   19703865      <NA>      <NA>               <NA>
#> 5 19716695   19703865      <NA>      <NA>               <NA>
#> 6 19716695   19703865      <NA>      <NA>               <NA>
nrow(Y)
#> [1] 28264

## Get selected columns for all lincRNAs encoded on chromosome Y. Here we use
## a filter expression to define what data to retrieve.
Y <- select(edb, keys = ~ seq_name == "Y" & gene_biotype == "lincRNA",
            columns = c("GENEID", "GENEBIOTYPE", "TXID", "GENENAME"))
#> Note: ordering of the results might not match ordering of keys!
head(Y)
#>            GENEID GENEBIOTYPE            TXID GENENAME SEQNAME
#> 1 ENSG00000129816     lincRNA ENST00000250776   TTTY1B       Y
#> 2 ENSG00000129845     lincRNA ENST00000250805    TTTY1       Y
#> 3 ENSG00000131538     lincRNA ENST00000253838    TTTY6       Y
#> 4 ENSG00000147753     lincRNA ENST00000457100    TTTY7       Y
#> 5 ENSG00000147753     lincRNA ENST00000276770    TTTY7       Y
#> 6 ENSG00000147753     lincRNA ENST00000449828    TTTY7       Y
nrow(Y)
#> [1] 73