Generating a Ensembl annotation package from Ensembl

The functions described on this page allow to build EnsDb annotation objects/databases from Ensembl annotations. The most complete set of annotations, which include also the NCBI Entrezgene identifiers for each gene, can be retrieved by the functions using the Ensembl Perl API (i.e. functions fetchTablesFromEnsembl, makeEnsemblSQLiteFromTables). Alternatively the functions ensDbFromAH, ensDbFromGRanges, ensDbFromGff and ensDbFromGtf can be used to build EnsDb objects using GFF or GTF files from Ensembl, which can be either manually downloaded from the Ensembl ftp server, or directly form within R using AnnotationHub. The generated SQLite database can be packaged into an R package using the makeEnsembldbPackage.

ensDbFromAH(ah, outfile, path, organism, genomeVersion, version)

ensDbFromGRanges(x, outfile, path, organism, genomeVersion,
                 version, ...)

ensDbFromGff(gff, outfile, path, organism, genomeVersion,
             version, ...)

ensDbFromGtf(gtf, outfile, path, organism, genomeVersion,
             version, ...)

fetchTablesFromEnsembl(version, ensemblapi, user="anonymous",
                       host="ensembldb.ensembl.org", pass="",
                       port=5306, species="human")

makeEnsemblSQLiteFromTables(path=".", dbname)

makeEnsembldbPackage(ensdb, version, maintainer, author,
                     destDir=".", license="Artistic-2.0")

Arguments

(in alphabetical order)

ah

For ensDbFromAH: an AnnotationHub object representing a single resource (i.e. GTF file from Ensembl) from AnnotationHub.

author

The author of the package.

dbname

The name for the database (optional). By default a name based on the species and Ensembl version will be automatically generated (and returned by the function).

destDir

Where the package should be saved to.

ensdb

The file name of the SQLite database generated by makeEnsemblSQLiteFromTables.

ensemblapi

The path to the Ensembl perl API installed locally on the system. The Ensembl perl API version has to fit the version.

genomeVersion

For ensDbFromAH, ensDbFromGtf and ensDbFromGff: the version of the genome (e.g. "GRCh37"). If not provided the function will try to guess it from the file name (assuming file name convention of Ensembl GTF files).

gff

The GFF file to import.

gtf

The GTF file name.

host

The hostname to access the Ensembl database.

license

The license of the package.

maintainer

The maintainer of the package.

organism

For ensDbFromAH, ensDbFromGff and ensDbFromGtf: the organism name (e.g. "Homo_sapiens"). If not provided the function will try to guess it from the file name (assuming file name convention of Ensembl GTF files).

outfile

The desired file name of the SQLite file. If not provided the name of the GTF file will be used.

pass

The password for the Ensembl database.

path

The directory in which the tables retrieved by fetchTablesFromEnsembl or the SQLite database file generated by ensDbFromGtf are stored.

port

The port to be used to connect to the Ensembl database.

species

The species for which the annotations should be retrieved.

user

The username for the Ensembl database.

version

For fetchTablesFromEnsembl, ensDbFromGRanges and ensDbFromGtf: the Ensembl version for which the annotation should be retrieved (e.g. 75). The ensDbFromGtf function will try to guess the Ensembl version from the GTF file name if not provided.

For makeEnsemblDbPackage: the version for the package.

x

For ensDbFromGRanges: the GRanges object.

...

Currently not used.

Functions

ensDbFromAH: Create an EnsDb (SQLite) database from a GTF file provided by AnnotationHub. The function returns the file name of the generated database file. For usage see the examples below.
ensDbFromGff: Create an EnsDb (SQLite) database from a GFF file from Ensembl. The function returns the file name of the generated database file. For usage see the examples below.
ensDbFromGtf: Create an EnsDb (SQLite) database from a GTF file from Ensembl. The function returns the file name of the generated database file. For usage see the examplesbelow.
ensDbFromGRanges: Create an EnsDb (SQLite) database from a GRanges object (e.g. from AnnotationHub). The function returns the file name of the generated database file. For usage see the examples below.
fetchTablesFromEnsembl: Uses the Ensembl Perl API to fetch all required data from an Ensembl database server and stores them locally to text files (that can be used as input for the makeEnsembldbSQLiteFromTables function).
makeEnsemblSQLiteFromTables: Creates the SQLite EnsDb database from the tables generated by the fetchTablesFromEnsembl.
makeEnsembldbPackage: Creates an R package containing the EnsDb database from a EnsDb SQLite database created by any of the above functions ensDbFromAH, ensDbFromGff, ensDbFromGtf or makeEnsemblSQLiteFromTables.

Details

The fetchTablesFromEnsembl function internally calls the perl script get_gene_transcript_exon_tables.pl to retrieve all required information from the Ensembl database using the Ensembl perl API.

As an alternative way, a EnsDb database file can be generated by the ensDbFromGtf or ensDbFromGff from a GTF or GFF file downloaded from the Ensembl ftp server or using the ensDbFromAH to build a database directly from corresponding resources from the AnnotationHub. The returned database file name can then be used as an input to the makeEnsembldbPackage or it can be directly loaded and used by the EnsDb constructor.

Note

A local installation of the Ensembl perl API is required for the fetchTablesFromEnsembl. See http://www.ensembl.org/info/docs/api/api_installation.html for installation inscructions.

A database generated from a GTF/GFF files lacks some features as they are not available in the GTF files from Ensembl. These are: NCBI Entrezgene IDs.

Value

makeEnsemblSQLiteFromTables, ensDbFromAH,

ensDbFromGRanges and ensDbFromGtf: the name of the SQLite file.

Author

Johannes Rainer

Examples


if (FALSE) {

    ## get all human gene/transcript/exon annotations from Ensembl (75)
    ## the resulting tables will be stored by default to the current working
    ## directory; if the correct Ensembl api (version 75) is defined in the
    ## PERL5LIB environment variable, the ensemblapi parameter can also be omitted.
    fetchTablesFromEnsembl(75,
                           ensemblapi="/home/bioinfo/ensembl/75/API/ensembl/modules",
                           species="human")

    ## These tables can then be processed to generate a SQLite database
    ## containing the annotations
    DBFile <- makeEnsemblSQLiteFromTables()

    ## and finally we can generate the package
    makeEnsembldbPackage(ensdb=DBFile, version="0.0.1",
                         maintainer="Johannes Rainer <johannes.rainer@eurac.edu>",
                         author="J Rainer")

    ## Build an annotation database form a GFF file from Ensembl.
    ## ftp://ftp.ensembl.org/pub/release-83/gff3/rattus_norvegicus
    gff <- "Rattus_norvegicus.Rnor_6.0.83.gff3.gz"
    DB <- ensDbFromGff(gff=gff)
    edb <- EnsDb(DB)
    edb

    ## Build an annotation file from a GTF file.
    ## the GTF file can be downloaded from
    ## ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/
    gtffile <- "Homo_sapiens.GRCh37.75.gtf.gz"
    ## generate the SQLite database file
    DB <- ensDbFromGtf(gtf=paste0(ensemblhost, gtffile))

    ## load the DB file directly
    EDB <- EnsDb(DB)

    ## Alternatively, we could fetch a GTF file directly from AnnotationHub
    ## and build the database from that:
    library(AnnotationHub)
    ah <- AnnotationHub()
    ## Query for all GTF files from Ensembl for Ensembl version 81
    query(ah, c("Ensembl", "release-81", "GTF"))
    ## We could get the one from e.g. Bos taurus:
    DB <- ensDbFromAH(ah["AH47941"])
    edb <- EnsDb(DB)
    edb
}

## Generate a sqlite database for genes encoded on chromosome Y
chrY <- system.file("chrY", package="ensembldb")
DBFile <- makeEnsemblSQLiteFromTables(path=chrY ,dbname=tempfile())
#> Processing 'chromosome' table ... 
#> OK
#> Processing 'gene' table ... 
#> OK
#> Processing 'trancript' table ... 
#> OK
#> Processing 'exon' table ... 
#> OK
#> Processing 'tx2exon' table ... 
#> OK
#> Creating indices ... 
#> OK
#> Checking validity of the database ... 
#> OK
## load this database:
edb <- EnsDb(DBFile)

edb
#> EnsDb for Ensembl:
#> |Backend: SQLite
#> |Db type: EnsDb
#> |Type of Gene ID: Ensembl Gene ID
#> |Supporting package: ensembldb
#> |Db created by: ensembldb package from Bioconductor
#> |script_version: 0.1.2
#> |Creation time: Wed Mar 18 09:30:54 2015
#> |ensembl_version: 75
#> |ensembl_host: manny.i-med.ac.at
#> |Organism: homo_sapiens
#> |genome_build: GRCh37
#> |DBSCHEMAVERSION: 1.0
#> | No. of genes: 495.
#> | No. of transcripts: 731.

## Generate a sqlite database from a GRanges object specifying
## genes encoded on chromosome Y
load(system.file("YGRanges.RData", package="ensembldb"))

Y
#> GRanges object with 7155 ranges and 16 metadata columns:
#>          seqnames            ranges strand |               source       type
#>             <Rle>         <IRanges>  <Rle> |             <factor>   <factor>
#>      [1]        Y   2652790-2652894      + |       snRNA          gene      
#>      [2]        Y   2652790-2652894      + |       snRNA          transcript
#>      [3]        Y   2652790-2652894      + |       snRNA          exon      
#>      [4]        Y   2654896-2655740      - |       protein_coding gene      
#>      [5]        Y   2654896-2655740      - |       protein_coding transcript
#>      ...      ...               ...    ... .                  ...        ...
#>   [7151]        Y 28772667-28773306      - | processed_pseudogene transcript
#>   [7152]        Y 28772667-28773306      - | processed_pseudogene exon      
#>   [7153]        Y 59001391-59001635      + | pseudogene           gene      
#>   [7154]        Y 59001391-59001635      + | processed_pseudogene transcript
#>   [7155]        Y 59001391-59001635      + | processed_pseudogene exon      
#>              score     phase         gene_id   gene_name    gene_source
#>          <numeric> <integer>     <character> <character>    <character>
#>      [1]        NA      <NA> ENSG00000251841  RNU6-1334P        ensembl
#>      [2]        NA      <NA> ENSG00000251841  RNU6-1334P        ensembl
#>      [3]        NA      <NA> ENSG00000251841  RNU6-1334P        ensembl
#>      [4]        NA      <NA> ENSG00000184895         SRY ensembl_havana
#>      [5]        NA      <NA> ENSG00000184895         SRY ensembl_havana
#>      ...       ...       ...             ...         ...            ...
#>   [7151]        NA      <NA> ENSG00000231514     FAM58CP         havana
#>   [7152]        NA      <NA> ENSG00000231514     FAM58CP         havana
#>   [7153]        NA      <NA> ENSG00000235857     CTBP2P1         havana
#>   [7154]        NA      <NA> ENSG00000235857     CTBP2P1         havana
#>   [7155]        NA      <NA> ENSG00000235857     CTBP2P1         havana
#>            gene_biotype   transcript_id transcript_name transcript_source
#>             <character>     <character>     <character>       <character>
#>      [1]          snRNA            <NA>            <NA>              <NA>
#>      [2]          snRNA ENST00000516032  RNU6-1334P-201           ensembl
#>      [3]          snRNA ENST00000516032  RNU6-1334P-201           ensembl
#>      [4] protein_coding            <NA>            <NA>              <NA>
#>      [5] protein_coding ENST00000383070         SRY-001    ensembl_havana
#>      ...            ...             ...             ...               ...
#>   [7151]     pseudogene ENST00000435741     FAM58CP-001            havana
#>   [7152]     pseudogene ENST00000435741     FAM58CP-001            havana
#>   [7153]     pseudogene            <NA>            <NA>              <NA>
#>   [7154]     pseudogene ENST00000431853     CTBP2P1-001            havana
#>   [7155]     pseudogene ENST00000431853     CTBP2P1-001            havana
#>          exon_number         exon_id         tag     ccds_id  protein_id
#>            <numeric>     <character> <character> <character> <character>
#>      [1]          NA            <NA>        <NA>        <NA>        <NA>
#>      [2]          NA            <NA>        <NA>        <NA>        <NA>
#>      [3]           1 ENSE00002088309        <NA>        <NA>        <NA>
#>      [4]          NA            <NA>        <NA>        <NA>        <NA>
#>      [5]          NA            <NA>        CCDS   CCDS14772        <NA>
#>      ...         ...             ...         ...         ...         ...
#>   [7151]          NA            <NA>        <NA>        <NA>        <NA>
#>   [7152]           1 ENSE00001616687        <NA>        <NA>        <NA>
#>   [7153]          NA            <NA>        <NA>        <NA>        <NA>
#>   [7154]          NA            <NA>        <NA>        <NA>        <NA>
#>   [7155]           1 ENSE00001794473        <NA>        <NA>        <NA>
#>   -------
#>   seqinfo: 1 sequence from GRCh37 genome

DB <- ensDbFromGRanges(Y, path=tempdir(), version=75,
                       organism="Homo_sapiens")
#> Processing genes ... 
#> Warning:  I'm missing column(s): 'entrezid'. The corresponding database column(s) will be empty!
#>  Attribute availability:
#>   o gene_id ... OK
#>   o gene_name ... OK
#>   o entrezid ... Nope
#>   o gene_biotype ... OK
#> OK
#> Processing transcripts ... 
#>  Attribute availability:
#>   o transcript_id ... OK
#>   o gene_id ... OK
#>   o source ... OK
#>   o transcript_name ... OK
#> OK
#> Processing exons ... 
#> OK
#> Processing chromosomes ... 
#> OK
#> Processing metadata ... 
#> OK
#> Generating index ... 
#> OK
#>   -------------
#> Verifying validity of the information in the database:
#> Checking transcripts ... 
#> OK
#> Checking exons ... 
#> OK
edb <- EnsDb(DB)