makeEnsemblDbPackage.Rd
The functions described on this page allow to build EnsDb
annotation objects/databases from Ensembl annotations. The most
complete set of annotations, which include also the NCBI Entrezgene
identifiers for each gene, can be retrieved by the functions using
the Ensembl Perl API (i.e. functions fetchTablesFromEnsembl
,
makeEnsemblSQLiteFromTables
). Alternatively the functions
ensDbFromAH
, ensDbFromGRanges
, ensDbFromGff
and
ensDbFromGtf
can be used to build EnsDb
objects using
GFF or GTF files from Ensembl, which can be either manually downloaded
from the Ensembl ftp server, or directly form within R using
AnnotationHub
.
The generated SQLite database can be packaged into an R package using
the makeEnsembldbPackage
.
ensDbFromAH(ah, outfile, path, organism, genomeVersion, version)
ensDbFromGRanges(x, outfile, path, organism, genomeVersion,
version, ...)
ensDbFromGff(gff, outfile, path, organism, genomeVersion,
version, ...)
ensDbFromGtf(gtf, outfile, path, organism, genomeVersion,
version, ...)
fetchTablesFromEnsembl(version, ensemblapi, user="anonymous",
host="ensembldb.ensembl.org", pass="",
port=5306, species="human")
makeEnsemblSQLiteFromTables(path=".", dbname)
makeEnsembldbPackage(ensdb, version, maintainer, author,
destDir=".", license="Artistic-2.0")
(in alphabetical order)
For ensDbFromAH
: an AnnotationHub
object representing
a single resource (i.e. GTF file from Ensembl) from
AnnotationHub
.
The author of the package.
The name for the database (optional). By default a name based on the species and Ensembl version will be automatically generated (and returned by the function).
Where the package should be saved to.
The file name of the SQLite database generated by makeEnsemblSQLiteFromTables
.
The path to the Ensembl perl API installed locally on the system. The Ensembl perl API version has to fit the version.
For ensDbFromAH
, ensDbFromGtf
and ensDbFromGff
:
the version of the genome (e.g. "GRCh37"
). If not provided
the function will try to guess it from the file name (assuming file
name convention of Ensembl GTF files).
The GFF file to import.
The GTF file name.
The hostname to access the Ensembl database.
The license of the package.
The maintainer of the package.
For ensDbFromAH
, ensDbFromGff
and ensDbFromGtf
:
the organism name (e.g. "Homo_sapiens"
). If not provided the
function will try to guess it from the file name (assuming file name
convention of Ensembl GTF files).
The desired file name of the SQLite file. If not provided the name of the GTF file will be used.
The password for the Ensembl database.
The directory in which the tables retrieved by
fetchTablesFromEnsembl
or the SQLite database file generated
by ensDbFromGtf
are stored.
The port to be used to connect to the Ensembl database.
The species for which the annotations should be retrieved.
The username for the Ensembl database.
For fetchTablesFromEnsembl
, ensDbFromGRanges
and ensDbFromGtf
: the
Ensembl version for which the annotation should be retrieved
(e.g. 75). The ensDbFromGtf
function will try to guess the
Ensembl version from the GTF file name if not provided.
For makeEnsemblDbPackage
: the version for the package.
For ensDbFromGRanges
: the GRanges
object.
Currently not used.
Create an EnsDb
(SQLite) database from a GTF file provided
by AnnotationHub
. The function returns the file name of the
generated database file. For usage see the examples below.
Create an EnsDb
(SQLite) database from a GFF file from
Ensembl. The function returns the file name of the
generated database file. For usage see the examples below.
Create an EnsDb
(SQLite) database from a GTF file from
Ensembl. The function returns the file name of the generated
database file. For usage see the examplesbelow.
Create an EnsDb
(SQLite) database from a GRanges object
(e.g. from AnnotationHub
). The function returns the file
name of the generated database file. For usage see the examples
below.
Uses the Ensembl Perl API to fetch all required data from an
Ensembl database server and stores them locally to text files
(that can be used as input for the
makeEnsembldbSQLiteFromTables
function).
Creates the SQLite EnsDb
database from the tables generated
by the fetchTablesFromEnsembl
.
Creates an R package containing the EnsDb
database from a
EnsDb
SQLite database created by any of the above
functions ensDbFromAH
, ensDbFromGff
,
ensDbFromGtf
or makeEnsemblSQLiteFromTables
.
The fetchTablesFromEnsembl
function internally calls the perl
script get_gene_transcript_exon_tables.pl
to retrieve all
required information from the Ensembl database using the Ensembl perl
API.
As an alternative way, a EnsDb database file can be generated by the
ensDbFromGtf
or ensDbFromGff
from a GTF or GFF file
downloaded from the Ensembl ftp server or using the ensDbFromAH
to build a database directly from corresponding resources from the
AnnotationHub. The returned database file name can then
be used as an input to the makeEnsembldbPackage
or it can be
directly loaded and used by the EnsDb
constructor.
A local installation of the Ensembl perl API is required for the
fetchTablesFromEnsembl
. See
http://www.ensembl.org/info/docs/api/api_installation.html for
installation inscructions.
A database generated from a GTF/GFF files lacks some features as they are not available in the GTF files from Ensembl. These are: NCBI Entrezgene IDs.
makeEnsemblSQLiteFromTables
, ensDbFromAH
,
ensDbFromGRanges
and ensDbFromGtf
: the name of the
SQLite file.
if (FALSE) {
## get all human gene/transcript/exon annotations from Ensembl (75)
## the resulting tables will be stored by default to the current working
## directory; if the correct Ensembl api (version 75) is defined in the
## PERL5LIB environment variable, the ensemblapi parameter can also be omitted.
fetchTablesFromEnsembl(75,
ensemblapi="/home/bioinfo/ensembl/75/API/ensembl/modules",
species="human")
## These tables can then be processed to generate a SQLite database
## containing the annotations
DBFile <- makeEnsemblSQLiteFromTables()
## and finally we can generate the package
makeEnsembldbPackage(ensdb=DBFile, version="0.0.1",
maintainer="Johannes Rainer <johannes.rainer@eurac.edu>",
author="J Rainer")
## Build an annotation database form a GFF file from Ensembl.
## ftp://ftp.ensembl.org/pub/release-83/gff3/rattus_norvegicus
gff <- "Rattus_norvegicus.Rnor_6.0.83.gff3.gz"
DB <- ensDbFromGff(gff=gff)
edb <- EnsDb(DB)
edb
## Build an annotation file from a GTF file.
## the GTF file can be downloaded from
## ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/
gtffile <- "Homo_sapiens.GRCh37.75.gtf.gz"
## generate the SQLite database file
DB <- ensDbFromGtf(gtf=paste0(ensemblhost, gtffile))
## load the DB file directly
EDB <- EnsDb(DB)
## Alternatively, we could fetch a GTF file directly from AnnotationHub
## and build the database from that:
library(AnnotationHub)
ah <- AnnotationHub()
## Query for all GTF files from Ensembl for Ensembl version 81
query(ah, c("Ensembl", "release-81", "GTF"))
## We could get the one from e.g. Bos taurus:
DB <- ensDbFromAH(ah["AH47941"])
edb <- EnsDb(DB)
edb
}
## Generate a sqlite database for genes encoded on chromosome Y
chrY <- system.file("chrY", package="ensembldb")
DBFile <- makeEnsemblSQLiteFromTables(path=chrY ,dbname=tempfile())
#> Processing 'chromosome' table ...
#> OK
#> Processing 'gene' table ...
#> OK
#> Processing 'trancript' table ...
#> OK
#> Processing 'exon' table ...
#> OK
#> Processing 'tx2exon' table ...
#> OK
#> Creating indices ...
#> OK
#> Checking validity of the database ...
#> OK
## load this database:
edb <- EnsDb(DBFile)
edb
#> EnsDb for Ensembl:
#> |Backend: SQLite
#> |Db type: EnsDb
#> |Type of Gene ID: Ensembl Gene ID
#> |Supporting package: ensembldb
#> |Db created by: ensembldb package from Bioconductor
#> |script_version: 0.1.2
#> |Creation time: Wed Mar 18 09:30:54 2015
#> |ensembl_version: 75
#> |ensembl_host: manny.i-med.ac.at
#> |Organism: homo_sapiens
#> |genome_build: GRCh37
#> |DBSCHEMAVERSION: 1.0
#> | No. of genes: 495.
#> | No. of transcripts: 731.
## Generate a sqlite database from a GRanges object specifying
## genes encoded on chromosome Y
load(system.file("YGRanges.RData", package="ensembldb"))
Y
#> GRanges object with 7155 ranges and 16 metadata columns:
#> seqnames ranges strand | source type
#> <Rle> <IRanges> <Rle> | <factor> <factor>
#> [1] Y 2652790-2652894 + | snRNA gene
#> [2] Y 2652790-2652894 + | snRNA transcript
#> [3] Y 2652790-2652894 + | snRNA exon
#> [4] Y 2654896-2655740 - | protein_coding gene
#> [5] Y 2654896-2655740 - | protein_coding transcript
#> ... ... ... ... . ... ...
#> [7151] Y 28772667-28773306 - | processed_pseudogene transcript
#> [7152] Y 28772667-28773306 - | processed_pseudogene exon
#> [7153] Y 59001391-59001635 + | pseudogene gene
#> [7154] Y 59001391-59001635 + | processed_pseudogene transcript
#> [7155] Y 59001391-59001635 + | processed_pseudogene exon
#> score phase gene_id gene_name gene_source
#> <numeric> <integer> <character> <character> <character>
#> [1] NA <NA> ENSG00000251841 RNU6-1334P ensembl
#> [2] NA <NA> ENSG00000251841 RNU6-1334P ensembl
#> [3] NA <NA> ENSG00000251841 RNU6-1334P ensembl
#> [4] NA <NA> ENSG00000184895 SRY ensembl_havana
#> [5] NA <NA> ENSG00000184895 SRY ensembl_havana
#> ... ... ... ... ... ...
#> [7151] NA <NA> ENSG00000231514 FAM58CP havana
#> [7152] NA <NA> ENSG00000231514 FAM58CP havana
#> [7153] NA <NA> ENSG00000235857 CTBP2P1 havana
#> [7154] NA <NA> ENSG00000235857 CTBP2P1 havana
#> [7155] NA <NA> ENSG00000235857 CTBP2P1 havana
#> gene_biotype transcript_id transcript_name transcript_source
#> <character> <character> <character> <character>
#> [1] snRNA <NA> <NA> <NA>
#> [2] snRNA ENST00000516032 RNU6-1334P-201 ensembl
#> [3] snRNA ENST00000516032 RNU6-1334P-201 ensembl
#> [4] protein_coding <NA> <NA> <NA>
#> [5] protein_coding ENST00000383070 SRY-001 ensembl_havana
#> ... ... ... ... ...
#> [7151] pseudogene ENST00000435741 FAM58CP-001 havana
#> [7152] pseudogene ENST00000435741 FAM58CP-001 havana
#> [7153] pseudogene <NA> <NA> <NA>
#> [7154] pseudogene ENST00000431853 CTBP2P1-001 havana
#> [7155] pseudogene ENST00000431853 CTBP2P1-001 havana
#> exon_number exon_id tag ccds_id protein_id
#> <numeric> <character> <character> <character> <character>
#> [1] NA <NA> <NA> <NA> <NA>
#> [2] NA <NA> <NA> <NA> <NA>
#> [3] 1 ENSE00002088309 <NA> <NA> <NA>
#> [4] NA <NA> <NA> <NA> <NA>
#> [5] NA <NA> CCDS CCDS14772 <NA>
#> ... ... ... ... ... ...
#> [7151] NA <NA> <NA> <NA> <NA>
#> [7152] 1 ENSE00001616687 <NA> <NA> <NA>
#> [7153] NA <NA> <NA> <NA> <NA>
#> [7154] NA <NA> <NA> <NA> <NA>
#> [7155] 1 ENSE00001794473 <NA> <NA> <NA>
#> -------
#> seqinfo: 1 sequence from GRCh37 genome
DB <- ensDbFromGRanges(Y, path=tempdir(), version=75,
organism="Homo_sapiens")
#> Processing genes ...
#> Warning: I'm missing column(s): 'entrezid'. The corresponding database column(s) will be empty!
#> Attribute availability:
#> o gene_id ... OK
#> o gene_name ... OK
#> o entrezid ... Nope
#> o gene_biotype ... OK
#> OK
#> Processing transcripts ...
#> Attribute availability:
#> o transcript_id ... OK
#> o gene_id ... OK
#> o source ... OK
#> o transcript_name ... OK
#> OK
#> Processing exons ...
#> OK
#> Processing chromosomes ...
#> OK
#> Processing metadata ...
#> OK
#> Generating index ...
#> OK
#> -------------
#> Verifying validity of the information in the database:
#> Checking transcripts ...
#> OK
#> Checking exons ...
#> OK
edb <- EnsDb(DB)