Map positions along the genome to positions within the protein sequence if
a protein is encoded at the location. The provided coordinates have to be
completely within the genomic position of an exon of a protein coding
transcript (see genomeToTranscript()
for details). Also, the provided
positions have to be within the genomic region encoding the CDS of a
transcript (excluding its stop codon; soo transcriptToProtein()
for
details).
For genomic positions for which the mapping failed an IRanges
with
negative coordinates (i.e. a start position of -1) is returned.
genomeToProtein(x, db)
GRanges
with the genomic coordinates that should be mapped to
within-protein coordinates.
EnsDb
object.
An IRangesList
with each element representing the mapping of one of the
GRanges
in x
(i.e. the length of the IRangesList
is length(x)
).
Each element in IRanges
provides the coordinates within the protein
sequence, names being the (Ensembl) IDs of the protein. The ID of the
transcript encoding the protein, the ID of the exon within which the
genomic coordinates are located and its rank in the transcript are provided
in metadata columns "tx_id"
, "exon_id"
and "exon_rank"
. Metadata
columns "cds_ok"
indicates whether the length of the CDS matches the
length of the encoded protein. Coordinates for which cds_ok = FALSE
should
be taken with caution, as they might not be correct. Metadata columns
"seq_start"
, "seq_end"
, "seq_name"
and "seq_strand"
provide the
provided genomic coordinates.
For genomic coordinates that can not be mapped to within-protein sequences
an IRanges
with a start coordinate of -1 is returned.
genomeToProtein
combines calls to genomeToTranscript()
and
transcriptToProtein()
.
Other coordinate mapping functions:
cdsToTranscript()
,
genomeToTranscript()
,
proteinToGenome()
,
proteinToTranscript()
,
transcriptToCds()
,
transcriptToGenome()
,
transcriptToProtein()
library(EnsDb.Hsapiens.v86)
## Restrict all further queries to chromosome x to speed up the examples
edbx <- filter(EnsDb.Hsapiens.v86, filter = ~ seq_name == "X")
## In the example below we define 4 genomic regions:
## 630898: corresponds to the first nt of the CDS of ENST00000381578
## 644636: last nt of the CDS of ENST00000381578
## 644633: last nt before the stop codon in ENST00000381578
## 634829: position within an intron.
gnm <- GRanges("X", IRanges(start = c(630898, 644636, 644633, 634829),
width = c(5, 1, 1, 3)))
res <- genomeToProtein(gnm, edbx)
#> Warning: 1 genomic region(s) could not be mapped to a transcript; hint: see ?seqlevelsStyle if you used UCSC chromosome names
#> Warning: Transcript(s) '' could not be found
#> Warning: Provided coordinates for 'ENST00000381578', 'ENST00000554971' are not within the coding region
## The result is an IRangesList with the same length as gnm
length(res)
#> [1] 4
length(gnm)
#> [1] 4
## The first element represents the mapping for the first GRanges:
## the coordinate is mapped to the first amino acid of the protein(s).
## The genomic coordinates can be mapped to several transcripts (and hence
## proteins).
res[[1]]
#> IRanges object with 4 ranges and 8 metadata columns:
#> start end width | tx_id cds_ok
#> <integer> <integer> <integer> | <character> <logical>
#> ENSP00000335505 1 2 2 | ENST00000334060 TRUE
#> ENSP00000370987 1 2 2 | ENST00000381575 TRUE
#> ENSP00000370990 1 2 2 | ENST00000381578 TRUE
#> ENSP00000452016 1 2 2 | ENST00000554971 TRUE
#> exon_id exon_rank seq_start seq_end seq_name
#> <character> <integer> <integer> <integer> <character>
#> ENSP00000335505 ENSE00001489177 2 630898 630902 X
#> ENSP00000370987 ENSE00001489169 1 630898 630902 X
#> ENSP00000370990 ENSE00001489177 2 630898 630902 X
#> ENSP00000452016 ENSE00001489169 1 630898 630902 X
#> seq_strand
#> <character>
#> ENSP00000335505 *
#> ENSP00000370987 *
#> ENSP00000370990 *
#> ENSP00000452016 *
## The stop codon is not translated, thus the mapping for the second
## GRanges fails
res[[2]]
#> IRanges object with 2 ranges and 8 metadata columns:
#> start end width | tx_id cds_ok exon_id
#> <integer> <integer> <integer> | <character> <logical> <character>
#> -1 -1 1 | ENST00000381578 <NA> ENSE00001489174
#> -1 -1 1 | ENST00000554971 <NA> ENSE00002438186
#> exon_rank seq_start seq_end seq_name seq_strand
#> <integer> <integer> <integer> <character> <character>
#> 6 644636 644636 X *
#> 5 644636 644636 X *
## The 3rd GRanges is mapped to the last amino acid.
res[[3]]
#> IRanges object with 2 ranges and 8 metadata columns:
#> start end width | tx_id cds_ok
#> <integer> <integer> <integer> | <character> <logical>
#> ENSP00000370990 292 292 1 | ENST00000381578 TRUE
#> ENSP00000452016 292 292 1 | ENST00000554971 TRUE
#> exon_id exon_rank seq_start seq_end seq_name
#> <character> <integer> <integer> <integer> <character>
#> ENSP00000370990 ENSE00001489174 6 644633 644633 X
#> ENSP00000452016 ENSE00002438186 5 644633 644633 X
#> seq_strand
#> <character>
#> ENSP00000370990 *
#> ENSP00000452016 *
## Mapping of intronic positions fail
res[[4]]
#> IRanges object with 1 range and 8 metadata columns:
#> start end width | tx_id cds_ok exon_id exon_rank
#> <integer> <integer> <integer> | <character> <logical> <character> <integer>
#> -1 -1 1 | <NA> <NA> <NA>
#> seq_start seq_end seq_name seq_strand
#> <integer> <integer> <character> <character>
#> 634829 634831 X *