R/transcriptToX.R
transcriptToProtein.Rd
transcriptToProtein
maps within-transcript coordinates to the corresponding
coordinates within the encoded protein sequence. The provided coordinates
have to be within the coding region of the transcript (excluding the stop
codon) but are supposed to be relative to the first nucleotide of the
transcript (which includes the 5' UTR). Positions relative to the CDS of a
transcript (e.g. /PKP2 c.1643delg/) have to be first converted to
transcript-relative coordinates. This can be done with the
cdsToTranscript()
function.
transcriptToProtein(x, db, id = "name")
IRanges
with the coordinates within the transcript. Coordinates
are counted from the start of the transcript (including the 5' UTR). The
Ensembl IDs of the corresponding transcripts have to be provided either
as names
of the IRanges
, or in one of its metadata columns.
EnsDb
object.
character(1)
specifying where the transcript identifier can be
found. Has to be either "name"
or one of colnames(mcols(prng))
.
IRanges
with the same length (and order) than the input IRanges
x
. Each element in IRanges
provides the coordinates within the
protein sequence, names being the (Ensembl) IDs of the protein. The
original transcript ID and the transcript-relative coordinates are provided
as metadata columns. Metadata columns "cds_ok"
indicates whether the
length of the transcript's CDS matches the length of the encoded protein.
IRanges
with a start coordinate of -1
is returned for transcript
coordinates that can not be mapped to protein-relative coordinates
(either no transcript was found for the provided ID, the transcript
does not encode a protein or the provided coordinates are not within
the coding region of the transcript).
Transcript-relative coordinates are mapped to the amino acid residues they encode. As an example, positions within the transcript that correspond to nucleotides 1 to 3 in the CDS are mapped to the first position in the protein sequence (see examples for more details).
cdsToTranscript()
and transcriptToCds()
for conversion between
CDS- and transcript-relative coordinates.
Other coordinate mapping functions:
cdsToTranscript()
,
genomeToProtein()
,
genomeToTranscript()
,
proteinToGenome()
,
proteinToTranscript()
,
transcriptToCds()
,
transcriptToGenome()
library(EnsDb.Hsapiens.v86)
## Restrict all further queries to chromosome x to speed up the examples
edbx <- filter(EnsDb.Hsapiens.v86, filter = ~ seq_name == "X")
## Define an IRanges with the positions of the first 2 nucleotides of the
## coding region for the transcript ENST00000381578
txpos <- IRanges(start = 692, width = 2, names = "ENST00000381578")
## Map these to the corresponding residues in the protein sequence
## The protein-relative coordinates are returned as an `IRanges` object,
## with the original, transcript-relative coordinates provided in metadata
## columns tx_start and tx_end
transcriptToProtein(txpos, edbx)
#> IRanges object with 1 range and 4 metadata columns:
#> start end width | tx_id tx_start
#> <integer> <integer> <integer> | <character> <integer>
#> ENSP00000370990 1 1 1 | ENST00000381578 692
#> tx_end cds_ok
#> <integer> <logical>
#> ENSP00000370990 693 TRUE
## We can also map multiple ranges. Note that for any of the 3 nucleotides
## encoding the same amino acid the position of this residue in the
## protein sequence is returned. To illustrate this we map below each of the
## first 4 nucleotides of the CDS to the corresponding position within the
## protein.
txpos <- IRanges(start = c(692, 693, 694, 695),
width = rep(1, 4), names = rep("ENST00000381578", 4))
transcriptToProtein(txpos, edbx)
#> IRanges object with 4 ranges and 4 metadata columns:
#> start end width | tx_id tx_start
#> <integer> <integer> <integer> | <character> <integer>
#> ENSP00000370990 1 1 1 | ENST00000381578 692
#> ENSP00000370990 1 1 1 | ENST00000381578 693
#> ENSP00000370990 1 1 1 | ENST00000381578 694
#> ENSP00000370990 2 2 1 | ENST00000381578 695
#> tx_end cds_ok
#> <integer> <logical>
#> ENSP00000370990 692 TRUE
#> ENSP00000370990 693 TRUE
#> ENSP00000370990 694 TRUE
#> ENSP00000370990 695 TRUE
## If the mapping fails, an IRanges with negative start position is returned.
## Mapping can fail (as below) because the ID is not known.
transcriptToProtein(IRanges(1, 1, names = "unknown"), edbx)
#> Warning: Transcript(s) 'unknown' could not be found
#> IRanges object with 1 range and 4 metadata columns:
#> start end width | tx_id tx_start tx_end cds_ok
#> <integer> <integer> <integer> | <character> <integer> <integer> <logical>
#> [1] -1 -1 1 | unknown 1 1 <NA>
## Or because the provided coordinates are not within the CDS
transcriptToProtein(IRanges(1, 1, names = "ENST00000381578"), edbx)
#> Warning: Provided coordinates for 'ENST00000381578' are not within the coding region
#> IRanges object with 1 range and 4 metadata columns:
#> start end width | tx_id tx_start tx_end
#> <integer> <integer> <integer> | <character> <integer> <integer>
#> [1] -1 -1 1 | ENST00000381578 1 1
#> cds_ok
#> <logical>
#> [1] <NA>