R/proteinToX.R
proteinToTranscript.Rd
proteinToTranscript
maps protein-relative coordinates to positions within
the encoding transcript. Note that the returned positions are relative to
the complete transcript length, which includes the 5' UTR.
Similar to the proteinToGenome()
function, proteinToTranscript
compares
for each protein whether the length of its sequence matches the length of
the encoding CDS and throws a warning if that is not the case. Incomplete
3' or 5' CDS of the encoding transcript are the most common reasons for a
mismatch between protein and transcript sequences.
proteinToTranscript(x, db, id = "name", idType = "protein_id")
IRanges
with the coordinates within the protein(s). The
object has also to provide some means to identify the protein (see
details).
EnsDb
object to be used to retrieve genomic coordinates of
encoding transcripts.
character(1)
specifying where the protein identifier can be
found. Has to be either "name"
or one of colnames(mcols(prng))
.
character(1)
defining what type of IDs are provided. Has to
be one of "protein_id"
(default), "uniprot_id"
or "tx_id"
.
IRangesList
, each element being the mapping results for one of the input
ranges in x
. Each element is a IRanges
object with the positions within
the encoding transcript (relative to the start of the transcript, which
includes the 5' UTR). The transcript ID is reported as the name of each
IRanges
. The IRanges
can be of length > 1 if the provided
protein identifier is annotated to more than one Ensembl protein ID (which
can be the case if Uniprot IDs are provided). If the coordinates can not be
mapped (because the protein identifier is unknown to the database) an
IRanges
with negative coordinates is returned.
The following metadata columns are available in each IRanges
in the result:
"protein_id"
: the ID of the Ensembl protein for which the within-protein
coordinates were mapped to the genome.
"tx_id"
: the Ensembl transcript ID of the encoding transcript.
"cds_ok"
: contains TRUE
if the length of the CDS matches the length
of the amino acid sequence and FALSE
otherwise.
"protein_start"
: the within-protein sequence start coordinate of the
mapping.
"protein_end"
: the within-protein sequence end coordinate of the mapping.
Protein identifiers (supported are Ensembl protein IDs or Uniprot IDs) can
be passed to the function as names
of the x
IRanges
object, or
alternatively in any one of the metadata columns (mcols
) of x
.
While mapping of Ensembl protein IDs to Ensembl transcript IDs is 1:1, a
single Uniprot identifier can be annotated to several Ensembl protein IDs.
proteinToTranscript
calculates in such cases transcript-relative
coordinates for each annotated Ensembl protein.
Mapping using Uniprot identifiers needs also additional internal checks that
can have a significant impact on the performance of the function. It is thus
strongly suggested to first identify the Ensembl protein identifiers for the
list of input Uniprot identifiers (e.g. using the proteins()
function and
use these as input for the mapping function.
Other coordinate mapping functions:
cdsToTranscript()
,
genomeToProtein()
,
genomeToTranscript()
,
proteinToGenome()
,
transcriptToCds()
,
transcriptToGenome()
,
transcriptToProtein()
library(EnsDb.Hsapiens.v86)
## Restrict all further queries to chromosome x to speed up the examples
edbx <- filter(EnsDb.Hsapiens.v86, filter = ~ seq_name == "X")
## Define an IRange with protein-relative coordinates within a protein for
## the gene SYP
syp <- IRanges(start = 4, end = 17)
names(syp) <- "ENSP00000418169"
res <- proteinToTranscript(syp, edbx)
#> Fetching CDS for 1 proteins ...
#> 1 found
#> Checking CDS and protein sequence lengths ...
#> 1/1 OK
res
#> IRangesList object of length 1:
#> $ENSP00000418169
#> IRanges object with 1 range and 5 metadata columns:
#> start end width | protein_id
#> <integer> <integer> <integer> | <character>
#> ENST00000479808 23 64 42 | ENSP00000418169
#> tx_id cds_ok protein_start protein_end
#> <character> <logical> <integer> <integer>
#> ENST00000479808 ENST00000479808 TRUE 4 17
#>
## Positions 4 to 17 within the protein span are encoded by the region
## from nt 23 to 64.
## Perform the mapping for multiple proteins identified by their Uniprot
## IDs.
ids <- c("O15266", "Q9HBJ8", "unexistant")
prngs <- IRanges(start = c(13, 43, 100), end = c(21, 80, 100))
names(prngs) <- ids
res <- proteinToTranscript(prngs, edbx, idType = "uniprot_id")
#> Fetching CDS for 3 proteins ...
#> Warning: No CDS found for: unexistant
#> 2 found
#> Checking CDS and protein sequence lengths ...
#> 2/2 OK
## The result is a list, same length as the input object
length(res)
#> [1] 3
names(res)
#> [1] "O15266" "Q9HBJ8" "unexistant"
## No protein/encoding transcript could be found for the last one
res[[3]]
#> IRanges object with 1 range and 6 metadata columns:
#> start end width | protein_id tx_id cds_ok
#> <integer> <integer> <integer> | <character> <character> <logical>
#> [1] -1 -1 1 | <NA> <NA> <NA>
#> protein_start protein_end uniprot_id
#> <integer> <integer> <character>
#> [1] 100 100 unexistant
## The first protein could be mapped to multiple Ensembl proteins. The
## region within all transcripts encoding the region in the protein are
## returned
res[[1]]
#> IRanges object with 4 ranges and 6 metadata columns:
#> start end width | protein_id
#> <integer> <integer> <integer> | <character>
#> ENST00000334060 728 754 27 | ENSP00000335505
#> ENST00000381575 128 154 27 | ENSP00000370987
#> ENST00000381578 728 754 27 | ENSP00000370990
#> ENST00000554971 128 154 27 | ENSP00000452016
#> tx_id cds_ok protein_start protein_end
#> <character> <logical> <integer> <integer>
#> ENST00000334060 ENST00000334060 TRUE 13 21
#> ENST00000381575 ENST00000381575 TRUE 13 21
#> ENST00000381578 ENST00000381578 TRUE 13 21
#> ENST00000554971 ENST00000554971 TRUE 13 21
#> uniprot_id
#> <character>
#> ENST00000334060 O15266
#> ENST00000381575 O15266
#> ENST00000381578 O15266
#> ENST00000554971 O15266
## The result for the region within the second protein
res[[2]]
#> IRanges object with 1 range and 6 metadata columns:
#> start end width | protein_id
#> <integer> <integer> <integer> | <character>
#> ENST00000380342 383 496 114 | ENSP00000369699
#> tx_id cds_ok protein_start protein_end
#> <character> <logical> <integer> <integer>
#> ENST00000380342 ENST00000380342 TRUE 43 80
#> uniprot_id
#> <character>
#> ENST00000380342 Q9HBJ8