Map within-protein coordinates to genomic coordinates

proteinToGenome maps protein-relative coordinates to genomic coordinates based on the genomic coordinates of the CDS of the encoding transcript. The encoding transcript is identified using protein-to-transcript annotations (and eventually Uniprot to Ensembl protein identifier mappings) from the submitted EnsDb object (and thus based on annotations from Ensembl).

Not all coding regions for protein coding transcripts are complete, and the function thus checks also if the length of the coding region matches the length of the protein sequence and throws a warning if that is not the case.

The genomic coordinates for the within-protein coordinates, the Ensembl protein ID, the ID of the encoding transcript and the within protein start and end coordinates are reported for each input range.

# S4 method for EnsDb
proteinToGenome(x, db, id = "name", idType = "protein_id")

Arguments

x: IRanges with the coordinates within the protein(s). The object has also to provide some means to identify the protein (see details).
db: EnsDb object to be used to retrieve genomic coordinates of encoding transcripts.
id: character(1) specifying where the protein identifier can be found. Has to be either "name" or one of colnames(mcols(prng)).
idType: character(1) defining what type of IDs are provided. Has to be one of "protein_id" (default), "uniprot_id" or "tx_id".

Value

list, each element being the mapping results for one of the input ranges in x and names being the IDs used for the mapping. Each element can be either a:

GRanges object with the genomic coordinates calculated on the protein-relative coordinates for the respective Ensembl protein (stored in the "protein_id" metadata column.
GRangesList object, if the provided protein identifier in x was mapped to several Ensembl protein IDs (e.g. if Uniprot identifiers were used). Each element in this GRangesList is a GRanges with the genomic coordinates calculated for the protein-relative coordinates from the respective Ensembl protein ID.

The following metadata columns are available in each GRanges in the result:

"protein_id": the ID of the Ensembl protein for which the within-protein coordinates were mapped to the genome.
"tx_id": the Ensembl transcript ID of the encoding transcript.
"exon_id": ID of the exons that have overlapping genomic coordinates.
"exon_rank": the rank/index of the exon within the encoding transcript.
"cds_ok": contains TRUE if the length of the CDS matches the length of the amino acid sequence and FALSE otherwise.
"protein_start": the within-protein sequence start coordinate of the mapping.
"protein_end": the within-protein sequence end coordinate of the mapping.

Genomic coordinates are returned ordered by the exon index within the transcript.

Details

Protein identifiers (supported are Ensembl protein IDs or Uniprot IDs) can be passed to the function as names of the x IRanges object, or alternatively in any one of the metadata columns (mcols) of x.

Note

While the mapping for Ensembl protein IDs to encoding transcripts (and thus CDS) is 1:1, the mapping between Uniprot identifiers and encoding transcripts (which is based on Ensembl annotations) can be one to many. In such cases proteinToGenome calculates genomic coordinates for within-protein coordinates for all of the annotated Ensembl proteins and returns all of them. See below for examples.

Mapping using Uniprot identifiers needs also additional internal checks that have a significant impact on the performance of the function. It is thus strongly suggested to first identify the Ensembl protein identifiers for the list of input Uniprot identifiers (e.g. using the proteins() function and use these as input for the mapping function.

A warning is thrown for proteins which sequence does not match the coding sequence length of any encoding transcripts. For such proteins/transcripts a FALSE is reported in the respective "cds_ok" metadata column. The most common reason for such discrepancies are incomplete 3' or 5' ends of the CDS. The positions within the protein might not be correclty mapped to the genome in such cases and it might be required to check the mapping manually in the Ensembl genome browser.

Author

Johannes Rainer based on initial code from Laurent Gatto and Sebastian Gibb

Examples


library(EnsDb.Hsapiens.v86)
## Restrict all further queries to chromosome x to speed up the examples
edbx <- filter(EnsDb.Hsapiens.v86, filter = ~ seq_name == "X")

## Define an IRange with protein-relative coordinates within a protein for
## the gene SYP
syp <- IRanges(start = 4, end = 17)
names(syp) <- "ENSP00000418169"
res <- proteinToGenome(syp, edbx)
#> Fetching CDS for 1 proteins ... 
#> 1 found
#> Checking CDS and protein sequence lengths ... 
#> 1/1 OK
res
#> $ENSP00000418169
#> GRanges object with 2 ranges and 7 metadata columns:
#>       seqnames            ranges strand |      protein_id           tx_id
#>          <Rle>         <IRanges>  <Rle> |     <character>     <character>
#>   [1]        X 49200151-49200177      - | ENSP00000418169 ENST00000479808
#>   [2]        X 49199019-49199033      - | ENSP00000418169 ENST00000479808
#>               exon_id exon_rank    cds_ok protein_start protein_end
#>           <character> <integer> <logical>     <integer>   <integer>
#>   [1] ENSE00001902363         1      TRUE             4          17
#>   [2] ENSE00003520347         2      TRUE             4          17
#>   -------
#>   seqinfo: 1 sequence from GRCh38 genome
#> 
## Positions 4 to 17 within the protein span two exons of the encoding
## transcript.

## Perform the mapping for multiple proteins identified by their Uniprot
## IDs.
ids <- c("O15266", "Q9HBJ8", "unexistant")
prngs <- IRanges(start = c(13, 43, 100), end = c(21, 80, 100))
names(prngs) <- ids

res <- proteinToGenome(prngs, edbx, idType = "uniprot_id")
#> Fetching CDS for 3 proteins ... 
#> Warning: No CDS found for: unexistant
#> 2 found
#> Checking CDS and protein sequence lengths ... 
#> 2/3 OK

## The result is a list, same length as the input object
length(res)
#> [1] 3
names(res)
#> [1] "O15266"     "Q9HBJ8"     "unexistant"

## No protein/encoding transcript could be found for the last one
res[[3]]
#> GRanges object with 0 ranges and 0 metadata columns:
#>    seqnames    ranges strand
#>       <Rle> <IRanges>  <Rle>
#>   -------
#>   seqinfo: no sequences

## The first protein could be mapped to multiple Ensembl proteins. The
## mapping result using all of their encoding transcripts are returned
res[[1]]
#> GRangesList object of length 4:
#> $ENSP00000335505
#> GRanges object with 1 range and 8 metadata columns:
#>       seqnames        ranges strand |  uniprot_id           tx_id
#>          <Rle>     <IRanges>  <Rle> | <character>     <character>
#>   [1]        X 630934-630960      + |      O15266 ENST00000334060
#>            protein_id         exon_id exon_rank    cds_ok protein_start
#>           <character>     <character> <integer> <logical>     <integer>
#>   [1] ENSP00000335505 ENSE00001489177         2      TRUE            13
#>       protein_end
#>         <integer>
#>   [1]          21
#>   -------
#>   seqinfo: 1 sequence from GRCh38 genome
#> 
#> $ENSP00000370987
#> GRanges object with 1 range and 8 metadata columns:
#>       seqnames        ranges strand |  uniprot_id           tx_id
#>          <Rle>     <IRanges>  <Rle> | <character>     <character>
#>   [1]        X 630934-630960      + |      O15266 ENST00000381575
#>            protein_id         exon_id exon_rank    cds_ok protein_start
#>           <character>     <character> <integer> <logical>     <integer>
#>   [1] ENSP00000370987 ENSE00001489169         1      TRUE            13
#>       protein_end
#>         <integer>
#>   [1]          21
#>   -------
#>   seqinfo: 1 sequence from GRCh38 genome
#> 
#> $ENSP00000370990
#> GRanges object with 1 range and 8 metadata columns:
#>       seqnames        ranges strand |  uniprot_id           tx_id
#>          <Rle>     <IRanges>  <Rle> | <character>     <character>
#>   [1]        X 630934-630960      + |      O15266 ENST00000381578
#>            protein_id         exon_id exon_rank    cds_ok protein_start
#>           <character>     <character> <integer> <logical>     <integer>
#>   [1] ENSP00000370990 ENSE00001489177         2      TRUE            13
#>       protein_end
#>         <integer>
#>   [1]          21
#>   -------
#>   seqinfo: 1 sequence from GRCh38 genome
#> 
#> $ENSP00000452016
#> GRanges object with 1 range and 8 metadata columns:
#>       seqnames        ranges strand |  uniprot_id           tx_id
#>          <Rle>     <IRanges>  <Rle> | <character>     <character>
#>   [1]        X 630934-630960      + |      O15266 ENST00000554971
#>            protein_id         exon_id exon_rank    cds_ok protein_start
#>           <character>     <character> <integer> <logical>     <integer>
#>   [1] ENSP00000452016 ENSE00001489169         1      TRUE            13
#>       protein_end
#>         <integer>
#>   [1]          21
#>   -------
#>   seqinfo: 1 sequence from GRCh38 genome
#> 

## The coordinates within the second protein span two exons
res[[2]]
#> GRanges object with 2 ranges and 8 metadata columns:
#>       seqnames            ranges strand |  uniprot_id           tx_id
#>          <Rle>         <IRanges>  <Rle> | <character>     <character>
#>   [1]        X 15659016-15659092      - |      Q9HBJ8 ENST00000380342
#>   [2]        X 15644993-15645029      - |      Q9HBJ8 ENST00000380342
#>            protein_id         exon_id exon_rank    cds_ok protein_start
#>           <character>     <character> <integer> <logical>     <integer>
#>   [1] ENSP00000369699 ENSE00001202097         3      TRUE            43
#>   [2] ENSP00000369699 ENSE00000978331         4      TRUE            43
#>       protein_end
#>         <integer>
#>   [1]          80
#>   [2]          80
#>   -------
#>   seqinfo: 1 sequence from GRCh38 genome