Use Cases for Metabolomics Data Annotation Using the `MetaboAnnotation` package
Johannes Rainer, Andrea Vicini, Liesa Salzer, Jan Stanstrup, Josep M Badia, Steffen Neumann, Michael Stravs, Vinicius Verri Hernandes, Laurent Gatto, Sebastian Gibb and Michael Witting
Source:vignettes/annotation-use-cases.Rmd
annotation-use-cases.Rmd
This documents provides use cases describing the annotation of
untarget metabolomics and small compound datasets using functionality
from the r Biocpkg("MetaboAnnotation")
, MetaboCoreUtils
and r Biocpkg("CompoundDb")
packages (Rainer et al. 2022). All data used in this
document is available within the github
repository https://github.com/jorainer/MetaboAnnotationTutorials.
Additional use cases and examples can be found in the packages’
vignettes and workshops/tutorials. See section Package repositories
and additional tutorials for links to the package repositories and
tutorials.
Installation instructions
The R packages as well as the example files used in this document can be installed with the R code below:
install.packages("BiocManager")
BiocManager::install("jorainer/MetaboAnnotationTutorials",
dependencies = TRUE, ask = FALSE, update = TRUE)
The source code for this document along with the test data can be downloaded from the github repository https://github.com/jorainer/MetaboAnnotationTutorials with the command (or alternatively downloading the zip archive directly from the github page).
git clone https://github.com/jorainer/MetaboAnnotationTutorials
\(MS^1\) annotation
In this example we annotate chromatographic peaks identified in a test sample provided within this repository first based only on their m/z values and then using their m/z and retention times against previously determined values for a set of pure standards. The test data represents LC-MS/MS measurements of a set of metabolite standards. In brief, aliquots for pure standards of 15 metabolites were solved in water and injected into a LC-MS system. Details are provided in the Material and Methods section in the main text.
Below we load the data which is distributed within the
MetaboAnnotationTutorials
package and perform the
chromatographic peak detection using the xcms
package.
library(xcms)
fl <- system.file("mzML", "HighIS_Mix07_CE20_POS.mzML",
package = "MetaboAnnotationTutorials")
std_data <- readMSData(fl, mode = "onDisk")
We next perform the chromatographic peak detection using the
centWave algorithm followed by a peak refinement to
reduce the number of peak detection artifacts, such as split-peaks or
overlapping peaks (see the LC-MS/MS data analysis with xcms
vignette from the xcms
package for details on the
chromatographic peak detection settings).
#' Peak detection
cwp <- CentWaveParam(snthresh = 10, prefilter = c(3, 4000), ppm = 40,
peakwidth = c(2, 8), integrate = 2)
std_data <- findChromPeaks(std_data, param = cwp)
#' Peak refinement
std_data <- refineChromPeaks(
std_data, MergeNeighboringPeaksParam(expandRt = 3))
In total 175 chromatographic peaks have been identified. This is even more than would be expected since the sample represents a mixture of 15 pure standards solved in water. Below we display the first 6 of the detected chromatographic peaks.
head(chromPeaks(std_data))
## mz mzmin mzmax rt rtmin rtmax into intb
## CP001 355.28240 355.28193 355.28369 30.495 24.575 37.641 83909.909 78939.997
## CP002 358.98005 358.97954 358.98037 10.390 10.065 11.356 7002.894 6897.381
## CP003 413.26607 413.26573 413.26704 32.892 25.729 37.641 102217.289 100770.550
## CP004 301.14091 301.13962 301.14279 25.729 24.575 40.013 47095.520 44283.920
## CP005 80.02494 80.02419 80.02543 40.013 35.261 56.554 53223.003 53005.070
## CP006 63.99953 63.99950 63.99958 42.371 35.261 61.268 156775.850 156184.543
## maxo sn sample
## CP001 10329.126 25 1
## CP002 5138.685 32 1
## CP003 22355.266 90 1
## CP004 7085.818 40 1
## CP005 6082.308 181 1
## CP006 17705.469 242 1
We next annotate these chromatographic peaks using only their
m/z values. As a reference we load a CompDb
database containing all compounds from the Human Metabolom Database
(Wishart et al. 2021) version 5.0 which is
also provided as a data release on the
MetaboAnnotationTutorials
github
repository. The database was build from HMDB 5.0 (2021-11-02) using
the CompoundDb
package (see the package
documentation for more information).
Below we download the database from github to a temporary folder.
#' Download the CompDb database using curl
library(curl)
dbname <- "CompDb.Hsapiens.HMDB.5.0.sqlite"
db_file <- file.path(tempdir(), dbname)
curl_download(
paste0("https://github.com/jorainer/MetaboAnnotationTutorials/",
"releases/download/2021-11-02/", dbname),
destfile = db_file)
We next load the database using the CompoundDb
package.
library(CompoundDb)
#' Load a CompDb database with compound annotation from HMDB
cdb <- CompDb(db_file)
cdb
## class: CompDb
## data source: HMDB
## version: 5.0
## organism: Hsapiens
## compound count: 217776
## MS/MS spectra count: 64920
Note: while we are here annotating chromatographic
peaks from an xcms
result object, it would also be possible
to annotate LC-MS features using the same approach by simply exchanging
the chromPeaks
call with featureDefinitions
.
In addition, the annotation could also be performed on a
SummarizedExperiment
representing the preprocessing results
which can be generated from an xcms
result object using the
quantify
method.
Annotation using m/z values
We next extract all peaks from the xcms
result object
and annotate them using the matchMz
function against all
compounds from HMDB (the compound annotations can be extracted from the
CompDb
database with the compounds
function).
The reference database does not provide m/z values but only
exact masses for each compound. Thus, in order to annotate our
chromatographic peaks’ m/z values, m/z values for
expected ions of the compounds need to be calculated first. With
Mass2MzParam
we can configure the parameters for the
annotation. Parameter adducts
allows to specify the adducts
(ions) we expect to see mostly in the data set. For our example we
assume to see mostly [M+H]+
and [M+Na]+
positive ions. With parameters tolerance
and
ppm
we can finally define also the absolute and/or
m/z relative acceptable difference, respectively, for
m/z values to be considered matching.
library(MetaboAnnotation)
#' Get the chromatographic peaks and add a column with the peak_id
pks <- data.frame(peak_id = rownames(chromPeaks(std_data)),
chromPeaks(std_data))
#' Set parameters for the m/z-based annotation
param <- Mass2MzParam(adducts = c("[M+H]+", "[M+Na]+"),
tolerance = 0, ppm = 10)
#' Perform the matching
pks_match <- matchMz(
pks,
compounds(cdb, c("compound_id", "exactmass", "formula", "name")),
param = param)
pks_match
## Object of class Matched
## Total number of matches: 697
## Number of query objects: 175 (112 matched)
## Number of target objects: 217776 (393 matched)
The result object handles the potential many-to-many matching between chromatographic peaks (query) and reference (target) and contains all information from both the query and the target object along with the score for the match (in this case the difference between the m/z values). Below we list all available information in the result object.
colnames(pks_match)
## [1] "peak_id" "mz" "mzmin"
## [4] "mzmax" "rt" "rtmin"
## [7] "rtmax" "into" "intb"
## [10] "maxo" "sn" "sample"
## [13] "target_compound_id" "target_name" "target_formula"
## [16] "target_exactmass" "adduct" "score"
## [19] "ppm_error"
The n:m mapping is represented in the matches
data
frame. It maps each annotated query element to all matching target
elements. Below we list the first 6 rows from that table. Column
"score"
provides the difference between the query and
target m/z values (in Da) and "ppm_error"
the ppm
error.
## query_idx target_idx adduct score ppm_error
## 1 1 6977 [M+H]+ -0.001888540 5.315573
## 2 1 7006 [M+H]+ -0.001888540 5.315573
## 3 1 158233 [M+H]+ -0.001888540 5.315573
## 4 1 160274 [M+H]+ -0.001888540 5.315573
## 5 1 170679 [M+H]+ -0.001888540 5.315573
## 6 2 164617 [M+H]+ 0.003242238 9.031887
For users it’s however more convenient to extract annotation
information with the matchedData
function, that will return
a data frame with all annotations. If query elements are annotated to
more than one target element they will be listed in as many rows as
there are matches. Not matching query elements are represented with a
single row in the result table, containing NA
for all
annotations. Below we extract all annotations, restricting to some of
the annotation columns.
matchedData(pks_match, c("peak_id", "target_name",
"target_formula", "adduct"))
## DataFrame with 760 rows and 4 columns
## peak_id target_name target_formula adduct
## <character> <character> <character> <character>
## CP001 CP001 MG(0:0/18:... C21H38O4 [M+H]+
## CP001.1 CP001 MG(18:2(9Z... C21H38O4 [M+H]+
## CP001.2 CP001 Glyceryl m... C21H38O4 [M+H]+
## CP001.3 CP001 2-Linoleoy... C21H38O4 [M+H]+
## CP001.4 CP001 RIOPROSTIL C21H38O4 [M+H]+
## ... ... ... ... ...
## CPM186.9 CPM186 4-Ethoxy-p... C12H22O [M+Na]+
## CPM186.10 CPM186 cis-3-Etho... C12H22O [M+Na]+
## CPM187 CPM187 NA NA NA
## CPM188 CPM188 NA NA NA
## CPM189 CPM189 NA NA NA
As we can see, the number of rows of this data frame is much larger than the number of chromatographic peaks. Most peaks are in fact annotated to several possible compounds based on their m/z. The average number of compounds to which a chromatographic peak in this data is annotated is calculated below.
## [1] 6.223214
Annotations in the result object are organized by query, annotations for individual query elements can thus be accessed very easily. Below we list all annotations for the first chromatographic peak.
matchedData(pks_match[1], c("peak_id", "target_name",
"target_formula", "adduct"))
## DataFrame with 5 rows and 4 columns
## peak_id target_name target_formula adduct
## <character> <character> <character> <character>
## CP001 CP001 MG(0:0/18:... C21H38O4 [M+H]+
## CP001.1 CP001 MG(18:2(9Z... C21H38O4 [M+H]+
## CP001.2 CP001 Glyceryl m... C21H38O4 [M+H]+
## CP001.3 CP001 2-Linoleoy... C21H38O4 [M+H]+
## CP001.4 CP001 RIOPROSTIL C21H38O4 [M+H]+
All compounds have thus the exact same chemical formula and hence, using only the m/z value, we can not discriminate between them or determine which would be the correct annotation. Considering also the retention times would allow a better annotation and increase its confidence.
Annotation using reference m/z and retention times
For LC-MS experiments compounds are first chromatographically separated based on some property of the compound (other than their mass). Thus, also the retention times contain information that can improve the annotation. In this section we use information from a previous experiment in which pure standards were measured with the same LC-MS setup used to generate the present test data and in which the retention time as well as the ion (adduct) and its m/z value were determined. These reference values were thus empirically determined on the same instruments with the same setup and should thus help to get higher confidence annotations.
Below we load the reference retention time and m/z data for some of the measured standards.
fl <- system.file("txt", "std_ions.txt",
package = "MetaboAnnotationTutorials")
std_ions <- read.table(fl, sep = "\t", header = TRUE)
head(std_ions)
## compound_id ion_adduct ion_rt ion_mz
## 1 HMDB0001539 [M+H]+ 176 203.1503
## 2 HMDB0000905 [M+H]+ 208 332.0754
## 3 HMDB0001847 [M+H]+ 35 195.0877
## 4 HMDB0001413 [M+H]+ 217 490.1224
## 5 HMDB0000562 [M+H]+ 87 114.0662
## 6 HMDB0000086 [M+H]+ 189 258.1101
We can now annotate the chromatographic peaks using this reference
data. We configure the matching using the MzRtParam
, which
allows to annotate query elements based on m/z and retention
times. With ppm
we define again the m/z-relative
acceptable difference of the m/z values and with
toleranceRt
the absolute acceptable difference in measured
retention times. The parameters mzColname
and
rtColname
allow us to define the names of the columns in
the query and target objects containing the m/z values and
retention time values for the matching. Settings for ppm
and toleranceRt
should be adapted to the experimental
setup, retention time difference could e.g. be larger for HILIC-based
chromatographic separation. Below we use a difference of 7 seconds
(since retention times in std_ions
and in the
xcms
result are reported in seconds).
param <- MzRtParam(ppm = 10, toleranceRt = 7)
pks_match <- matchMz(pks, std_ions, param = param,
mzColname = c("mz", "ion_mz"),
rtColname = c("rt", "ion_rt"))
pks_match
## Object of class Matched
## Total number of matches: 9
## Number of query objects: 175 (9 matched)
## Number of target objects: 19 (9 matched)
In addition to supplying the reference (target) data as a
data.frame
we could also directly use an
IonDb
. This type of annotation object extends the above
used CompDb
database providing in addition to general
compound annotations also information on measured ions including their
retention times and m/z values. It is thus designed to contain
reference annotation databases for specific LC-MS setups and MS
instrumentation used in a lab. Below we create such an
IonDb
database (which will be saved in the present example
to a temporary file) from the HMDB CompDb
database and add
our ion annotations to it.
#' Create an IonDb filling it with data from cdb.
idb <- IonDb(tempfile(), cdb)
#' Insert measured m/z and retention times for ions
idb <- insertIon(idb, std_ions)
idb
## class: IonDb
## data source: HMDB
## version: 5.0
## organism: Hsapiens
## compound count: 217776
## MS/MS spectra count: 64920
## ion count: 19
The advantage of such an annotation database over individual (text)
file-based annotation approaches is obvious: along with the annotation
data also metadata can be saved and the database can be versioned, which
is crucial for reproducible workflows. Also, the database’s SQLite file
is portable and can be distributed and used on different computers. In
addition, it would also be possible to use a MySQL-based database system
which would enable centralized annotation resources in a laboratory.
More information on the creation and usage of annotation resources with
the CompoundDb
package is provided in its package vignettes
which are also available here.
We can now annotate our peaks against that database using the same
call as above, but providing the idb
as target
parameter.
pks_match <- matchMz(
pks, ions(idb, c("compound_id", "ion_adduct", "ion_mz", "ion_rt", "name")),
param = param, mzColname = c("mz", "ion_mz"),
rtColname = c("rt", "ion_rt"))
pks_match
## Object of class Matched
## Total number of matches: 9
## Number of query objects: 175 (9 matched)
## Number of target objects: 19 (9 matched)
In total 9 chromatographic peaks were annotated to an ion of one of
the standards. We can identify the chromatographic peaks with available
annotation using the whichQuery
function, that returns
their index in the query. Below we thus also subset the matched results
to only those chromatographics peaks with an annotation.
whichQuery(pks_match)
## [1] 61 68 74 86 88 131 142 148 162
pks_match <- pks_match[whichQuery(pks_match)]
We next build a result table for the successfully annotated chromatographic peaks.
library(pander)
pandoc.table(as.data.frame(
matchedData(pks_match, c("peak_id", "target_name", "score", "score_rt"))),
style = "rmarkdown", split.tables = Inf)
peak_id | target_name | score | score_rt | |
---|---|---|---|---|
CP067 | CP067 | Caffeine | -0.0006847 | 0.261 |
CP074 | CP074 | Creatinine | 0.0001926 | 0.238 |
CP080 | CP080 | Sphingosine | -0.0003391 | 3.013 |
CP094 | CP094 | L-Methionine | -0.0001487 | -6.839 |
CP096 | CP096 | 1-Methyluric acid | 0.0002982 | -0.626 |
CP143 | CP143 | 3-Methylhistidine | -0.0002466 | 0.734 |
CP154 | CP154 | Asymmetric dimethylarginine | -0.0003642 | 3.688 |
CP162 | CP162 | Glycerophosphocholine | -0.0003917 | 0.098 |
CP176 | CP176 | Deoxyadenosine monophosphate | 0.0007856 | 2.337 |
Columns "score"
and "score_rt"
contain the
difference between the query and the annotated target’s m/z (in
Da) and retention time (in seconds), respectively.
Annotation using reference m/z and retention indices
Retention time is a valuable orthogonal information for metabolite identification, but chromatographic conditions are far from being standardized. Even when using nominally the same chromatographic system, meaning column and eluents, differences in retention times arise due to dead volumes, gradient delay volumes, etc. In GC-MS retention time indexing has been used to normalize for drifts in retention times by converting them into retention indices. A similar approach has been recently described for LC-MS (Stoffel et al. 2021) using N-Alkyl-pyridinium sulfonates as retention indexing substances. Their use allows to convert retention times into retention indices and to compare retention information across different LC-MS systems.
We first load data from an untargeted metabolomic experiments. This data has been completely processed. Each feature is identified by a unique name.
library(MetaboCoreUtils)
fl <- system.file("txt", "rti_ms_pos_features.txt",
package = "MetaboAnnotationTutorials")
ms1_features <- read.table(fl, header = TRUE, sep = "\t")
head(ms1_features)
## Name mz rtime Cel_pure_1.B.1_01_21144
## 1 Cluster_00003 71.01397 1.527515 946.2993
## 2 Cluster_00004 72.33667 1.529377 1402.1213
## 3 Cluster_00006 86.37420 1.946073 7245.5283
## 4 Cluster_00018 66.03701 1.454697 1306.3455
## 5 Cluster_00025 71.03044 1.259239 256197.7500
## 6 Cluster_00030 80.03526 1.259666 242820.3125
## Cel_pure_1.B.1_01_21149 Cel_pure_1.B.1_01_21154 Cel_pure_1.B.1_01_21159
## 1 806.088 796.521 771.9626
## 2 1427.545 1334.492 1400.0985
## 3 8062.006 8196.895 8502.9092
## 4 1389.757 1338.029 1341.1783
## 5 238996.312 212299.125 242366.2188
## 6 209218.312 174476.219 223134.0000
## Cel_pure_1.B.1_01_21164
## 1 713.608
## 2 1503.299
## 3 9668.902
## 4 1584.487
## 5 239547.453
## 6 215851.828
Likewise we load the information on the N-Alkyl-pyridinium sulfonates. The retention index is defined as the number of carbons in the alkyl chain times 100.
fl <- system.file("txt", "rti_ms_pos_naps.txt",
package = "MetaboAnnotationTutorials")
ms1_naps <- read.table(fl, header = TRUE, sep = "\t")
head(ms1_naps)
## rtime rindex
## 1 NA 100
## 2 1.5 200
## 3 1.7 300
## 4 2.6 400
## 5 5.9 500
## 6 10.3 600
The relationship between the retention time and retention index can be plotted.
plot(ms1_naps)
We can use the N-Alkly-pyridinium sulfonates to convert the retention
times of the measured features using the indexRtime
function. By default linear interpolation is used for conversion, but
other (custom) functions can be defined if required.
ms1_features$rindex <- indexRtime(ms1_features$rtime, ms1_naps)
This data can then be annotated using reference databases. Two different databases are used. One was obtained on a UHPLC column and one on a HPLC column of the same column chemistry. The untargeted data was obtained from a HPLC column, but using a different LC hardware, resulting in different absolute retention times.
fl <- system.file("txt", "rti_db_hplc.txt",
package = "MetaboAnnotationTutorials")
db_hplc <- read.table(fl, header = TRUE, sep = "\t",
quote = "", comment.char = "")
fl <- system.file("txt", "rti_db_uhplc.txt",
package = "MetaboAnnotationTutorials")
db_uhplc <- read.table(fl, header = TRUE, sep = "\t",
quote = "", comment.char = "")
Next we can define the Mass2MzRtParam
similar to above.
We will use an absolute error of 10 retention index units and use
[M+H]+
and [M+Na]+
adducts.
match_param <- Mass2MzRtParam(adducts = c("[M+H]+", "[M+Na]+"),
tolerance = 0.005,
toleranceRt = 10)
Matching can now be performed. Important is that the column with the
retention index needs to be defined using rtColname
.
Matching is performed against both databases and results are then
filtered to only contain features with at least one match.
matches_hplc <- matchMz(ms1_features,
db_hplc,
match_param,
massColname = "exact.mass",
mzColname = "mz",
rtColname = "rindex")
matches_hplc <- matches_hplc[whichQuery(matches_hplc)]
matches_uhplc <- matchMz(ms1_features,
db_uhplc,
match_param,
massColname = "exact.mass",
mzColname = "mz",
rtColname = "rindex")
matches_uhplc <- matches_uhplc[whichQuery(matches_uhplc)]
We next build a result table for the successfully annotated chromatographic peaks. First for the results matching with the HPLC database.
library(pander)
pandoc.table(head(as.data.frame(
matchedData(matches_hplc, c("Name", "mz", "rindex", "target_name",
"adduct", "score", "score_rt")))),
style = "rmarkdown", split.tables = Inf)
Name | mz | rindex | target_name | adduct | score | score_rt | |
---|---|---|---|---|---|---|---|
24 | Cluster_00116 | 132.1 | 328.8 | 5-AMINOLEVULINATE | [M+H]+ | 0.005076 | -1.226 |
24.1 | Cluster_00116 | 132.1 | 328.8 | N-ACETYLALANINE | [M+H]+ | 0.005076 | -1.226 |
395 | Cluster_01443 | 125.1 | 418.4 | 4-METHYLCATECHOL | [M+H]+ | -0.0002352 | 8.425 |
434 | Cluster_01544 | 133 | 392.7 | GLUTARATE | [M+H]+ | -0.000176 | 0.733 |
434.1 | Cluster_01544 | 133 | 392.7 | GLUTARATE | [M+H]+ | -0.000176 | -2.267 |
469 | Cluster_01596 | 138.1 | 582.2 | SALICYLAMIDE | [M+H]+ | 0.0001774 | -1.791 |
Then for matching with the UPLC database.
library(pander)
pandoc.table(head(as.data.frame(
matchedData(matches_uhplc, c("Name", "mz", "rindex", "target_name",
"adduct", "score", "score_rt")))),
style = "rmarkdown", split.tables = Inf)
Name | mz | rindex | target_name | adduct | score | score_rt | |
---|---|---|---|---|---|---|---|
24 | Cluster_00116 | 132.1 | 328.8 | N-ACETYLALANINE | [M+H]+ | 0.005076 | 4.774 |
380 | Cluster_01397 | 123 | 679.6 | BENZOATE | [M+H]+ | -0.0003409 | -1.38 |
434 | Cluster_01544 | 133 | 392.7 | GLUTARATE | [M+H]+ | -0.000176 | 4.733 |
434.1 | Cluster_01544 | 133 | 392.7 | GLUTARATE | [M+H]+ | -0.000176 | 3.733 |
469 | Cluster_01596 | 138.1 | 582.2 | 4-AMINOBENZOATE | [M+H]+ | 0.0001774 | -2.791 |
469.1 | Cluster_01596 | 138.1 | 582.2 | ANTHRANILATE | [M+H]+ | 0.0001774 | -2.791 |
\(MS^2\) annotation
In addition to \(MS^1\) annotation
described in the previous section, it is also possible to annotate
features (or chromatographic peaks) by comparing experimentally measured
\(MS^2\) (MS/MS) spectra against
reference spectra. Such annotation can be performed using the
matchSpectra
from the MetaboAnnotation
package
that takes Spectra
objects (with \(MS^2\) spectra) as query and target
parameters. Such Spectra
objects can be either imported
from a variety of input formats, including MGF, MSP, mzML, mzXML or
netCDF files, extracted from xcms
result objects or even
retrieved from databases such as MassBank (see also the
tutorials from SpectraTutorials
for more examples and use cases).
To illustrate this, we will extract in this section \(MS^2\) spectra for the annotated chromatographic peaks from the previous section and compare them against reference \(MS^2\) spectra from HMDB. Annotation reliability can be increased if, in addition to m/z and retention time, also the \(MS^2\) spectra for a chromatographic peak or a feature matches the reference spectrum from a pure standard.
We first extract all \(MS^2\)
spectra for the annotated chromatographic peaks using the
chromPeakSpectra
function from the xcms
result
object. This will return all \(MS^2\)
spectra with the precursor m/z and retention time within the
m/z ranges and retention time ranges of the detected
chromatographic peaks. For an xcms
result object containing
also correspondence results (i.e. grouped chromatographic peaks across
samples, also referred to as features) the
featureSpectra
function could be used instead.
std_spectra <- chromPeakSpectra(std_data, return.type = "Spectra",
peaks = pks_match$peak_id)
std_spectra
## MSn data (Spectra) with 11 spectra in a MsBackendMzR backend:
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## F1.S107 2 35.650 107
## F1.S193 2 66.622 193
## F1.S224 2 78.600 224
## F1.S127 2 38.323 127
## F1.S159 2 50.239 159
## F1.S581 2 184.743 581
## F1.S649 2 197.112 649
## F1.S548 2 175.749 548
## F1.S600 2 188.408 600
## F1.S707 2 209.046 707
## F1.S760 2 221.347 760
## ... 38 more variables/columns.
##
## file(s):
## HighIS_Mix07_CE20_POS.mzML
For some of the chromatographic peaks we have more than one \(MS^2\) spectrum, but in total we have \(MS^2\) spectra only for 7 chromatographic peaks. Below we count the number of spectra per peak.
table(std_spectra$peak_id)
##
## CP067 CP074 CP080 CP143 CP154 CP162 CP176
## 1 2 2 2 1 1 2
Before matching the spectra we are cleaning our experimental spectra removing peaks with an intensity lower than 5% of a spectra’s highest peak intensity and removing spectra with less than 3 peaks.
#' Define a function to remove low intensity peaks
low_int <- function(x, ...) {
x > max(x, na.rm = TRUE) * 0.05
}
#' Remove peaks with an intensity below 5% of BPI
std_spectra <- filterIntensity(std_spectra, intensity = low_int)
#' Remove peaks with less than 3 peaks
std_spectra <- std_spectra[lengths(std_spectra) > 2]
std_spectra
## MSn data (Spectra) with 8 spectra in a MsBackendMzR backend:
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## F1.S107 2 35.650 107
## F1.S127 2 38.323 127
## F1.S159 2 50.239 159
## F1.S581 2 184.743 581
## F1.S649 2 197.112 649
## F1.S548 2 175.749 548
## F1.S600 2 188.408 600
## F1.S707 2 209.046 707
## ... 38 more variables/columns.
##
## file(s):
## HighIS_Mix07_CE20_POS.mzML
## Lazy evaluation queue: 1 processing step(s)
## Processing:
## Remove peaks based on their intensities and a user-provided function in spectra of MS level(s) 2. [Tue Dec 5 09:03:15 2023]
In addition, we scale the peak intensities within each spectrum to values between 0 and 100. This is mostly for visualization reason, since most spectra similarity scoring algorithms are independent of absolute peak intensities.
#' Define a function to *scale* the intensities
scale_int <- function(x, ...) {
maxint <- max(x[, "intensity"], na.rm = TRUE)
x[, "intensity"] <- 100 * x[, "intensity"] / maxint
x
}
#' *Apply* the function to the data
std_spectra <- addProcessing(std_spectra, scale_int)
Below we match now these cleaned spectra against the
reference spectra from HMDB. We configure the matching using the
CompareSpectraParam
allowing m/z differences of
50ppm between peaks from a query and a target spectrum. Since HMDB does
not provide precursor m/z we in addition have to disable the
additional matching filter (i.e. set
requirePrecursor = FALSE
), which has however a negative
impact on the performance of the comparisons. By default, spectra
similarities (calculated with the normalized dot product) higher than
0.7 will be considered matching.
spectra_match <- matchSpectra(
std_spectra, Spectra(cdb),
param = CompareSpectraParam(ppm = 50, requirePrecursor = FALSE))
spectra_match
## Object of class MatchedSpectra
## Total number of matches: 27
## Number of query objects: 8 (4 matched)
## Number of target objects: 64920 (27 matched)
Thus, in total 4 of the query spectra were matched with 27. Below we create a result table for this matching.
res <- spectraData(spectra_match,
c("peak_id", "target_compound_id",
"target_name", "score"))
rownames(res) <- NULL
pandoc.table(as.data.frame(res),
style = "rmarkdown", split.tables = Inf)
peak_id | target_compound_id | target_name | score |
---|---|---|---|
CP067 | HMDB0001847 | Caffeine | 0.7258 |
CP067 | HMDB0001847 | Caffeine | 0.705 |
CP067 | HMDB0001847 | Caffeine | 0.7016 |
CP067 | HMDB0001847 | Caffeine | 0.7046 |
CP080 | NA | NA | NA |
CP080 | HMDB0253992 | Laurocapram | 0.7589 |
CP080 | HMDB0253992 | Laurocapram | 0.7789 |
CP143 | NA | NA | NA |
CP143 | HMDB0000479 | 3-Methylhistidine | 0.8739 |
CP143 | HMDB0000479 | 3-Methylhistidine | 0.8739 |
CP154 | NA | NA | NA |
CP162 | NA | NA | NA |
CP176 | HMDB0000034 | Adenine | 0.7165 |
CP176 | HMDB0000034 | Adenine | 0.7094 |
CP176 | HMDB0000034 | Adenine | 0.7092 |
CP176 | HMDB0000034 | Adenine | 0.7133 |
CP176 | HMDB0000034 | Adenine | 0.7165 |
CP176 | HMDB0000034 | Adenine | 0.7123 |
CP176 | HMDB0000034 | Adenine | 0.7165 |
CP176 | HMDB0000034 | Adenine | 0.7165 |
CP176 | HMDB0000034 | Adenine | 0.7139 |
CP176 | HMDB0000050 | Adenosine | 0.7067 |
CP176 | HMDB0000050 | Adenosine | 0.7165 |
CP176 | HMDB0000101 | Deoxyadenosine | 0.7165 |
CP176 | HMDB0000905 | Deoxyadenosine monophosphate | 0.9523 |
CP176 | HMDB0000905 | Deoxyadenosine monophosphate | 0.8143 |
CP176 | HMDB0000905 | Deoxyadenosine monophosphate | 0.7407 |
CP176 | HMDB0000905 | Deoxyadenosine monophosphate | 0.9136 |
CP176 | HMDB0000905 | Deoxyadenosine monophosphate | 0.9523 |
CP176 | HMDB0000905 | Deoxyadenosine monophosphate | 0.9524 |
CP176 | HMDB0001983 | 5’-Deoxyadenosine | 0.7086 |
Matching results can also be very easily visualized with the
plotSpectraMirror
function. Below we show the results for
the first peak (Caffeine). The upper spectrum represents the query, the
lower the target (reference) spectrum.
plotSpectraMirror(spectra_match[1])
The results seem to be reasonable. For the second peak (third spectrum) the match was based however on a single peak.
plotSpectraMirror(spectra_match[3])
Note that we could now also add some of the (cleaned and annotated)
\(MS^2\) spectra from the present test
set, to the IonDb
using the insertSpectra
function. This enables generation of lab-internal reference databases
including also MS/MS spectra.
Working with data from other tools
\(MS^1\)-based annotations can be
performed on any data.frame
or similar that provides \(m/z\) and/or retention time values. For
\(MS^2\)-based annotations a
Spectra
object would be needed, and that can be imported
from a variety of formats. In this section we show how pre-processing
results from an external tool (in this case MZmine) can be imported and used as
input for MetaboAnnotation
.
The test data we are using is one of the example data sets for GNPS Feature-Based Molecular Networking (Nothias et al. 2020). A description on how this data was generated is provided in the GNPS Feature-Based Molecular Networking documentation. The data set consists of feature abundances (along with retention time and m/z values) and \(MS^2\) spectra for these features. These example files are distributed through the github repository of the GNPS documentation https://github.com/CCMS-UCSD/GNPSDocumentation.git.
Below we first download the feature table (in csv format) from the above mentioned repository to a temporary file.
f_file <- "MZmine-GNPS_AG_test_featuretable.csv"
url <- paste0("https://raw.githubusercontent.com/CCMS-UCSD/GNPSDocumentation",
"/master/docs/tutorials/AG_tutorial_files/")
curl_download(paste0(url, f_file),
destfile = file.path(tempdir(), f_file))
We next load this file into R using the base read.csv
function.
We can now use this file as an input for \(MS^1\)-based annotation, as shown in
section Annotation using m/z values above,
and match the m/z values of the features against HMDB assuming
[M+H]+
and [M+Na]+
adducts. With parameter
mzColname
we define the column in the input file containing
the m/z values. For MZmine this is column "row m/z"
.
param <- Mass2MzParam(adducts = c("[M+H]+", "[M+Na]+"), ppm = 10)
f_match <- matchMz(
ftable,
compounds(cdb, c("compound_id", "exactmass", "formula", "name")),
param = param,
mzColname = "row m/z")
f_match
## Object of class Matched
## Total number of matches: 31333
## Number of query objects: 3443 (2251 matched)
## Number of target objects: 217776 (13194 matched)
For 2251 of the in total 3443 features a compound matching the feature’s m/z was found.
\(MS^2\) spectra can be exported from MZmine in mgf file format. Below we download the MGF file with all \(MS^2\) spectra for the features from the example above to a temporary folder.
s_file <- "MZmine-GNPS_AG_test_GNPS.mgf"
curl_download(paste0(url, s_file),
destfile = file.path(tempdir(), s_file))
Files in MGF format can be imported with the MsBackendMgf
package. Below we use this package to create a Spectra
object with all spectra from the example file.
library(MsBackendMgf)
sps_mgf <- Spectra(file.path(tempdir(), s_file),
source = MsBackendMgf())
By default all data fields from the original MGF file are imported.
Thus, the Spectra
object contains also a variable called
FEATURE_ID
with the identifiers for the features the
spectra are associated with.
head(sps_mgf$FEATURE_ID)
## [1] "1" "2" "3" "4" "6" "7"
This Spectra
object could then simply be used as in
input parameter to the matchSpectra
function as shown in
the \(MS^2\) annotation
section above to e.g. identify matches with reference spectra from
HMDB.
Package repositories and additional tutorials
- Tutorial with additional examples and explanations for \(MS^2\)-based annotations: https://jorainer.github.io/SpectraTutorials/
- Repository of the
MsCoreUtils
package: https://rformassspectrometry.github.io/MsCoreUtils/ - Repository of the
MetaboCoreUtils
package: https://rformassspectrometry.github.io/MetaboCoreUtils/ - Repository of the
Spectra
package: https://rformassspectrometry.github.io/Spectra/ - Repository of the
MetaboAnnotation
package: https://rformassspectrometry.github.io/MetaboAnnotation/ - Repository of the
CompoundDb
package: https://rformassspectrometry.github.io/CompoundDb/
Session information
The R version and versions of used packages are listed below.
## R version 4.3.2 (2023-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] MsBackendMgf_1.10.0 Spectra_1.12.0 MetaboCoreUtils_1.10.0
## [4] pander_0.6.5 MetaboAnnotation_1.6.1 CompoundDb_1.6.0
## [7] AnnotationFilter_1.26.0 curl_5.1.0 xcms_4.0.0
## [10] MSnbase_2.28.1 ProtGenerics_1.33.1 S4Vectors_0.40.2
## [13] mzR_2.37.1 Rcpp_1.0.11 Biobase_2.62.0
## [16] BiocGenerics_0.48.1 BiocParallel_1.36.0 knitr_1.45
## [19] BiocStyle_2.30.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 jsonlite_1.8.7
## [3] MultiAssayExperiment_1.28.0 magrittr_2.0.3
## [5] MALDIquant_1.22.1 rmarkdown_2.25
## [7] fs_1.6.3 zlibbioc_1.48.0
## [9] ragg_1.2.6 vctrs_0.6.5
## [11] multtest_2.58.0 memoise_2.0.1
## [13] RCurl_1.98-1.13 base64enc_0.1-3
## [15] htmltools_0.5.7 S4Arrays_1.2.0
## [17] progress_1.2.2 AnnotationHub_3.10.0
## [19] SparseArray_1.2.2 mzID_1.40.0
## [21] sass_0.4.7 bslib_0.6.1
## [23] htmlwidgets_1.6.3 desc_1.4.2
## [25] plyr_1.8.9 impute_1.76.0
## [27] cachem_1.0.8 igraph_1.5.1
## [29] mime_0.12 lifecycle_1.0.4
## [31] iterators_1.0.14 pkgconfig_2.0.3
## [33] Matrix_1.6-4 R6_2.5.1
## [35] fastmap_1.1.1 shiny_1.8.0
## [37] GenomeInfoDbData_1.2.11 MatrixGenerics_1.14.0
## [39] clue_0.3-65 digest_0.6.33
## [41] pcaMethods_1.94.0 rsvg_2.6.0
## [43] colorspace_2.1-0 AnnotationDbi_1.64.1
## [45] rprojroot_2.0.4 textshaping_0.3.7
## [47] GenomicRanges_1.54.1 RSQLite_2.3.3
## [49] filelock_1.0.2 fansi_1.0.5
## [51] httr_1.4.7 abind_1.4-5
## [53] compiler_4.3.2 bit64_4.0.5
## [55] doParallel_1.0.17 DBI_1.1.3
## [57] highr_0.10 MASS_7.3-60
## [59] MsExperiment_1.4.0 ChemmineR_3.54.0
## [61] rappdirs_0.3.3 DelayedArray_0.28.0
## [63] rjson_0.2.21 tools_4.3.2
## [65] interactiveDisplayBase_1.40.0 httpuv_1.6.12
## [67] glue_1.6.2 promises_1.2.1
## [69] QFeatures_1.12.0 grid_4.3.2
## [71] cluster_2.1.6 generics_0.1.3
## [73] gtable_0.3.4 preprocessCore_1.64.0
## [75] hms_1.1.3 xml2_1.3.5
## [77] utf8_1.2.4 XVector_0.42.0
## [79] BiocVersion_3.18.1 RANN_2.6.1
## [81] foreach_1.5.2 pillar_1.9.0
## [83] stringr_1.5.1 limma_3.58.1
## [85] later_1.3.1 robustbase_0.99-1
## [87] splines_4.3.2 dplyr_1.1.4
## [89] BiocFileCache_2.10.1 lattice_0.22-5
## [91] bit_4.0.5 survival_3.5-7
## [93] tidyselect_1.2.0 Biostrings_2.70.1
## [95] gridExtra_2.3 IRanges_2.36.0
## [97] SummarizedExperiment_1.32.0 xfun_0.41
## [99] statmod_1.5.0 matrixStats_1.1.0
## [101] DEoptimR_1.1-3 DT_0.30
## [103] stringi_1.8.2 lazyeval_0.2.2
## [105] yaml_2.3.7 evaluate_0.23
## [107] codetools_0.2-19 MsCoreUtils_1.14.1
## [109] tibble_3.2.1 BiocManager_1.30.22
## [111] cli_3.6.1 affyio_1.72.0
## [113] xtable_1.8-4 systemfonts_1.0.5
## [115] munsell_0.5.0 jquerylib_0.1.4
## [117] GenomeInfoDb_1.38.1 MassSpecWavelet_1.68.0
## [119] dbplyr_2.4.0 png_0.1-8
## [121] XML_3.99-0.16 parallel_4.3.2
## [123] ellipsis_0.3.2 blob_1.2.4
## [125] pkgdown_2.0.7.9000 ggplot2_3.4.4
## [127] prettyunits_1.2.0 bitops_1.0-7
## [129] MsFeatures_1.10.0 scales_1.3.0
## [131] affy_1.80.0 ncdf4_1.22
## [133] purrr_1.0.2 crayon_1.5.2
## [135] rlang_1.1.2 KEGGREST_1.42.0
## [137] vsn_3.70.0