Last modified: 2021-02-26 08:25:17
Compiled: Fri Feb 26 08:30:04 2021

Overview

Description

Mass spectrometry (MS) data is a key technology in modern proteomics and metabolomics experiments. Due to continuous improvements in MS instrumentation, the generated data can easily become very large. Also, additional resources of MS data exist, such as spectra libraries and databases, all with their own specific file formats and database systems that sometimes do not support manipulations of the original data.

Learning from experiences with the MSnbase Bioconductor package, the Spectra package was developed to provide an even more flexible and expandable infrastructure for MS data in R. This package implements a clear separation of user functionality from code to provide, store and import MS data. Different backends can hence be used that enable access to data from various resources or that are designed specifically for very large MS data sets. Data manipulations are by default not directly applied to the data but cached in a lazy evaluation queue which allows analyses also of read-only data representations.

This (instructor-led live demo) workshop shows the expandability of the new infrastructure to enable a seamless integration and analysis of MS data from a variety of input formats illustrated by a simple matching of experimental MS2 spectra against public spectral databases and export of the data in a format commonly used for exchange of MS2 data.

Pre-requisites

  • Basic familiarity with R and Bioconductor.
  • Basic understanding of Mass Spectrometry (MS) data.

Participation

  • Get the docker image of this tutorial with docker pull jorainer/spectra_tutorials:latest.

  • Start docker using

    docker run \
        -e PASSWORD=bioc \
        -p 8787:8787 \
        jorainer/spectra_tutorials:latest
  • Enter http://localhost:8787 in a web browser and log in with username rstudio and password bioc.

  • Open this R-markdown file (vignettes/analyzing-MS-data-from-different-sources-with-Spectra.Rmd) in the RStudio server version in the web browser and evaluate the R code blocks.

  • To get the source code: clone this github repository, e.g. with git clone https://github.com/jorainer/SpectraTutorials.

  • Optionally, to run also the code to import the MS2 spectra from HMDB the All Spectra Files (XML) archive from the hmdb downloads page has to be downloaded. The contents of the hmdb_all_spectra.zip archive should then be unzipped into the folder data/hmdb_all_spectra.

R/Bioconductor packages used

  • Spectra
  • MsCoreUtils

Other R packages not (yet) in Bioconductor:

Time outline

Time outline

Activity Time
Introduction (LC-MS/MS, Spectra package) 10min
MS data import and handling 5min
Data processing and manipulation 5min
Spectrum data comparison 5min
Comparing spectra against MassBank 10min
Data export 5min
(Comparing spectra against HMDB) (5min)

Workshop goals and objectives

Learning goals

  • Understand how to import MS data into R.
  • Understand the basic concept of backends in Spectra and how they can be used to work with MS data from various sources.

Learning objectives

  • Import and export MS data with Spectra.
  • Integrate MS data from different resources into an MS data analysis workflow.
  • Apply different data manipulations on MS data represented as a Spectra object.
  • Use Spectra to perform spectra matching in R.

Workshop

LC-MS/MS in a nutshell

  • Mass spectrometry (MS) instruments measure mass-to-charge ratios (m/z) of ions.
  • Most compounds are not charged, they need to be ionized first (with e.g. electro spray ionization (ESI).
  • MS is usually combined with another separation technique, such as liquid chromatography (LC). This adds another dimension: retention time (rt).

LC-MS setup

  • With LC-MS we measure features characterized by m/z and retention time - we still don’t know what molecule was actually measured.
  • Create in addition fragment (MS/MS) spectra from the ions to get some information about their structure.

CID-based fragmentation

  • Commonly used method: collision induced dissociation (CID). In a collision chamber filled with e.g. N2, ions get fragmented, a spectrum of these fragments is recorded.
  • Matching fragment spectra from an ion against a reference helps identifying the compound.

The Spectra package

The Spectra package implements a clear separation of user functionality from code to provide, store and read mass spectrometry data. Thus, different data or file format-specific backends can be implemented and directly plugged-in without affecting the way the user would access or analyze the data. This represents an extension to the in-memory and on-disk data modes already available in the MSnbase package that enabled either a fast data processing or an analysis of very large data sets by keeping only a limited amount of data in the computer’s memory (Gatto, Gibb, and Rainer 2020).

Spectra: separation into user functionality and data representation

In this workshop we will import MS data from mzML files, match the MS2 fragment spectra for one ion against MS2 spectra from public databases (i.e. MassBank and the Human Metabolom Database HMDB) and export the data as a MGF file. A different backend is used for each data import and export operation.

MS data import and handling

Below we import the MS data from the mzML files provided within this package. These files contain MSn data of a mix of 8 standard compounds (solved either in water or a pool of human serum samples) measured with a HILIC-based LC-MS/MS setup. MS2 data was generated by data dependent acquisition using two different collision energies. For data import and representation of these experimantal data we use the MsBackendMzR backend which supports import (and export) of data from the most common raw mass spectrometry file formats (i.e. mzML, mzXML and CDF).

library(Spectra)

#' Define the input files
fls <- dir(system.file("mzML", package = "SpectraTutorials"),
           full.names = TRUE)

#' Import the data
sps_all <- Spectra(fls, backend = MsBackendMzR())

The MS data is now represented by a Spectra object, which can be thought of as a data.frame with columns being the spectra variables (such as "rtime", i.e. the retention time) and rows the individual spectra. Each spectra variable can be accessed either via $ and its name or by using its dedicated access function (which is the preferred way). The spectraVariables function can be used to list all available variables within such a Spectra object.

#' List all available spectra variables (attributes)
spectraVariables(sps_all)
##  [1] "msLevel"                  "rtime"                   
##  [3] "acquisitionNum"           "scanIndex"               
##  [5] "dataStorage"              "dataOrigin"              
##  [7] "centroided"               "smoothed"                
##  [9] "polarity"                 "precScanNum"             
## [11] "precursorMz"              "precursorIntensity"      
## [13] "precursorCharge"          "collisionEnergy"         
## [15] "isolationWindowLowerMz"   "isolationWindowTargetMz" 
## [17] "isolationWindowUpperMz"   "peaksCount"              
## [19] "totIonCurrent"            "basePeakMZ"              
## [21] "basePeakIntensity"        "ionisationEnergy"        
## [23] "lowMZ"                    "highMZ"                  
## [25] "mergedScan"               "mergedResultScanNum"     
## [27] "mergedResultStartScanNum" "mergedResultEndScanNum"  
## [29] "injectionTime"            "filterString"            
## [31] "spectrumId"               "ionMobilityDriftTime"    
## [33] "scanWindowLowerLimit"     "scanWindowUpperLimit"

Below we access the retention times of the first spectra using either $rtime or the function rtime.

#' Access the spectras' retention time
head(sps_all$rtime)
## [1] 0.273 0.570 0.873 1.183 1.491 1.798
head(rtime(sps_all))
## [1] 0.273 0.570 0.873 1.183 1.491 1.798

Our Spectra object contains information from in total 3203 spectra from length(unique(dataOrigin(sps_all))) mzML files. By using the MsBackendMzR backend only general information about each spectrum is kept in memory resulting in a low memory footprint.

print(object.size(sps_all), units = "MB")
## 0.8 Mb

With a MsBackendMzR backend, intensity or m/z values are retrieved on demand from the original data files.

mz(sps_all)
## NumericList of length 3203
## [[1]] 50.2264681118888 52.9758613330164 ... 998.412321909748 999.815483842029
## [[2]] 50.0188439310345 51.0188982589394 ... 994.093212059943 998.65669267328
## [[3]] 50.0188439310345 51.0260087879784 ... 999.549654921888 999.904649454908
## [[4]] 50.0347997547605 51.0271913631291 ... 984.467433193755 998.839380503321
## [[5]] 50.1615393614339 51.0280734441081 ... 996.599267645346 999.81102566575
## [[6]] 50.0158524973623 50.2809808965161 ... 992.93057113729 999.650537939432
## [[7]] 50.0178467765376 51.02704020796 ... 999.217711368357 999.757528325686
## [[8]] 50.0158524973623 51.0257255950236 ... 993.780890433629 999.583671855333
## [[9]] 50.0158524973623 50.9727348052523 ... 999.508971233846 999.663911424618
## [[10]] 50.0158524973623 50.0264589058649 ... 997.312177169353 999.672827131107
## ...
## <3193 more elements>

We can also load the full data into memory by changing the backend from MsBackendMzR to MsBackendDataFrame. This does not affect the way we use the Spectra object itself: the same operations and functions are available, independently of the way the data is stored (i.e. which backend is used).

#' Change backend to a MsBackendDataFrame: load data into memory
sps_all <- setBackend(sps_all, MsBackendDataFrame())

The size of our Spectra object is now larger, since the full data has been loaded into memory.

print(object.size(sps_all), units = "MB")
## 35.9 Mb

At last we subset our data to MS2 spectra with a precursor ion that matches the m/z of the [M+H]+ ion of the metabolite Cystine (accepting a difference in m/z of 10 parts-per-million (ppm)).

#' Define the m/z ratio for the [M+H]+ ion of Cystine
mz <- 241.0311

#' Subset the dataset to MS2 spectra matching the m/z
sps <- filterPrecursorMz(sps_all, mz = mz + ppm(c(-mz, mz), 10))
sps
## MSn data (Spectra) with 11 spectra in a MsBackendDataFrame backend:
##      msLevel     rtime scanIndex
##    <integer> <numeric> <integer>
## 1          2   209.936       673
## 2          2   220.072       714
## 3          2   231.604       734
## 4          2   211.481       687
## 5          2   224.634       718
## 6          2   236.326       731
## 7          2   215.089       761
## 8          2   225.739       781
## 9          2   240.020       804
## 10         2   212.161       771
## 11         2   224.662       801
##  ... 33 more variables/columns.
## Processing:
##  Switch backend from MsBackendMzR to MsBackendDataFrame [Fri Feb 26 08:30:11 2021]
##  Filter: select spectra with a precursor m/z within [241.028689689, 241.033510311] [Fri Feb 26 08:30:11 2021]

In total 11 spectra matched our target precursor m/z.

Data processing and manipulation

The plotSpectra function can be used to visualize spectra. Below we plot the first spectrum from our data subset.

#' Plot the first spectrum
plotSpectra(sps[1])

This raw MS2 spectrum contains many very low abundance peaks, most likely representing noise. Thus we next filter the spectra removing all peaks with an intensity smaller than 5% of the maximum intensity of each spectrum (i.e. the base peak intensity). To this end we define a function that takes intensity values from each spectrum and returns a logical value whether the peak should be kept (TRUE) or not (FALSE). This function is then passed to the filterIntensity function.

#' Define a filtering function
low_int <- function(x, ...) {
    x > max(x, na.rm = TRUE) * 0.05
}
#' Apply the function to filter the spectra
sps <- filterIntensity(sps, intensity = low_int)

After filtering, the spectra are cleaner:

#' Plot the first spectrum after filtering
plotSpectra(sps[1])

In addition we normalize each spectrum replacing the absolute intensity values with values relative to the spectrum’s maximum intensity (which is set to 100). For this operation we also define a function which takes a peak matrix as input and returns a matrix with the same dimensions. The peak matrix is the two-column matrix with m/z (first column) and intensity values (second column) of each peak of a spectrum. This function is then passed with parameter FUN to the addProcessing function which allows to apply any user-defined function to the peak matrix of each spectrum in a Spectra object.

#' Define a function to *normalize* the intensities
norm_int <- function(x, ...) {
    maxint <- max(x[, "intensity"], na.rm = TRUE)
    x[, "intensity"] <- 100 * x[, "intensity"] / maxint
    x
}
#' *Apply* the function to the data
sps <- addProcessing(sps, norm_int)

To show the effect of the normalization we extract the intensities of the first spectrum:

#' Get the intensities after normalization
intensity(sps)[[1]]
## [1]  13.439402  93.631648  49.287979 100.000000   9.422415   6.581111  12.969807
## [8]  54.633224

The intensity values are now all between 0 and 100. Note that all these data manipulations (intensity filtering and normalization) did not change the original m/z and intensity values. Data manipulation operations are cached by default in the lazy evaluation queue of the Spectra object and applied to the data on-the-fly each time m/z or intensity values are accessed. This ensures that the same data manipulations can be used for any type of backend, even if the data resource is read-only (e.g. if the data is retrieved on the fly from mzML files).

This mechanism enables us also to undo cached data manipulations with the reset function:

#' Remove any processing steps
sps_orig <- reset(sps)
head(intensity(sps_orig)[[1]])
## [1]  10 163  41  20  31 210

Spectrum data comparison

We next perform a pairwise comparison of the spectra using the dot product as similarity measure. Prior to the actual similarity calculation, the peaks of the individual spectra have to be matched against each other (i.e. it has to be determined which peak from one spectrum correspond to which from the other spectrum based on their mass-to-charge ratios). We specify ppm = 20 so that peaks with a difference in m/z smaller than 20ppm will be considered matching.

#' Pairwise comparison of all spectra
cormat <- compareSpectra(sps, ppm = 20)

The pairwise spectra similarities are represented with the heatmap below (note that RStudio in the docker might crash by the pheatmap call - to avoid this add filename = "hm.pdf" to the heatmap call).

library(pheatmap)
hm <- pheatmap(cormat)

The 11 spectra appear to be grouping into 3 clusters, which are in fact related to the collision energy used for the fragmentation (see below; the collision energy is encoded in the file name as CE20 and CE30). We thus use the cutree function to cut the tree (dendrogram) into 3 clusters. Subsequently we reduce our dataset to the cluster with the spectra generated with a collision energy of 20eV.

#' Get the cluster-assignment of the spectra
cl <- cutree(hm$tree_row, 3)
cl
##  [1] 1 1 1 2 2 3 1 1 1 2 3
#' From which file are they?
split(basename(dataOrigin(sps)), cl)
## $`1`
## [1] "HighIS_Mix06_CE20_POS.mzML"    "HighIS_Mix06_CE20_POS.mzML"   
## [3] "HighIS_Mix06_CE20_POS.mzML"    "QC_HighIS_Mix06_CE20_POS.mzML"
## [5] "QC_HighIS_Mix06_CE20_POS.mzML" "QC_HighIS_Mix06_CE20_POS.mzML"
## 
## $`2`
## [1] "HighIS_Mix06_CE30_POS.mzML"    "HighIS_Mix06_CE30_POS.mzML"   
## [3] "QC_HighIS_Mix06_CE30_POS.mzML"
## 
## $`3`
## [1] "HighIS_Mix06_CE30_POS.mzML"    "QC_HighIS_Mix06_CE30_POS.mzML"
#' Select spectra from one collision energy
sps_ce20 <- split(sps, cutree(hm$tree_row, 3))[[1L]]

Comparing spectra against MassBank

Although the precursor m/z of our spectra matches the m/z of Cystine, we can still not exclude that they might represent fragmentations of ions from a different compound (i.e. that would have the same precursor m/z).

Matching experimental spectra against a public spectral library can be used as a first step in the identification process. Several (public) spectral libraries for small molecules are available, such as:

For some of these databases MsBackend interfaces are already implemented allowing inclusion of their data directly into R-based analysis workflows. Access to MassBank data is for example possible with the MsBackendMassbank package. This package provides the MsBackendMassbank for import/export of MassBank files as well as the MsBackendMassbankSql backend that directly interfaces the MassBank MySQL database.

Below we load the MsBackendMassbank package and connect to a local installation of the MassBank MySQL database (release 2021.02 which is provided within the docker image of this tutorial).

library(RMariaDB)
library(MsBackendMassbank)

#' Connect to the MassBank MySQL database
con <- dbConnect(MariaDB(), user = "massbank", dbname = "MassBank",
                 host = "localhost", pass = "massbank")

We can now initialize a Spectra object with a MsBackendMassbankSql backend to access all the data in MassBank.

#' Access the spectra data in MassBank
mbank <- Spectra(con, source = MsBackendMassbankSql())
mbank
## MSn data (Spectra) with 88168 spectra in a MsBackendMassbankSql backend:
##      msLevel precursorMz  polarity
##    <integer>   <numeric> <integer>
## 1          2     506.000         0
## 2         NA          NA         1
## 3         NA          NA         0
## 4         NA          NA         1
## 5         NA          NA         0
## 6          2     449.380         1
## 7          2     426.022         0
## 8          2     131.060         0
## 9          2     183.170         1
## 10         2     358.270         0
##  ... 42 more variables/columns.
##  Use  'spectraVariables' to list all of them.

The Spectra object mbank represents now the MS data from the MassBank database with in total length(mbank) spectra. In fact, the mbank object does only contain the primary keys of the spectra but no MS data. Hence, it’s size in memory is only relatively small:

print(object.size(mbank), units = "MB")
## 6.1 Mb

Any operation on this Spectra object will load the requested data from the database on-the-fly.

We can now compare our experimental spectra against the reference spectra from MassBank. Because loading data from the database takes some time, we first screen for spectra that have a peak matching the precursor m/z (even this operation takes ~ 30 seconds to finish).

#' Identify spectra that contain a peak matching the m/z of Cystine
has_mz <- containsMz(mbank, mz = mz, ppm = 20)

Note that, to improve performance, we could also load all the spectra data into memory by simply changing the backend with setBackend to a MsBackendDataFrame (as we did with the experimental spectra data above).

In total 217 spectra contain a peak with the required m/z and we can proceed to calculate spectral similarities between our experimental spectra and this subset from MassBank. compareSpectra will calculate the similarity between each spectrum from sps_ce20 and each spectrum from mbank_with_mz and returns this as a numeric matrix with the similarity scores (number of rows equal to the number of spectra in sps_ce20 and number of columns equal to the number of spectra in mbank_with_mz - element res[3, 15] would thus represent the similarity between experimental spectrum 3 to MassBank spectrum 15).

#' Subset the MassBank Spectra
mbank_with_mz <- mbank[has_mz]

#' Compare experimental spectra against MassBank
res <- compareSpectra(sps_ce20, mbank_with_mz, ppm = 20)

The highest similarity between our spectra and the spectra from MassBank is 0.9236822. Below we indentify the best matching spectrum and access its intensity values. With parameter arr.ind = TRUE which returns an integer of length 2, the first element representing the row in res, the second the column.

#' Identify the best-matching pair
idx <- which(res == max(res), arr.ind = TRUE)
idx
##      row col
## [1,]   4   5
intensity(mbank_with_mz[idx[2]])
## NumericList of length 1
## [["PB000446"]] 50.369 88.328 45.922 37.428 ... 950.368 69.414 130.733 2338.193

Absolute intensities are reported for the MassBank spectrum, but for better visualizations we would like to normalize them the same way we did with our experimental spectra. As a side note it should also be mentioned that the used dot product function for spectra similarity is independent of absolute intensity values, thus, it did not matter if we performed the spectra comparisons on absolute or relative intensities.

The above described lazy evaluation queue of Spectra objects allows us to perform data manipulations on the MassBank data without actually modifying the original data. We thus normalize the MassBank spectra with the same function we used for the experimental spectra.

#' *Normalize* the MassBank data
mbank_with_mz <- addProcessing(mbank_with_mz, norm_int)

Below we can then compare the two best matching spectra with a mirror plot, in the upper panel showing our experimental spectrum and in the lower panel the best matching MS2 spectrum from MassBank. Plotting functions in Spectra are highly customizable and in the example below we add the m/z for each individual peak as an annotation to it if the intensity of the peak is higher than 5.

#' Specifying a function to draw peak labels
label_fun <- function(x) {
    ints <- unlist(intensity(x))
    mzs <- format(unlist(mz(x)), digits = 4)
    mzs[ints < 5] <- ""
    mzs
}
plotSpectraMirror(sps_ce20[idx[1]], mbank_with_mz[idx[2]], tolerance = 0.2,
                  labels = label_fun, labelPos = 2, labelOffset = 0.2,
                  labelSrt = -30)
grid()

Our experimental spectrum nicely matches the reference MS2 in MassBank. We next want to know which compound this spectrum actually represents. Spectra objects can have arbitrarily many additional annotation fields (so called spectra variables) and we can use the spectraVariables function to list all of them which are available in a specific Spectra object.

#' What variables are available in MassBank
spectraVariables(mbank_with_mz)
##  [1] "msLevel"                 "rtime"                  
##  [3] "acquisitionNum"          "scanIndex"              
##  [5] "dataStorage"             "dataOrigin"             
##  [7] "centroided"              "smoothed"               
##  [9] "polarity"                "precScanNum"            
## [11] "precursorMz"             "precursorIntensity"     
## [13] "precursorCharge"         "collisionEnergy"        
## [15] "isolationWindowLowerMz"  "isolationWindowTargetMz"
## [17] "isolationWindowUpperMz"  "spectrum_id"            
## [19] "spectrum_name"           "date"                   
## [21] "authors"                 "license"                
## [23] "copyright"               "publication"            
## [25] "splash"                  "compound_id"            
## [27] "adduct"                  "ionization"             
## [29] "ionization_voltage"      "fragmentation_mode"     
## [31] "collision_energy_text"   "instrument"             
## [33] "instrument_type"         "formula"                
## [35] "exactmass"               "smiles"                 
## [37] "inchi"                   "inchikey"               
## [39] "cas"                     "pubchem"                
## [41] "synonym"                 "precursor_mz_text"      
## [43] "compound_name"

In fact we get, in addition to spectra specific information like the instrument on which it was measured or the ionization voltage used, we get also information on the compound such as its name ("compound_name"), its chemical formula ("formula") or its INChI key ("inchikey"). Thus, we have all relevant information already in the Spectra object. Below we select the best matching spectrum from MassBank and display its associated compound name.

mbank_best_match <- mbank_with_mz[idx[2]]
mbank_best_match$compound_name
## [1] "Cystine"

Thus, the best matching spectrum is in fact a MS2 spectrum of Cystine. Below we add the name and the chemical formula for the best matching spectrum to our experimental spectra. We also set the collision energy for these spectra to 20eV and assign the ion/adduct of Cystine from which the reference spectrum was created.

#' Add annotations to the experimental spectra
sps_ce20$name <- mbank_best_match$compound_name
sps_ce20$formula <- mbank_best_match$formula
sps_ce20$adduct <- mbank_best_match$adduct
sps_ce20$collisionEnergy <- 20

Data export

At last we want to export our spectra to a file in MGF format. For this we use the MsBackendMgf R package which provides the MsBackendMgf backend that adds support for MGF file import/export to Spectra objects.

Data from Spectra objects can generally be exported with the export function. The format in which the data is exported depends on the specified MsBackend class. By using an instance of MsBackendMgf we can write below the data to a file in MGF format.

library(MsBackendMgf)

#' Export the spectra to a MGF file
export(sps_ce20, backend = MsBackendMgf(), file = "Cystine_ce20.mgf")

Comparing spectra against HMDB

In addition to the MsBackendMassbank, which provides access to MassBank data, there is also the MsBackendHmdb package supporting spectral data from the public Human Metabolome Database (HMDB). This package does however not yet provide direct access to the HMDB database but, through the MsBackendHmdbXml backend, allows to import MS2 spectra files in HMDB format. These are provided by HMDB as individual xml files in a custom file format which are bundled (and can hence be downloaded) in a single archive.

To reproduce the following code it is expected (as detailed in the Installation section) that all xml files from HMDB are available in a folder data/hmdb_all_spectra. Below we identify all xml files containing the key word "ms_ms" in their file name and load them into a Spectra object using the MsBackendHmdbXml backend. Note that this import operation from the ~ 500,000 individual xml files takes up to ~ 2 hours to finish.

library(MsBackendHmdb)

#' Get all MS2 spectra xml files and import data
fls <- dir("data/hmdb_all_spectra/", full.names = TRUE, pattern = "ms_ms")
hmdb <- Spectra(fls, source = MsBackendHmdbXml(), nonStop = TRUE)

With this we have now a Spectra object containing all MS2 spectra from HMDB. Note that with the MsBackendHmdbXml all spectra data is kept in memory.

hmdb
## MSn data (Spectra) with 458963 spectra in a MsBackendHmdbXml backend:
##          msLevel     rtime scanIndex
##        <integer> <numeric> <integer>
## 1              2        NA        NA
## 2              2        NA        NA
## 3              2        NA        NA
## 4              2        NA        NA
## 5              2        NA        NA
## ...          ...       ...       ...
## 458959         2        NA        NA
## 458960         2        NA        NA
## 458961         2        NA        NA
## 458962         2        NA        NA
## 458963         2        NA        NA
##  ... 21 more variables/columns.

Also here, to avoid comparing our experimental spectra against all these ~500,000 spectra, we first determine with the containsMz function which of the HMDB spectra contain a peak matching the m/z of our ion of interest. We have to use a rather large tolerance value (which defines the maximal acceptable absolute difference in m/z values) since some of the experimental spectra in HMDB seem to be recorded by not well calibrated instruments.

#' Identify spectra containing a peak matching Cystine m/z
has_mz <- containsMz(hmdb, mz = mz, tolerance = 0.2)

In total 46772 spectra contain a peak with the required m/z (+/- 0.2 Dalton) and we can proceed to calculate spectral similarities between our experimental spectra and this subset from HMDB.

#' Subset HMDB
hmdb_with_mz <- hmdb[has_mz]

#' Compare experimental spectra against HMDB
res <- compareSpectra(sps_ce20, hmdb_with_mz, tolerance = 0.2)

The highest similarity between our spectra and the spectra from HMDB is r max(res). Below we compare the two best matching spectra with a mirror plot, in the upper panel showing our experimental spectrum and in the lower panel the best matching MS2 spectrum from HMDB.

idx <- which(res == max(res), arr.ind = TRUE)
## Specifying a function to draw peak labels
label_fun <- function(x) {
    format(unlist(mz(x)), digits = 4)
}
plotSpectraMirror(sps_ce20[idx[1]], hmdb_with_mz[idx[2]], tolerance = 0.2,
                  labels = label_fun, labelPos = 2, labelOffset = 0.2,
                  labelSrt = -30)
grid()

Our experimental spectrum seems to nicely match the reference MS2 in HMDB. Below we extract the compound identifier from the best matching HMDB spectrum (stored in a spectra variable called "compound_id")

hmdb_with_mz[idx[2]]$compound_id
## [1] "HMDB0000192"

In fact, the matching spectrum from HMDB is an experimental spectrum for L-Cystine.

Summary

With the simple use case of matching experimental MS2 spectra against a public database we illustrated in this short tutorial the flexibility and expandability of the Spectra package that enables the seamless integration of mass spectrometry data from different sources. This was only possible with a clear separation of the user functionality (Spectra object) from the representation of the data (MsBackend object). Backends such as the MsBackendMgf, the MsBackendMassbank or the MsBackendHmdbXml can provide support for additional data formats or data sources, while others, due to their much lower memory footprint (MsBackendMzR, MsBackendHdf5Peaks), enable the analysis of also very large data sets. Most importantly however, these backends are interchangeable and do not affect the way users can handle and analyze MS data with the Spectra package.

References

Gatto, Laurent, Sebastian Gibb, and Johannes Rainer. 2020. MSnbase, Efficient and Elegant R-Based Processing and Visualization of Raw Mass Spectrometry Data.” Journal of Proteome Research, September. https://doi.org/10.1021/acs.jproteome.0c00313.

  1. Institute for Biomedicine, Eurac Research, Bolzano, Italy; ↩︎

  2. Research Unit Analytical BioGeoChemistry, Helmholtz Zentrum München and Chair of Analytical Food Chemistry, TUM School or Life Sciences, Technical University of Munich, Germany↩︎

  3. Department of Anaesthesiology and Intensive Care, University Medicine Greifswald, Germany↩︎

  4. Computational Biology Unit, de Duve Institute, UCLouvain, Brussels, Belgium↩︎