Seamless Integration of Mass Spectrometry Data from Different Sources with the `Spectra` Package

Last modified: 2024-08-07 13:59:30.28495
Compiled: Wed Aug 7 14:04:19 2024

Overview

Description

Mass spectrometry (MS) data is a key technology in modern proteomics and metabolomics experiments. Due to continuous improvements in MS instrumentation, the generated data can easily become very large. Also, additional resources of MS data exist, such as spectra libraries and databases, all with their own specific file formats and database systems that sometimes do not support manipulations of the original data.

Learning from experiences with the MSnbase Bioconductor package, the Spectra package was developed to provide an even more flexible and expandable infrastructure for MS data in R. This package implements a clear separation between the user interface and the code to provide, store and import MS data. The latter is defined by the MsBackend interface which allows implementation of data type, format or storage-dependent backends. Backends can thus be implemented for specific file types and data resources or different ways to represent MS data (e.g. in memory or on-disk data representations as described in (Gatto, Gibb, and Rainer 2020)). They are also supposed to be interchangeable hence allowing the user to switch backends without affecting the analysis. To enable processing of also very large MS data sets, data manipulations are by default not directly applied to the data but cached in a lazy evaluation queue which allows analyses also of read-only data representations.

This workshop shows the expandability of the new infrastructure to enable a seamless integration and analysis of MS data from a variety of input formats. This is illustrated by a simple comparison and matching of experimental MS2 spectra against public spectral databases and export of the data in a format commonly used for exchange of MS2 data.

Pre-requisites

Basic familiarity with R and Bioconductor.
Basic understanding of Mass Spectrometry (MS) data.

Installation and Participation

The present version of the tutorial bases on package versions available through the Bioconductor release 3.19.
Get the docker image of this tutorial with docker pull jorainer/spectra_tutorials:RELEASE_3_19.

Start docker using

docker run \
    -e PASSWORD=bioc \
    -p 8787:8787 \
    jorainer/spectra_tutorials:RELEASE_3_19

Enter http://localhost:8787 in a web browser and log in with username rstudio and password bioc.
Open this R-markdown file (vignettes/analyzing-MS-data-from-different-sources-with-Spectra.Rmd) in the RStudio server version in the web browser and evaluate the R code blocks.

R/Bioconductor packages used

Time outline

Activity	Time
Introduction (LC-MS/MS, `Spectra` package)	10min
MS data import and handling	5min
Data processing and manipulation	5min
Spectrum data comparison	5min
Comparing spectra against MassBank	10min
Data export	5min
(Comparing spectra against HMDB)	(5min)
(An alternative way to store MS data)	(5min)

Workshop goals and objectives

Learning goals

Understand how to import MS data into R.
Understand the basic concept of backends in Spectra and how they can be used to work with MS data from various sources.

Learning objectives

Import and export MS data with Spectra.
Integrate MS data from different resources into an MS data analysis workflow.
Apply different data manipulations on MS data represented as a Spectra object.
Use Spectra to perform spectra comparisons in R.

Workshop

LC-MS/MS in a nutshell

Mass spectrometry (MS) instruments measure mass-to-charge ratios (m/z) and abundances of ions. The resulting m/z and intensity values are stored/represented as a spectrum.
Most compounds are not charged, they need to be ionized first (with e.g. electro spray ionization (ESI)).
MS is usually combined with another separation technique, such as liquid chromatography (LC). This adds another dimension to the data: retention time (rt). LC-MS data is represented by multiple spectra, each spectrum with its own retention time.

LC-MS setup

With LC-MS we measure features characterized by m/z and retention time - we still don’t know what molecule was actually measured.
Create in addition fragment (MS/MS) spectra from the ions to get some information about their structure.

CID-based fragmentation

Commonly used method: collision induced dissociation (CID). In a collision chamber filled with e.g. N2, ions get fragmented and a spectrum of these fragments is recorded (Steckel and Schlosser 2019).
Comparing and matching such fragment spectra against a reference helps identifying the compound.

The `Spectra` package

Purpose: provide an expandable, well tested and user-friendly infrastructure for mass spectrometry (MS) data.
Design: separation between user interface (functions) and code to provide, store and read MS data.
Advantage: support different ways to store and handle the data without affecting the user functionality: all functionality of a Spectra object is independent of where the data is, either in-memory, on-disk, locally or remote.
Expandable: through implementation of additional backend classes (extending the MsBackend virtual class) it is possible to add support for additional input file formats or data representations without needing to change the Spectra package or object.

Spectra: separation into user functionality and data representation

In this workshop we will:

import MS data from mzML files,
select MS2 spectra for a certain compound,
compare and match the MS2 spectra against reference MS2 spectra from a public database,
annotate the spectra and export them to a file in MGF format.

MS data import and handling

Below we import the MS data from the mzML files provided within this package. These files contain MSn data of a mix of 8 standard compounds (added either to water or a pool of human serum samples) measured with a HILIC-based LC-MS/MS setup. MS2 data was generated by data dependent acquisition using a collision energy of 20eV. For data import and representation of these experimental data we use the MsBackendMzR backend which supports import (and export) of data from the most common raw mass spectrometry file formats (i.e. mzML, mzXML and CDF).

library(Spectra)

#' Define the input files
fls <- dir(system.file("mzML", package = "SpectraTutorials"),
           full.names = TRUE)

#' Import the data
sps_all <- Spectra(fls, source = MsBackendMzR())

The MS data is now represented by a Spectra object, which can be thought of as a data.frame with rows being the individual spectra and columns the spectra variables (such as "rtime", i.e. the retention time). The spectraVariables() function lists all available variables within such a Spectra object.

#' List all available spectra variables (attributes)
spectraVariables(sps_all)

##  [1] "msLevel"                  "rtime"                   
##  [3] "acquisitionNum"           "scanIndex"               
##  [5] "dataStorage"              "dataOrigin"              
##  [7] "centroided"               "smoothed"                
##  [9] "polarity"                 "precScanNum"             
## [11] "precursorMz"              "precursorIntensity"      
## [13] "precursorCharge"          "collisionEnergy"         
## [15] "isolationWindowLowerMz"   "isolationWindowTargetMz" 
## [17] "isolationWindowUpperMz"   "peaksCount"              
## [19] "totIonCurrent"            "basePeakMZ"              
## [21] "basePeakIntensity"        "ionisationEnergy"        
## [23] "lowMZ"                    "highMZ"                  
## [25] "mergedScan"               "mergedResultScanNum"     
## [27] "mergedResultStartScanNum" "mergedResultEndScanNum"  
## [29] "injectionTime"            "filterString"            
## [31] "spectrumId"               "ionMobilityDriftTime"    
## [33] "scanWindowLowerLimit"     "scanWindowUpperLimit"

Each spectra variable can be accessed either via $ and its name or by using its dedicated access function (which is the preferred way). Below we extract the retention times of the first spectra using either $rtime or the function rtime().

#' Access the spectras' retention time
sps_all$rtime |> head()

## [1] 0.273 0.570 0.873 1.183 1.491 1.798

rtime(sps_all) |> head()

## [1] 0.273 0.570 0.873 1.183 1.491 1.798

Our Spectra object contains information from in total 1578 spectra from 2 mzML files. By using the MsBackendMzR backend, only general information about each spectrum is kept in memory resulting in a low memory footprint.

print(object.size(sps_all), units = "MB")

## 0.4 Mb

The spectra’s m/z and intensity values can be accessed with the mz() or intensity() functions or also with the peaksData() function that returns a list of numeric matrices with the m/z and intensity values. By using an MsBackendMzR backend, these are retrieved on demand from the original data files each time the functions are called.

## Show the first 6 lines of the first spectrum's peak matrix
peaksData(sps_all)[[1L]] |> head()

##            mz intensity
## [1,] 50.22647  383.7802
## [2,] 52.97586  377.5944
## [3,] 59.97846  518.6014
## [4,] 60.04677 4716.1259
## [5,] 60.06897  181.5105
## [6,] 61.02565  250.6573

We can also load the full data into memory by changing the backend from MsBackendMzR to MsBackendDataFrame. This does not affect the way we use the Spectra object itself: the same operations and functions are available, independently of the way the data is stored (i.e. which backend is used).

#' Change backend to a MsBackendDataFrame: load data into memory
sps_all <- setBackend(sps_all, MsBackendDataFrame())

The size of our Spectra object is now larger, since the full data has been loaded into memory.

print(object.size(sps_all), units = "MB")

## 16.5 Mb

To subset Spectra objects, we can use either [ or one of the many available filter* functions (that are usually more efficient than [). We could this subset the sps_all to some arbitrary spectra simply using:

sps_all[c(4, 2, 5)]

## MSn data (Spectra) with 3 spectra in a MsBackendDataFrame backend:
##     msLevel     rtime scanIndex
##   <integer> <numeric> <integer>
## 1         1     1.183         4
## 2         1     0.570         2
## 3         1     1.491         5
##  ... 33 more variables/columns.
## Processing:
##  Switch backend from MsBackendMzR to MsBackendDataFrame [Wed Aug  7 14:04:23 2024]

In our workshop, we next want to identify in our experimental data MS2 spectra that were generated from an ion that matches the m/z of the [M+H]+ ion of the metabolite cystine (one of the standards added to the sample mix measured in the present experimental data). We thus use below the filterPrecursorMzValues() function to subset the data to MS2 spectra matching that m/z (accepting a difference in m/z of 10 parts-per-million (ppm)).

#' Define the m/z ratio for an ion of cystine
mz <- 241.0311

#' Subset the dataset to MS2 spectra matching the m/z
sps <- filterPrecursorMzValues(sps_all, mz = mz, ppm = 10)
sps

## MSn data (Spectra) with 6 spectra in a MsBackendDataFrame backend:
##     msLevel     rtime scanIndex
##   <integer> <numeric> <integer>
## 1         2   209.936       673
## 2         2   220.072       714
## 3         2   231.604       734
## 4         2   215.089       761
## 5         2   225.739       781
## 6         2   240.020       804
##  ... 33 more variables/columns.
## Processing:
##  Switch backend from MsBackendMzR to MsBackendDataFrame [Wed Aug  7 14:04:23 2024]
##  Filter: select spectra with precursor m/z matching 241.0311 [Wed Aug  7 14:04:23 2024]

In total 6 spectra matched our target precursor m/z.

Data processing and manipulation

The plotSpectra() function can be used to visualize spectra. Below we plot the first spectrum from our data subset.

#' Plot the first spectrum
plotSpectra(sps[1])

This raw MS2 spectrum contains many very low intensity peaks, most likely representing noise. Thus we next filter the spectra removing all peaks with an intensity smaller than 5% of the maximum intensity of each spectrum (i.e. the base peak intensity). To this end we define a function that takes intensity values from each spectrum as input and returns a logical value whether the peak should be retained (TRUE) or not (FALSE). This function is then passed to the filterIntensity() function to perform the actual filtering of the spectra.

#' Define a filtering function
low_int <- function(x, ...) {
    x > max(x, na.rm = TRUE) * 0.05
}
#' Apply the function to filter the spectra
sps <- filterIntensity(sps, intensity = low_int)

After filtering, the spectra are cleaner:

#' Plot the first spectrum after filtering
plotSpectra(sps[1])

In addition we want to replace the absolute peak intensities with values that are relative to each spectrum’s highest peak intensity. To this end we define a function that takes a peak matrix as input (i.e. a numeric matrix with columns containing the m/z and intensity values) and returns a two-column numeric matrix. This function is then passed with parameter FUN to the addProcessing() function to perform the desired data processing.

#' Define a function to scale the intensities. `x` is expected to be a
#' *peak matrix*, i.e. numeric matrix with columns m/z and intensity
scale_int <- function(x, ...) {
    maxint <- max(x[, "intensity"], na.rm = TRUE)
    x[, "intensity"] <- 100 * x[, "intensity"] / maxint
    x
}
#' *Apply* the function to the data
sps <- addProcessing(sps, scale_int)

The addProcessing() function allows to add any user-defined data manipulation operation on MS data (i.e. the peaks matrices) in a Spectra object. Note that, for scaling of peak intensities, there would also be a dedicated function, scalePeaks(), available.

To show the effect of the data processing we extract the intensities of the first spectrum:

#' Get the intensities after scaling
peaksData(sps)[[1L]]

##             mz  intensity
## [1,]  74.02289  13.439402
## [2,] 120.01111  93.631648
## [3,] 122.02667  49.287979
## [4,] 151.98277 100.000000
## [5,] 153.99924   9.422415
## [6,] 177.99880   6.581111
## [7,] 195.02556  12.969807
## [8,] 241.03074  54.633224

The intensity values are now all between 0 and 100.

Spectrum data comparison

We next perform a pairwise comparison of the subsetted spectra using the dot product as similarity measure. Prior to the actual similarity calculation, the peaks of the individual spectra have to be matched against each other (i.e. it has to be determined which peak from one spectrum correspond to which from the other spectrum based on their mass-to-charge ratios). We specify ppm = 20 so that peaks with a difference in m/z smaller than 20ppm will be considered matching. By default, compareSpectra() uses the (normalized) dot-product to calculate the similarity between the spectra, but a variety of different algorithms can be used instead (see the Spectra vignette for more information).

#' Pairwise comparison of all spectra
sim_mat <- compareSpectra(sps, ppm = 20)
sim_mat

##           [,1]      [,2]      [,3]      [,4]      [,5]      [,6]
## [1,] 1.0000000 0.9717669 0.8907129 0.9721095 0.8795912 0.8818719
## [2,] 0.9717669 1.0000000 0.9992127 0.9964783 0.9525726 0.9486682
## [3,] 0.8907129 0.9992127 1.0000000 0.9947650 0.9518138 0.9478201
## [4,] 0.9721095 0.9964783 0.9947650 1.0000000 0.9543584 0.9494561
## [5,] 0.8795912 0.9525726 0.9518138 0.9543584 1.0000000 0.9994800
## [6,] 0.8818719 0.9486682 0.9478201 0.9494561 0.9994800 1.0000000

The pairwise spectra similarities are represented with the heatmap below (note that RStudio in the docker might crash by the pheatmap() call - to avoid this add filename = "hm.pdf" to the heatmap call).

library(pheatmap)
hm <- pheatmap(sim_mat, breaks = seq(0, 1, length.out = 101))

The similarity between all the selected experimental MS2 spectra is very high (in fact above 0.88) suggesting that all of them are fragment spectra from the same original compound.

Comparing spectra against MassBank

Although the precursor m/z of our spectra matches the m/z of cystine, we can still not exclude that they represent fragments of ions from different compounds (with the same m/z than cystine).

Matching experimental spectra against a public spectral library can be used as a first step in the identification process of untargeted metabolomics data. Several (public) spectral libraries for small molecules are available, such as:

MassBank
MassBank of North America MoNa
Human Metabolom Database HMDB
GNPS
…

For some of these databases MsBackend interfaces are already implemented allowing inclusion of their data directly into R-based analysis workflows. Access to MassBank data is for example possible with the MsBackendMassbank package. This package provides the MsBackendMassbank for import/export of MassBank files as well as the MsBackendMassbankSql backend that directly interfaces the MassBank MySQL database.

Below we connect to a local installation of the MassBank MySQL database (release 2023.11 which is provided within the docker image of this tutorial).

library(RMariaDB)
#' Connect to the MassBank MySQL database
con <- dbConnect(MariaDB(), user = "massbank", dbname = "MassBank",
                 host = "localhost", pass = "massbank")

We next load the MsBackendMassbank package and initialize a Spectra object with a MsBackendMassbankSql backend providing it with the database connection.

library(MsBackendMassbank)
#' Access the spectra data in MassBank
mbank <- Spectra(con, source = MsBackendMassbankSql())
mbank

## MSn data (Spectra) with 86576 spectra in a MsBackendMassbankSql backend:
##         msLevel precursorMz  polarity
##       <integer>   <numeric> <integer>
## 1             2         506         0
## 2            NA          NA         1
## 3            NA          NA         0
## 4            NA          NA         1
## 5            NA          NA         0
## ...         ...         ...       ...
## 86572         2     449.380         1
## 86573         2     426.022         0
## 86574         2     131.060         0
## 86575         2     183.170         1
## 86576         2     358.270         0
##  ... 42 more variables/columns.
##  Use  'spectraVariables' to list all of them.

Alternatively, if no MySQL database system is available or if this tutorial can not be run within docker, the MassBank data can also be retrieved from Bioconductor’s AnnotationHub. In that case MsBackendCompDb backend (defined in the CompoundDb) is used instead, but both backends retrieve their data from a SQL database and have the same properties. Below we list MassBank releases/versions available in Bioconductor’s AnnotationHub.

library(AnnotationHub)
ah <- AnnotationHub()
query(ah, "MassBank")

## AnnotationHub with 6 records
## # snapshotDate(): 2024-04-30
## # $dataprovider: MassBank
## # $species: NA
## # $rdataclass: CompDb
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## #   rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH107048"]]' 
## 
##              title                                
##   AH107048 | MassBank CompDb for release 2021.03  
##   AH107049 | MassBank CompDb for release 2022.06  
##   AH111334 | MassBank CompDb for release 2022.12.1
##   AH116164 | MassBank CompDb for release 2023.06  
##   AH116165 | MassBank CompDb for release 2023.09  
##   AH116166 | MassBank CompDb for release 2023.11

We can retrieve the respective release (2021.03 in this case) and access its spectra data with the code below. Note that AnnotationHub will cache the downloaded database locally and any subsequent call will not download the database again.

mb <- ah[["AH107048"]]
mbank <- Spectra(mb)

The Spectra object mbank represents now the MS data from the MassBank database with in total 86576 spectra. In fact, the mbank object itself does not contain any MS data, but only the primary keys of the spectra in the MassBank database. Hence, it has also a relatively low memory footprint.

print(object.size(mbank), units = "MB")

## 6.6 Mb

Any operation on a Spectra object with this backend will load the requested data from the database on demand. Calling intensity() on such a Spectra object would for example retrieve the intensity values for all spectra from the database. Below we extract the peaks matrix for the first spectrum.

#' Get peaks matrix for the first spectrum
peaksData(mbank[1])[[1L]]

##             mz    intensity
##  [1,] 146.7608     0.980184
##  [2,] 158.8635   938.145447
##  [3,] 174.9888    56.718353
##  [4,] 176.8688   847.169678
##  [5,] 213.9867   133.479233
##  [6,] 221.0402     3.189743
##  [7,] 238.9681    76.114944
##  [8,] 246.3159     2.592528
##  [9,] 255.0206     2.059480
## [10,] 256.8184     1.597428
## [11,] 272.9973   801.059082
## [12,] 290.9484     4.848598
## [13,] 293.9753     1.233919
## [14,] 310.1210    33.710808
## [15,] 352.9528    49.103123
## [16,] 370.9348   461.860779
## [17,] 390.0576     4.611422
## [18,] 408.0003 23963.912109
## [19,] 409.0284     0.884388
## [20,] 426.1263    69.098648
## [21,] 470.0574     2.051622
## [22,] 487.9893   812.386597
## [23,] 585.8845     1.093564

As we can see, also MassBank provides absolute intensities for each spectrum. To have all the data on the same scale, we would however like to scale them to values between 0 and 100, just like we did for our experimental data. Below we thus apply the same data processing operation also to the MassBank Spectra object.

#' Scale intensities for all MassBank spectra
mbank <- addProcessing(mbank, scale_int)
peaksData(mbank[1])[[1L]]

##             mz    intensity
##  [1,] 146.7608 4.090250e-03
##  [2,] 158.8635 3.914826e+00
##  [3,] 174.9888 2.366824e-01
##  [4,] 176.8688 3.535189e+00
##  [5,] 213.9867 5.570010e-01
##  [6,] 221.0402 1.331061e-02
##  [7,] 238.9681 3.176232e-01
##  [8,] 246.3159 1.081847e-02
##  [9,] 255.0206 8.594089e-03
## [10,] 256.8184 6.665973e-03
## [11,] 272.9973 3.342773e+00
## [12,] 290.9484 2.023292e-02
## [13,] 293.9753 5.149072e-03
## [14,] 310.1210 1.406732e-01
## [15,] 352.9528 2.049045e-01
## [16,] 370.9348 1.927318e+00
## [17,] 390.0576 1.924319e-02
## [18,] 408.0003 1.000000e+02
## [19,] 409.0284 3.690499e-03
## [20,] 426.1263 2.883446e-01
## [21,] 470.0574 8.561298e-03
## [22,] 487.9893 3.390042e+00
## [23,] 585.8845 4.563378e-03

This worked, although we are (for obvious reasons) not allowed to change the m/z and intensity values within the MassBank database. How is this then possible? It works, because any data manipulation operation on a Spectra object is cached within the object’s lazy evaluation queue and any such operation is applied to the m/z and intensity values on-the-fly whenever the data is requested (i.e. whenever mz() or intensity() is called on it). As a side effect, this also allows to undo certain data operations by simply calling reset() on the Spectra object.

We next want to compare our experimental spectra against the spectra from MassBank. Instead of comparing against all 86576 available spectra, we first filter the MassBank database to spectra with a precursor m/z matching the one of the [M+H]+ ion of cystine. Note that, as an alternative to the filterPrecursorMzValues() used below, we could also use the containsMz() function to screen for spectra containing an actual peak matching the precursor m/z.

#' Filter MassBank for spectra of precursor ion
mbank_sub <- filterPrecursorMzValues(mbank, mz = mz, ppm = 10)
mbank_sub

## MSn data (Spectra) with 12 spectra in a MsBackendMassbankSql backend:
##       msLevel precursorMz  polarity
##     <integer>   <numeric> <integer>
## 1           2     241.031         1
## 2           2     241.031         1
## 3           2     241.031         1
## 4           2     241.031         1
## 5           2     241.031         1
## ...       ...         ...       ...
## 8           2     241.031         1
## 9           2     241.031         1
## 10          2     241.031         1
## 11          2     241.031         1
## 12          2     241.031         1
##  ... 42 more variables/columns.
##  Use  'spectraVariables' to list all of them.
## Lazy evaluation queue: 1 processing step(s)
## Processing:
##  Filter: select spectra with precursor m/z matching 241.0311 [Wed Aug  7 14:04:31 2024]

This left us with 12 spectra for which we calculate the similarity against each of our experimental spectra using the compareSpectra() function.

#' Compare MassBank subset to experimental spectra
res <- compareSpectra(sps, mbank_sub, ppm = 20)
res

##       RP014401  RP014402    RP014403  CE000600  CE000598  CE000602  CE000595
## [1,] 0.2026317 0.3343503 0.024658599 0.9021127 0.9069832 0.7714246 0.4711052
## [2,] 0.1807144 0.2933160 0.021440961 0.8863867 0.8655798 0.7507946 0.4307388
## [3,] 0.1801990 0.2892779 0.005165072 0.8862688 0.8541779 0.7335706 0.4406949
## [4,] 0.1596113 0.3067826 0.005668103 0.8666230 0.8760503 0.7770975 0.4382448
## [5,] 0.1784321 0.3608116 0.005392068 0.8792278 0.9101864 0.7761744 0.5016645
## [6,] 0.1871595 0.3629835 0.005564441 0.8849054 0.8705634 0.7200598 0.5052482
##       CE000599  CE000596  CE000597  CE000603  CE000601
## [1,] 0.5101048 0.7358534 0.2843629 0.6267950 0.6085564
## [2,] 0.5304578 0.7415840 0.3069911 0.6079285 0.5669921
## [3,] 0.5379880 0.7480320 0.3157130 0.6220260 0.5458613
## [4,] 0.4738241 0.6957804 0.2580864 0.5819616 0.6049588
## [5,] 0.4913249 0.7185230 0.2783447 0.6469762 0.6055037
## [6,] 0.5013606 0.7278495 0.2864854 0.6554185 0.5523657

As a result we got the (normalized dot product) similarity score between each tested MassBank spectrum (columns) against each experimental spectrum (rows).

pheatmap(res)

We get some very high similarity scores, but also some lower ones. We next determine the best matching pair from all the comparisons.

best_match <- which(res == max(res), arr.ind = TRUE)
best_match

##      row col
## [1,]   5   5

The best_match variable contains now the index of the best matching MassBank and experimental spectrum. Below we visualize these two using a mirror plot showing the best matches between experimental spectra (upper panel) and fragment spectra from MassBank (lower panel). Matching peaks are highlighted with a blue color. Plotting functions in Spectra are highly customizable and in the example below we add the m/z for each individual peak as an annotation to it but only if the intensity of the peak is higher than 5.

#' Specifying a function to draw peak labels
label_fun <- function(x) {
    ints <- unlist(intensity(x))
    mzs <- format(unlist(mz(x)), digits = 4)
    mzs[ints < 5] <- ""
    mzs
}
plotSpectraMirror(sps[best_match[1]], mbank_sub[best_match[2]],
                  ppm = 20, labels = label_fun, labelPos = 2,
                  labelOffset = 0.2, labelSrt = -30)
grid()

As a comparison we plot also two spectra with a low similarity score.

plotSpectraMirror(sps[3], mbank_sub[1],
                  ppm = 20, labels = label_fun, labelPos = 2,
                  labelOffset = 0.2, labelSrt = -30)
grid()

Since we could identify a MassBank spectrum with a high similarity to our experimental spectra we would also like to know which compound this spectrum actually represents. Spectra objects are generally very flexible and can have arbitrarily many additional annotation fields (i.e. spectra variables) for each spectrum. Thus, we below use the spectraVariables() function to list all of the variables that are available in our MassBank Spectra object.

#' What variables are available in MassBank
spectraVariables(mbank_sub)

##  [1] "msLevel"                 "rtime"                  
##  [3] "acquisitionNum"          "scanIndex"              
##  [5] "dataStorage"             "dataOrigin"             
##  [7] "centroided"              "smoothed"               
##  [9] "polarity"                "precScanNum"            
## [11] "precursorMz"             "precursorIntensity"     
## [13] "precursorCharge"         "collisionEnergy"        
## [15] "isolationWindowLowerMz"  "isolationWindowTargetMz"
## [17] "isolationWindowUpperMz"  "spectrum_id"            
## [19] "spectrum_name"           "date"                   
## [21] "authors"                 "license"                
## [23] "copyright"               "publication"            
## [25] "splash"                  "compound_id"            
## [27] "adduct"                  "ionization"             
## [29] "ionization_voltage"      "fragmentation_mode"     
## [31] "collision_energy_text"   "instrument"             
## [33] "instrument_type"         "formula"                
## [35] "exactmass"               "smiles"                 
## [37] "inchi"                   "inchikey"               
## [39] "cas"                     "pubchem"                
## [41] "synonym"                 "precursor_mz_text"      
## [43] "compound_name"

In fact, in addition to spectra specific information like the instrument on which it was measured or the ionization voltage used, we get also information on the originating compound such as its name ("compound_name"), its chemical formula ("formula") or its INChI key ("inchikey"). We thus next subset mbank_sub to the best matching spectrum and display its associated compound name. Note that the database provided by AnnotationHub does contain compound names in a column/spectra variable "name" instead of "compound_name", thus, if we used the option to download the MassBank data from AnnotationHub above, we would have to use mbank_best_match$name instead of mbank_best_match$compound_name below.

mbank_best_match <- mbank_sub[best_match[2]]
mbank_best_match$compound_name

## [1] "Cystine"

Indeed, our experimental cystine spectrum matches (one) cystine spectrum in MassBank. Below we next add the name and the chemical formula of this spectrum to our experimental spectra. We also set the collision energy for to 20eV and assign the ion/adduct of cystine from which the reference spectrum was created. Spectra allows adding new (or replacing existing) spectra variables using $<-.

#' Add annotations to the experimental spectra
sps$name <- mbank_best_match$compound_name
sps$formula <- mbank_best_match$formula
sps$adduct <- mbank_best_match$adduct
sps$collisionEnergy <- 20

Note: a more convenient spectra matching functionality designed for less experienced R users is available in the MetaboAnnotation package. A second tutorial is available within this github repository/R package and additional tutorials can be found in the MetaboAnnotationTutorials repository/R package.

Data export

At last we want to export our spectra to a file in MGF format. For this we use the MsBackendMgf package which provides the MsBackendMgf backend that adds support for MGF file import/export to Spectra objects.

Data from Spectra objects can generally be exported with the export() function. The format in which the data is exported depends on the specified MsBackend class. By using an instance of MsBackendMgf we define to export the data to a file in MGF format.

library(MsBackendMgf)

#' Export the spectra to a MGF file
export(sps, backend = MsBackendMgf(), file = "Cystine_ce20.mgf")

Below we read the first 6 lines from the resulting file.

readLines("Cystine_ce20.mgf", n = 6)

## [1] "BEGIN IONS"                               
## [2] "TITLE=msLevel 2; retentionTime ; scanNum "
## [3] "msLevel=2"                                
## [4] "RTINSECONDS=209.93599999998"              
## [5] "SCANS=673"                                
## [6] "scanIndex=673"

Comparing spectra against HMDB

In addition to the MsBackendMassbank, which provides access to MassBank data, there is also the MsBackendHmdb package supporting spectral data from the public Human Metabolome Database (HMDB). This package does however not yet provide direct access to the HMDB database but, through the MsBackendHmdbXml backend, allows to import MS2 spectra files in HMDB format. These are provided by HMDB as individual xml files in a custom file format which are bundled (and can hence be downloaded) in a single archive.

As an alternative, it is also possible to create portable annotation resources using the CompoundDb package. This packages allows for example to create annotation databases for MoNA, MassBank or HMDB. CompDb with annotations for small compounds from different resources will become available also on Bioconductors AnnotationHub.

A CompDb database containing all compound annotations along with all (experimental) MS2 spectra is provided as a data release in the MetaboAnnotationTutorials repository. Below we download this SQLite database to a temporary file.

#' Download the CompDb database using curl
library(curl)
dbname <- "CompDb.Hsapiens.HMDB.5.0.sqlite"
db_file <- file.path(tempdir(), dbname)
curl_download(
    paste0("https://github.com/jorainer/MetaboAnnotationTutorials/",
           "releases/download/2021-11-02/", dbname),
    destfile = db_file)

We can now load this annotation resource using the CompDb function from the CompoundDb package:

library(CompoundDb)
#' Load a CompDb database with compound annotation from HMDB
cdb <- CompDb(db_file)
cdb

## class: CompDb 
##  data source: HMDB 
##  version: 5.0 
##  organism: Hsapiens 
##  compound count: 217776 
##  MS/MS spectra count: 64920

We can now again create a Spectra object to get access to the MS2 spectra within this package:

hmdb <- Spectra(cdb)

With this we have now a Spectra object containing all MS2 spectra from HMDB along with compound annotations.

hmdb

## MSn data (Spectra) with 64920 spectra in a MsBackendCompDb backend:
##         msLevel precursorMz  polarity
##       <integer>   <numeric> <integer>
## 1            NA          NA         1
## 2            NA          NA         1
## 3            NA          NA         1
## 4            NA          NA         1
## 5            NA          NA         1
## ...         ...         ...       ...
## 64916        NA          NA         0
## 64917        NA          NA         0
## 64918        NA          NA         0
## 64919        NA          NA         0
## 64920        NA          NA         1
##  ... 32 more variables/columns.
##  Use  'spectraVariables' to list all of them.
##  data source: HMDB 
##  version: 5.0 
##  organism: Hsapiens

The spectra variables (including compound annotations) available for each spectrum are:

spectraVariables(hmdb)

##  [1] "msLevel"                 "rtime"                  
##  [3] "acquisitionNum"          "scanIndex"              
##  [5] "dataStorage"             "dataOrigin"             
##  [7] "centroided"              "smoothed"               
##  [9] "polarity"                "precScanNum"            
## [11] "precursorMz"             "precursorIntensity"     
## [13] "precursorCharge"         "collisionEnergy"        
## [15] "isolationWindowLowerMz"  "isolationWindowTargetMz"
## [17] "isolationWindowUpperMz"  "compound_id"            
## [19] "name"                    "inchi"                  
## [21] "inchikey"                "formula"                
## [23] "exactmass"               "smiles"                 
## [25] "original_spectrum_id"    "predicted"              
## [27] "splash"                  "instrument_type"        
## [29] "instrument"              "spectrum_id"            
## [31] "msms_mz_range_min"       "msms_mz_range_max"      
## [33] "synonym"

Also here, we want to filter the data resource first for spectra with a matching precursor m/z. Unfortunately, HMDB does not provided the spectra’s precursor m/z and we hence need to use the containsMz() function to find spectra containing a peak with an m/z matching the m/z of our ion of interest. In addition, We need to use a rather large tolerance value (which defines the maximal acceptable absolute difference in m/z values) since some of the experimental spectra in HMDB seem to be recorded by not well calibrated instrument.

#' Identify spectra containing a peak matching cystine m/z
has_mz <- containsMz(hmdb, mz = mz, tolerance = 0.2)

In total 3904 spectra contain a peak with the required m/z (+/- 0.2 Dalton) and we can proceed to calculate spectral similarities between these and our experimental spectra.

#' Subset HMDB
hmdb_sub <- hmdb[has_mz]
#' Compare HMDB against experimental spectra
res <- compareSpectra(hmdb_sub, sps, tolerance = 0.2)

The highest similarity between our spectra and the spectra from HMDB is r max(res). Below we compare the two best matching spectra with a mirror plot, in the upper panel showing our experimental spectrum and in the lower panel the best matching MS2 spectrum from HMDB.

best_match <- which(res == max(res), arr.ind = TRUE)
## Specifying a function to draw peak labels
label_fun <- function(x) {
    format(unlist(mz(x)), digits = 4)
}
plotSpectraMirror(hmdb_sub[best_match[1]], sps[best_match[2]], tolerance = 0.2,
                  labels = label_fun, labelPos = 2, labelOffset = 0.2,
                  labelSrt = -30)
grid()

Our experimental spectrum seems to nicely match the reference MS2 in HMDB. Below we extract the compound identifier from the best matching HMDB spectrum (stored in a spectra variable called "compound_id")

hmdb_sub[best_match[1]]$compound_id

## [1] "HMDB0000192"

And the name of this compound

hmdb_sub[best_match[1]]$name

## [1] "L-Cystine"

In fact, the matching spectrum from HMDB is an experimental spectrum for L-Cystine.

An alternative way to store MS data

For some use cases it might be useful to store the data from whole MS experiments in a central place. The MsBackendSql backend allows for example to store such data into a relational database, such as simple SQLite files or MySQL/MariaDB database systems. The latter case would allow also remote access to a centrally stored data set. Also, performance, especially for large datasets would be much better with a MySQL database compared to a SQLite database as used here. In this section we show how data from an MS experiment can be inserted into such a database and how it can be again retrieved.

Below we create an empty SQLite database (in a temporary file). As detailed above, any SQL database system could be used instead (resulting eventually in even better performance).

library(RSQLite)
db_file <- tempfile()
scon <- dbConnect(SQLite(), db_file)

We can next import the data from our example experiment into the selected database.

library(MsBackendSql)
## Create a database containing the full raw data
createMsBackendSqlDatabase(scon, fls)

## [1] TRUE

dbDisconnect(scon)

The MsBackendSql or MsBackendOfflineSql backends allow now to access and use the MS data stored in such a database. The MsBackendOfflineSql does not keep an active connection to the database but reconnects to the database when needed. While having a slightly lower performance, this allows the object to be serialized, e.g. saved to and reloaded from disk using the save() and load() functions. To access the data within the SQL database We simply the name to the database as a parameter to the Spectra() function and use the MsBackendOfflineSql as backend. For the MsBackendOfflineSql we need to specify the SQL driver, in our case, since we are using a SQLite database, it will be SQLite() as well as the name of the database (which for SQLite is the file name of the database).

sps_db <- Spectra(db_file, drv = SQLite(),
                  source = MsBackendOfflineSql())
sps_db

## MSn data (Spectra) with 1578 spectra in a MsBackendOfflineSql backend:
##        msLevel precursorMz  polarity
##      <integer>   <numeric> <integer>
## 1            1          NA         1
## 2            1          NA         1
## 3            1          NA         1
## 4            1          NA         1
## 5            1          NA         1
## ...        ...         ...       ...
## 1574         2    107.9501         1
## 1575         1          NA         1
## 1576         2     95.0608         1
## 1577         1          NA         1
## 1578         1          NA         1
##  ... 34 more variables/columns.
##  Use  'spectraVariables' to list all of them.
## Database: /tmp/RtmpDqKO6m/fileeb02e28809b

The memory footprint of this Spectra object is only very small as all the data is retrieved from the database on-the-fly.

print(object.size(sps_db), units = "MB")

## 0.1 Mb

The performance of this backend is also higher than for example of a MsBackendMzR. Below we compare the performance of the MsBackendOfflineSql on the present small data set to a MsBackendDataFrame (data in-memory) and a MsBackendMzR (data retrieved from original file). Note that in more recent versions of the Spectra package a higher-performance in-memory backend would be available (MsBackendMemory).

## Load the data to different backends
sps_mzr <- Spectra(fls, source = MsBackendMzR())
sps_df <- setBackend(sps_mzr, MsBackendDataFrame())

library(microbenchmark)
microbenchmark(peaksData(sps_mzr),
               peaksData(sps_df),
               peaksData(sps_db),
               times = 10)

## Unit: milliseconds
##                expr       min        lq      mean    median        uq      max
##  peaksData(sps_mzr) 251.35233 262.96548 469.85594 264.44346 934.94526 947.7634
##   peaksData(sps_df)  75.78775  88.86324  97.47453  97.11475 105.58224 124.3042
##   peaksData(sps_db)  42.41971  52.02533 115.15461  66.11412  82.27333 575.4869
##  neval
##     10
##     10
##     10

Thus, the MsBackendOfflineSql has similar performance than having all the data in memory. The performance is even better if smaller subsets of the data are loaded.

microbenchmark(peaksData(sps_mzr[1:10]),
               peaksData(sps_df[1:10]),
               peaksData(sps_db[1:10]),
               times = 10)

## Unit: milliseconds
##                      expr      min       lq      mean    median        uq
##  peaksData(sps_mzr[1:10]) 9.876781 9.965817 11.540823 10.296418 10.460980
##   peaksData(sps_df[1:10]) 3.384398 3.433821  3.521974  3.481565  3.543696
##   peaksData(sps_db[1:10]) 3.220835 3.276307  3.677648  3.464268  3.690149
##        max neval
##  21.308500    10
##   3.992773    10
##   5.792197    10

A real world use case for such a backend would be to store a large MS experiment in a central place. Data analysts could connect to that central database, explore and subset the data and ultimately either load subsets of the data into the (local) memory by changing the backend of the data subset to e.g. a MsBackendDataFrame or store the data subset to local files e.g. in mzML format by using the export() method and the MsBackendMzR backend.

Short summary

Spectra provides a powerful infrastructure for MS data.
Independence between data analysis and data storage functionality allows to easily add support for additional data types or data handling/storage modes.
Caching data operations and applying them on-the-fly allows data manipulations regardless of how and where data is stored.

Summary

With the simple use case of matching experimental MS2 spectra against a public database we illustrated in this short tutorial the flexibility and expandability of the Spectra package that enables the seamless integration of mass spectrometry data from different sources. This was only possible with a clear separation of the user functionality (Spectra object) from the representation of the data (MsBackend object). Backends such as the MsBackendMgf, the MsBackendMassbank or the MsBackendHmdbXml provide support for additional data formats or data sources, while others, due to their much lower memory footprint (MsBackendMzR, MsBackendHdf5Peaks), enable the analysis of also very large data sets. Most importantly however, these backends are interchangeable and do not affect the way users can handle and analyze MS data with the Spectra package. Also, by caching data operations within the Spectra object and applying them only upon data requests, the same data operations can be applied to any data resource regardless of how and where the data is stored, even if the data itself is read-only.

What else?

MetaboAnnotation (Andrea Vicini, Michael Witting) to provide simple functions for matching of spectra or m/z and retention time data (Rainer et al. 2022).
MsExperiment to represent whole MS experiments (incl. Spectra, sample annotation, data quantitation, annotation files etc).
MsQuality (Thomas Naake): QC measures for MS data (Naake, Rainer, and Huber 2023).
MsBackendMsp (Steffen Neumann): backend for files in MSP format.
...

References

Gatto, Laurent, Sebastian Gibb, and Johannes Rainer. 2020. “MSnbase, Efficient and Elegant R-Based Processing and Visualization of Raw Mass Spectrometry Data.” Journal of Proteome Research, September. https://doi.org/10.1021/acs.jproteome.0c00313.

Naake, Thomas, Johannes Rainer, and Wolfgang Huber. 2023. “MsQuality: An Interoperable Open-Source Package for the Calculation of Standardized Quality Metrics of Mass Spectrometry Data.” Bioinformatics (Oxford, England) 39 (10): btad618. https://doi.org/10.1093/bioinformatics/btad618.

Rainer, Johannes, Andrea Vicini, Liesa Salzer, Jan Stanstrup, Josep M. Badia, Steffen Neumann, Michael A. Stravs, et al. 2022. “A Modular and Expandable Ecosystem for Metabolomics Data Annotation in R.” Metabolites 12 (2): 173. https://doi.org/10.3390/metabo12020173.

Steckel, Arnold, and Gitta Schlosser. 2019. “An Organic Chemist’s Guide to Electrospray Mass Spectrometric Structure Elucidation.” Molecules (Basel, Switzerland) 24 (3): E611. https://doi.org/10.3390/molecules24030611.

Johannes Rainer1, Michael Witting2, Sebastian Gibb3, Laurent Gatto4

Overview

Description

Pre-requisites

Installation and Participation

R/Bioconductor packages used

Time outline

Time outline

Workshop goals and objectives

Learning goals

Learning objectives

Workshop

LC-MS/MS in a nutshell

The Spectra package

MS data import and handling

Data processing and manipulation

Spectrum data comparison

Comparing spectra against MassBank

Data export

Comparing spectra against HMDB

An alternative way to store MS data

Short summary

Summary

What else?

References

Johannes Rainer¹, Michael Witting², Sebastian Gibb³, Laurent Gatto⁴

The `Spectra` package