## Hands-on? {.build}
- https://github.com/jotsetung/metabolomics2018
- Open *xcms-preprocessing.Rmd* in e.g. [RStudio](https://www.rstudio.com/).
## Content {.build}
This presentation focuses on updates of `xcms`:
- re-use data structures from Bioconductor's `MSnbase` package
- simplified raw data access
Content:
- Basic MS data handling ([`MSnbase`](https://github.com/lgatto/MSnbase))
- Simple MS data centroiding (`MSnbase`)
- LC-MS data pre-processing ([`xcms`](https://github.com/sneumann/xcms)):
- chromatographic peak detection
- alignment
- correspondence
# Basic MS data handling
## Data import and representation {.build}
```{r load-libs, message = FALSE, results = "hide", echo = FALSE}
library(xcms)
library(magrittr)
## Set up parallel processing using 3 cores
register(bpstart(MulticoreParam(3)), default = TRUE)
```
- **Data set**:
- subset from 2 files with pooled human serum samples
- UHPLC (Agilent 1290) coupled with Q-TOF MS (TripleTOF 5600+ AB Sciex)
- HILIC-based chromatographic separation
- Define file names and sample descriptions.
```{r load-data, message = FALSE }
fls <- dir(system.file("sciex", package = "msdata"), full.names = TRUE)
## Define a data.frame with additional information on the files.
pd <- data.frame(file = basename(fls), injection_idx = c(1, 19),
sample = c("POOL_1", "POOL_2"), group = "POOL")
```
## Data import and representation {.build}
- Read data from mzML/mzXML/CDF files with `readMSData` function.
```{r, message = FALSE}
## Read the data
data <- readMSData(fls, pdata = new("NAnnotatedDataFrame", pd),
mode = "onDisk")
```
- `mode = "onDisk"`: reads only spectrum header from files, but no data.
- on-disk mode enables analysis of very large experiments.
## Basic data access {.build}
- Access sample/phenotype information using `pData` or `$`:
```{r show-fData, message = FALSE}
## Access phenotype information
pData(data)
```
```{r show-fData2, message = FALSE}
## Or individual columns directly using the $ operator
data$injection_idx
```
## Basic data access {.build}
- Access general spectrum information: `msLevel`, `centroided`, `rtime`,
`polarity`.
- Access MS data: `spectra`, `mz`, `intensity`: reads data from files.
- In most cases we work with subsets: use filter functions to
subset the data:
- `filterFile` subset to individual files/samples.
- `filterRtime` restrict to specific retention time window.
- `filterMz` restrict to m/z range.
- `filterMsLevel` subset to certain MS level(s).
## {.build}
- Example: extract all spectra measured between 180 and 181
seconds. Using the `%>%` (pipe) operator to avoid nested function calls.
```{r spectra-filterRt, message = FALSE }
## Get all spectra measured between 180 and 181 seconds
## Use %>% for better readability
sps <- data %>%
filterRt(rt = c(180, 181)) %>%
spectra
```
```{r spectra-filterRt-length}
## How many spectra?
length(sps)
```
```{r spectra-filterRt-sapply}
## From which file?
sapply(sps, fromFile)
```
## {.build}
- Example: plot the data from the last spectrum
```{r spectrum-plot, message = FALSE, fig.width = 5, fig.height = 3.5}
plot(sps[[6]])
```
- But how to get chromatographic data?
## {.build}
- `chromatogram`: extract chromatographic data.
- Example: XIC for Serine (m/z of [M+H]+ adduct 106.0455).
```{r serine-xic, message = FALSE, fig.height = 4, fig.width = 7.5, eval = FALSE}
data %>%
filterRt(rt = c(175, 189)) %>%
filterMz(mz = c(106.02, 106.07)) %>%
chromatogram(aggregationFun = "max") %>%
plot()
```
```{r serine-xic-plot, message = FALSE, fig.height = 3.3, fig.width = 7.5, echo = FALSE}
par(mar = c(4, 4.5, 1, 0.5))
data %>%
filterRt(rt = c(175, 189)) %>%
filterMz(mz = c(106.02, 106.07)) %>%
chromatogram(aggregationFun = "max") %>%
plot()
```
## Centroiding of profile MS data {.build .faster}
- *centroiding* is the process in which mass peaks are reduced to a
single, representative signal, their centroids.
- `xcms`, specifically *centWave* was designed for centroided data.
- `MSnbase` provides basic tools to perform MS data smoothing and
centroiding: `smooth` and `pickPeaks`.
- Example: show the combined m/z, rt and intensity data for
Serine.
```{r serine-profile-mode-data, message = FALSE, eval = FALSE}
data %>%
filterRt(rt = c(175, 189)) %>%
filterMz(mz = c(106.02, 106.07)) %>%
plot(type = "XIC")
```
## {.flexbox .vcenter}
```{r serine-profile-mode-data2, message = FALSE, echo = FALSE}
## Filter the MS data to the signal from the Serine ion and plot it using
## type = "XIC"
data %>%
filterRt(rt = c(175, 189)) %>%
filterMz(mz = c(106.02, 106.07)) %>%
plot(type = "XIC")
```
- plot `type = "XIC"` creates a combined chromatographic and *map*
visualization of the data.
## {.build}
- Example: smooth data with Savitzky-Golay filter followed by a centroiding that
simply reports the maximum signal for each mass peak in each spectrum. See
`?pickPeaks` for more advanced options.
```{r centroiding, message = FALSE, warning = FALSE, eval = FALSE}
## Smooth the signal, then do a simple peak picking.
data_cent <- data %>%
smooth(method = "SavitzkyGolay", halfWindowSize = 6) %>%
pickPeaks()
```
```{r, eval = FALSE}
## Plot the centroided data for Serine
data_cent %>%
filterRt(rt = c(175, 189)) %>%
filterMz(mz = c(106.02, 106.07)) %>%
plot(type = "XIC")
```
----
```{r centroiding2, message = FALSE, warning = FALSE, fig.width = 10, fig.height = 5, echo = FALSE}
## Smooth the signal, then do a simple peak picking.
data_cent <- data %>%
smooth(method = "SavitzkyGolay", halfWindowSize = 6) %>%
pickPeaks()
## Plot the centroided data for Serine
data_cent %>%
filterRt(rt = c(175, 189)) %>%
filterMz(mz = c(106.02, 106.07)) %>%
plot(type = "XIC")
```
## {.build}
- Note: since data is not available in memory, data smoothing and centroiding is
applied *on-the-fly* each time m/z or intensity values are accessed.
- To make changes persistent: export and re-read the data.
```{r remove-exported, message = FALSE, echo = FALSE, results = "hide"}
lapply(basename(fileNames(data)), function (z) {
if (file.exists(z))
file.remove(z)
})
```
```{r export-centroided, message = FALSE, warning = FALSE }
## Write the centroided data to files with the same names in the current
## directory
fls_new <- basename(fileNames(data))
writeMSData(data_cent, file = fls_new)
## Read the centroided data.
data_cent <- readMSData(fls_new, pdata = new("NAnnotatedDataFrame", pd),
mode = "onDisk")
```
# LC-MS data preprocessing
## Chromatographic peak detection {.build}
- Aim: identify chromatographic peaks in the data.
- Function: `findChromPeaks`.
- Available methods:
- *matchedFilter* (`MatchedFilterParam`) [Smith *Anal. chem.* 2006].
- *centWave* (`CentWaveParam`) [Tautenhahn *BMC Bioinformatics* 2008].
- *massifquant* (`MassifquantParam` [Conley *Bioinformatics* 2014].
## | centWave
- First step: identify regions of interest.
```{r out.width="600px", echo = FALSE}
knitr::include_graphics("images/centWave-ROI.png")
```
[Tautenhahn *BMC Bionformatics* 2008]
----
- 2nd: peak detection in these regions using continuous wavelet transform.
```{r out.width="600px", echo = FALSE}
knitr::include_graphics("images/centWave-CWT.png")
```
[Tautenhahn *BMC Bionformatics* 2008]
## {.build}
- Crucial centWave parameters: `peakwidth`, `ppm`; list all with `?CentWaveParam`.
- `peakwidth`: minimal and maximal expected peak width.
- Example: extract chromatographic data for Serine.
```{r centWave-default, message = FALSE, results = "hide", fig.height = 3.5, fig.width = 5, eval = FALSE}
srn_chr <- chromatogram(data_cent, rt = c(165, 200),
mz = c(106.03, 106.06),
aggregationFun = "max")
plot(srn_chr)
```
```{r, echo = FALSE, fig.height = 3.2, fig.width = 5}
srn_chr <- chromatogram(data_cent, rt = c(165, 200),
mz = c(106.03, 106.06),
aggregationFun = "max")
par(mar = c(4, 4.5, 1.5, 0.5))
plot(srn_chr)
```
## {.build}
- **New**: peak detection on `Chromatogram` objects.
- Perform peak detection using default centWave parameters in that data.
```{r centWave-default2, message = FALSE, results = "hide"}
cwp <- CentWaveParam()
findChromPeaks(srn_chr, param = cwp)
```
- **What went wrong?** What's the default for `peakwidth`?
```{r centWave-default3, message = FALSE}
peakwidth(cwp)
```
- Default for `peakwidth` does not match the current data.
## {.smaller .build}
- Reduce `peakwidth` and run peak detection again.
- *Note*: `chromatogram` will also extract identified chrom peaks.
```{r centWave-adapted, message = FALSE, fig.height = 3.5, width = 5, eval = FALSE}
peakwidth(cwp) <- c(2, 10)
srn_chr <- findChromPeaks(srn_chr, param = cwp)
## Plot the data and higlight identified peak area
plot(srn_chr)
```
```{r, message = FALSE, fig.height = 3.5, width = 5, echo = FALSE}
cwp <- CentWaveParam(peakwidth = c(2, 10))
srn_chr <- findChromPeaks(srn_chr, param = cwp)
## Plot the data and higlight identified peak area
par(mar = c(4, 4.5, 1.5, 0.5))
plot(srn_chr)
```
- Ideally check settings on more known compounds.
## {.smaller .build}
- `ppm`: maximal allowed scattering of m/z values for one ion.
- Example: evaluate the m/z scattering of the signal for Serine.
```{r Serine-mz-scattering-plot, message = FALSE, fig.height = 3.5, width = 5 }
## Restrict the data to signal from Serine
srn <- data_cent %>%
filterRt(rt = c(179, 186)) %>%
filterMz(mz = c(106.04, 106.06))
## Plot the data
plot(srn, type = "XIC")
```
## {.build}
- Example: calculate the difference of m/z values between consecutive scans.
```{r define-ppm, message = FALSE }
## Extract mz values for Serine from first file
srn_mz <- unlist(mz(filterFile(srn, 1)))
## The difference between m/z values from consecutive scans in ppm
diff(srn_mz) * 1e6 / mean(srn_mz)
```
- This should be performed ideally on more compounds.
- `ppm`: large enough to capture the full chromatographic peak.
## {.build}
- Perform chromatographic peak detection with our data set-specific settings.
```{r findPeaks-centWave, message = FALSE }
## Perform peak detection
ppm(cwp) <- 30
data_cent <- findChromPeaks(data_cent, param = cwp)
```
- Result: `XCMSnExp` object extends the `OnDiskMSnExp`, contains preprocessing
results **and** enables data access as described above.
## {.build}
- Use `chromPeaks` to access the peak detection results.
```{r xcmsnexp, message = FALSE}
head(chromPeaks(data_cent), n = 5)
```
## Alignment - in short {.build}
- Aim: adjust shifts in retention times between samples.
- Function: `adjustRtime`.
- Available methods:
- *obiwarp* (`ObiwarpParam`) [Prince *Anal. chem.* 2006]: warps the (full)
data to a reference sample.
- *peakGroups* (`PeakGroupsParam`) [Smith *Anal. chem.* 2006]:
- align spectra from different samples based on *hook* peaks.
- Need to define the hook peaks first: peaks present in most/all samples.
## {.build}
- Example: perform a peak grouping to define potential hook peaks and align the
samples based on these.
- *Note:* details on initial peak grouping provided in the next section.
```{r alignment-correspondence, message = FALSE }
## Define the settings for the initial peak grouping
pdp <- PeakDensityParam(sampleGroups = data_cent$group, bw = 1.8,
minFraction = 1, binSize = 0.02)
data_cent <- groupChromPeaks(data_cent, pdp)
```
- Align the samples.
```{r alignment-correspondence-alignment, message = FALSE}
## Define settings for the alignment
pgp <- PeakGroupsParam(minFraction = 1, span = 0.6)
data_cent <- adjustRtime(data_cent, param = pgp)
```
## {.build}
- Inspect difference between raw and adjusted retention times.
```{r alignment-result, message = FALSE, fig.width = 8, fig.height = 4 , eval = FALSE}
plotAdjustedRtime(data_cent)
```
```{r alignment-result-plot, message = FALSE, fig.width = 8, fig.height = 3.5 , echo = FALSE}
par(mar = c(4, 4.5, 0.5, 0.5))
plotAdjustedRtime(data_cent)
```
- Difference between raw and adjusted retention times resonable.
- Hook peaks along the full retention time range.
## {.build}
- Plot BPC before and after alignment.
- Plot XIC of known compounds before and after alignment.
```{r serine-xic-adjusted, message = FALSE, fig.width = 8, fig.height = 3.1 }
## Use adjustedRtime parameter to access raw/adjusted retention times
par(mfrow = c(1, 2), mar = c(4, 4.5, 0.9, 0.5))
plot(chromatogram(data_cent, mz = c(106.04, 106.06),
rt = c(179, 186), adjustedRtime = FALSE))
plot(chromatogram(data_cent, mz = c(106.04, 106.06),
rt = c(179, 186)))
```
## Correspondence {.build}
- Aim: group signal (peaks) from the same ion across samples.
- Function: `groupChromPeaks`.
- Methods available:
- *peak density* (`PeakDensityParam`) [Smith *Anal. chem.* 2006].
- *nearest* (`NearestPeaksParam`) [Katajamaa *Bioinformatics* 2006].
## | peak density {.build}
- Iterate through slices along m/z.
- Compare retention times of peaks within each slice and group peaks if they are
close.
- Distribution of peaks along retention time axis is used to define which peaks
to group.
- `plotChromPeakDensity`: plot distribution of identified peaks along rt for a
given m/z slice; simulates correspondence analysis.
## {.build}
- Example:
- Plot data for the m/z slice containing the Serine peak.
- Use `plotChromPeakDensity` to simulate a correspondence analysis in the same
slice.
```{r correspondence-example, message = FALSE, eval = FALSE}
## Extract a BPC for an m/z slice containing serine
bpc_srn <- chromatogram(data_cent, mz = c(106.04, 106.06),
aggregationFun = "max")
## Get default parameters for the grouping
pdp <- PeakDensityParam(sampleGroups = data_cent$group)
## Dry-run correspondence and show the results.
plotChromPeakDensity(bpc_srn, param = pdp)
```
## {.smaller}
```{r correspondence-example-plot, message = FALSE, width = 7, height = 5, echo = FALSE}
## Plot the BPC for the m/z slice containing serine
par(mfrow = c(2, 1), mar = c(4, 4.3, 1, 0.5))
bpc_srn <- chromatogram(data_cent, mz = c(106.04, 106.06),
aggregationFun = "max")
## Get default parameters for the grouping
pdp <- PeakDensityParam(sampleGroups = data_cent$group)
## Dry-run correspondence and show the results.
plotChromPeakDensity(bpc_srn, param = pdp)
```
- Points are peaks per sample;
- black line: peak density distribution;
- grey rectangles: grouped peaks (features).
##
- Parameters:
- `binSize`: m/z width of the data slice in which peaks are grouped.
- `bw` defines the smoothness of the density function.
- `maxFeatures`: maximum number of features to be defined in one bin.
- `minFraction`: minimum proportion of samples (of one group!) for
which a peak has to be present.
- `minSamples`: minimum number of samples a peak has to be present.
> - Parameters `minFraction` and `minSamples` depend on experimental layout!
> - `binSize` should be small enough to avoid peaks from different ions
measured at similar retention times to be grouped together.
> - `bw` is the most important parameter.
## {.build}
- Test default settings for a slice containing ions with similar m/z and rt:
isomers Betaine and Valine ([M+H]+ m/z 118.08625).
```{r correspondence-bw, message = FALSE, eval = FALSE}
## Plot the chromatogram for an m/z slice containing Betaine and Valine
mzr <- 118.08625 + c(-0.01, 0.01)
chr <- chromatogram(data_cent, mz = mzr, aggregationFun = "max")
## Correspondence in that slice using default settings
pdp <- PeakDensityParam(sampleGroups = data_cent$group)
plotChromPeakDensity(chr, param = pdp)
```
##
```{r correspondence-bw-plot, message = FALSE, width = 7, height = 5, echo = FALSE}
mzr <- 118.08625 + c(-0.01, 0.01)
chr <- chromatogram(data_cent, mz = mzr, aggregationFun = "max")
## Correspondence in that slice using default settings
pdp <- PeakDensityParam(sampleGroups = data_cent$group)
plotChromPeakDensity(chr, param = pdp)
```
> - **Correspondence failed**: all peaks grouped into one feature!
> - Default for `bw` (`30`) too large for present data set.
## {.build}
- `plotChromPeakDensity` allows to evaluate and tune settings on data subsets.
- Test smaller `bw` (`1.8`) on the same slice.
```{r correspondence-bw2, message = FALSE, eval = FALSE}
## Reducing the bandwidth
pdp <- PeakDensityParam(sampleGroups = data_cent$group, bw = 1.8)
plotChromPeakDensity(chr, param = pdp)
```
##
```{r correspondence-bw2-plot, message = FALSE, fig.width = 7, fig.height = 5, echo = FALSE}
pdp <- PeakDensityParam(sampleGroups = data_cent$group, bw = 1.8)
plotChromPeakDensity(chr, param = pdp)
```
> - Reducing the `bw` enabled grouping of isomers into different
features.
##
- Perform the correspondence analysis with tuned settings.
```{r correspondence-analysis, message = FALSE}
pdp <- PeakDensityParam(sampleGroups = data_cent$group, bw = 1.8,
minFraction = 0.4, binSize = 0.02)
## Perform the correspondence analysis
data_cent <- groupChromPeaks(data_cent, param = pdp)
```
> - Evaluate results after correspondence: `plotChromPeakDensity` with `simulate
= FALSE` shows the actual results from the correspondence.
> - Feature definitions are stored within the `XCMSnExp` object, can be accessed
with `featureDefinitions`.
## {.build}
- Use `featureValues` to access the features' abundance estimates.
```{r}
## feature intensity matrix
fmat <- featureValues(data_cent, value = "into", method = "maxint")
head(fmat)
```
- `featureValues` parameters:
- `value`: name of the column in `chromPeaks` that should be returned.
- `method`: for features with multiple peaks in one sample: from which
peak should the value be returned?
## Missing values {.build}
- Peak detection may have failed in one sample.
- Ion is not present in a sample.
- `fillChromPeaks` allows to *fill-in* signal for missing peaks from the
feature area (defined by the median rt and mz of all peaks assigned to the
feature).
- `fillChromPeaks` Parameters:
- `expandMz`, `expandRt`: expands the region from which signal is integrated
in m/z or rt dimension. A value of 0 means no expansion, 1 means the region
is grown by half of the feature's m/z width on both sides.
- `ppm`: expand the m/z width by a m/z dependent value.
## Summary {.build}
- The new data objects and functions aim to:
- simplify data access and inspection of results
- facilitate data set-dependent definition of algorithm parameters.
- More work to come for the analysis of chromatographic data (SRM/MRM)
and eventually for data normalization.
- **Don't blindly use default parameters!**
## Acknowledgments
> - Jan Stanstrup (University of Copenhagen, Denmark)
> - Laurent Gatto (University of Cambridge, UK); `MSnbase`, `mzR`.
> - Steffen Neumann (IPB Halle, Germany); `xcms`, `mzR`
> - **YOU for your attention!**
https://github.com/jorainer/metabolomics2018