## Hands-on? {.build}
- https://github.com/jotsetung/metabolomics2018 - Open *xcms-preprocessing.Rmd* in e.g. [RStudio](https://www.rstudio.com/).
## Content {.build} This presentation focuses on updates of `xcms`: - re-use data structures from Bioconductor's `MSnbase` package - simplified raw data access
Content: - Basic MS data handling ([`MSnbase`](https://github.com/lgatto/MSnbase)) - Simple MS data centroiding (`MSnbase`) - LC-MS data pre-processing ([`xcms`](https://github.com/sneumann/xcms)): - chromatographic peak detection - alignment - correspondence
# Basic MS data handling ## Data import and representation {.build} ```{r load-libs, message = FALSE, results = "hide", echo = FALSE} library(xcms) library(magrittr) ## Set up parallel processing using 3 cores register(bpstart(MulticoreParam(3)), default = TRUE) ```
- **Data set**: - subset from 2 files with pooled human serum samples - UHPLC (Agilent 1290) coupled with Q-TOF MS (TripleTOF 5600+ AB Sciex) - HILIC-based chromatographic separation
- Define file names and sample descriptions. ```{r load-data, message = FALSE } fls <- dir(system.file("sciex", package = "msdata"), full.names = TRUE) ## Define a data.frame with additional information on the files. pd <- data.frame(file = basename(fls), injection_idx = c(1, 19), sample = c("POOL_1", "POOL_2"), group = "POOL") ```
## Data import and representation {.build}
- Read data from mzML/mzXML/CDF files with `readMSData` function.
```{r, message = FALSE} ## Read the data data <- readMSData(fls, pdata = new("NAnnotatedDataFrame", pd), mode = "onDisk") ```
- `mode = "onDisk"`: reads only spectrum header from files, but no data. - on-disk mode enables analysis of very large experiments.
## Basic data access {.build} - Access sample/phenotype information using `pData` or `$`:
```{r show-fData, message = FALSE} ## Access phenotype information pData(data) ```
```{r show-fData2, message = FALSE} ## Or individual columns directly using the $ operator data$injection_idx ```
## Basic data access {.build}
- Access general spectrum information: `msLevel`, `centroided`, `rtime`, `polarity`.
- Access MS data: `spectra`, `mz`, `intensity`: reads data from files.
- In most cases we work with subsets: use filter functions to subset the data: - `filterFile` subset to individual files/samples. - `filterRtime` restrict to specific retention time window. - `filterMz` restrict to m/z range. - `filterMsLevel` subset to certain MS level(s).
## {.build} - Example: extract all spectra measured between 180 and 181 seconds. Using the `%>%` (pipe) operator to avoid nested function calls. ```{r spectra-filterRt, message = FALSE } ## Get all spectra measured between 180 and 181 seconds ## Use %>% for better readability sps <- data %>% filterRt(rt = c(180, 181)) %>% spectra ``` ```{r spectra-filterRt-length} ## How many spectra? length(sps) ``` ```{r spectra-filterRt-sapply} ## From which file? sapply(sps, fromFile) ``` ## {.build} - Example: plot the data from the last spectrum ```{r spectrum-plot, message = FALSE, fig.width = 5, fig.height = 3.5} plot(sps[[6]]) ``` - But how to get chromatographic data? ## {.build} - `chromatogram`: extract chromatographic data. - Example: XIC for Serine (m/z of [M+H]+ adduct 106.0455). ```{r serine-xic, message = FALSE, fig.height = 4, fig.width = 7.5, eval = FALSE} data %>% filterRt(rt = c(175, 189)) %>% filterMz(mz = c(106.02, 106.07)) %>% chromatogram(aggregationFun = "max") %>% plot() ``` ```{r serine-xic-plot, message = FALSE, fig.height = 3.3, fig.width = 7.5, echo = FALSE} par(mar = c(4, 4.5, 1, 0.5)) data %>% filterRt(rt = c(175, 189)) %>% filterMz(mz = c(106.02, 106.07)) %>% chromatogram(aggregationFun = "max") %>% plot() ``` ## Centroiding of profile MS data {.build .faster}
- *centroiding* is the process in which mass peaks are reduced to a single, representative signal, their centroids.
- `xcms`, specifically *centWave* was designed for centroided data. - `MSnbase` provides basic tools to perform MS data smoothing and centroiding: `smooth` and `pickPeaks`.
- Example: show the combined m/z, rt and intensity data for Serine. ```{r serine-profile-mode-data, message = FALSE, eval = FALSE} data %>% filterRt(rt = c(175, 189)) %>% filterMz(mz = c(106.02, 106.07)) %>% plot(type = "XIC") ```
## {.flexbox .vcenter} ```{r serine-profile-mode-data2, message = FALSE, echo = FALSE} ## Filter the MS data to the signal from the Serine ion and plot it using ## type = "XIC" data %>% filterRt(rt = c(175, 189)) %>% filterMz(mz = c(106.02, 106.07)) %>% plot(type = "XIC") ``` - plot `type = "XIC"` creates a combined chromatographic and *map* visualization of the data. ## {.build}
- Example: smooth data with Savitzky-Golay filter followed by a centroiding that simply reports the maximum signal for each mass peak in each spectrum. See `?pickPeaks` for more advanced options. ```{r centroiding, message = FALSE, warning = FALSE, eval = FALSE} ## Smooth the signal, then do a simple peak picking. data_cent <- data %>% smooth(method = "SavitzkyGolay", halfWindowSize = 6) %>% pickPeaks() ```
```{r, eval = FALSE} ## Plot the centroided data for Serine data_cent %>% filterRt(rt = c(175, 189)) %>% filterMz(mz = c(106.02, 106.07)) %>% plot(type = "XIC") ```
---- ```{r centroiding2, message = FALSE, warning = FALSE, fig.width = 10, fig.height = 5, echo = FALSE} ## Smooth the signal, then do a simple peak picking. data_cent <- data %>% smooth(method = "SavitzkyGolay", halfWindowSize = 6) %>% pickPeaks() ## Plot the centroided data for Serine data_cent %>% filterRt(rt = c(175, 189)) %>% filterMz(mz = c(106.02, 106.07)) %>% plot(type = "XIC") ``` ## {.build}
- Note: since data is not available in memory, data smoothing and centroiding is applied *on-the-fly* each time m/z or intensity values are accessed.
- To make changes persistent: export and re-read the data. ```{r remove-exported, message = FALSE, echo = FALSE, results = "hide"} lapply(basename(fileNames(data)), function (z) { if (file.exists(z)) file.remove(z) }) ```
```{r export-centroided, message = FALSE, warning = FALSE } ## Write the centroided data to files with the same names in the current ## directory fls_new <- basename(fileNames(data)) writeMSData(data_cent, file = fls_new) ## Read the centroided data. data_cent <- readMSData(fls_new, pdata = new("NAnnotatedDataFrame", pd), mode = "onDisk") ``` # LC-MS data preprocessing ## Chromatographic peak detection {.build} - Aim: identify chromatographic peaks in the data.
- Function: `findChromPeaks`.
- Available methods: - *matchedFilter* (`MatchedFilterParam`) [Smith *Anal. chem.* 2006]. - *centWave* (`CentWaveParam`) [Tautenhahn *BMC Bioinformatics* 2008]. - *massifquant* (`MassifquantParam` [Conley *Bioinformatics* 2014].
## | centWave - First step: identify regions of interest. ```{r out.width="600px", echo = FALSE} knitr::include_graphics("images/centWave-ROI.png") ```
[Tautenhahn *BMC Bionformatics* 2008]
---- - 2nd: peak detection in these regions using continuous wavelet transform. ```{r out.width="600px", echo = FALSE} knitr::include_graphics("images/centWave-CWT.png") ```
[Tautenhahn *BMC Bionformatics* 2008]
## {.build} - Crucial centWave parameters: `peakwidth`, `ppm`; list all with `?CentWaveParam`. - `peakwidth`: minimal and maximal expected peak width.
- Example: extract chromatographic data for Serine. ```{r centWave-default, message = FALSE, results = "hide", fig.height = 3.5, fig.width = 5, eval = FALSE} srn_chr <- chromatogram(data_cent, rt = c(165, 200), mz = c(106.03, 106.06), aggregationFun = "max") plot(srn_chr) ```
```{r, echo = FALSE, fig.height = 3.2, fig.width = 5} srn_chr <- chromatogram(data_cent, rt = c(165, 200), mz = c(106.03, 106.06), aggregationFun = "max") par(mar = c(4, 4.5, 1.5, 0.5)) plot(srn_chr) ```
## {.build} - **New**: peak detection on `Chromatogram` objects. - Perform peak detection using default centWave parameters in that data. ```{r centWave-default2, message = FALSE, results = "hide"} cwp <- CentWaveParam() findChromPeaks(srn_chr, param = cwp) ``` - **What went wrong?** What's the default for `peakwidth`? ```{r centWave-default3, message = FALSE} peakwidth(cwp) ``` - Default for `peakwidth` does not match the current data. ## {.smaller .build} - Reduce `peakwidth` and run peak detection again. - *Note*: `chromatogram` will also extract identified chrom peaks. ```{r centWave-adapted, message = FALSE, fig.height = 3.5, width = 5, eval = FALSE} peakwidth(cwp) <- c(2, 10) srn_chr <- findChromPeaks(srn_chr, param = cwp) ## Plot the data and higlight identified peak area plot(srn_chr) ``` ```{r, message = FALSE, fig.height = 3.5, width = 5, echo = FALSE} cwp <- CentWaveParam(peakwidth = c(2, 10)) srn_chr <- findChromPeaks(srn_chr, param = cwp) ## Plot the data and higlight identified peak area par(mar = c(4, 4.5, 1.5, 0.5)) plot(srn_chr) ``` - Ideally check settings on more known compounds. ## {.smaller .build} - `ppm`: maximal allowed scattering of m/z values for one ion. - Example: evaluate the m/z scattering of the signal for Serine. ```{r Serine-mz-scattering-plot, message = FALSE, fig.height = 3.5, width = 5 } ## Restrict the data to signal from Serine srn <- data_cent %>% filterRt(rt = c(179, 186)) %>% filterMz(mz = c(106.04, 106.06)) ## Plot the data plot(srn, type = "XIC") ``` ## {.build} - Example: calculate the difference of m/z values between consecutive scans. ```{r define-ppm, message = FALSE } ## Extract mz values for Serine from first file srn_mz <- unlist(mz(filterFile(srn, 1))) ## The difference between m/z values from consecutive scans in ppm diff(srn_mz) * 1e6 / mean(srn_mz) ``` - This should be performed ideally on more compounds. - `ppm`: large enough to capture the full chromatographic peak. ## {.build}
- Perform chromatographic peak detection with our data set-specific settings. ```{r findPeaks-centWave, message = FALSE } ## Perform peak detection ppm(cwp) <- 30 data_cent <- findChromPeaks(data_cent, param = cwp) ```
- Result: `XCMSnExp` object extends the `OnDiskMSnExp`, contains preprocessing results **and** enables data access as described above.
## {.build} - Use `chromPeaks` to access the peak detection results. ```{r xcmsnexp, message = FALSE} head(chromPeaks(data_cent), n = 5) ``` ## Alignment - in short {.build}
- Aim: adjust shifts in retention times between samples.
- Function: `adjustRtime`.
- Available methods: - *obiwarp* (`ObiwarpParam`) [Prince *Anal. chem.* 2006]: warps the (full) data to a reference sample.
- *peakGroups* (`PeakGroupsParam`) [Smith *Anal. chem.* 2006]: - align spectra from different samples based on *hook* peaks. - Need to define the hook peaks first: peaks present in most/all samples.
## {.build} - Example: perform a peak grouping to define potential hook peaks and align the samples based on these. - *Note:* details on initial peak grouping provided in the next section. ```{r alignment-correspondence, message = FALSE } ## Define the settings for the initial peak grouping pdp <- PeakDensityParam(sampleGroups = data_cent$group, bw = 1.8, minFraction = 1, binSize = 0.02) data_cent <- groupChromPeaks(data_cent, pdp) ```
- Align the samples. ```{r alignment-correspondence-alignment, message = FALSE} ## Define settings for the alignment pgp <- PeakGroupsParam(minFraction = 1, span = 0.6) data_cent <- adjustRtime(data_cent, param = pgp) ```
## {.build} - Inspect difference between raw and adjusted retention times. ```{r alignment-result, message = FALSE, fig.width = 8, fig.height = 4 , eval = FALSE} plotAdjustedRtime(data_cent) ``` ```{r alignment-result-plot, message = FALSE, fig.width = 8, fig.height = 3.5 , echo = FALSE} par(mar = c(4, 4.5, 0.5, 0.5)) plotAdjustedRtime(data_cent) ``` - Difference between raw and adjusted retention times resonable. - Hook peaks along the full retention time range. ## {.build} - Plot BPC before and after alignment. - Plot XIC of known compounds before and after alignment. ```{r serine-xic-adjusted, message = FALSE, fig.width = 8, fig.height = 3.1 } ## Use adjustedRtime parameter to access raw/adjusted retention times par(mfrow = c(1, 2), mar = c(4, 4.5, 0.9, 0.5)) plot(chromatogram(data_cent, mz = c(106.04, 106.06), rt = c(179, 186), adjustedRtime = FALSE)) plot(chromatogram(data_cent, mz = c(106.04, 106.06), rt = c(179, 186))) ``` ## Correspondence {.build}
- Aim: group signal (peaks) from the same ion across samples.
- Function: `groupChromPeaks`. - Methods available: - *peak density* (`PeakDensityParam`) [Smith *Anal. chem.* 2006]. - *nearest* (`NearestPeaksParam`) [Katajamaa *Bioinformatics* 2006].
## | peak density {.build}
- Iterate through slices along m/z. - Compare retention times of peaks within each slice and group peaks if they are close.
- Distribution of peaks along retention time axis is used to define which peaks to group.
- `plotChromPeakDensity`: plot distribution of identified peaks along rt for a given m/z slice; simulates correspondence analysis.
## {.build}
- Example: - Plot data for the m/z slice containing the Serine peak. - Use `plotChromPeakDensity` to simulate a correspondence analysis in the same slice.
```{r correspondence-example, message = FALSE, eval = FALSE} ## Extract a BPC for an m/z slice containing serine bpc_srn <- chromatogram(data_cent, mz = c(106.04, 106.06), aggregationFun = "max") ## Get default parameters for the grouping pdp <- PeakDensityParam(sampleGroups = data_cent$group) ## Dry-run correspondence and show the results. plotChromPeakDensity(bpc_srn, param = pdp) ```
## {.smaller} ```{r correspondence-example-plot, message = FALSE, width = 7, height = 5, echo = FALSE} ## Plot the BPC for the m/z slice containing serine par(mfrow = c(2, 1), mar = c(4, 4.3, 1, 0.5)) bpc_srn <- chromatogram(data_cent, mz = c(106.04, 106.06), aggregationFun = "max") ## Get default parameters for the grouping pdp <- PeakDensityParam(sampleGroups = data_cent$group) ## Dry-run correspondence and show the results. plotChromPeakDensity(bpc_srn, param = pdp) ``` - Points are peaks per sample; - black line: peak density distribution; - grey rectangles: grouped peaks (features). ## - Parameters: - `binSize`: m/z width of the data slice in which peaks are grouped. - `bw` defines the smoothness of the density function. - `maxFeatures`: maximum number of features to be defined in one bin. - `minFraction`: minimum proportion of samples (of one group!) for which a peak has to be present. - `minSamples`: minimum number of samples a peak has to be present. > - Parameters `minFraction` and `minSamples` depend on experimental layout! > - `binSize` should be small enough to avoid peaks from different ions measured at similar retention times to be grouped together. > - `bw` is the most important parameter. ## {.build} - Test default settings for a slice containing ions with similar m/z and rt: isomers Betaine and Valine ([M+H]+ m/z 118.08625). ```{r correspondence-bw, message = FALSE, eval = FALSE} ## Plot the chromatogram for an m/z slice containing Betaine and Valine mzr <- 118.08625 + c(-0.01, 0.01) chr <- chromatogram(data_cent, mz = mzr, aggregationFun = "max") ## Correspondence in that slice using default settings pdp <- PeakDensityParam(sampleGroups = data_cent$group) plotChromPeakDensity(chr, param = pdp) ``` ## ```{r correspondence-bw-plot, message = FALSE, width = 7, height = 5, echo = FALSE} mzr <- 118.08625 + c(-0.01, 0.01) chr <- chromatogram(data_cent, mz = mzr, aggregationFun = "max") ## Correspondence in that slice using default settings pdp <- PeakDensityParam(sampleGroups = data_cent$group) plotChromPeakDensity(chr, param = pdp) ``` > - **Correspondence failed**: all peaks grouped into one feature! > - Default for `bw` (`30`) too large for present data set. ## {.build} - `plotChromPeakDensity` allows to evaluate and tune settings on data subsets. - Test smaller `bw` (`1.8`) on the same slice. ```{r correspondence-bw2, message = FALSE, eval = FALSE} ## Reducing the bandwidth pdp <- PeakDensityParam(sampleGroups = data_cent$group, bw = 1.8) plotChromPeakDensity(chr, param = pdp) ``` ## ```{r correspondence-bw2-plot, message = FALSE, fig.width = 7, fig.height = 5, echo = FALSE} pdp <- PeakDensityParam(sampleGroups = data_cent$group, bw = 1.8) plotChromPeakDensity(chr, param = pdp) ``` > - Reducing the `bw` enabled grouping of isomers into different features. ## - Perform the correspondence analysis with tuned settings. ```{r correspondence-analysis, message = FALSE} pdp <- PeakDensityParam(sampleGroups = data_cent$group, bw = 1.8, minFraction = 0.4, binSize = 0.02) ## Perform the correspondence analysis data_cent <- groupChromPeaks(data_cent, param = pdp) ``` > - Evaluate results after correspondence: `plotChromPeakDensity` with `simulate = FALSE` shows the actual results from the correspondence. > - Feature definitions are stored within the `XCMSnExp` object, can be accessed with `featureDefinitions`. ## {.build} - Use `featureValues` to access the features' abundance estimates. ```{r} ## feature intensity matrix fmat <- featureValues(data_cent, value = "into", method = "maxint") head(fmat) ``` - `featureValues` parameters: - `value`: name of the column in `chromPeaks` that should be returned. - `method`: for features with multiple peaks in one sample: from which peak should the value be returned? ## Missing values {.build}
- Peak detection may have failed in one sample. - Ion is not present in a sample.
- `fillChromPeaks` allows to *fill-in* signal for missing peaks from the feature area (defined by the median rt and mz of all peaks assigned to the feature).
- `fillChromPeaks` Parameters: - `expandMz`, `expandRt`: expands the region from which signal is integrated in m/z or rt dimension. A value of 0 means no expansion, 1 means the region is grown by half of the feature's m/z width on both sides. - `ppm`: expand the m/z width by a m/z dependent value.
## Summary {.build}
- The new data objects and functions aim to: - simplify data access and inspection of results - facilitate data set-dependent definition of algorithm parameters.
- More work to come for the analysis of chromatographic data (SRM/MRM) and eventually for data normalization.
- **Don't blindly use default parameters!**
## Acknowledgments > - Jan Stanstrup (University of Copenhagen, Denmark) > - Laurent Gatto (University of Cambridge, UK); `MSnbase`, `mzR`. > - Steffen Neumann (IPB Halle, Germany); `xcms`, `mzR` > - **YOU for your attention!**
https://github.com/jorainer/metabolomics2018