24 June 2018; Updated 20 June 2019

Hands-on?

Content

This presentation focuses on updates of xcms:

  • re-use data structures from Bioconductor’s MSnbase package
  • simplified raw data access

Content:

  • Basic MS data handling (MSnbase)
  • Simple MS data centroiding (MSnbase)
  • LC-MS data pre-processing (xcms):
    • chromatographic peak detection
    • alignment
    • correspondence

Basic MS data handling

Data import and representation

  • Data set:
    • subset from 2 files with pooled human serum samples
    • UHPLC (Agilent 1290) coupled with Q-TOF MS (TripleTOF 5600+ AB Sciex)
    • HILIC-based chromatographic separation
  • Define file names and sample descriptions.
fls <- dir(system.file("sciex", package = "msdata"), full.names = TRUE)

## Define a data.frame with additional information on the files.
pd <- data.frame(file = basename(fls), injection_idx = c(1, 19),
                 sample = c("POOL_1", "POOL_2"), group = "POOL")

Data import and representation

  • Read data from mzML/mzXML/CDF files with readMSData function.
## Read the data
data <- readMSData(fls, pdata = new("NAnnotatedDataFrame", pd), 
                   mode = "onDisk") 
  • mode = "onDisk": reads only spectrum header from files, but no data.
  • on-disk mode enables analysis of very large experiments.

Basic data access

  • Access sample/phenotype information using pData or $:
## Access phenotype information
pData(data)
##                               file injection_idx sample group
## 1 20171016_POOL_POS_1_105-134.mzML             1 POOL_1  POOL
## 2 20171016_POOL_POS_3_105-134.mzML            19 POOL_2  POOL
## Or individual columns directly using the $ operator
data$injection_idx
## [1]  1 19

Basic data access

  • Access general spectrum information: msLevel, centroided, rtime, polarity.
  • Access MS data: spectra, mz, intensity: reads data from files.
  • In most cases we work with subsets: use filter functions to subset the data:
    • filterFile subset to individual files/samples.
    • filterRtime restrict to specific retention time window.
    • filterMz restrict to m/z range.
    • filterMsLevel subset to certain MS level(s).

  • Example: extract all spectra measured between 180 and 181 seconds. Using the %>% (pipe) operator to avoid nested function calls.
## Get all spectra measured between 180 and 181 seconds
## Use %>% for better readability
sps <- data %>%
    filterRt(rt = c(180, 181)) %>%
    spectra
## How many spectra?
length(sps)
## [1] 6
## From which file?
sapply(sps, fromFile) 
## F1.S646 F1.S647 F1.S648 F2.S646 F2.S647 F2.S648 
##       1       1       1       2       2       2

  • Example: plot the data from the last spectrum
plot(sps[[6]]) 

  • But how to get chromatographic data?

  • chromatogram: extract chromatographic data.

  • Example: XIC for Serine (m/z of [M+H]+ adduct 106.0455).

data %>%
    filterRt(rt = c(175, 189)) %>%
    filterMz(mz = c(106.02, 106.07)) %>%
    chromatogram(aggregationFun = "max") %>%
    plot()

Centroiding of profile MS data

  • centroiding is the process in which mass peaks are reduced to a single, representative signal, their centroids.
  • xcms, specifically centWave was designed for centroided data.
  • MSnbase provides basic tools to perform MS data smoothing and centroiding: smooth and pickPeaks.
  • Example: show the combined m/z, rt and intensity data for Serine.
data %>%
    filterRt(rt = c(175, 189)) %>%
    filterMz(mz = c(106.02, 106.07)) %>%
    plot(type = "XIC") 

  • plot type = "XIC" creates a combined chromatographic and map visualization of the data.

  • Example: smooth data with Savitzky-Golay filter followed by a centroiding that simply reports the maximum signal for each mass peak in each spectrum. See ?pickPeaks for more advanced options.
## Smooth the signal, then do a simple peak picking.
data_cent <- data %>%
    smooth(method = "SavitzkyGolay", halfWindowSize = 6) %>%
    pickPeaks()
## Plot the centroided data for Serine
data_cent %>%
    filterRt(rt = c(175, 189)) %>%
    filterMz(mz = c(106.02, 106.07)) %>%
    plot(type = "XIC") 

  • Note: since data is not available in memory, data smoothing and centroiding is applied on-the-fly each time m/z or intensity values are accessed.
  • To make changes persistent: export and re-read the data.
## Write the centroided data to files with the same names in the current
## directory
fls_new <- basename(fileNames(data))
writeMSData(data_cent, file = fls_new)

## Read the centroided data.
data_cent <- readMSData(fls_new, pdata = new("NAnnotatedDataFrame", pd),
                        mode = "onDisk") 

LC-MS data preprocessing

Chromatographic peak detection

  • Aim: identify chromatographic peaks in the data.
  • Function: findChromPeaks.
  • Available methods:
    • matchedFilter (MatchedFilterParam) [Smith Anal. chem. 2006].
    • centWave (CentWaveParam) [Tautenhahn BMC Bioinformatics 2008].
    • massifquant (MassifquantParam [Conley Bioinformatics 2014].

centWave

  • First step: identify regions of interest.

[Tautenhahn BMC Bionformatics 2008]

  • 2nd: peak detection in these regions using continuous wavelet transform.

[Tautenhahn BMC Bionformatics 2008]

  • Crucial centWave parameters: peakwidth, ppm; list all with ?CentWaveParam.

  • peakwidth: minimal and maximal expected peak width.

  • Example: extract chromatographic data for Serine.
srn_chr <- chromatogram(data_cent, rt = c(165, 200),
                        mz = c(106.03, 106.06),
                        aggregationFun = "max")
plot(srn_chr)

  • New: peak detection on Chromatogram objects.
  • Perform peak detection using default centWave parameters in that data.
cwp <- CentWaveParam()
findChromPeaks(srn_chr, param = cwp)
  • What went wrong? What’s the default for peakwidth?
peakwidth(cwp)
## [1] 20 50
  • Default for peakwidth does not match the current data.

  • Reduce peakwidth and run peak detection again.
  • Note: chromatogram will also extract identified chrom peaks.
peakwidth(cwp) <- c(2, 10)
srn_chr <- findChromPeaks(srn_chr, param = cwp)
## Plot the data and higlight identified peak area
plot(srn_chr)

  • Ideally check settings on more known compounds.

  • ppm: maximal allowed scattering of m/z values for one ion.

  • Example: evaluate the m/z scattering of the signal for Serine.

## Restrict the data to signal from Serine
srn <- data_cent %>%
    filterRt(rt = c(179, 186)) %>%
    filterMz(mz = c(106.04, 106.06))

## Plot the data
plot(srn, type = "XIC") 

  • Example: calculate the difference of m/z values between consecutive scans.
## Extract mz values for Serine from first file
srn_mz <- unlist(mz(filterFile(srn, 1)))
## The difference between m/z values from consecutive scans in ppm
diff(srn_mz) * 1e6 / mean(srn_mz) 
##       F1.S643       F1.S644       F1.S645       F1.S646       F1.S647 
##  13.695973646 -27.391665930   1.112565444  13.695804399   0.000000000 
##       F1.S648       F1.S649       F1.S650       F1.S651       F1.S652 
##  -0.158840806   0.000000000   0.000000000  -0.682098923   0.000000000 
##       F1.S653       F1.S654       F1.S655       F1.S656       F1.S657 
##   0.000000000   0.007189239 -13.695795336  13.695795336 -12.807200180 
##       F1.S658       F1.S659       F1.S660       F1.S661       F1.S662 
##   0.000000000   0.000000000  13.443799681   0.000000000 -13.695795190 
##       F1.S663       F1.S664       F1.S665       F1.S666 
##  13.957010392   0.000000000   0.000000000 -14.085629933
  • This should be performed ideally on more compounds.
  • ppm: large enough to capture the full chromatographic peak.

  • Perform chromatographic peak detection with our data set-specific settings.
## Perform peak detection
ppm(cwp) <- 30
data_cent <- findChromPeaks(data_cent, param = cwp) 
  • Result: XCMSnExp object extends the OnDiskMSnExp, contains preprocessing results and enables data access as described above.

  • Use chromPeaks to access the peak detection results.
head(chromPeaks(data_cent), n = 5) 
##             mz    mzmin    mzmax     rt  rtmin  rtmax     into     intb
## CP001 111.0443 111.0431 111.0476 25.670 24.554 27.065 574.3303 562.2062
## CP002 129.0541 129.0522 129.0553 25.391 24.275 27.065 806.1325 770.1417
## CP003 114.0727 114.0715 114.0731 26.786 25.670 28.460 767.0744 764.5634
## CP004 111.0057 111.0044 111.0073 28.739 27.623 29.297 397.1809 395.7859
## CP005 127.0387 127.0378 127.0410 28.739 27.902 29.576 227.0865 223.2144
##           maxo  sn sample
## CP001 578.7692  30      1
## CP002 733.1538  21      1
## CP003 498.2657 497      1
## CP004 400.4895 399      1
## CP005 231.5594  16      1

Alignment - in short

  • Aim: adjust shifts in retention times between samples.
  • Function: adjustRtime.
  • Available methods:
    • obiwarp (ObiwarpParam) [Prince Anal. chem. 2006]: warps the (full) data to a reference sample.
  • peakGroups (PeakGroupsParam) [Smith Anal. chem. 2006]:
    • align spectra from different samples based on hook peaks.
    • Need to define the hook peaks first: peaks present in most/all samples.

  • Example: perform a peak grouping to define potential hook peaks and align the samples based on these.
  • Note: details on initial peak grouping provided in the next section.
## Define the settings for the initial peak grouping
pdp <- PeakDensityParam(sampleGroups = data_cent$group, bw = 1.8,
                        minFraction = 1, binSize = 0.02)
data_cent <- groupChromPeaks(data_cent, pdp)
  • Align the samples.
## Define settings for the alignment
pgp <- PeakGroupsParam(minFraction = 1, span = 0.6)
data_cent <- adjustRtime(data_cent, param = pgp)

  • Inspect difference between raw and adjusted retention times.
plotAdjustedRtime(data_cent) 

  • Difference between raw and adjusted retention times resonable.
  • Hook peaks along the full retention time range.

  • Plot BPC before and after alignment.
  • Plot XIC of known compounds before and after alignment.
## Use adjustedRtime parameter to access raw/adjusted retention times
par(mfrow = c(1, 2), mar = c(4, 4.5, 0.9, 0.5))
plot(chromatogram(data_cent, mz = c(106.04, 106.06),
                  rt = c(179, 186), adjustedRtime = FALSE))
plot(chromatogram(data_cent, mz = c(106.04, 106.06),
                  rt = c(179, 186))) 

Correspondence

  • Aim: group signal (peaks) from the same ion across samples.
  • Function: groupChromPeaks.
  • Methods available:
    • peak density (PeakDensityParam) [Smith Anal. chem. 2006].
    • nearest (NearestPeaksParam) [Katajamaa Bioinformatics 2006].

peak density

  • Iterate through slices along m/z.
  • Compare retention times of peaks within each slice and group peaks if they are close.
  • Distribution of peaks along retention time axis is used to define which peaks to group.
  • plotChromPeakDensity: plot distribution of identified peaks along rt for a given m/z slice; simulates correspondence analysis.

  • Example:
    • Plot data for the m/z slice containing the Serine peak.
    • Use plotChromPeakDensity to simulate a correspondence analysis in the same slice.
## Extract a BPC for an m/z slice containing serine
bpc_srn <- chromatogram(data_cent, mz = c(106.04, 106.06),
                        aggregationFun = "max")
## Get default parameters for the grouping
pdp <- PeakDensityParam(sampleGroups = data_cent$group)

## Dry-run correspondence and show the results.
plotChromPeakDensity(bpc_srn, param = pdp)

  • Points are peaks per sample;
  • black line: peak density distribution;
  • grey rectangles: grouped peaks (features).

  • Parameters:
    • binSize: m/z width of the data slice in which peaks are grouped.
    • bw defines the smoothness of the density function.
    • maxFeatures: maximum number of features to be defined in one bin.
    • minFraction: minimum proportion of samples (of one group!) for which a peak has to be present.
    • minSamples: minimum number of samples a peak has to be present.
  • Parameters minFraction and minSamples depend on experimental layout!
  • binSize should be small enough to avoid peaks from different ions measured at similar retention times to be grouped together.
  • bw is the most important parameter.

  • Test default settings for a slice containing ions with similar m/z and rt: isomers Betaine and Valine ([M+H]+ m/z 118.08625).
## Plot the chromatogram for an m/z slice containing Betaine and Valine
mzr <- 118.08625 + c(-0.01, 0.01)
chr <- chromatogram(data_cent, mz = mzr, aggregationFun = "max")
## Correspondence in that slice using default settings
pdp <- PeakDensityParam(sampleGroups = data_cent$group)
plotChromPeakDensity(chr, param = pdp)

  • Correspondence failed: all peaks grouped into one feature!
  • Default for bw (30) too large for present data set.

  • plotChromPeakDensity allows to evaluate and tune settings on data subsets.

  • Test smaller bw (1.8) on the same slice.

## Reducing the bandwidth
pdp <- PeakDensityParam(sampleGroups = data_cent$group, bw = 1.8)
plotChromPeakDensity(chr, param = pdp) 

  • Reducing the bw enabled grouping of isomers into different features.

  • Perform the correspondence analysis with tuned settings.
pdp <- PeakDensityParam(sampleGroups = data_cent$group, bw = 1.8,
                        minFraction = 0.4, binSize = 0.02)

## Perform the correspondence analysis
data_cent <- groupChromPeaks(data_cent, param = pdp) 
  • Evaluate results after correspondence: plotChromPeakDensity with simulate = FALSE shows the actual results from the correspondence.
  • Feature definitions are stored within the XCMSnExp object, can be accessed with featureDefinitions.

  • Use featureValues to access the features’ abundance estimates.
## feature intensity matrix
fmat <- featureValues(data_cent, value = "into", method = "maxint")
head(fmat) 
##       20171016_POOL_POS_1_105-134.mzML 20171016_POOL_POS_3_105-134.mzML
## FT001                        3159.7569                         3093.752
## FT002                        4762.3987                               NA
## FT003                         744.8752                         1033.232
## FT004                       20211.2634                        15839.550
## FT005                       10220.8762                        10837.710
## FT006                       19653.1073                        31816.844
  • featureValues parameters:
    • value: name of the column in chromPeaks that should be returned.
    • method: for features with multiple peaks in one sample: from which peak should the value be returned?

Missing values

  • Peak detection may have failed in one sample.
  • Ion is not present in a sample.
  • fillChromPeaks allows to fill-in signal for missing peaks from the feature area (defined by the median rt and mz of all peaks assigned to the feature).
  • fillChromPeaks Parameters:
    • expandMz, expandRt: expands the region from which signal is integrated in m/z or rt dimension. A value of 0 means no expansion, 1 means the region is grown by half of the feature’s m/z width on both sides.
    • ppm: expand the m/z width by a m/z dependent value.

Summary

  • The new data objects and functions aim to:
    • simplify data access and inspection of results
    • facilitate data set-dependent definition of algorithm parameters.
  • More work to come for the analysis of chromatographic data (SRM/MRM) and eventually for data normalization.
  • Don’t blindly use default parameters!

Acknowledgments

  • Jan Stanstrup (University of Copenhagen, Denmark)
  • Laurent Gatto (University of Cambridge, UK); MSnbase, mzR.
  • Steffen Neumann (IPB Halle, Germany); xcms, mzR
  • YOU for your attention!

https://github.com/jorainer/metabolomics2018