Lipid-oriented databases
Databases that curate individual lipid structures, both from historical and new publications into organized repositories are essential for researchers who aim to identify the specific molecules present in their biological samples. Databases also serve as a foundation for many data analysis pipelines as well as key knowledge bases for lipid research. Over the last 5–10 years, the size of lipidomics research datasets generated using MS and tandem MS (MS/MS) has increased massively, and their routine analysis requires automated programmatic approaches to enable database searching. To support the selection of the databases suitable for a particular application, the ‘Task specific information’ tabs within ‘Lipid Oriented Databases’ section provide an overview of the database functionalities, including the number of included lipid structures, structural ontology, covered lipid (sub)classes, levels of curation and annotation. Automated approaches to support data searchability and utility are described, including used identifiers, structural representation, availability of spectral libraries, and calculated physicochemical properties when available.
The most widely used lipid-specific databases are provided by LIPID MAPS and SwissLipids. LIPID MAPS hosts several databases in which lipid structures are cataloged according to the LIPID MAPS nomenclature and classification22,23,24. Specific databases provide utility for different use-cases as follows. LIPID MAPS Structure Database (LMSD)22 contains over 47,000 lipids (August 2022) obtained from sources that include experimental work performed by the LIPID MAPS consortium, from other lipid databases, from the scientific literature, and also some that are computationally generated on the basis of commonly occurring fatty acid chains in mammalian lipids. LMSD can return either bulk (lipid species) annotations for MS data, on the basis of the shorthand nomenclature described by Liebisch et al.24, or fully annotated names (structure-defined lipids), where users already have additional structural information, for example, from MS/MS experiments. LMSD has recently implemented a display of reaction data to link together lipid species by biochemical transformations. This was initially obtained from Rhea25, WikiPathways26, Reactome27, and other sources and is now in place for many generic lipids. This is in the process of being cascaded down to individual lipid species. In the case of (high-resolution) MS experiments, the user may only have information on the m/z value of detected lipid ions. In this case, searching databases will provide information on elemental compositions and, using this, generate putative matches. It is recommended to use the BULK search tool on LIPID MAPS to perform this operation since this returns shorthand nomenclature as a first step. Putative matches on the basis of MS indicate the number of carbons in fatty acyl chains and double bonds or rings present, but not how these are distributed between or within acyl chains in the molecule. For some users, the LIPID MAPS Computationally Generated Bulk Lipids (COMP_DB)22 may be a more suitable resource to query. This database contains over 59,000 lipid species in shorthand format (in the major classes such as fatty acyls, glycero- and glycerophospholipids, sterols, and sphingolipids), computationally generated from a list of commonly occurring acyl and alkyl chains. Most entries in this database represent hierarchical structures that could map to many different specific annotations. The LIPID MAPS In Silico Structure Database (LMISSD)22 contains over 1,100,000 entries derived from the computational expansion of headgroups and chains for common lipid classes. These are provided as specific structural annotations but can also be provided as a hierarchy of sum composition and chain composition. Lastly, the Lipidomic Ion Mobility Database22 was developed using data from the McLean and Griffin labs28,29,30 to provide collisional cross section measurements for drift tube MS experiments.
The SwissLipids knowledgebase31 was developed to aid lipidomics researchers in interpreting experimental datasets and integrating them with prior biological knowledge, allowing also for data exploration and hypothesis generation. In SwissLipids, experimentally characterized lipids are curated from peer-reviewed literature using the ChEBI32 ontology (https://www.ebi.ac.uk/chebi/). Lipid metabolism is described using the Rhea knowledgebase25 for biochemical and transport reactions (https://www.rhea-db.org), which is itself based on ChEBI, while enzymes, transporters, and interacting proteins are described using the UniProt Knowledgebase UniProtKB33 (https://www.uniprot.org), for which Rhea is the reference vocabulary for such annotation34. As the number of experimentally characterized lipid structures represents only a small fraction of the possible structures that may exist in nature, expert-curated knowledge of lipid structures and metabolism in ChEBI, Rhea, and UniProt is used to design and create a library of all theoretically feasible lipid structures in silico, which is fully mapped to these three resources. The current version of the SwissLipids library contains almost 600,000 lipid structures from over 550 lipid classes, organized into two distinct hierarchical lipid classifications—one that parallels the structural classification of LIPID MAPS35 and one that is based on the shorthand notation for MS data36 that links lipid identifications from MS-based experiments to structures and biological knowledge.
MS data repositories
Raw and/or processed data deposition using free repositories services, although a standard task before publication of the results in the field of proteomics for many years, is only now finding its way into the lipidomics community37. MS data repositories increase data transparency and reproducibility, allow reanalysis for new discoveries and data-driven hypothesis generation, as well as benchmarking of new software tools38. Although numerous platforms for upload of raw MS datasets exist (for example, MassIVE (https://massive.ucsd.edu/), ProteomeXchange (http://www.proteomexchange.org/)), specific functionalities to support metadata, sample preparation protocols, and data matrices are necessary to improve the reusability of the deposited datasets following FAIR principles39. To select the optimal solution for data upload or download, users would need to be informed about the types of stored raw, processed, and metadata, curation strategy, total number of available datasets, and species coverage.
Repositories tuned for metabolomics and lipidomics data, such as Metabolomics Workbench40 and MetaboLights41 have the functionality to associate deposited data with compound query results to enhance the reusability of the datasets, allowing further interrogation. Each dataset is assigned a unique project accession ID, sufficient space to host the raw and/or processed data, supported by detailed information, including study design, associated metadata, details on sample preparation, and analysis protocols. Datasets can be browsed and searched by specific keywords, organisms of origin and reported compounds, and are usually associated with a source publication. MetaboLights has unique fields for data transformation and metabolite identification and provides an online viewer to review lipid identifiers, quantities, and corresponding structures, while Metabolomics Workbench is bundled with the RefMet42 data resource (containing over 160,000 annotated metabolite species, including a large collection of lipids) and a suite of online data analysis tools. MetaboLights and Metabolomics Workbench are accepted by mainstream journals as data repositories for publications of lipidomics datasets.
Analysis of targeted lipidomics datasets
Lipidomics data acquisition strategies can be generally subdivided into targeted and untargeted workflows. In targeted lipidomics, a predefined set of lipids with a known mass-to-charge ratio (m/z) of the precursor and fragment/product ion(s) need to be provided by the user before data acquisition. Moreover, optimization of ionization and MS parameters for each pair of precursor–product ions (so-called ‘transition’) must be performed to optimize the sensitivity of the method. Targeted analysis using single or multiple reaction monitoring (SRM, MRM) on triple quadrupole instruments and, more recently, parallel reaction monitoring (PRM) on orbitrap and quadrupole time-of-flight instruments are successfully applied to the quantification of selected sets of lipids as well as hundreds of lipids in large sample cohorts (for example, over 600 lipid species in one liquid chromatography (LC)–MS/MS analysis43). However, to quantify a large number of lipids in a correspondingly large sample cohort, targeted lipidomics workflows should be quick to establish, and obtained results should be easy to inspect and validate. This process can be extremely time consuming and most often is not accessible to non-experts. Thus, specialized tools can be used to facilitate both method design and data processing steps. For software-assisted method design, the user should define the type of the targeted acquisition method planned (SRM/MRM or PRM) and lipid (sub)classes/species aimed to be covered. The selection of transitions can be done among experimentally validated or computationally optimized or can be even predicted on-the-fly on the basis of common knowledge of lipid subclass-specific gas phase fragmentation chemistry. The set of fragment ions and their yield will strongly depend on class, the number of double bonds and fatty acyl length and even the type of instrument on which data were acquired. For instance, with LipidCreator44 the targeted assay can be generated in three steps. In brief, during step 1 the user would select the lipid category and class to work with and define fatty acyl, double bond, hydroxyl group, and adduct constraints (precursor selection) as well as the polarity mode to analyze lipids of interest. In step 2 the monitored fragments at the MS/MS level can be defined. In step 3 the designed molecules can be added to the target list, reviewed, and transferred to the MS instrument for data acquisition. METLIN-MRM45 is another data-rich resource where users can choose from experimentally and/or computationally optimized transitions or even public repository transitions with links to corresponding DOIs.
Although method design requires careful optimization and is time consuming, post-acquisition data processing of targeted lipidomics datasets is relatively straightforward and follows general rules of LC–MS/MS-based targeted quantification accepted in both the proteomics and metabolomics communities. Indeed, several open-access tools originally developed for targeted analysis of peptides (Skyline) or metabolites (XCMS-MRM) have been adapted for lipidomics applications. Thus, LipidCreator is fully integrated with Skyline46 for small molecules, making it a vendor-independent software. METLIN-MRM-assisted method development can be directly extended to post-acquisition data processing using the XCMS-MRM45 platform. Both Skyline and XCMS-MRM tools provide automated solutions for peak integration, relative and absolute quantification, and data quality control.
Lipid identification from untargeted lipidomics datasets
A second analytical strategy commonly used in lipidomics relates to untargeted workflows on the basis of data-dependent (DDA) or data-independent acquisition (DIA). Here users perform MS analysis of a lipidome in so-called ‘discovery’ mode without prior knowledge of the exact set of lipids to be analyzed in the sample. Generally, the main aim of untargeted lipidomics is to analyze and ideally identify as many lipid species as possible (ultimately all ionizable constituents extracted from the sample). Both DDA and DIA experiments rely on the iteration of instrument cycles that include MS1 survey scans (usually acquired at high resolution to define the elemental composition of the lipid ions) and a number of MS/MS spectra in which lipid ions, selected on the basis of their abundance (DDA) or within a given m/z range (DIA), undergo collision-induced dissociation (CID). MS/MS information is then used to assign lipids to particular molecular species on the basis of their known gas phase fragmentation patterns. Thus, untargeted lipidomics experiments can support lipid identification at different levels of structural assignment with high-resolution MS spectra providing only elemental composition and thus the putative bulk composition of the lipid (for example, PC 36:4), but with additional MS/MS information supporting the identification of lipids at molecular species levels (for example, PC 16:0_20:4). Although this is possible by manually checking MS and corresponding MS/MS spectra, lipid identification requires automated solutions to support analysis throughput, as within commonly used LC–MS/MS DDA setups, thousands of individual MS/MS spectra are generated within a single analysis.
Owing to the high demand and popularity of untargeted lipidomics workflows, numerous tools have been developed to support this area. Thus, the section of the interactive chart for untargeted lipidomics is represented by nine software tools with open access for academic users. By clicking on the corresponding ‘Task specific information’ tabs, users can get familiar with the tools which support specific acquisition strategies only versus other tools which cover larger application areas. To support the selection of the optimal identification tool, the user can select between high-resolution MS applications (Lipid Data Analyzer (LDA)47, LipidFinder48, MS-DIAL49, XCMS online50), DDA (LDA51,52, MS-DIAL, LipidHunter253, LipidXplorer54, Lipostar255, and MZmine56,57,58), DIA (MS-DIAL and Lipostar2), and even datasets acquired using ion mobility methods, which provide orthogonal to LC–MS/MS separation (MS-DIAL, MZmine, Lipostar2). Furthermore, analysis of epilipidomics datasets focusing on the identification of oxidized lipids can be supported by LDA59, Lipostar2, LPPtiger60, and MS-DIAL tools. For each particular application listed above, the ‘Task specific information’ tab provides information about the main principles of operation and scoring, and accuracy measures.
Lipid quantification from untargeted lipidomics datasets
The quantification of lipids provides their abundance (relative or absolute) in a biological sample, enabling comparison with other samples. Quantified values aid harmonization across lipidomics datasets. Quantitative analysis can be performed using data acquired from targeted and untargeted approaches regardless of whether they were acquired using Full-MS, DDA, or DIA modes. Untargeted lipidomics quantification can be subdivided into relative (for example, fold change between condition 1 and condition 2) and semi-absolute (for example, expressed in pmol μg−1 of proteins). Owing to the extremely large diversity of lipid structures in natural lipidomes and relatively limited numbers of commercially available lipid standards, it is not feasible to perform absolute quantification at a true lipidome level61,62. On the other hand, owing to the close similarity in ionization and MS behavior of lipids from the same subclass, the use of one or a small number of internal standards per subclass is currently considered as a compromise. Isotopic correction algorithms can be used during data processing to minimize the effect of structural differences between internal standards and individual lipid molecular species63. Lipids present a particular challenge for accurate identification since there will be several hundreds of lipids distributed over a relatively narrow m/z range (for example, from 400 to 900 m/z), as well as a high number of isobaric and even isomeric species. Additionally, lipids are detected over a large dynamic range of concentrations in natural lipidomes. These issues result in significant challenges for accurate peak assignment and integration and downstream accurate quantification62. Tools for processing quantitative lipidomics datasets have benefited from previously developed software solutions designed for quantitative proteomics and metabolomics. However, owing to the special properties of lipids as outlined above, additional optimizations are necessary to ensure the accuracy of lipidomics data processing. For instance, data normalization using a preconfigured set of internal standards (for example, Lipostar2 and MS-DIAL) is introduced to simplify the normalization process and reduce the post-processing of the data matrix. Additionally, robust peak picking and peak boundary selection algorithms are critical for obtaining accurate peak areas for quantitative analysis. Though several robust peak picking algorithms are available, manual adjustment and re-integration is often required because of the high number of isobaric and isomeric species. Additional features integrated within data processing tools such as peak alignment and deconvolution are important to handle lipid species with multiple adducts types and to process DIA datasets. Current available quantification tools such as LDA, Lipostar2, MS-DIAL, MZmine, and XCMS online generally provide integrated pipelines from lipid identification up to quantification including essential normalization functions. For each tool, the ‘Task specific information’ section within the LIPID MAPS Lipidomics Tools Guide displays multiple features to guide the choice of the tool on the basis of user requirements, including details on quantification methods and accuracy measures.
Statistical analysis and visualization of lipidomics datasets
Lipidomics research generates large datasets, and the complexity of experimental design is also increasing. Therefore, a critical bottleneck in lipidomics data processing is often the statistical analysis, which requires extensive use of tailored approaches that take into account the specific characteristics of lipid data. Different methods are available for the analysis of lipidomics data, each one with its own advantages and pitfalls. The choice of statistical methods to be applied should be first guided by the aim of the lipidomic study. When testing for statistical significance between predefined groups is desired (for example, health versus disease), differences between groups of samples are usually evaluated by applying parametric (for example, t-test, ANOVA) or non-parametric (for example, Wilcoxon signed-rank test, Kruskal–Wallis) statistical hypothesis tests64. With thousands of lipids being considered in some lipidomics experiments, the high number of variables increases the chance to find spurious correlated variables (false positives). Therefore, correction for multiple comparison testing is required. In addition, in lipidomics, variables (lipids) are usually not all truly independent (for example, one lipid can be represented by several ions or adducts), meaning that corrections commonly applied for genomics/transcriptomics, such as Bonferroni or Benjamini–Hochberg can significantly overcorrect. Here, softer corrections, such as sequential goodness of fit, represent an alternative that may be more appropriate65.
Another consideration is that detected features might not always follow a normal distribution66. Thus, multivariate statistical approaches, in which all the variables are considered simultaneously, often by assuming they are correlated and not fully independent, are extensively applied in lipidomics. For explorative purposes, principal component analysis (PCA)67 represents the most widely used approach in omics, including lipidomics68. Using PCA, the original dataset is represented in a lower-dimensional subspace that maintains most of the relevant information (variance). Being an unsupervised method, PCA does not require a priori knowledge of the dataset and can be used not only to explore clusters of samples eventually formed but also for interpretation without imposing any information on classification or cluster association. Hierarchical or non-hierarchical clustering methods aim at grouping samples by similarity, which is measured utilizing statistical distances or similarities between samples69. Supervised regression algorithms for dimensionality reduction, such as linear discriminant analysis70,71 or partial least squares discriminant analysis72,73, are also available to evaluate and classify sample identity. In addition to partial least squares methods, other machine learning approaches have been also used in lipidomics applications. Among them, supervised methods like support vector machine74 and random forest75 were used for classification purposes and can also be used for feature selection. Despite the wide availability of statistical tools applied to lipidomics, several potential issues need to be considered. For example, in large studies, the so-called ‘batch effect’ can hamper statistical analysis, and correction with internal standards and/or quality controls must be made before the application of statistical tools. Also, missing data, which are the result of molecule concentrations below detection limits and very common in lipidomics, can be detrimental in model generation and interpretation, with some tools more sensitive than others68. Nevertheless, several strategies for missing data imputation have been proposed76.
Generally, the multi-functional tools described above for quantitative lipidomics all provide integrated platforms for statistical analysis and data visualization (LDA, Lipostar2, MS-DIAL, MZmine, and XCMS online). Additionally, several tools were specifically developed to support chemometrics analysis and result visualization of metabolomics and lipidomics data (LIPID MAPS Statistical Analysis Tools77 and MetaboAnalyst 5.078). Integrated statistical analysis and visualization functions provide easy access to most common functions, including univariate (parametric and non-parametric testing) as well as multivariate (non-supervised and supervised) solutions with a close interactive connection to the corresponding lipid quantification data matrix and often bundled with data pretreatments including normalization, scaling, and visualization of filtered data subset. Dedicated tools (LIPID MAPS Statistical Analysis Tools and MetaboAnalyst 5.0) might require researchers to transform the quantification data according to specific templates for dataset import but can provide a more extensive set of statistical and visualization functions with detailed customizable configurations. MetaboAnalyst 5.0, for instance, has a dedicated utility for batch-effect correction, which contains nine methods well established in the field of metabolomics as well as eight methods for missing value imputation79.
Data integration solutions
The ultimate aim of many lipidomics studies is to investigate biological relevance and mechanisms behind lipidome remodeling driven by the specific biological conditions. Considering the nature of ‘big data’ produced by lipidomics experiments, manual evaluation of the biological significance of obtained results would be extremely time consuming and require extensive knowledge in diverse areas of biochemistry and cell biology. Such advanced data integration goes well beyond single lipidomics data matrices and extends into related multiomics approaches using curated pathways or network analysis strategies. The combination and utilization of multiomics data from different sources require sophisticated data pretreatments, including manual curation and advanced bioinformatics solutions. This type of workflow can be generally divided into three steps: conversion of lipid annotations to their corresponding IDs within knowledge and ontology databases, lipid ontology enrichment, and advanced pathway/network analysis.
Tools that are capable of bridging lipid annotations supported by purely lipidomics software with the structural or functional IDs in data integration tools provide the first critical step towards systems biology integration of lipidomics datasets. To reduce the complexity of ID cross-validation and database queries, several tools are available to assist this conversion (Goslin80, LipidLynxX81, and RefMet42) and to link lipid identifiers to various databases (BridgeDb82, Goslin, LipidLynxX, and RefMet). For example, BridgeDb has mappings to other databases for almost 19,000 LIPID MAPS identifiers (https://doi.org/10.6084/m9.figshare.13550384.v1).
Biological interpretation of lipidomics data is often driven by the focus on individual lipids. Although this approach is useful in biomarker discovery, it obscures the possible effects of shared properties of molecules related to the biological phenomenon. A way to circumvent this is to manually curate lipid groups that share specific properties (for example, lipid class, level of unsaturation) and report aggregate statistics. However, the manual construction of these groups is often laborious owing to the ambiguity in lipid nomenclature and introduces a risk of cherry-picking. Ontologies, formalizations of concepts, and their relations have been successful in other omics fields to provide frameworks for constructing groups of molecules with shared biological properties. For lipidomics data, several ontologies, such as lipid ontology (LION/Web)16 and Lipid Mini-On83, are useful in aiding in the biological interpretation. Currently, LION links over 50,000 lipids to chemical (for example, LIPID MAPS classification and fatty acid associations), physiochemical (for example, membrane fluidity and intrinsic curvature), and cell biological (for example, predominant subcellular localization) properties and Lipid Mini-On uses a text mining strategy to attribute Lipid Ontology structural terms to lipids.
Typically, ontology-derived groups of molecules (‘terms’) are analyzed using enrichment analysis approaches. In these analyses, a given term is enriched if the molecules belonging to the term are overrepresented in a target list or are higher ranked in a list of molecules ordered by a statistic (for example, fold change and P value) than expected by chance. Both LION/web and Lipid Mini-On are freely available online tools that perform ontology-term enrichment analysis of user-provided lipidomics data. LION/web allows specific LION-term categories to be included for analysis. After submission, LION/web reports descriptive matching statistics and enrichment analysis, as well as publication-ready figures. Traditionally, enrichment analyses compare two groups of samples. To analyze datasets with more sample groups, LION/web was recently expanded with the PCA-LION heat map module. This module generates a heat map showing the most dynamic LION-terms for all samples on the basis of the enrichment analysis of a given number of principle components. Lipid IDs of significantly enriched terms can be further mapped to available pathways and networks to investigate the changes at the systems level. Lipid Mini-On enables to generate a variety of visualization of lipid enrichment by structural characteristics. Lipids and their associated lipid ontology terms can be visualized as a network to hierarchize interpretations of the enrichment performed.
Several tools are available to support pathway and network analysis of lipidomics datasets, including integrated pathway graph analysis modules in Lipostar2 and stand-alone web application BioPAN84, which allows the visualization of quantitative lipidomics data in the context of known biosynthetic pathways, as well as the central hub of community-driven pathways represented by the Lipid Portal on WikiPathways26, in collaboration with LIPID MAPS. Though more advanced analysis can be performed with highly customized programs and scripts by experienced bioinformaticians, these tools provide simple interfaces for researchers to begin to map lipidomics data to obtain essential lipid centric analysis results from predefined pathways and networks in, for example, PathVisio85 and Cytoscape86. Furthermore, the pathways from WikiPathways can be easily converted to a network through the WikiPathways App87, after which these networks can be extended with additional knowledge such as micro RNAs, transcription factors, or drugs through the CyTargetLinker88 app.