How the KEGG Database Is Powering Modern Biology
From genomic reference to computational framework for biological systems
Imagine trying to understand a complex machine by examining its individual parts without any assembly diagram. For decades, this was the challenge facing biologists studying living organisms. Then, in 1995, as the first complete bacterial genome was sequenced, Professor Minoru Kanehisa at Kyoto University foresaw the coming data deluge and created a solution: the Kyoto Encyclopedia of Genes and Genomes (KEGG)9 . What began as a reference resource has evolved into an indispensable computational framework that helps researchers worldwide decode the molecular wiring of life itself.
KEGG has transformed from a simple database into a comprehensive biological systems resource that integrates genomic, chemical, and health information. By representing biological systems in terms of molecular networks, KEGG allows scientists to move beyond studying individual genes or proteins to understanding how they interact in complex pathways—much like understanding how individual components work together in a sophisticated machine5 9 .
This systems approach has become crucial for analyzing massive molecular datasets generated by modern high-throughput technologies, making KEGG an essential tool in today's bioinformatics research.
KEGG is built upon four interconnected categories of information that work together to provide a holistic view of biological systems7 9 :
At the core of KEGG are the manually drawn pathway maps—visual representations of molecular interaction networks that capture knowledge from published literature1 5 .
The true power of KEGG emerges through the KEGG Orthology (KO) system, which groups functionally similar genes across different organisms7 . This allows researchers to study pathways conserved in species from bacteria to humans.
| Category | Database | Primary Content | Research Application |
|---|---|---|---|
| Systems Information | PATHWAY | Manually drawn pathway maps | Pathway mapping and analysis |
| BRITE | Hierarchical functional classifications | Gene function categorization | |
| MODULE | Functional units and complexes | Module identification in genomes | |
| Genomic Information | ORTHOLOGY | Ortholog groups (KO entries) | Cross-species functional analysis |
| GENES | Genes from complete genomes | Genomic annotation | |
| GENOME | Complete genome sequences | Comparative genomics | |
| Chemical Information | COMPOUND | Metabolites and small molecules | Metabolomics research |
| REACTION | Biochemical reactions | Metabolic network reconstruction | |
| ENZYME | Enzyme nomenclature | Enzyme function prediction | |
| Health Information | DISEASE | Disease genes and networks | Disease mechanism studies |
| DRUG | Drug targets and interactions | Pharmaceutical research |
Traditional pathway analysis methods have long treated each pathway as independent, despite biological knowledge that pathways extensively cross-talk and regulate one another. This limitation prompted researchers to develop more sophisticated analytical approaches.
A breakthrough methodology published in BMC Bioinformatics introduced a decision analysis model that accounts for the inherent dependencies among pathways8 . This approach recognizes that in real biological systems, pathways don't operate in isolation—they influence each other through complex regulatory relationships.
Identifies the most relevant pathways by considering both direct impact and indirect influences from related pathways8 .
To validate their approach, researchers applied the decision analysis model to a microarray dataset from bovine mammary tissue collected throughout the entire lactation cycle8 . This time-course experiment presented the perfect scenario to test the method, as lactation involves precisely orchestrated changes in multiple interacting pathways.
Impact values for each pathway were computed using the Dynamic Impact Approach (DIA), which aggregates gene-level statistics including proportion of differentially expressed genes, their average fold change, and statistical significance8 .
The correlation structure among pathway impacts was analyzed to quantify their interrelationships.
The DC was calculated for each pathway, incorporating both direct effects and indirect effects through other pathways.
The sign and magnitude of DC values were used to identify the most biologically relevant pathways and their activation states.
| KEGG Pathway Category | Direct Determination Ratio | Indirect Determination Ratio | Decision Coefficient | Biological Interpretation |
|---|---|---|---|---|
| Lipid Metabolism | 0.32 | 0.68 | +0.45 | Highly cooperative regulation with other pathways |
| Carbohydrate Metabolism | 0.41 | 0.59 | +0.38 | Moderate cooperative regulation |
| Signal Transduction | 0.55 | 0.45 | +0.29 | More independent function |
| Amino Acid Metabolism | 0.38 | 0.62 | +0.41 | Strong network regulation |
| Cellular Processes | 0.49 | 0.51 | +0.25 | Balanced direct and indirect regulation |
The results demonstrated that traditional methods would have overlooked crucial biological insights. For instance, the analysis revealed that for lipid metabolism pathways, approximately 68% of their determination came from indirect effects through other pathways8 . This highlighted the extensive cross-talk between metabolic and signaling pathways during lactation—a finding that would have been masked by conventional approaches treating pathways as independent entities.
Researchers start with lists of differentially expressed genes, proteins, or metabolites, ensuring proper ID formatting.
Molecular entities are mapped to KEGG pathways using tools like BlastKOALA or KEGG Mapper3 .
Results are visualized on KEGG pathway maps, where colors indicate regulation states4 .
| Tool Name | Tool Type | Primary Function | Best For |
|---|---|---|---|
| KEGG Mapper | Mapping tool | Pathway/BRITE/MODULE mapping | Visualizing user data on KEGG pathways |
| BlastKOALA | Annotation server | Automatic genome annotation with KOs | Annotating newly sequenced genomes |
| GhostKOALA | Annotation server | Metagenome annotation with KOs | Analyzing metagenomic datasets |
| KEGG OC | Orthology tool | Browsing ortholog clusters | Comparative genomics across species |
| PathPred | Prediction tool | Pathway prediction from compounds | Predicting metabolic routes |
| SIMCOMP | Chemical tool | Chemical structure similarity search | Metabolite identification |
KEGG pathways enable systematic identification of drug targets by revealing critical nodes in disease-associated pathways. The integration of drug information allows researchers to explore drug repurposing opportunities and understand mechanisms of drug action and toxicity.
KEGG helps map molecular networks underlying disease processes. By integrating genetic variation data with signaling pathway information, researchers can visualize how genetic perturbations disrupt normal cellular functions and identify potential biomarkers.
KEGG bridges the gap between genetic information and metabolic processes. Researchers can interpret high-throughput metabolomic data by mapping identified metabolites onto KEGG metabolic pathways, then connecting these to the genes and enzymes responsible for their synthesis and degradation.
Recent advances continue to expand KEGG's capabilities. The emergence of specialized bioinformatics tools like the "ggkegg" package has enhanced pathway visualization, enabling simultaneous analysis of transcriptomic and proteomic data7 . In cancer research, systems like BRCA-Pathway integrate genomic cancer databases with KEGG pathways to visualize signaling network alterations in tumors7 .
The KEGG NETWORK database represents another innovation, enabling visualization of how genetic variations influence cellular signaling pathways—particularly valuable for understanding complex diseases with polygenic inheritance7 .
From its beginnings as a genomic reference in 1995, KEGG has evolved into a dynamic modeling framework for biological systems. As Professor Kanehisa stated, KEGG aims to be a "computer representation of the biological system"9 —an ambitious goal that continues to drive its development.
The future of KEGG lies in increasingly sophisticated integration of diverse biological data types and in developing more powerful analytical approaches that account for the true complexity of biological networks. Methods like the decision analysis model represent just the beginning of this journey toward understanding biology as an integrated system rather than a collection of isolated parts.
As sequencing technologies advance and multi-omics datasets grow increasingly complex, resources like KEGG will become ever more essential for extracting meaningful biological insights from the data deluge. They serve not merely as databases, but as conceptual frameworks that help researchers ask better questions, design more informative experiments, and ultimately piece together the magnificent puzzle of life at the molecular level.
References would be listed here in the appropriate format.