SHARE: The Internet Search Engine for Life's Big Data

Revolutionizing bioinformatics by connecting biological data across distributed sources using Semantic Web technologies

The Biological Data Tsunami

Imagine trying to find a single, specific sentence in a library containing every book ever written, while simultaneously figuring out how the ideas in different books connect to tell a completely new story.

This is the monumental challenge facing biologists today. The field of bioinformatics is experiencing a data deluge of unprecedented scale, with high-throughput technologies generating petabytes of molecular information about everything from human DNA to deadly viruses 4 .

Bioinformatics Data Growth
The Bioinformatics Bottleneck

For decades, biological discovery followed a straightforward path. Today, a single sequencing machine can generate massive genomic datasets in a day 1 .

This embarrassment of riches created a new problem: data silos. Critical information became scattered across countless specialized databases and web services.

How SHARE Works: The Internet of Biological Data

At its core, SHARE functions as a specialized query engine for bioinformatics that understands both data and meaning. Think of it as a super-powered search engine that doesn't just find keywords but comprehends biological concepts and relationships.

SHARE builds upon Semantic Web standards, which are designed to make online information machine-readable 1 7 .

SHARE Framework Components
  • SPARQL: Sophisticated query language for multi-database searches
  • OWL Ontologies: Formal representations of biological concepts
  • Web Services: Bridges to biological databases and tools 1 7

SHARE Workflow Process

1
Query Input

User submits biological question in natural or formal language

2
Semantic Analysis

System interprets biological concepts and relationships

3
Distributed Querying

On-demand data retrieval from multiple sources 7

4
Result Integration

Automatic assembly of analytical workflows 1

A Real-World Expedition: Tracking Viral Hijackers

To understand SHARE's power in action, consider a pressing biological question: how do viruses mimic human proteins to hijack our cellular machinery? This molecular mimicry allows pathogens to evade our immune systems and manipulate our cells for their replication .

Experimental Methodology

Researchers employed a systematic bioinformatics approach to identify what they termed the "share-ome"—the complete set of sequences shared between hosts and pathogens .

Research Workflow:
  1. Data Collection: Retrieving protein sequences from NCBI databases
  2. Data Processing: Cleaning and removing redundant sequences
  3. K-mer Dictionary Construction: Breaking proteins into fragments
  4. Exhaustive Matching: Identifying identical sequences
  5. Functional Characterization: Analyzing biological roles
Virus Distribution in Share-ome

Revealing Results: Hidden Connections Between Humans and Viruses

Metric Count Biological Significance
Shared Nonamers 2,430 Molecular mimicry candidates
Mapped Flaviviridae Proteins 16,946 Viral proteins involved in sharing
Mapped Human Proteins 7,506 Human pathways potentially targeted
Flaviviridae Species Affected 125 Breadth of mimicry across virus types
Shared Sequence Distribution
Key Findings

These shared sequences were not evenly distributed across viral species. The majority (~68%) mapped to Hepatitis C virus, with significant representation from West Nile, Dengue, and Zika viruses .

Further analysis provided insights into the structural and functional implications of these shared sequences, supporting the hypothesis that molecular mimicry represents an evolutionary adaptation by viruses.

The Scientist's Toolkit

Research in the age of big data biology requires both computational infrastructure and specialized analytical frameworks.

SHARE Framework

Query Engine for distributed querying across biological data sources

SADI Services

Web Service Standard that enables automated workflow assembly

NCBI Entrez APIs

Programming Interface for retrieval of molecular data from public databases

CD-HIT

Computational Tool for removal of redundant sequences from large datasets

SPARQL

Query Language for semantic queries across linked data sources

OWL Ontologies

Knowledge Representation for standardized biological concepts and relationships

The Future of Discovery: Beyond Single Experiments

The implications of SHARE and similar semantic technologies extend far beyond individual experiments.

By making data integration automated and reliable, these systems accelerate the entire scientific discovery process. What previously took months of custom programming and data wrestling can now be accomplished through thoughtful query design.

Drug Discovery

Identifying unintended cross-reactions between drugs and human proteins by finding shared sequences

Vaccine Development

Pinpointing viral regions that are unlikely to mutate for more stable vaccine targets

Disease Surveillance

Tracking the evolution of pathogens by monitoring changes in shared sequences across outbreaks

Personalized Medicine

Understanding how individual genetic variations affect susceptibility to different pathogens

Conclusion: A New Era of Biological Exploration

SHARE represents more than just a technical solution to a data management problem—it embodies a fundamental shift in how we conduct biological research. By creating a "bioinformatics nation" where data flows freely and meaning is machine-interpretable, SHARE and similar Semantic Web technologies are breaking down the barriers between discrete biological datasets.

The framework demonstrates that the true power of bioinformatics lies not just in accumulating data, but in connecting disparate information to reveal biological truths that remain invisible when examining datasets in isolation 1 .

References