GEOAgent Knowledge Base: An Integrated and Self-Contained Resource of SQLite Metadata, Vector Embeddings, and Local Re-ranking Models for Automated GEO Data Reuse

Overview This repository contains the core knowledge base for GEOAgent, an intelligent assistant designed to automate the discovery and analysis of biomedical datasets. The database integrates high-fidelity metadata from 180,000+ GEO series, 5 million+ samples, and 84,000+ PubMed abstracts (updated as of April 2026) into a multi-modal retrieval system. Core Components SQLite DB (Structured Metadata): A relational database optimized for high-speed filtering. It features a multi-table schema (including gse_metadata, gsm_metadata, and pubmed_metadata) with B-tree indexing for range/equality queries and FTS5 indexing for full-text search of study titles and summaries. Vector DB (Semantic Embeddings): An unstructured knowledge store generated via Hierarchical Semantic Chunking (HSC). Using the Nomic Embed v2 model (768 dimensions), raw metadata is systematically partitioned into five distinct semantic zones to enable granular similarity matching, where gse_core captures study design and research rationale, gse_sample encompasses sample-specific metadata and clinical/biological attributes, gse_protocol extracts experimental protocols including sample treatment, growth, and extraction, gse_processing documents downstream data processing and computational workflows, and pub_core incorporates rich bibliographic context via publication titles and abstracts. BGE-Reranker (Deep Ranking): An industry-standard Cross-Encoder model (BAAI/bge-reranker-v2-m3) co-packaged to enable full offline execution. Deployed in the final stage of the hybrid retrieval pipeline, it performs deep semantic re-ranking of dataset candidates to guarantee the highest precision for complex natural language queries. Technical Workflow Integration The database is specifically engineered to support the GEOAgent 5-stage pipeline: Intent Parsing: LLM-based extraction of research goals. Hybrid Retrieval: Concurrent SQL Hard Filtering (for structured attributes) and Semantic Matching (for unstructured context). Logical Filtering & Reranking: Final validation and precision sorting of results. Application & Compatibility This database serves as the foundational data asset for automated technology modality identification and cross-platform sample pairing (e.g., ChIP-seq IP/Input control matching and single-cell multi-omics linkage). ,This database serves as the foundational data asset for automated technology modality identification and cross-platform sample pairing (e.g., ChIP-seq IP/Input control matching and single-cell multi-omics linkage). It offers native, turnkey integration with bioStream—an industrialized, containerized Nextflow workflow platform—enabling automated and highly reproducible standardized processing for 6 major omics types: RNA-seq, scRNA-seq, ATAC-seq, scATAC-seq, ChIP-seq, and scMultiome. Links & Repositories Core Agent Software: GitHub - JiekaiLab/GEOAgent Multi-Omics Processing Pipeline: GitHub - JiekaiLab/bioStream Links & Ecosystem Web Application Portal: http://geoagent.ccla.ac.cn/— Try GEOAgent directly in your browser with our turnkey web interface. GEOAgent Core & Deployment Source: GitHub - JiekaiLab/GEOAgent — Repository for the desktop client and intelligent agent backend. Multi-Omics Processing Pipeline: GitHub - JiekaiLab/bioStream — Nextflow workflow backend for automated primary analysis and sequencing data quantification.

Found an issue? Give us feedback