Featured image of post AutoSchemaKG Paper and Code Release

AutoSchemaKG Paper and Code Release

AutoSchemaKG is a breakthrough system that automatically constructs large-scale knowledge graphs without requiring predefined schemas.

Introducing AutoSchemaKG: Autonomous Knowledge Graph Construction with Code Release

I’m excited to share our latest research paper and code release for AutoSchemaKG, a significant advancement in knowledge graph construction that eliminates the need for predefined schemas.

What is AutoSchemaKG?

AutoSchemaKG leverages large language models to simultaneously extract knowledge triples and induce comprehensive schemas directly from text. Unlike traditional approaches that require domain experts to create predefined schemas, our system:

  • Models both entities and events as first-class citizens
  • Employs conceptualization to organize instances into semantic categories
  • Scales to web-scale corpora without manual intervention

The ATLAS Knowledge Graphs

By applying our framework to the Dolma 1.7 corpus across three diverse subsets (Wikipedia, Semantic Scholar, and Common Crawl), we constructed the ATLAS family of knowledge graphs containing:

  • 900+ million nodes
  • 5.9 billion edges
  • Billions of facts comparable in scale to parametric knowledge in LLMs

Key Results

Our experiments demonstrate that AutoSchemaKG:

  • Achieves 95% semantic alignment with human-crafted schemas with zero manual intervention
  • Outperforms state-of-the-art baselines by 12-18% on multi-hop question answering tasks
  • Enhances large language model factuality by up to 9%
  • Improves LLM performance on reasoning tasks across domains including Global Facts, History, Law, Religion, Philosophy, Medicine, and Social Sciences

Code and Resources Now Available!

We’ve released the complete implementation on GitHub at HKUST-KnowComp/AutoSchemaKG, along with a Python package to make it easier to use our technology in your own projects.

Getting Started

Install our package using pip after setting up the required dependencies:

1
2
3
4
5
6
7
8
# First install PyTorch with CUDA support
conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.1 -c pytorch -c nvidia

# Then install FAISS-GPU
conda install -c pytorch -c nvidia faiss-gpu

# Finally install atlas-rag
pip install atlas-rag

Example Usage

The repository includes several example notebooks:

  • atlas_full_pipeline.ipynb: Build new knowledge graphs and implement RAG
  • atlas_billion_kg_usage.ipynb: Host and use our billion-scale ATLAS knowledge graphs
  • atlas_multihopqa.ipynb: Replicate our multi-hop QA evaluation results

Available Resources

  • Paper: Read our research paper for technical details
  • Code: Use our code on github
  • Full Dataset: Download our complete dataset
  • Neo4j CSV Dumps: Access our Neo4j database dumps via Huggingface Dataset

Why This Matters

This research represents a fundamental rethinking of knowledge graph construction, transforming what was once a heavily supervised process requiring significant domain expertise into a fully automated pipeline. This advancement not only accelerates KG development but also dramatically expands the potential application domains for knowledge-intensive AI systems.

We’re excited to see how the community will use AutoSchemaKG to build and leverage knowledge graphs without the traditional constraints of manual schema design!

Licensed under CC BY-NC-SA 4.0
Built with Hugo
Theme Stack designed by Jimmy