KGSum: Automatic Knowledge Graphs Profiling

KGSum: Automatic Classification and Profiling of Knowledge Graphs KgSum is a Python application for extracting, preparing, and classifying Knowledge Graphs (KGs). It combines Large Language Models (such as Mistral Instructor 7B with QLoRA) and traditional machine learning for effective graph classification and profiling. Getting Started Follow these steps to set up KgSum locally. Prerequisites For Local Machine Learning Backend: Miniconda (required) Python 3.12 (suggested) CUDA 12.8 (for transformer models like Mistral) NVIDIA GPU (recommended: RTX 3070 or higher) For Frontend: Node.js npm For Docker Deployment: Docker Docker Compose Installation Local Setup (Machine Learning Backend) Clone the repository: git clone https://github.com/mariocosenza/kgsum.git cd kgsum Create and activate conda environment: conda env create -f environment.yml conda activate kgsum For GPU/Transformer Models (Mistral): Comment out CUDA libraries in environment.yml Change TensorFlow version to GPU-compatible version as suggested in comments Frontend Setup Install dependencies: npm install Run the frontend: npm run dev For GraphDB embedding visualization: Replace GraphDB's security-config.xml with the one in /docker/graphdb Configuration Environment Variables Set the following environment variables in your shell: export GEMINI_API_KEY=your_gemini_api_key_here export LOCAL_ENDPOINT_LOV=http://your-local-endpoint export LOCAL_ENDPOINT=http://your-local-endpoint export SECRET_KEY=your_secret_key_here export UPLOAD_FOLDER=/path/to/uploads export UPLOAD=true export NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY=your_clerk_publishable_key export CLASSIFICATION_API_URL=http://localhost:5000 export GITHUB_TOKEN=your_github_token_here Backend Configuration Configure the backend by editing config.json: { "labeling" : { "use_gemini": false, "search_zenodo": true, "search_github": true, "search_lod_cloud": true, "stop_before_merging": false }, "extraction": { "start_offset": 0, "step_numbers": 10, "step_range": 16, "extract_sparql": true, "query_lov": false }, "processing" : { "use_ner": false, "use_filter": true }, "training" : { "classifier": "NAIVE_BAYES", "feature": ["CURI", "PURI", "LAB", "CON", "TLDS", "VOC", "LCN", "LPN", "DSC", "SBJ"], "oversample": true, "max_token": 36000, "use_tfidf_autoencoder": true }, "profile": { "store_profile_after_training": false, "base_domain": "http://www.isislab.it" }, "general_settings": { "info": "Possible classifiers: SVM, NAIVE_BAYES, KNN, J48, MISTRAL, MLP, DEEP, BATCHNORM, Phase: LABELING, EXTRACTION, PROCESSING, TRAINING, STORE", "start_phase": "labeling", "stop_phase": "training", "allow_upload": true } } Available Classifiers: SVM, NAIVE_BAYES, KNN, J48, MISTRAL, MLP, DEEP, BATCHNORMAvailable Features: CURI, PURIProcessing Phases: LABELING, EXTRACTION, PROCESSING, TRAINING, STORE (back to top) Usage Training Process Full Training Pipeline Run the complete training process from extraction to model training: python train.py Individual Script Training For more fine-tuned control, run individual scripts in /src: # Run scripts in /src directory for specific phases Running the Application Local Flask Server After completing training, start the WSGI Flask server on port 5000: python app.py Prerequisites for Complete Profiling Linked Open Vocabularies (LOV) instance is required for complete profiling and initial data extraction API Usage Send POST requests to: /api/v1/profile/sparql /api/v1/profile/file Refer to the Swagger documentation for detailed request and response formats. Profile Evaluation The src/profile_evaluation folder contains the material used to compare KGSum-generated VoID profiles with reference profiles from LOD Cloud. Folder contents - lodcloud_profiles/: reference Turtle profiles used as the evaluation baseline. - kgsum_profiles/: Turtle profiles generated by KGSum. Files are matched with `lodcloud_profiles/` by filename. - lodcloud_sources.json: source list used by the KGSum profile generation script. - kgsum_profile_timings.csv: timing log written while generating profiles. - profile_evaluation_results.csv: optional per-profile evaluation output. - missing_fields.svg: optional chart showing the most common missing fields. - create_kgsum_profiles.py: calls the local KGSum API to generate Turtle profiles from SPARQL endpoints or RDF dumps. - evaluate.py: compares KGSum profiles against the LOD Cloud reference profiles. Typical workflow 1. Start the backend API from the project root: python app.py 2. Generate KGSum profiles from the configured sources: python src/profile_evaluation/create_kgsum_profiles.py This writes generated profiles to src/profile_evaluation/kgsum_profiles/ and appends timing data to src/profile_evaluation/kgsum_profile_timings.csv. 3. Run the evaluation: python src/profile_evaluation/evaluate.py \ --details-csv src/profile_evaluation/profile_evaluation_results.csv \ --missing-fields-chart src/profile_evaluation/missing_fields.svg The command prints the coverage summary in the terminal, writes per-profile details to CSV, and creates an SVG chart for missing fields. 4. To evaluate custom folders, pass explicit paths: python src/profile_evaluation/evaluate.py \ --kgsum-dir path/to/generated_profiles \ --lodcloud-dir path/to/reference_profiles \ --timings path/to/kgsum_profile_timings.csv Docker Deployment Quick Setup with Pre-trained Model For a simpler deployment using the pre-trained Naive Bayes model: Navigate to the docker directory: cd /docker Fill the .env file with your configuration Run with Docker Compose: docker-compose up Individual Docker Services Three individual Dockerfiles are provided for custom deployments: Backend service Frontend service GraphDB configuration Hardware Requirements Tested Configuration Component Specification CPU AMD Ryzen 5800x RAM 32 GB DDR4 3600MHz GPU NVIDIA RTX 3070 Recommended Configuration Component Specification RAM 64+ GB (larger size suggested) GPU High-performance GPU for better LLM performance (back to top) Roadmap Add Swagger API documentation Expand coverage for more LLMs Improve Docker deployment documentation Add more dataset preparation examples Add performance optimization guides Enhance frontend visualization features See the open issues for a full list of proposed features (and known issues). (back to top) Contributing Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated. If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again! Fork the Project Create your Feature Branch (git checkout -b feature/AmazingFeature) Commit your Changes (git commit -m 'Add some AmazingFeature') Push to the Branch (git push origin feature/AmazingFeature) Open a Pull Request

Found an issue? Give us feedback