Automated Quantum Issue Labeling in Qiskit: Large Language Models and Fine-Tuned Transformers

# Automated Quantum Issue Labeling in Qiskit: Large Language Models and Fine-Tuned Transformers This repository contains the implementation code for the paper **"Automated Quantum Issue Labeling in Qiskit: Large Language Models and Fine-Tuned Transformers"** submitted to EMSE (Empirical Software Engineering, Springer). The code includes experiments comparing fine-tuned transformer models (DistilBERT, RoBERTa), Large Language Models (GPT-4o Mini, GPT-5 Mini, GPT-5 Nano), and RAG-enhanced LLM pipelines on quantum computing software issue classification. ## Table of Contents- [Study Overview](#study-overview)- [Repository Structure](#repository-structure)- [Installation](#installation)- [Configuration](#configuration)- [Usage](#usage)- [Input Data Format](#input-data-format)- [Output and Results](#output-and-results)- [Analysis Tools](#analysis-tools)- [GPU Acceleration](#gpu-acceleration-notes)- [Computational Requirements](#computational-requirements)- [Hyperparameters](#hyperparameters)- [Troubleshooting](#troubleshooting)- [Citation](#citation)- [License](#license) ## Study Overview This study compares the performance of multiple approaches for classifying GitHub issues in quantum computing repositories: **Fine-tuned Transformer Models:**- DistilBERT (F1=0.95)- RoBERTa (F1=0.94) **Large Language Models (direct prompting, 242-issue test set):**- GPT-5 Mini — zero-shot (F1=0.77) and few-shot (F1=0.82)- GPT-5 Nano — zero-shot (F1=0.59) and few-shot (F1=0.65)- GPT-4o Mini — zero-shot (F1=0.62) and few-shot (F1=0.64) **RAG-Enhanced LLM Pipelines (with threshold tuning, 721-issue test set):**- Agentic RAG + GPT-4o Mini: zero-shot F1=0.606, few-shot F1=0.682- Adaptive RAG + GPT-4o Mini: zero-shot F1=0.613, few-shot F1=0.744- Adaptive RAG + GPT-5 Mini: F1=0.836- Direct GPT-5 Mini few-shot (no RAG, same 721-issue test set): F1=0.843 We evaluate models on their ability to automatically classify Qiskit GitHub repository issues across 12 quantum module labels (e.g., `mod: circuit`, `mod: transpiler`). ## Repository Structure ```.├── Data/│ └── qiskit_repo_quantum_issues.json # 2,415 labeled Qiskit issues│├── Analysis/│ ├── predictions/│ ├── results/│ ├── bot_labeling_analysis.py│ ├── config.py│ ├── ml_baselines.py│ ├── quantum_term_analysis.py│ ├── statisticalanalysis.py│ └── analysis_requirements.txt│├── Fine_tuned_Experiments/│ ├── output/│ ├── distilbert_final.py│ ├── finetunedconfig.py│ ├── roberta_distilbert_requirements.txt│ └── roberta_final.py│├── Gpt_Experiments/│ ├── scripts/│ │ ├── gpt4o_mini_fewshot_gridsearch.py│ │ ├── gpt4o_mini_zeroshot_gridsearch.py│ │ ├── gpt_5_mini_fewshot_gridsearch.py│ │ ├── gpt_5_mini_zeroshot_gridsearch.py│ │ ├── gpt_5_nano_fewshot.py│ │ └── gpt_5_nano_zeroshot.py│ ├── config.py│ └── gpt_requirements.txt│├── RAG_Experiments/│ ├── code/│ │ ├── 01_agentic_rag_zeroshot.py # Agentic RAG, GPT-4o Mini, zero-shot│ │ ├── 02_agentic_rag_fewshot.py # Agentic RAG, GPT-4o Mini, few-shot│ │ ├── 03_adaptive_rag_zeroshot.py # Adaptive RAG, GPT-4o Mini, zero-shot│ │ ├── 04_adaptive_rag_fewshot.py # Adaptive RAG, GPT-4o Mini, few-shot│ │ ├── 05_adaptive_rag_gpt5mini.py # Adaptive RAG, GPT-5 Mini, few-shot│ │ ├── 06_threshold_tuning.py # Per-label threshold tuning utility│ │ ├── 07_gpt5mini_fewshot_direct.py # Direct GPT-5 Mini baseline (no RAG)│ │ └── config.py # RAG-specific configuration│ ├── predictions/ # Saved prediction JSON files│ ├── results/ # Threshold tuning result JSONs│ ├── rag_requirements.txt│ └── README.md│├── finetuned_distilbert_results/│ └── DistilBERT_Results/│├── finetuned_roberta_results/│ └── RoBERTa_Results/│├── gpt_4o_mini_results/│ ├── few_shot_results/│ ├── visuals_insights/│ └── zero_shot_results/│├── gpt_5_mini_results/│ ├── few_shot_results/│ └── zero_shot_results/│├── gpt_5_nano_results/│ ├── few_shot_results/│ └── zero_shot_results/│├── .gitignore├── LICENSE└── README.md``` ## Installation ### Requirements- Python 3.8+- CUDA-compatible GPU (recommended for fine-tuning experiments)- PyTorch- OpenAI API key (for GPT and RAG experiments) ### Setup 1. **Clone the repository:**```bashgit clone [repository URL]cd quantum-bug-labeling-main``` 2. **Create and activate a virtual environment:**```bash# For Windowspython -m venv venvvenv\Scripts\activate # For macOS and Linuxpython3 -m venv venvsource venv/bin/activate``` 3. **Install Fine-tuned experiments requirements:**```bashcd Fine_tuned_Experimentspip install -r roberta_distilbert_requirements.txt``` **Note**: The `roberta_distilbert_requirements.txt` includes:- `torch>=1.12.0` — PyTorch deep learning framework- `transformers>=4.30.0` — Hugging Face transformers- `numpy>=1.24.0`, `pandas>=2.0.0` — Data handling- `scikit-learn>=1.2.0` — Machine learning metrics- `matplotlib>=3.7.0`, `seaborn>=0.12.0` — Visualizations- `tqdm>=4.65.0` — Progress bars- `datasets>=2.12.0` — Dataset utilities 4. **Install PyTorch (for Fine-tuned experiments):** Option 1: CPU-only (simpler but slower for training)```bashpip install torch torchvision torchaudiocd ..``` Option 2: GPU with CUDA 11.8 (recommended)```bashpip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118cd ..``` 5. **Install GPT experiments requirements:**```bashcd Gpt_Experimentspip install -r gpt_requirements.txtcd ..``` 6. **Install RAG experiments requirements:**```bashcd RAG_Experimentspip install -r rag_requirements.txtcd ..``` **Note**: The `rag_requirements.txt` includes:- `openai>=1.13.3` — OpenAI API client (GPT-4o Mini + GPT-5 Mini + embeddings)- `numpy>=1.24.0`, `pandas>=2.0.0` — Data handling- `scikit-learn>=1.2.0` — NearestNeighbors retrieval index and metrics 7. **Install Analysis requirements:**```bashcd Analysispip install -r analysis_requirements.txtcd ..``` ## Configuration ### Fine-tuned Models Configuration Edit `Fine_tuned_Experiments/finetunedconfig.py`:```pythonINPUT_FILE = r"../data/qiskit_repo_quantum_issues.json" # Update pathOUTPUT_BASE_DIR = "output"``` ### GPT Models Configuration Edit `Gpt_Experiments/config.py`:```pythonOPENAI_API_KEY = "your-api-key-here" # REQUIREDINPUT_FILE = r"../data/qiskit_repo_quantum_issues.json"``` ### RAG Experiments Configuration Edit `RAG_Experiments/code/config.py`:```pythonOPENAI_API_KEY = "your-api-key-here" # REQUIREDDATA_PATH = r"C:\path\to\Data\qiskit_repo_quantum_issues.json" # Update pathGPT4O_MINI_MODEL = "gpt-4o-mini"GPT5_MINI_MODEL = "gpt-5-mini"EMBED_MODEL = "text-embedding-3-small"TOP_K = 5MAX_CHARS = 6000``` ### Analysis Configuration Edit `Analysis/config.py`:```pythonINPUT_FILE = r"../data/qiskit_repo_quantum_issues.json"GITHUB_TOKEN = "" # Only for bot_labeling_analysis.py``` ## Usage ### Fine-tuned Experiments ```bashcd Fine_tuned_Experimentspython distilbert_final.pypython roberta_final.py``` ### GPT Experiments ```bashcd Gpt_Experiments/scriptspython gpt4o_mini_zeroshot_gridsearch.py # GPT-4o Mini zero-shotpython gpt4o_mini_fewshot_gridsearch.py # GPT-4o Mini few-shotpython gpt_5_mini_zeroshot_gridsearch.py # GPT-5 Mini zero-shotpython gpt_5_mini_fewshot_gridsearch.py # GPT-5 Mini few-shotpython gpt_5_nano_zeroshot.py # GPT-5 Nano zero-shotpython gpt_5_nano_fewshot.py # GPT-5 Nano few-shot``` ### RAG Experiments Run from the project root directory. Scripts are self-contained and use `RAG_Experiments/code/config.py`. **Step 1 — Run any RAG pipeline:**```bashpython RAG_Experiments/code/01_agentic_rag_zeroshot.pypython RAG_Experiments/code/02_agentic_rag_fewshot.pypython RAG_Experiments/code/03_adaptive_rag_zeroshot.pypython RAG_Experiments/code/04_adaptive_rag_fewshot.pypython RAG_Experiments/code/05_adaptive_rag_gpt5mini.py``` **Step 2 — Run per-label threshold tuning on saved predictions:**```bashpython RAG_Experiments/code/06_threshold_tuning.py agentic_rag_zeroshot_predictions.jsonpython RAG_Experiments/code/06_threshold_tuning.py agentic_rag_fewshot_predictions.jsonpython RAG_Experiments/code/06_threshold_tuning.py adaptive_rag_zeroshot_predictions.jsonpython RAG_Experiments/code/06_threshold_tuning.py adaptive_rag_fewshot_predictions.jsonpython RAG_Experiments/code/06_threshold_tuning.py adaptive_rag_gpt5mini_predictions.json``` **Step 3 — Run direct GPT-5 Mini baseline (no RAG, same 721-issue test set):**```bashpython RAG_Experiments/code/07_gpt5mini_fewshot_direct.pypython RAG_Experiments/code/06_threshold_tuning.py gpt5mini_fewshot_direct_predictions.json``` > **Note on runtime**: Each RAG script makes ~1–3 API calls per test issue (721 issues total). Expect 30–120 minutes per script depending on model and network latency. ### Analysis ```bashcd Analysispython ml_baselines.py # Classical ML baselines (LR, SVM)python statisticalanalysis.py # McNemar's significance testspython quantum_term_analysis.py # Quantum terminology analysispython bot_labeling_analysis.py # Bot labeling analysis (requires GitHub token)``` ## Input Data Format The experiments expect a JSON file with GitHub issues in the following format: ```json[ { "ID": "issue-123", "Title": "Fix barrier label position when bits are reversed", "Body": "Issue description text here...", "Labels": ["mod: visualization", "bug"] }]``` ### Dataset Details- **Source**: Qiskit GitHub repository issues- **Total**: 2,415 issues across 12 quantum-specific categories- **Labels**: `mod: algorithms`, `mod: circuit`, `mod: opflow`, `mod: primitives`, `mod: pulse`, `mod: qasm2`, `mod: qasm3`, `mod: qpy`, `mod: quantum info`, `mod: transpiler`, `mod: visualization`, `qamp`- **Label type**: Multi-label (issues can have multiple labels)- **Few-shot examples**: 13 curated examples excluded from all evaluation sets ## Output and Results ### Fine-tuned Models- `finetuned_distilbert_results/DistilBERT_Results/`- `finetuned_roberta_results/RoBERTa_Results/` ### GPT Models- `gpt_4o_mini_results/` — GPT-4o Mini results (grid search)- `gpt_5_mini_results/` — GPT-5 Mini results- `gpt_5_nano_results/` — GPT-5 Nano results ### RAG Experiments- `RAG_Experiments/predictions/` — JSON predictions for all 5 RAG variants + direct baseline- `RAG_Experiments/results/` — Threshold tuning results (global sweep + per-label) ### Analysis Results- `Analysis/predictions/` — Model predictions (`.npy` files for McNemar's test)- `Analysis/results/` — Baseline metrics, quantum term analysis, bot labeling analysis ## Analysis Tools ### ML Baselines (`ml_baselines.py`)- Trains Logistic Regression and Linear SVM with TF-IDF features- **Outputs**: `baseline_results.csv`, `per_category_baseline_results.csv`, prediction `.npy` files ### Statistical Analysis (`statisticalanalysis.py`)- McNemar's tests on 242 held-out test issues- **Outputs**: `predictions/mcnemar_results.json` ### Quantum Terminology Analysis (`quantum_term_analysis.py`)- Validates quantum-specific nature of dataset (hybrid TF-IDF + documentation approach)- **Outputs**: Console statistics, domain specificity ratios ### Bot Labeling Analysis (`bot_labeling_analysis.py`)- Compares bot labeling patterns across 10 classical + 10 quantum repositories- Requires GitHub API token in `config.py`- **Outputs**: Excel report, visualization PNG ## GPU Acceleration Notes Fine-tuned models benefit from GPU acceleration:- **Memory**: At least 8 GB GPU memory recommended- **Training time**: 60–90 min with GPU (vs. days on CPU)- Install PyTorch with CUDA as shown in the Installation section ## Computational Requirements ### Fine-tuned Models- GPU: 8 GB+ VRAM- Training: ~60 min (DistilBERT), ~90 min (RoBERTa) on RTX 3080- Disk: ~300 MB (DistilBERT), ~500 MB (RoBERTa) ### GPT Experiments- API costs: see `Gpt_Experiments/` for per-configuration cost breakdown- Runtime: 30–60 min per configuration; several hours for full grid search ### RAG Experiments- API costs: ~$0.50–2.00 per full run (embeddings + GPT calls for 721 issues)- Runtime: 30–120 min per script ### Analysis Scripts- RAM: 4 GB+- Runtime: minutes ## Hyperparameters ### Fine-tuned Models (best configurations)- **DistilBERT**: lr=8e-5, epochs=10, batch=12, weight_decay=0.005, cosine schedule- **RoBERTa**: lr=3e-5, epochs=18, batch=32, weight_decay=0.15, cosine schedule ### GPT Models (Grid Search)- Temperature: 0.0–1.0 (0.1 increments), Top-p: [0.8, 0.9, 1.0], Seed: 42 ### GPT-5 Models (Responses API)- Reasoning effort: [minimal, medium, high], Verbosity: low, Max output tokens: 300 ### RAG — Threshold Tuning- Global sweep: τ ∈ [0.05, 0.95] step 0.05- Per-label tuning: independently optimized for labels with support < 50 ## Troubleshooting **OpenAI API Errors**- `No API key found` → Set `OPENAI_API_KEY` in the relevant `config.py`- Rate limit exceeded → Reduce concurrency or upgrade API tier- `unsupported_parameter` with GPT-5 models → Scripts automatically retry without unsupported params **GPU/CUDA Issues**- `CUDA out of memory` → Reduce batch size in config or switch to CPU- `CUDA not available` → Verify: `python -c "import torch; print(torch.cuda.is_available())"` **Data Issues**- JSON parsing errors → Ensure input file is valid UTF-8 JSON- Missing labels in results → Confirm labels start with `mod:` or equal `qamp` ## Citation If you use this work in your research, please cite: ```bibtex@article{thatamsetty2026quantum, title={Automated Quantum Issue Labeling in Qiskit: Large Language Models and Fine-Tuned Transformers}, author={Thatamsetty, Poojitha and Zhang, Lei}, journal={Empirical Software Engineering}, publisher={Springer}, year={2026}, note={Under review}}``` ## Paper Status This work has been submitted to **EMSE (Empirical Software Engineering, Springer)**. The replication package is publicly available on Zenodo (DOI: (https://zenodo.org/records/18775645)). ## Acknowledgments This research is funded by the Strategic Awards for Research Transitions (START) at the University of Maryland, Baltimore County. ## Contact For questions or issues, please open an issue in this repository or contact pthatam1@umbc.edu. ## License This project is licensed under the MIT License — see the [LICENSE](LICENSE) file for details. --- **Note**: Ensure your OpenAI API key is configured in the appropriate `config.py` before running GPT or RAG experiments. API keys and large model files are excluded from version control via `.gitignore`.

Keywords

quantum software engineering, issue labeling, multi-label classification, large language models, fine-tuned transformers, retrieval-augmented generation, Qiskit

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Upload OA version

Are you the author of this publication? Upload your Open Access version to Zenodo!

It’s fast and easy, just two clicks!

uploadUpload now