
# Automated Quantum Issue Labeling in Qiskit: Large Language Models and Fine-Tuned Transformers This repository contains the implementation code for the paper **"Automated Quantum Issue Labeling in Qiskit: Large Language Models and Fine-Tuned Transformers"** submitted to EMSE (Empirical Software Engineering, Springer). The code includes experiments comparing fine-tuned transformer models (DistilBERT, RoBERTa), Large Language Models (GPT-4o Mini, GPT-5 Mini, GPT-5 Nano), and RAG-enhanced LLM pipelines on quantum computing software issue classification. ## Table of Contents- [Study Overview](#study-overview)- [Repository Structure](#repository-structure)- [Installation](#installation)- [Configuration](#configuration)- [Usage](#usage)- [Input Data Format](#input-data-format)- [Output and Results](#output-and-results)- [Analysis Tools](#analysis-tools)- [GPU Acceleration](#gpu-acceleration-notes)- [Computational Requirements](#computational-requirements)- [Hyperparameters](#hyperparameters)- [Troubleshooting](#troubleshooting)- [Citation](#citation)- [License](#license) ## Study Overview This study compares the performance of multiple approaches for classifying GitHub issues in quantum computing repositories: **Fine-tuned Transformer Models:**- DistilBERT (F1=0.95)- RoBERTa (F1=0.94) **Large Language Models (direct prompting, 242-issue test set):**- GPT-5 Mini — zero-shot (F1=0.77) and few-shot (F1=0.82)- GPT-5 Nano — zero-shot (F1=0.59) and few-shot (F1=0.65)- GPT-4o Mini — zero-shot (F1=0.62) and few-shot (F1=0.64) **RAG-Enhanced LLM Pipelines (with threshold tuning, 721-issue test set):**- Agentic RAG + GPT-4o Mini: zero-shot F1=0.606, few-shot F1=0.682- Adaptive RAG + GPT-4o Mini: zero-shot F1=0.613, few-shot F1=0.744- Adaptive RAG + GPT-5 Mini: F1=0.836- Direct GPT-5 Mini few-shot (no RAG, same 721-issue test set): F1=0.843 We evaluate models on their ability to automatically classify Qiskit GitHub repository issues across 12 quantum module labels (e.g., `mod: circuit`, `mod: transpiler`). ## Repository Structure ```.├── Data/│ └── qiskit_repo_quantum_issues.json # 2,415 labeled Qiskit issues│├── Analysis/│ ├── predictions/│ ├── results/│ ├── bot_labeling_analysis.py│ ├── config.py│ ├── ml_baselines.py│ ├── quantum_term_analysis.py│ ├── statisticalanalysis.py│ └── analysis_requirements.txt│├── Fine_tuned_Experiments/│ ├── output/│ ├── distilbert_final.py│ ├── finetunedconfig.py│ ├── roberta_distilbert_requirements.txt│ └── roberta_final.py│├── Gpt_Experiments/│ ├── scripts/│ │ ├── gpt4o_mini_fewshot_gridsearch.py│ │ ├── gpt4o_mini_zeroshot_gridsearch.py│ │ ├── gpt_5_mini_fewshot_gridsearch.py│ │ ├── gpt_5_mini_zeroshot_gridsearch.py│ │ ├── gpt_5_nano_fewshot.py│ │ └── gpt_5_nano_zeroshot.py│ ├── config.py│ └── gpt_requirements.txt│├── RAG_Experiments/│ ├── code/│ │ ├── 01_agentic_rag_zeroshot.py # Agentic RAG, GPT-4o Mini, zero-shot│ │ ├── 02_agentic_rag_fewshot.py # Agentic RAG, GPT-4o Mini, few-shot│ │ ├── 03_adaptive_rag_zeroshot.py # Adaptive RAG, GPT-4o Mini, zero-shot│ │ ├── 04_adaptive_rag_fewshot.py # Adaptive RAG, GPT-4o Mini, few-shot│ │ ├── 05_adaptive_rag_gpt5mini.py # Adaptive RAG, GPT-5 Mini, few-shot│ │ ├── 06_threshold_tuning.py # Per-label threshold tuning utility│ │ ├── 07_gpt5mini_fewshot_direct.py # Direct GPT-5 Mini baseline (no RAG)│ │ └── config.py # RAG-specific configuration│ ├── predictions/ # Saved prediction JSON files│ ├── results/ # Threshold tuning result JSONs│ ├── rag_requirements.txt│ └── README.md│├── finetuned_distilbert_results/│ └── DistilBERT_Results/│├── finetuned_roberta_results/│ └── RoBERTa_Results/│├── gpt_4o_mini_results/│ ├── few_shot_results/│ ├── visuals_insights/│ └── zero_shot_results/│├── gpt_5_mini_results/│ ├── few_shot_results/│ └── zero_shot_results/│├── gpt_5_nano_results/│ ├── few_shot_results/│ └── zero_shot_results/│├── .gitignore├── LICENSE└── README.md``` ## Installation ### Requirements- Python 3.8+- CUDA-compatible GPU (recommended for fine-tuning experiments)- PyTorch- OpenAI API key (for GPT and RAG experiments) ### Setup 1. **Clone the repository:**```bashgit clone [repository URL]cd quantum-bug-labeling-main``` 2. **Create and activate a virtual environment:**```bash# For Windowspython -m venv venvvenv\Scripts\activate # For macOS and Linuxpython3 -m venv venvsource venv/bin/activate``` 3. **Install Fine-tuned experiments requirements:**```bashcd Fine_tuned_Experimentspip install -r roberta_distilbert_requirements.txt``` **Note**: The `roberta_distilbert_requirements.txt` includes:- `torch>=1.12.0` — PyTorch deep learning framework- `transformers>=4.30.0` — Hugging Face transformers- `numpy>=1.24.0`, `pandas>=2.0.0` — Data handling- `scikit-learn>=1.2.0` — Machine learning metrics- `matplotlib>=3.7.0`, `seaborn>=0.12.0` — Visualizations- `tqdm>=4.65.0` — Progress bars- `datasets>=2.12.0` — Dataset utilities 4. **Install PyTorch (for Fine-tuned experiments):** Option 1: CPU-only (simpler but slower for training)```bashpip install torch torchvision torchaudiocd ..``` Option 2: GPU with CUDA 11.8 (recommended)```bashpip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118cd ..``` 5. **Install GPT experiments requirements:**```bashcd Gpt_Experimentspip install -r gpt_requirements.txtcd ..``` 6. **Install RAG experiments requirements:**```bashcd RAG_Experimentspip install -r rag_requirements.txtcd ..``` **Note**: The `rag_requirements.txt` includes:- `openai>=1.13.3` — OpenAI API client (GPT-4o Mini + GPT-5 Mini + embeddings)- `numpy>=1.24.0`, `pandas>=2.0.0` — Data handling- `scikit-learn>=1.2.0` — NearestNeighbors retrieval index and metrics 7. **Install Analysis requirements:**```bashcd Analysispip install -r analysis_requirements.txtcd ..``` ## Configuration ### Fine-tuned Models Configuration Edit `Fine_tuned_Experiments/finetunedconfig.py`:```pythonINPUT_FILE = r"../data/qiskit_repo_quantum_issues.json" # Update pathOUTPUT_BASE_DIR = "output"``` ### GPT Models Configuration Edit `Gpt_Experiments/config.py`:```pythonOPENAI_API_KEY = "your-api-key-here" # REQUIREDINPUT_FILE = r"../data/qiskit_repo_quantum_issues.json"``` ### RAG Experiments Configuration Edit `RAG_Experiments/code/config.py`:```pythonOPENAI_API_KEY = "your-api-key-here" # REQUIREDDATA_PATH = r"C:\path\to\Data\qiskit_repo_quantum_issues.json" # Update pathGPT4O_MINI_MODEL = "gpt-4o-mini"GPT5_MINI_MODEL = "gpt-5-mini"EMBED_MODEL = "text-embedding-3-small"TOP_K = 5MAX_CHARS = 6000``` ### Analysis Configuration Edit `Analysis/config.py`:```pythonINPUT_FILE = r"../data/qiskit_repo_quantum_issues.json"GITHUB_TOKEN = "" # Only for bot_labeling_analysis.py``` ## Usage ### Fine-tuned Experiments ```bashcd Fine_tuned_Experimentspython distilbert_final.pypython roberta_final.py``` ### GPT Experiments ```bashcd Gpt_Experiments/scriptspython gpt4o_mini_zeroshot_gridsearch.py # GPT-4o Mini zero-shotpython gpt4o_mini_fewshot_gridsearch.py # GPT-4o Mini few-shotpython gpt_5_mini_zeroshot_gridsearch.py # GPT-5 Mini zero-shotpython gpt_5_mini_fewshot_gridsearch.py # GPT-5 Mini few-shotpython gpt_5_nano_zeroshot.py # GPT-5 Nano zero-shotpython gpt_5_nano_fewshot.py # GPT-5 Nano few-shot``` ### RAG Experiments Run from the project root directory. Scripts are self-contained and use `RAG_Experiments/code/config.py`. **Step 1 — Run any RAG pipeline:**```bashpython RAG_Experiments/code/01_agentic_rag_zeroshot.pypython RAG_Experiments/code/02_agentic_rag_fewshot.pypython RAG_Experiments/code/03_adaptive_rag_zeroshot.pypython RAG_Experiments/code/04_adaptive_rag_fewshot.pypython RAG_Experiments/code/05_adaptive_rag_gpt5mini.py``` **Step 2 — Run per-label threshold tuning on saved predictions:**```bashpython RAG_Experiments/code/06_threshold_tuning.py agentic_rag_zeroshot_predictions.jsonpython RAG_Experiments/code/06_threshold_tuning.py agentic_rag_fewshot_predictions.jsonpython RAG_Experiments/code/06_threshold_tuning.py adaptive_rag_zeroshot_predictions.jsonpython RAG_Experiments/code/06_threshold_tuning.py adaptive_rag_fewshot_predictions.jsonpython RAG_Experiments/code/06_threshold_tuning.py adaptive_rag_gpt5mini_predictions.json``` **Step 3 — Run direct GPT-5 Mini baseline (no RAG, same 721-issue test set):**```bashpython RAG_Experiments/code/07_gpt5mini_fewshot_direct.pypython RAG_Experiments/code/06_threshold_tuning.py gpt5mini_fewshot_direct_predictions.json``` > **Note on runtime**: Each RAG script makes ~1–3 API calls per test issue (721 issues total). Expect 30–120 minutes per script depending on model and network latency. ### Analysis ```bashcd Analysispython ml_baselines.py # Classical ML baselines (LR, SVM)python statisticalanalysis.py # McNemar's significance testspython quantum_term_analysis.py # Quantum terminology analysispython bot_labeling_analysis.py # Bot labeling analysis (requires GitHub token)``` ## Input Data Format The experiments expect a JSON file with GitHub issues in the following format: ```json[ { "ID": "issue-123", "Title": "Fix barrier label position when bits are reversed", "Body": "Issue description text here...", "Labels": ["mod: visualization", "bug"] }]``` ### Dataset Details- **Source**: Qiskit GitHub repository issues- **Total**: 2,415 issues across 12 quantum-specific categories- **Labels**: `mod: algorithms`, `mod: circuit`, `mod: opflow`, `mod: primitives`, `mod: pulse`, `mod: qasm2`, `mod: qasm3`, `mod: qpy`, `mod: quantum info`, `mod: transpiler`, `mod: visualization`, `qamp`- **Label type**: Multi-label (issues can have multiple labels)- **Few-shot examples**: 13 curated examples excluded from all evaluation sets ## Output and Results ### Fine-tuned Models- `finetuned_distilbert_results/DistilBERT_Results/`- `finetuned_roberta_results/RoBERTa_Results/` ### GPT Models- `gpt_4o_mini_results/` — GPT-4o Mini results (grid search)- `gpt_5_mini_results/` — GPT-5 Mini results- `gpt_5_nano_results/` — GPT-5 Nano results ### RAG Experiments- `RAG_Experiments/predictions/` — JSON predictions for all 5 RAG variants + direct baseline- `RAG_Experiments/results/` — Threshold tuning results (global sweep + per-label) ### Analysis Results- `Analysis/predictions/` — Model predictions (`.npy` files for McNemar's test)- `Analysis/results/` — Baseline metrics, quantum term analysis, bot labeling analysis ## Analysis Tools ### ML Baselines (`ml_baselines.py`)- Trains Logistic Regression and Linear SVM with TF-IDF features- **Outputs**: `baseline_results.csv`, `per_category_baseline_results.csv`, prediction `.npy` files ### Statistical Analysis (`statisticalanalysis.py`)- McNemar's tests on 242 held-out test issues- **Outputs**: `predictions/mcnemar_results.json` ### Quantum Terminology Analysis (`quantum_term_analysis.py`)- Validates quantum-specific nature of dataset (hybrid TF-IDF + documentation approach)- **Outputs**: Console statistics, domain specificity ratios ### Bot Labeling Analysis (`bot_labeling_analysis.py`)- Compares bot labeling patterns across 10 classical + 10 quantum repositories- Requires GitHub API token in `config.py`- **Outputs**: Excel report, visualization PNG ## GPU Acceleration Notes Fine-tuned models benefit from GPU acceleration:- **Memory**: At least 8 GB GPU memory recommended- **Training time**: 60–90 min with GPU (vs. days on CPU)- Install PyTorch with CUDA as shown in the Installation section ## Computational Requirements ### Fine-tuned Models- GPU: 8 GB+ VRAM- Training: ~60 min (DistilBERT), ~90 min (RoBERTa) on RTX 3080- Disk: ~300 MB (DistilBERT), ~500 MB (RoBERTa) ### GPT Experiments- API costs: see `Gpt_Experiments/` for per-configuration cost breakdown- Runtime: 30–60 min per configuration; several hours for full grid search ### RAG Experiments- API costs: ~$0.50–2.00 per full run (embeddings + GPT calls for 721 issues)- Runtime: 30–120 min per script ### Analysis Scripts- RAM: 4 GB+- Runtime: minutes ## Hyperparameters ### Fine-tuned Models (best configurations)- **DistilBERT**: lr=8e-5, epochs=10, batch=12, weight_decay=0.005, cosine schedule- **RoBERTa**: lr=3e-5, epochs=18, batch=32, weight_decay=0.15, cosine schedule ### GPT Models (Grid Search)- Temperature: 0.0–1.0 (0.1 increments), Top-p: [0.8, 0.9, 1.0], Seed: 42 ### GPT-5 Models (Responses API)- Reasoning effort: [minimal, medium, high], Verbosity: low, Max output tokens: 300 ### RAG — Threshold Tuning- Global sweep: τ ∈ [0.05, 0.95] step 0.05- Per-label tuning: independently optimized for labels with support < 50 ## Troubleshooting **OpenAI API Errors**- `No API key found` → Set `OPENAI_API_KEY` in the relevant `config.py`- Rate limit exceeded → Reduce concurrency or upgrade API tier- `unsupported_parameter` with GPT-5 models → Scripts automatically retry without unsupported params **GPU/CUDA Issues**- `CUDA out of memory` → Reduce batch size in config or switch to CPU- `CUDA not available` → Verify: `python -c "import torch; print(torch.cuda.is_available())"` **Data Issues**- JSON parsing errors → Ensure input file is valid UTF-8 JSON- Missing labels in results → Confirm labels start with `mod:` or equal `qamp` ## Citation If you use this work in your research, please cite: ```bibtex@article{thatamsetty2026quantum, title={Automated Quantum Issue Labeling in Qiskit: Large Language Models and Fine-Tuned Transformers}, author={Thatamsetty, Poojitha and Zhang, Lei}, journal={Empirical Software Engineering}, publisher={Springer}, year={2026}, note={Under review}}``` ## Paper Status This work has been submitted to **EMSE (Empirical Software Engineering, Springer)**. The replication package is publicly available on Zenodo (DOI: (https://zenodo.org/records/18775645)). ## Acknowledgments This research is funded by the Strategic Awards for Research Transitions (START) at the University of Maryland, Baltimore County. ## Contact For questions or issues, please open an issue in this repository or contact pthatam1@umbc.edu. ## License This project is licensed under the MIT License — see the [LICENSE](LICENSE) file for details. --- **Note**: Ensure your OpenAI API key is configured in the appropriate `config.py` before running GPT or RAG experiments. API keys and large model files are excluded from version control via `.gitignore`.
quantum software engineering, issue labeling, multi-label classification, large language models, fine-tuned transformers, retrieval-augmented generation, Qiskit
quantum software engineering, issue labeling, multi-label classification, large language models, fine-tuned transformers, retrieval-augmented generation, Qiskit
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
