General Linear Model (GLM) framework for robust ANOVA-PCA in unbalanced chemometric designs

Torres, Éder Rissi; Ferreira, Márcia Miguel Castro

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Software

Data sources: ZENODO

General Linear Model (GLM) framework for robust ANOVA-PCA in unbalanced chemometric designs

integration_instructionsResearch softwarekeyboard_double_arrow_right Software Under curation English Publisher:Zenodo

Authors: Torres, Éder Rissi; Ferreira, Márcia Miguel Castro;

doi: 10.5281/zenodo.19821115

General Linear Model (GLM) framework for robust ANOVA-PCA in unbalanced chemometric designs

- Summary

Abstract

This repository contains the computational framework, simulated datasets, and a standalone Python solver designed to evaluate and mitigate variance leakage in ANOVA-PCA under unbalanced experimental designs. Classical ANOVA-PCA relies on marginal means to estimate factor effects, which implicitly assumes structural orthogonality. However, missing data or unbalanced designs compromise this assumption, leading to severe Frobenius estimation errors and spurious geometric subspace rotations during the PCA stage. To address this, this repository provides: Monte Carlo Simulation Engine: A complete Python pipeline to simulate synthetic vibrational spectra based on a hierarchical $2^4$ factorial design. It allows the evaluation of data loss impacts under distinct attrition regimes (e.g., Missing Completely at Random - MCAR vs. systematic structural imbalance). Theoretical Bound Assessment: Scripts to quantify subspace distortion and interpretability loss strictly through the lens of Davis-Kahan perturbation bounds. Robust GLM Solver (glm_anova_pca.py): A standalone, object-oriented Python module that implements General Linear Model (GLM) orthogonal projections via Moore-Penrose pseudoinverse. This solver is designed for direct application to new experimental datasets, ensuring structural integrity even under severe imbalance. Real-World Validation: Execution logs and scripts applying the GLM framework to experimental Near-Infrared (NIR) spectroscopy data (bread staling kinetics), establishing baseline equivalence and backward compatibility with classical approaches in perfectly balanced scenarios. This open-science package ensures full reproducibility of the associated manuscript, providing chemometricians and data scientists with a robust standard for routine variance partitioning in multivariate analysis.

Found an issue? Give us feedback