
As Large Language Models (LLMs) scale, the deployment cost on commodity hardware becomes prohibitive. While unstructured pruning offers theoretical compression, it often requires specialized kernels to realize speedups. We propose a robust Structured Minification framework that physically reduces the intermediate dimensions of Transformer MLPs, ensuring compatibility with standard GEMM operations. Our methodology combines (1) a global Taylor-First-Order sensitivity analysis to identify redundant feature dimensions, and (2) a closed-form Ridge Regression reconstruction to optimally heal the output distribution of the pruned layers. <div> We investigate the efficacy of this approach across model scales, applying it to a parameter-dense 135M model and a 1.7B model. Our results demonstrate that minification is highly effective even for smaller, dense models at high retention rates: the 135M model retains significant coherence at 90% retention (Perplexity 4.33 → 4.89). Furthermore, we observe a strong scaling law: the 1.7B model exhibits remarkable robustness, tolerating 30% structural removal with only minor degradation (Perplexity 3.16 → 4.09). This suggests that while smaller models require conservative minification (80-90% retention), larger over-parameterized models possess a highly compressible subspace recoverable via linear leastsquares. </div> <div> Furthermore, because our framework reduces model topology without altering weight precision, it remains strictly orthogonal to quantization, enabling composite compression pipelines that leverage both structural minification and subsequent bit-width reduction. </div> <div> The code is available at https://github.com/VladimerKhasia/minisp </div>
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
