
Abstract: Recovering function names from stripped binaries remains a bottleneck in software maintenance, program comprehension, binary debugging, and security analysis. Although recent years have seen a wave of machine-learning-based techniques, the practical state of the art remains difficult to assess. Prior studies are confounded by three recurring problems: a widespread assumption that heavy manual preprocessing is needed to help tokenizers, even though such processing can erase domain-specific semantics or simplify labels in ways that inflate scores; evaluations that are not directly comparable because tools rely on different function-discovery backends or permissive metrics such as token-level top-$k$; and severe reproducibility barriers caused by missing artifacts, undocumented bugs, and extreme computational cost. This experience paper reports our effort to systematize and re-evaluate function-name recovery through a within-pipeline sensitivity analysis. We reproduce four representative state-of-the-art models on a common dataset and controlled pipeline, then retrain them under multiple preprocessing configurations to test whether manual segmentation and normalization are necessary. Across models, we find that these hand-engineered strategies often provide limited benefit over modern tokenizers and can silently discard useful semantic information. We further re-evaluate model outputs under stricter, analyst-facing criteria and show that permissive scoring schemes can substantially overstate practical performance. Finally, we document the scalability and reproducibility challenges encountered during reproduction, including missing artifacts, software bugs, and prohibitive resource demands. Based on these findings, we propose a unified evaluation framework and concrete best practices for more robust, comparable, and reproducible research on function name recovery.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
