
Large Language Models (LLM) with reasoning capabilities offer a promising path for improving candidate evaluation in planning frameworks, but their relative performance against traditional non-reasoning models remains largely underexplored. In this study, we benchmark a distilled 1.5B parameter reasoning model (DeepSeek-R1) against several state-of-the-art non-reasoning LLMs within a generator-discriminator LLM planning framework for the text-to-SQL task. For this, we introduce a novel method for extracting soft scores from the chain-of-thought (CoT) outputs from reasoning that enables fine-grResearch goal: How does the reasoning accuracy of multimodal large language models compare to diffusion-based trajectory policies in dynamic task planning environments when evaluated on the RoboBench benchmark with varying levels of environmental noise?Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.6/10.
