
Large language models (LLMs) are commonly aligned with human preferences via RLHF or direct preference optimization. We introduce Ouroboros, a human-led recursive reinforcement (HLRR) procedure that repeatedly distills a single teacher’s judgments, meta-commentary, and persona into future model behavior. In contrast to conventional RLHF—which freezes supervision into a static rewarder—Ouroboros closes the loop: model outputs are archived, summarized, and then re-expressed as deliberately intricate “labyrinth” prompts that probe coherence and reasoning. The same human then scores and rewrites the exchange, producing rich signals that assess factuality, logical self-consistency, and identity coherence. Across three base models (GPT-J 6B, Llama-2 70B, GPT-4o), Ouroboros improves long-horizon factual accuracy by 8–14 percentage points, roughly halves adversarial mode collapse, and reaches a target persona about 3× faster than RLHF baselines. We release code, evaluation suites, and annotated traces to support reproducibility.
autoregressive transformers, GPT-4o, reward modeling, Machine Learning, Llama-2, prompt engineering, human-led recursive reinforcement, human-in-the-loop learning, long-horizon reasoning, Artificial Intelligence, meta-feedback, Reflexion, large language models, Ouroboros, Machine Learning/standards, conversational agents, alignment research, self-consistency, GPT-J, reinforcement learning from human feedback, preference optimization, recursive alignment, RLAIF, labyrinth prompts, AI safety, persona alignment, RLHF
autoregressive transformers, GPT-4o, reward modeling, Machine Learning, Llama-2, prompt engineering, human-led recursive reinforcement, human-in-the-loop learning, long-horizon reasoning, Artificial Intelligence, meta-feedback, Reflexion, large language models, Ouroboros, Machine Learning/standards, conversational agents, alignment research, self-consistency, GPT-J, reinforcement learning from human feedback, preference optimization, recursive alignment, RLAIF, labyrinth prompts, AI safety, persona alignment, RLHF
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
