
This paper extends prior work on the articulation–application gap in AI safety and the Contextual Ethical Consistency Test (CECT) by introducing a multi-layer, longitudinal evaluation of ethical behavior in large language models. Using a corpus of nine commercial models evaluated across bilingual, multi-turn scenarios, the study examines ethical consistency as a trajectory-dependent property rather than a static attribute of isolated outputs. The paper introduces a key distinction between consistency of choice and consistency of justification, showing that models may maintain stable decisions while substantially reconfiguring the moral frameworks that support them. Additional layers of analysis include full-history reconstruction (CTH), affective framing sensitivity (EDP), localized perturbations (LOS family), self-auditing under blind and revealed conditions, and cross-model auditing, including double adjudicative audits. The findings suggest that observed ethical behavior in LLMs is highly sensitive to contextual variables such as language, authority, narrative accumulation, stake inversion, and reset conditions. Furthermore, the study demonstrates that the evaluation layer itself is not stable: auditors (LLMs evaluating other LLMs) may change interpretation depending on identity disclosure. The paper argues that evaluating ethical consistency in AI systems requires moving from local snapshot assessments to longitudinal, multi-layer frameworks that explicitly account for trajectory, justification, retrospective reconstruction, and auditor stability. This work does not propose a mechanistic theory of moral reasoning in LLMs; instead, it provides a behaviorally grounded and auditable framework for studying persistence, contextual inducibility, and evaluation robustness in deployed conversational systems.
