
How should retrieval-augmented generation systems be configured for clinical decision support in Portuguese? We evaluate 500 clinical queries across 6 medical specialties comparing BM25, dense, and hybrid retrieval. Four findings: (1) BM25 and hybrid retrieval surface statistically distinct document sets (McNemar p<0.001), confirming complementarity; (2) dense-only retrieval fails for 22.2% of queries; (3) authority-weighted scoring affects ranking but not recall; (4) inter-annotator agreement reaches kappa=0.954, validating LLM-as-judge for Portuguese clinical text. Deterministic citation verification eliminates hallucinations entirely (461/500 vs 1/500, Fisher p<0.001).
