
This preprint describes a fully reproducible computational pipeline applied to the complete corpus of UFO sighting records declassified by the Spanish Air Force (1962–1995) and published through the Biblioteca Virtual de Defensa (BVMDefensa). The work includes automated scraping, dual OCR (Apple Vision and olmOCR), corpus fusion, relational database construction (SQLite), structured field extraction, semantic embeddings, clustering (UMAP + HDBSCAN), knowledge graph construction, and a retrieval-augmented generation (RAG) system validated on historical ground-truth cases. The resulting database (78 canonical cases, 2,135 OCR pages, 6,460 indexed text chunks) is intended as an open, auditable computational resource for research on declassified unidentified aerial phenomena (UAP) records in Spain. This work serves as a benchmark prior to scaling the methodology to larger international databases (NUFORC, UFOCAT, GEIPAN).
OCR, knowledge graph, declassified records, retrieval-augmented generation, Natural language processing, semantic embeddings, Data mining
OCR, knowledge graph, declassified records, retrieval-augmented generation, Natural language processing, semantic embeddings, Data mining
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
