Data sharing and standardization in Linguistics

Linguistic typology stands to gain significantly from advances in the use of extremely large datasets. However, our ability to secure these gains will depend on the availability of machine-readable data that is precise and comparable. Here we identify the challenges and opportunities ahead, relating to the quality, longevity, and (re-)usability of linguistic data in typology. Then in response, we introduce the DeAR principles (Decentralized, Automatically verified, Revisable), designed to guide and assist researchers to create diverse, high-resolution and robust datasets. We demonstrate the DeAR principles in action through the example of Paralex, a data standard (i.e., set of scientific conventions) developed collaboratively for lexicons of morphologically inflected forms. Our proposals aim to foster a more resilient and equitable infrastructure for the future of linguistic research.

Found an issue? Give us feedback

Funded by

UKRI, EC| MOLOR