A Compendium of Regular Expression Shapes in SPARQL Queries

Regular path queries (RPQs) are at the heart of navigational queries in graph databases. Motivated by new features of regular path queries in the languages Cypher, GQL, and SQL/PGQ, which require new approaches for indexing and compactly storing intermediate query results, we investigate a large corpus of real-world RPQs. Our corpus consists of 148.7 million RPQs occurring in 937.2 million SPARQL queries, used on 29 different data sets. We investigate three main questions on these logs. First, what is the syntactic structure of RPQs in practice? Second, how much non-determinism do they have? Third, can they be evaluated tractably under simple path and trail semantics? Concerning the first question, we show that all the RPQs can be classified in only 572 different syntactic shapes, which we provide in a downloadable data set in Zenodo. Furthermore, we classify the the relative use of various RPQ operators, and popular predicates that are used for transitive navigation. Concerning the second question, we show that although non-determinism occurs in the RPQs, less than one in ten million requires a deterministic finite automaton with more states than the size of the regular expression. This is remarkable because this blow-up is known to be exponential in the worst case. When using this data set, please cite the following paper:@inproceedings{HM25, author = {Janik Hammerer and Wim Martens}, title = {A Compendium of Regular Expression Shapes in SPARQL Queries}, booktitle = {Joint International Workshop on Graph Data Management Experiences {\&} Systems {(GRADES)} and Network Data Analytics (NDA)}, publisher = {{ACM}}, year = {2025}, url = {https://doi.org/10.1145/3735546.3735853}, doi = {10.1145/3735546.3735853}}

Related Organizations

University of Bayreuth
Germany

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

1

Average