
While qualitative research plays a vital role in understanding complex phenomena, it lends itself poorly to testing formal hypotheses due to its inability to fit statistical models to text data. Approaches that are traditionally used to quantify text data (e.g., content analysis) are generally time-consuming, prone to researcher bias, and neglect a substantial amount of potentially important semantic context. Although novel approaches have been proposed, these typically require large amounts of text data and tend to be inductive in nature. To enable researchers to ask hypothesis-based and open-ended questions from one's text data, the current study proposes a novel retrieval augmented generation (RAG)-based approach (called text embedding similarity analysis, TESA) that transforms a hypothesis into two specific search terms: a population (or sample) and a variable of interest. Using pretrained large language models (LLM), we extract the semantic embedding of the search terms and text data and use cosine similarity to match search terms. This allows hypothesis testing by assessing the alignment between the distribution of similarity scores for a variable of interest with the expectation for the population.
Psychology/methods, Humans, semantics, qualitative research
Psychology/methods, Humans, semantics, qualitative research
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
