
doi: 10.2139/ssrn.2670326
This presentation to the Moore-Sloan Data Science Environments Summit at Suncadia, WA describes two hybrid text analysis pipelines and closes with a call to accelerate the uptake of text analysis techniques by social scientists. The first text analysis pipeline allows researchers to inductively identify sets of actions cohering in recognizable performances (i.e. marches, demonstrations, etc.) during events of the Occupy movement. The approach uses human annotation, coreference resolution, named-entity-resolution and replacement, a novel approach to date recognition in text, and structural topic modeling to show how and why protester-initiated events and police responses varied over the course of 184 American Occupy campaigns. Findings upend the going sociological theories of American protest policing, which argue that police are independent from city politics and mostly reactive to protesters' activities. The second pipeline adapts the traditional method of content analysis 'hand-coding' to a new workforce: crowd workers and citizen scientist. The complex process of content analysis is decomposed in two steps. First, texts are 'chunked' according to researcher's units of analysis (as opposed to at the sentence or paragraph level). Next, a new Text Thresher interface presents a series of 'reading comprehension' questions to crowd workers, asking them to highlight the text they use to justify their answers. The combination of highlighting and question answering effectively 'labels' text without requiring face-to-face training of crowd workers. All of these labeled data can then be used to train algorithms via well-known machine learning processes.All told, these and other Hybrid Text Analysis approaches stand to greatly increase the capacity and quality of text analysis work, enabling social scientists and digital humanists to take full advantage of the terabytes of text data now available. But, universities must make special efforts to bring social scientists and digital humanists up to speed with the many tools computer scientists and computational scientists have created. Without changing incentives and collaborative structures, the growth of text analysis will plod on slowly, senior scholars will be left behind, and they may even hold back the progress of scholarship by failing to support powerful new approaches.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
