
As the data generated on the internet exponentially increases, developing guided data collection methods become more and more essential to the research process. This paper proposes an approach to building a self-guiding web-crawler to collect data specifically from extremist websites. The guidance component of the web-crawler is achieved through the use of sentiment-based classification rules which allow the crawler to make decisions on the content of the webpage it downloads. First, content from 2,500 webpages was collected for each of the four different sentiment-based classes: pro-extremist websites, anti-extremist websites, neutral news sites discussing extremism and finally sites with no discussion of extremism. Then parts of speech tagging was used to find the most frequent keywords in these pages. Utilizing sentiment software in conjunction with classification software a decision tree that could effectively discern which class a particular page would fall into was generated. The resulting tree showed an 80% success rate on differentiating between the four classes and a 92% success rate at classifying specifically extremist pages. This decision tree was then applied to a randomly selected sample of pages for each class. The results from the secondary test showed similar results to the primary test and hold promise for future studies using this framework.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 19 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 10% | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Top 10% | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Top 10% |
