A corpus-based investigation of junk emails

Part of book or chapter of book English OPEN
Orasan, Constantin ; Krishnamurthy, Ramesh (2002)
  • Publisher: ELRA

Almost everyone who has an email account receives from time to time unwanted emails. These emails can be jokes from friends or commercial product offers from unknown people. In this paper we focus on these unwanted messages which try to promote a product or service, or to offer some “hot” business opportunities. These messages are called junk emails. Several methods to filter junk emails were proposed, but none considers the linguistic characteristics of junk emails. In this paper, we investigate the linguistic features of a corpus of junk emails, and try to decide if they constitute a distinct genre. Our corpus of junk emails was build from the messages received by the authors over a period of time. Initially, the corpus consisted of 1563, but after eliminating the duplications automatically we kept only 673 files, totalising just over 373,000 tokens. In order to decide if the junk emails constitute a different genre, a comparison with a corpus of leaflets extracted from BNC and with the whole BNC corpus is carried out. Several characteristics at the lexical and grammatical levels were identified.
  • References (8)

    Ion Androutsopoulos, John Koutsias, Konstantinos Chandrinos, Constantine D. Spyropoulos, 2000 An experimental comparison of naive bayesian and keyword-based anti-spam filtering with personal e-mail messages. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, (SIGIR2000), July 24-28, Athens, Greece, pp. 160-167

    Andrei Z. Broder, 1998 On the resemblance and containment of documents. In Compression and Complexity of Sequences (SEQUENCES'97). pp. 21 - 29, IEEE Computer Society

    Lou Burnard, 1995 Users Reference Guide: British National Corpus Version 1.0, Oxford University Computing Services, UK.

    Xavier Carreras and Lluis Marquez, 2001: Boosting trees for anti-spam email filtering, In Proceedings of RANLP2001, Tzigov Chark, Bulgaria, pp. 58 - 64

    J. Postel, 1975 On the junk mail problem. Network working Group Request for Comments: 706, NIC #33861, November, http://www.faqs.org/rfcs/ rfc706.html

    M. Sahami, S. Dumais, D. Heckerman and E. Horvitz (1998) A Bayesian Approach to Filterin Junk E-amis. In Learning for Text Categorisation - Papers from the AAAI Workshop, pp. 55 - 62, Madison Wisconsin. AAAI Technical Report WS-98-05

    John M. Sinclair 2001, Preface. In Ghadessy, M., Henry, A. and Roseberry, R. L. (eds) Small Corpus Studies and ELT: Theory and Practice, John Benjamins

    P. Tapanainen and T. Jarvinen, 1997 A non-projective dependency parser. In Proceedings of the 5th Conference of Applied Natural Language Processing, pp. 64 - 71, Washington D.C., USA

  • Metrics
    No metrics available
Share - Bookmark