Telenor Nordics Customer Service Self-Help Corpus

This is a multilingual customer service self-help corpus comprising 1,122 manually validated documents in Finnish, Danish, Norwegian, and Swedish, totaling over one million tokens. The documents have been sourced from the public self-help pages of four Nordic telecommunications operators and subsequently filtered for person-identifiable information and relevance through a combined LLM and human annotation pipeline. Accompanying paper is submitted to Nordic Machine Intelligence Journal, pending peer reivew. Version 1.1 - Added a derived metadata.topic_classification field to every document (zero-shot category, similarity score, model, text source, prompt language). - Corpus size is now reported in spaCy word tokens and characters (previously subword tokens); added per-language linguistic statistics and a length figure. - Updated and simplified the reproduction code (analysis/lingcount, analysis/topicclass) and the documentation. - Document text, filtering, PII and content spans are unchanged from v1.0.

Found an issue? Give us feedback