
handle: 1959.4/51451
With the continual increases in storage and bandwidth capacity, there has been a corresponding increase in the need for effective data analysis. Applications range from marketing and customer relations to fraud and risk management to epidemiology. Many techniques are focused on detecting useful patterns in data -common trends that can be exploited and applied to most cases. However, often it is the unusual cases that are interesting. Cases of fraud or network intrusion are not the norm and as such, specific tools are needed for the identification of these abnormal scenarios. This thesis analyzes several problems related to the identification of un usual patterns in large data sets. We focus on the development of efficient and accurate techniques for detection of such patterns. These patterns are identified for a number of domains including network analysis (determinia tion the protocol for encrypted data)and census records (looking for patterns of unusual deaths from mortality data). In being useful for a number of do mains, we can analyse a number of different data types; detection of outliers and estimating densities for spatial data, identification of unusual sequences for network data and groups of unusual points for categorical data. Our approaches have many real world applications, and many of the data sets we use for the evaluation of our methods are real world extracts. This demonstrates that our techniques can be used on data from different domains, still maintaining high levels of performance and accuracy. Furthermore, our techniques are novel and provide new tools for mining unusual patterns. This facilitates improved analysis compared to existing methods. We provide for increased speed for identification of local outliers in spatial data; this is complemented with a novel technique for density estimation for high dimensional spatial data. Additionally, we present im proved techniques for identification of protocols and users for network data. Finally, we develop an approach for grouping anomalies and demonstrate this approach on behavioural risk factor and mortality data. Unlike existing techniques such as clustering, our approach is able to group instances based on why they are considered anomalous.
data, methodologies,, analysis, data,, patterns, methodologies, analysis,, 004, patterns,
data, methodologies,, analysis, data,, patterns, methodologies, analysis,, 004, patterns,
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
