Learning from imbalanced and heterogeneous data

Big data analysis is a term describing the analysis of large and/or complex datasets using a series of techniques including but not limited to machine learning techniques. Contemporary digital world comprises multiple forms of unstructured and/or semi-structured data that are inter-linked and inter-related in complex networks. These data present innumerable opportunities for discovering patterns and extracting actionable knowledge and insights accompanied by challenges of predicting the future condition. The large amount of raw data provides an excellent opportunity for data scientists to retrieve valuable information using advanced data mining and machine learning techniques that also lead to the rapid emergence of new technologies in data analytics. The major inspiration has come from the variety, variability, and veracity of data in addition to their infinitely growing volume. Data stream causes many new challenges and has wide applicability, such as analyzing messages in social networks or predicting stock changes. Big data is used to enhance decision making, provide insight and discovery, and support and optimize processes. Big data has become a core topic in different industries, research disciplines and the whole society. The reason is that analysing unexampled amounts of diverse data helps to fundamentally change the way industries operate, how research can be done and how people live by modern technology. Different industries such as finance, health-care or manufacturing, can dramatically benefit from improved and faster data analysis. However, due to the complexity of data in many real world problems, one-size-fits-all solutions are seldom able to provide satisfactory answers. In addition to the large volume, variability and volatility, most of the real datasets are imbalanced too which means that a large proportion of data belongs to one class and just a small proportion belongs to the other class. Although the studies of data mining have been active in many years, big data requires new technologies and techniques to capture, store, and analyze as most of the possible developments related to big data are still in an early stage. Therefore this thesis summarizes our investigations on various parametric and non-parametric methods aiming at providing better data mining solutions to problems involving big data that are not only large, but also heterogeneous and imbalanced, such as financial markets and most of the health related data. First of all, the imbalanced learning problem greatly affects performance of the learning algorithms due to the presence of underrepresented data and severe class distribution skews. For learning real world data, choosing a suitable metric learning method that addresses the properties and domain characteristics is critical to achieve a high performance in most of machine learning and data mining algorithms. When the dataset is large and imbalanced, even with an accurate metric, it is extremely difficult to achieve good learning performance. Therefore, in this thesis, first, an integrated method is proposed to tackle the imbalanced learning problem. Further improvement is achieved by an ensemble method based on balancing techniques and metric learning techniques for learning imbalanced datasets. In addition, combinations of metric learning algorithms and balancing techniques are experimented and their performances are compared based on different evaluation metrics on bootstrap datasets with various sizes. Distance metric learning is a powerful tool that can improve the performance of similarity based classifications. In general, global metric learning techniques do not consider local data distributions, and hence do not perform well on complex/ heterogeneous data. Local metric learning methods address this problem but are usually expensive in runtime and also have overfitting problems. Therefore, a fuzzy-based local metric learning approach is proposed that outperforms recently proposed local metric methods, while still being faster than popular global metric learning methods in most cases. Furthermore, different from the literature, we also extend the proposed approach to handle both numeric and categorical data. However, classification is a research area in data mining that has a long history. While many efficient classification algorithms have been developed, it can be difficult to interpret and apply their results for prediction applications especially for stock market prediction purposes. Therefore, the last chapter investigates the benefits of classification method combining with regression approach to enhance prediction efficiency. In particular, data are first classified by hierarchical beta process based method before being projected by a local metric learning based support vector regression method. Experiments based on real stock market datasets show the effectiveness of our proposal. We also show that the prediction returns are further enhanced by considering other data sources such as news and overseas financial markets. Extensive experiments on various public datasets demonstrate the effectiveness and performance of our proposed frameworks. In addition, experiments using various real world datasets including financial data, water pipeline and horse race data show that our proposed methods are practical.

Country

Australia

Related Organizations

UNSW Sydney
Australia

Keywords

Imbalanced, Classification, Metric Learning, 004

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green