Oracle R technologies for data analytics and machine learning in hybrid data systems
Martin Marquez, Manuel
Romero Marin, Antonio
CERN openlab summer student
The goal of this Openlab summer student project is to evaluate Oracle R Advanced Analytics for Hadoop (ORAAH) as a platform for CERN Big Data Analytics. It provides R interface to manipulate the data stored in HDFS, relational databases and file systems along with leveraging the capabilities of CRAN R packages. ORAAH also uses both HIVE transparency capabilities and maps HDFS as direct input into Machine Learning algorithms that can run as Map Reduce jobs or inside an Apache Spark container. The objective was to determine if the Cryogenic Valve Degradation analyses and Automatic Detection of Faulty valve can be efficiently done, in terms of CPU usage and time for completion of the job using ORAAH and Apache Spark. ORAAH allows you to run your R queries on Hadoop against data in HDFS, giving you the benefits of using R while taking advantage of the horizontal scalability of Hadoop and MapReduce. ORAAH is designed for parallel reads and writes, has resource management and database connectivity features, so it can be used together with ORE. Apart from this, the components like ORAAH Spark MLlib algorithms make the analysis, classification and prediction problems easier.
BigDataLite VM 4.5.0
Oracle R Advanced Analytics for Hadoop 2.6.0
o ORCH package
Interaction with HDFS
Orch.ml.svm (Spark MLlib container for Support Vector Machine Algorithm)
Apache Spark 1.6.0
o Spark SQL
o Spark MLlib
I present an evaluation of Oracle R Advanced Analytics for Hadoop as a Big Data Analysis platform for advance analytics and machine learning. I have used R as a basic modelling tool as it one of the most powerful statistical and computing languages with a number of predefined functionalities available to allow an easy analysis and testing of data. To provide a comparison and truly judge the performance of ORAAH, Apache Spark has been used to model the same approaches. The performance has been measured on the basis of the time consumed to build the model and the accuracy of the model.
The task in this project was aimed to study the potential applicability of the aforementioned technologies using real CERN analytics use cases: (a) The degradation analysis of cryogenic valves in LHCb (b) Predict faulty cryogenic valves.
The above mentioned use cases were duly run and modelled using the technologies mentioned earlier and the results computed provided very promising statistics for future use of scalable services as CERN Big Data Analytics platform.