RGen: Generador de datos para benchmarking de cargas de trabajo Big Data

[Resumen] El presente Trabajo Fin de Grado (TFG) presenta el diseño e implementación de RGen, un generador de datos paralelo para el benchmarking de cargas de trabajo Big Data. La herramienta está desarrollada en Java bajo el paradigma de programación MapReduce, más concretamente haciendo uso del framework de procesamiento Apache Hadoop. Además, RGen soporta la generación de datos directamente sobre el sistema de ficheros distribuido de Hadoop, piedra angular del almacenamiento de los frameworks Big Data para procesamiento por lotes (batch processing). RGen conjuga una doble labor de integración de características preexistentes y desarrollo de nuevas funcionalidades en una herramienta independiente. El objetivo final que se persigue es la creación de una herramienta completa, paralela y escalable que reúna las funcionalidades necesarias, sin tener que depender de software de terceros, para la generación de datos de las distintas cargas de trabajo soportadas en la suite de benchmarking Big Data Evaluator (BDEv). Las principales funcionalidades desarrolladas en este TFG son la generación de texto y grafos que cumplen las características definidas por las 4 Vs del Big Data: Volumen, Variedad, Velocidad y Veracidad. Se pone especial énfasis en esta última ya que en muchos benchmarks específicos la necesidad de una gran cantidad de información verídica es primordial. Para ello se ha escogido el modelo LDA, utilizado para la extracción de tópicos o temas tratados en una serie de documentos, para la generación de texto. Por otro lado, en cuanto a la generación de grafos se refiere, se realiza a partir del modelo Kronecker. Para el desarrollo de RGen se han empleado prácticas bien asentadas en la Ingeniería del Software. En cuanto al diseño, se ha hecho uso de patrones de diseño y arquitecturales con el objetivo de conseguir una herramienta fácilmente mantenible y extensible, a la vez que se proporciona un código limpio y de calidad. Para facilitar la organización en el trabajo se ha utilizado Scrum, marco de desarrollo ágil basado en Sprints. Con respecto a la evaluación del rendimiento y escalabilidad del generador de datos se ha realizado la experimentación tanto en un entorno local como en un clúster de altas prestaciones. Para ello se han evaluado distintas configuraciones tanto en el número de nodos como en la cantidad de datos a generar en paralelo. La herramienta desarrollada se encuentra disponible para su descarga en el siguiente repositorio Git: https://github.com/rubenperez98/RGen.

[Abstract] This BSc Thesis presents the design and implementation of RGen, a parallel data generator for benchmarking Big Data workloads. The tool is developed in Java under the MapReduce programming paradigm, more specifically making use of the Apache Hadoop processing framework. In addition, RGen supports the generation of data directly on the Hadoop distributed file system, cornerstone of the storage of Big Data frameworks for batch processing. RGen brings together a twofold task of integrating existing features and developing new functionalities in a standalone tool. The main objective is the creation of a complete, parallel and scalable tool that gathers the necessary functionalities without having to depend on third-party software to generate data for the different workloads supported by the Big Data Evaluator (BDEv) benchmarking suite. The main functionalities developed in this BSc Thesis are the generation of text and graphs that meet the characteristics defined by the 4 Vs of Big Data: Volume, Variety, Velocity and Veracity. Special emphasis is placed on the last one since many specific benchmarks require a huge amount of truthful information. On the one hand, the LDA model has been used for text generation, which is employed for the extraction of topics or themes covered in a series of documents. On the other hand, graphs generation is based on the Kronecker model. RGen has been developed following well-established practices in software engineering. Design and architectural patterns have been used with the aim of obtaining an easily maintainable and extensible tool, while also providing clean and quality code. Scrum, an agile development framework based on Sprints, has been used to facilitate work organization. Regarding the performance evaluation and scalability of the data generator, multiple experiments have been carried out both in a local environment and in a high-performance cluster. Different configurations have been evaluated both in the number of nodes and the amount of data to be generated in parallel. The developed tool is publicly available to download at the following Git repository: https://github.com/rubenperez98/RGen.

Traballo fin de grao (UDC.FIC). Enxeñaría informática. Curso 2019/2020

Country

Spain

Related Organizations

University of A Coruña
Spain

Keywords

Generador de datos, Big Data, HDFS, Benchmarking, Data generator, Apache Hadoop, MapReduce, Java

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green