A distributed data processing scheme based on Hadoop for synchrotron radiation experiments

descriptionPublicationkeyboard_double_arrow_right Article , Other literature type 24 Apr 2024Publisher:International Union of Crystallography (IUCr)Journal:Journal of Synchrotron Radiation, volume 31, pages 635-645 (eissn: 1600-5775,

Copyright policy )

Authors: Ding Zhang; Ze-Yi Dai; Xue-Ping Sun; Xue-Ting Wu; Hui Li; Lin Tang; Jian-Hua He;

doi: 10.1107/s1600577524002637

pmid: 38656774

pmc: PMC11075715

A distributed data processing scheme based on Hadoop for synchrotron radiation experiments

- Summary
- Subjects
- Metrics

Abstract

With the development of synchrotron radiation sources and high-frame-rate detectors, the amount of experimental data collected at synchrotron radiation beamlines has increased exponentially. As a result, data processing for synchrotron radiation experiments has entered the era of big data. It is becoming increasingly important for beamlines to have the capability to process large-scale data in parallel to keep up with the rapid growth of data. Currently, there is no set of data processing solutions based on the big data technology framework for beamlines. Apache Hadoop is a widely used distributed system architecture for solving the problem of massive data storage and computation. This paper presents a set of distributed data processing schemes for beamlines with experimental data using Hadoop. The Hadoop Distributed File System is utilized as the distributed file storage system, and Hadoop YARN serves as the resource scheduler for the distributed computing cluster. A distributed data processing pipeline that can carry out massively parallel computation is designed and developed using Hadoop Spark. The entire data processing platform adopts a distributed microservice architecture, which makes the system easy to expand, reduces module coupling and improves reliability.

Related Organizations

Wuhan University
China (People's Republic of)
Paul Scherrer Institute
Switzerland

Keywords

distributed data processing, Crystallography, big data, QD901-999, Nuclear and particle physics. Atomic energy. Radioactivity, microservice architecture, apache hadoop, QC770-798, Computer Programs

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average