
AbstractBackgroundAnalysing large and high-dimensional biological data sets poses significant computational difficulties for bioinformaticians due to lack of accessible tools that scale to hundreds of millions of data points.ResultsWe developed a novel machine learning command line tool called PyBDA for automated, distributed analysis of big biological data sets. By using Apache Spark in the backend, PyBDA scales to data sets beyond the size of current applications. It uses Snakemake in order to automatically schedule jobs to a high-performance computing cluster. We demonstrate the utility of the software by analyzing image-based RNA interference data of 150 million single cells.ConclusionPyBDA allows automated, easy-to-use data analysis using common statistical methods and machine learning algorithms. It can be used with simple command line calls entirely making it accessible to a broad user base. PyBDA is available athttps://pybda.rtfd.io.
QH301-705.5, Computer applications to medicine. Medical informatics, R858-859.7, Data analysis, Computational Biology, Computing Methodologies, Grid engine, Big data; Data analysis; Command line; Pipeline; Computing cluster; Grid engine; Machine learning, Command line, Machine Learning, Automation, Big data, Pipeline, Machine learning, Image Processing, Computer-Assisted, Humans, Biology (General), Computing cluster, Software, Algorithms, HeLa Cells
QH301-705.5, Computer applications to medicine. Medical informatics, R858-859.7, Data analysis, Computational Biology, Computing Methodologies, Grid engine, Big data; Data analysis; Command line; Pipeline; Computing cluster; Grid engine; Machine learning, Command line, Machine Learning, Automation, Big data, Pipeline, Machine learning, Image Processing, Computer-Assisted, Humans, Biology (General), Computing cluster, Software, Algorithms, HeLa Cells
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 4 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
