Powered by OpenAIRE graph

Found an issue? Give us feedback

ZENODOarrow_drop_down

Software . 2024

License: http://www.apache.org/licenses/LICENSE-2.0

Data sources: ZENODO

Software . 2025

License: http://www.apache.org/licenses/LICENSE-2.0

Data sources: ZENODO

Software . 2024

License: http://www.apache.org/licenses/LICENSE-2.0

Data sources: ZENODO

Software . 2026

License: http://www.apache.org/licenses/LICENSE-2.0

Data sources: ZENODO

Software . 2025

License: http://www.apache.org/licenses/LICENSE-2.0

Data sources: ZENODO

Software . 2026

License: http://www.apache.org/licenses/LICENSE-2.0

Data sources: ZENODO

Software . 2024

License: http://www.apache.org/licenses/LICENSE-2.0

Data sources: Datacite

Software . 2025

License: http://www.apache.org/licenses/LICENSE-2.0

Data sources: Datacite

Software . 2026

License: http://www.apache.org/licenses/LICENSE-2.0

Data sources: Datacite

Software . 2024

License: http://www.apache.org/licenses/LICENSE-2.0

Data sources: Datacite

Software . 2026

License: http://www.apache.org/licenses/LICENSE-2.0

Data sources: Datacite

Software . 2026

License: http://www.apache.org/licenses/LICENSE-2.0

Data sources: Datacite

Software . 2025

License: http://www.apache.org/licenses/LICENSE-2.0

Data sources: Datacite

Software . 2026

License: http://www.apache.org/licenses/LICENSE-2.0

Data sources: Datacite

Software . 2026

License: http://www.apache.org/licenses/LICENSE-2.0

Data sources: Datacite

Software . 2026

License: http://www.apache.org/licenses/LICENSE-2.0

Data sources: Datacite

View all 10 versions

DataTrove: large scale data processing

integration_instructionsResearch softwarekeyboard_double_arrow_right Software 28 Aug 2024Publisher:Zenodo

Authors: Penedo, Guilherme; Kydlíček, Hynek; Cappelli, Alessandro; Wolf, Thomas; Sasko, Mario;

Code repository: https://github.com/huggingface/datatrove/tree/v0.4.0

DataTrove: large scale data processing

- Summary
- Subjects
- Metrics

Abstract

DataTrove is a library to process, filter and deduplicate text data at a very large scale. It provides a set of prebuilt commonly used processing blocks with a framework to easily add custom functionality. DataTrove processing pipelines are platform-agnostic, running out of the box locally or on a slurm cluster. Its (relatively) low memory usage and multiple step design makes it ideal for large workloads, such as to process an LLM's training data.

If you use this software, please cite it using the metadata from this file.

Keywords

scale, data, pytorch, transformers, llms, deep-learning

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Powered by OpenAIRE graph

Found an issue? Give us feedback

selected citations

0

Average

Average

Average