
ProteinTensor is a Python library and file format (.ptt) that eliminates redundant preprocessing in structural biology machine learning pipelines. It converts mmCIF/PDB structures - or raw protein sequences - once into a Zarr-backed, LZ4-compressed, memory-mappable store, providing zero-parse access to atomic coordinates, backbone geometry, covalent bond graphs, MSA tokens, pairwise distance features, and protein language model embeddings. Sequence-only entries serve as direct input to AlphaFold- and Boltz-style predictors. Round-trip conversion is lossless, and structure loading is benchmarked at 2-95x faster than mmCIF parsing across proteins from 74 to 3,525 residues.
