
ProteinTensor is a Python library and file format (.ptt) that eliminates redundant preprocessing in structural biology machine learning pipelines. By converting mmCIF/PDB structures once into a Zarr-backed, LZ4-compressed, memory-mappable store, ProteinTensor provides zero-parse access to atomic coordinates, backbone geometry, covalent bond graphs, MSA tokens, pairwise distance features, and protein language model embeddings. Benchmarked on proteins from 76 to 3,525 residues, full feature assembly is 34x faster on average than traditional mmCIF-based pipelines.
