Pretrained Transformer Encoder for SMILES Strings

Pretrained Transformer Encoder Parameters This component provides the pretrained parameters for a transformer encoder designed to extract feature representations from SMILES strings. The model was trained using masked token prediction to capture intricate patterns and long-range dependencies within molecular sequences. The transformer architecture includes: 10 sequential transformer blocks, Multi-head self-attention for contextualized token embeddings, Position-wise feed-forward layers with Gaussian Error Linear Unit (GELU) activation, residual connections, and layer normalization. The global molecular representation is derived from the start token embedding, which aggregates sequence-wide information during self-attention computations. Pretraining Dataset This component provides the dataset used to pretrain the transformer encoder. It integrates SMILES strings from the following sources: ChEMBL 33: ~2.4 million bioactive molecules with drug-like properties, GuacaMol v1: ~1.6 million molecules derived from ChEMBL 24, MOSES: ~1.8 million molecules selected from ZINC 15 for diversity and medicinal chemistry suitability, BindingDB: ~1.2 million unique small molecules bound to proteins, PDBbind v2020: ~15,710 unique small molecules bound to proteins. This model has been optimized for drug discovery applications, including protein-ligand binding affinity prediction, and can serve as a foundational tool for researchers working on cheminformatics, computational biology, and medicinal chemistry.

Related Organizations

Yale University
United States

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average