Materials Informatics

Materials informatics, data-enabled investigation, is a "fourth paradigm" in materials science research after the conventional empirical approach, theoretical science, and computational research. Materials informatics has two essential ingredients: fingerprinting materials proprieties and the theory of statistical inference and learning. We have researched the organic semiconductor's enigmas through the materials informatics approach. By applying diverse neural network topologies, logical axiom, and inferencing information science, we have developed data-driven procedures for novel organic semiconductor discovery for the semiconductor industry and knowledge extraction for the materials science community. We have reviewed and corresponded with various algorithms for the neural network design topology for the materials informatics dataset. We have used four chemical compound space databases for model training and validation in this research notebook. The first one is the general quantum chemistry structures and properties of 134-kilo molecules (QM9) of computed geometric, energetic, electronic, and thermodynamic properties for 134-kilo stable small organic molecules made up of C, H, O, N, F for the novel design of new drugs and materials. The second dataset is for the compounds of molecular organic light-emitting diodes (OLED) materials for high-throughput virtual screening and efficient design. The third dataset is related to sustainable energy storage materials for the quantum chemistry compounds of Redox flow battery materials for accelerated design and discovery. The final fourth dataset is a statistical study of 51,000 organic photovoltaic solar cell molecules designed with the non-fullerene acceptor. We have used a variety of regression analysis techniques. We have trained models with linear regressor, Kernel ridge regressor, Keras regressor, Gaussian process regressor, Random Forest regressor, multi-layer perceptron regressor, Bagging regressor, Extreme gradient boosting regressor, Extreme gradient boosting multi-layer perceptron regressor, Extreme gradient boosting Keras regressor, Extreme gradient boosting kernel ridge regressor for the material informatics dataset. For the dimensionality reduction, projection, and classification task in the material informatics dataset, we have employed Principal component analysis (PCA), t-stochastic neighborhood embedding (t-SNE), and Uniform manifold approximation and projection for dimension reduction (UMAP) algorithms. Further, We have trained models on the convolutional neural network (CNN), recurrent neural network (RNN), radial basis function network, variational autoencoders (VAE), graph neural network (GNN), message-passing neural network (MPNN), directed message-passing neural network, materials graph network-based variational autoencoder, attention network (AN), geometric learning network, active learning network, and Bayesian optimization network, Evolutionary algorithm based neural network, genetic algorithm network, multi-fidelity batch reification, model correlation estimation, and fusion optimization, and optimal design of experiments. We have investigated a deep learning design to predict quantitatively accurate and desirable material properties by constructing a relationship between the molecular structure and its property through a material graph-based neural network. We have used various encoding and descriptors algorithms in this work. In the one-hot encoding scheme, we convert simplified molecular-input line-entry system (SMILES) and Self-referencing embedded strings (SELFIES) strings to 2-D pixel images to use convolutional neural network (CNN) recurrent neural network (RNN) and variational autoencoders (VAE), networks taking advantage of image-based learning, in the organic semiconductor molecular design through variational autoencoders (VAE) combining convolutional neural networks (CNN) as encoder and recurrent neural network (RNN) as decoder section. We have also used the RDKit 2-D and 3-D descriptors, cheminformatics molecular similarity 166-bit MACCS (Molecular ACCess System) keys, Morgan Extended-connectivity fingerprints (ECFP6), extended reduced graph approach pharmacophore-2D type node descriptions, and breaking of retro synthetically interesting chemical substructures (BRICS) algorithm, to describe the information to the network. We have used scikit-learn, min-max, and standard-scaler preprocessor for the data preparation to train the various network topologies. We have used several data-splitting techniques to train, validate, and test the models. We have extensively used the no-split, no-select, repeated 5-fold cross-validation, leave one group out, and leave out percentage techniques. For extrication and feature engineering tasks, we have used the diverse strategy of the learning curve, ensemble model feature selector, scikit-learn feature selector, and standard scaler algorithms.

Keywords

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average