MATERIALS INFORMATICS
STUDENT NAME : PRAVEEN M
STUDENT ROLL NO : CB.SC.P2PHY20013
PROJECT GUIDE : Dr.M.DHARANI
INTRODUCTION
RESULT AND ANALYSIS
Materials informatics is a field of This is an emerging field, with a goal to achieve
Raw Data from Matminer Data repository
study that applies the principles of high-speed and robust acquisition, Data after Cleaning
informatics to materials science and management, analysis, and dissemination of
engineering to improve the diverse materials data with the goal of greatly
understanding, use, selection, reducing the time and risk required to develop,
development, and discovery of produce, and deploy a new materials, which
materials. generally takes longer than 20 years.
OBJECTIVE Fitted Classical ML Models with various parameters
To extract data from materials database repositories to perform data analysis and
build a Machine Learning model to predict the Bulk modulus of the given
materials using Composition Based Feature Vector.
METHODOLOGY Best fitted model after Evaluation Performance of models with 10 different splits
Extracting data: The data we use is accessed through matminer interface where the data we use is from "A
complete copy of the Materials Project database" which consists of 83989 compositions.
Loading and examining the data: Using Pandas, we read in the dataset into a Data Frame. We examine some
rows and look at the data's basic statistics.
Data Cleaning: Here we can use the built-in Pandas methods to check for NaN values in the dataset, which are
missing values. Finally, after cleaning and processing the data, you can save it to disk in a cleaned state for you to
use later. Panda allows us to save our data as a comma separated value .csv file.
Splitting data into the train/validation/test dataset:It is important to split your full dataset into
train/validation/test datasets, and reliably use the same datasets for your modeling tasks later. By saving these
dataset splits into files, you can then later reproducible use these same exact splits during your subsequent model
training and comparison steps. Use the same datasets for all your models, that way, you can ensure a fair
comparison. Performance of models with 10 different splits Evaluated Denset Neural Network Model
Data Featurization: Featurizing materials composition data using so-called "composition-based feature vectors",
or CBFVs. This method represents a single chemical formula as one vector based on its constituent atoms'
chemical properties
Modeling using "classical" machine learning models:Here we implement some classical ML models from
sklearn:Ridge regression, Support vector machine, Linear support vector machine, Random forest, Extra trees,
Adaptive boosting, Gradient boosting, k-nearest neighbors, Dummy regression. In our case, since our target
variables are continuous values (Bulk Modulus), we are performing a regression task.
Evaluating model performance on validation dataset:We use the same validation set to evaluate all models.
This ensures a fair comparison. In addition, we plot the predicted vs. actual plots using the predictions made by
each trained model on the same validation set. After you have finalized your model, you can re-train your model
again on the combined train + validation datasets, and finally, evaluate your model on the held-out test dataset to
improve model performance. Therefore, typically the average value of all the scores are reported, as this gives a
much more accurate estimate of how well the model actually performs.
Modeling using neural network:We will define a simple dense fully-connected neural network which we will
call DenseNet.The input layer of DenseNet accepts input data in the dimension of each row of the input data,
which is equal to the number of features in our CBFV featurization scheme. The output layer dimension of
CONCLUSION
DenseNet is 1, because we want to predict one value (bulk modulus).And finally, we train the neural network by
In this study, we were able to predict the Bulk modulus property of 83989 compositions just by specifying its
evaluating the model on the validation dataset using val_loader and plot predicted vs. actual value plots.
chemical formula with the help of Composition Based Feature Vector (CBFV).
We were able to get an appropriate model (i.e. Extra Tree regression) and fitted it within just 42 seconds.
We also successfully implemented the Densenet neural network model with the extracted data and evaluated the
model performance.
Finally, we have successfully created a linkage between the composition and the property to be predicted.
REFERENCE
Sterling G. Baird*, Marianne Liu*, Hasan M. Sayeed*, and Taylor D. Sparks “Data-Driven Materials Discovery and Synthesis using Machine Learning Methods, Comprehensive Inorganic Chemistry III..”
Taylor D. Sparks, “Inaugural Congress to Focus on Artificial Intelligence” JOM, 73, 3679-3680 (2021).
Debanshu Banerjee* and Taylor D. Sparks “Comparing transfer learning to feature optimization Computational Materials Science, 195, 110452 (2021)”.
Debanshu Banerjee* and Taylor D. Sparks “Comparing transfer learning to feature optimization in microstructure classification Computational Materials Science, 195, 110452 (2021)”.