Data Science | Tianshi Wang

Project1: Bandgap prediction

Introduction

It is desirable to develop a code which can predict the band gap of a large number of materials for novel material design. Currently, the density functional theory (DFT) based first-principles calculations are the dominant method for band gap prediction. However, the DFT-LDA are DFT-GGA are well-known to underestimate band gaps, while more advanced method such as hybrid functional and GW can predict decent results but not suitable for a large number of materials for their heavy cost.

Here a new bandgap prediction code, based on machine learning techniques, is presented. Using this code, we achieved a metal/nonmetal classification accuracy of ~90%, and the root-mean-square deviation (RMSE) is ~0.7 eV for bandgap prediction. The result is pretty amazing especially considering the error of standard DFT calculations are more than 1.5 eV. In the following, I will first describe how it works and then show the detailed results.

You may find the source code on my GitHub repository: https://github.com/tianshi-wang/Band_Gap_Machine_Learning

Workflow

The code comprises of two modules. The /grep_features_python contains a Python script which can grep data from different databases and generate the training set, cross-validation set, and test set. To run it, please use a Python >= 3.5, and install Numpy and Pymatgen. Those generated set can be used to train support vector machine (SVM) models using the second module (written in MATLAB) in /SVM_MATLAB. In the MATLAB module, the messages printed on screen and comments can help you.

The workflow below shows how the developed code works for classifying metal/nonmetal and predict band gaps. The data ingestion module collects 44 features for each of the ~4100 materials from databases e.g. MaterialsProject . Then, the collected data are used to train support vector machine (SVM) classifier and SVM regression. Finally, we test the models on test set.

Workflow of the developed code for metal/nonmetal classification and bandgap projection

Metal/Nonmetal Classification Results

Before showing the test result, the table on the right shows how the prediction errors change with respect to the box constraint C for the training set and cross-validation (CV) set. An error value of 0.1 means the classification prediction is 90%. The box constraint C is a way to prevent over-fitting the data; however, if it is too small, the model may become under-fit the data. From the data, C=1.28 may be a good choice since the model using it derives a decent result for CV set.

Next we show the test results on test set. All the compounds in test set are not used in model training in order to get an objective result.The pie chart shows the trained SVM classifier can correctly seperate ~90% metals and nonmetals. On the right figure, we further provide the prediction confidence for each compound. It is clearly that if the model believes the compound is a metal or a nonmetal (prediction confidence is near 100%), the prediction is probably correct. If the model is not quite sure (prediction confidence is less than ~75%), the predictions are quite possible to be wrong.

Metal/nonmetal classification results (90% accuracy)

Choose box-constraint from cross-validation

Prediction confidence for each prediction result

TrueMetal: prediction=metal and material=metal TrueNonmetal: prediction=nonmetal and material=nonmetal

FalseMetal: prediction=nonmetal and material=metal FalseNonmetal: prediction=metal and material=nonmetal

Bandgap Predic Results

As mentioned, the root-mean-square deviation (RMSE) for the developed SVM regression model is ~0.7 eV for bandgap prediction. The result is pretty amazing especially considering the error of standard DFT calculations are more than 1.5 eV. In the following diagram, the predicted band gap is compared to experimental band gap. Ideally, all the dot would be on the line. The deviation from the line represents the error of our predictions. It is interesting to know which features, such as element group number and compound symmetry dominate a material's band gap. We are still working on it.

Download

Please visit my Github: https://github.com/tianshi-wang/Band_Gap_Machine_Learning

Project2 Learn from the past: find undervalued stocks and earn

This algorithm assumes stocks which performed similarly in the past will likely continue doing so in the near future. Therefore, it is regarded as a buy single if a stock performed worse than others in a cluster.

S&P500 stocks are clustered by their daily performance from 201401 to 201801 using KMeans method. Based on their performance during 201802-201805, ~40 underperformed (to-buy) and outperformed (to-sell) stocks are selected. The selected to-buy stocks return 2.2% in 201806 (or 26% annually) compared to 0.5% for market and -0.6% for to-sell stocks.

Introduction

Price chart of ten randomly selected stocks in S&P500

Diagram of stock-selection procedure using this algorithm

Download

Please visit my Github: https://github.com/tianshi-wang/Stock_Selection_Clustering_Method