Monte Carlo Simulations of Au38(SCH3)24 Nanocluster Using Distance-Based Machine Learning Methods

We present an implementation of distance-based machine learning (ML) methods to create a realistic atomistic interaction potential to be used in Monte Carlo simulations of thermal dynamics of thiolate (SR) protected gold nanoclusters. The ML potential is trained for Au38(SR)24 by using previously published, density functional theory (DFT) -based, molecular dynamics (MD) simulation data on two experimentally characterised structural isomers of the cluster, and validated against independent DFT MD simulations. This method opens a door to efficient probing of the configuration space for further investigations of thermal-dependent electronic and optical properties of Au38(SR)24. Our ML implementation strategy allows for generalisation and accuracy control of distance-based ML models for complex nanostructures having several chemical elements and interactions of varying strength.


Introduction
Monolayer protected clusters (MPCs) are small metal nanoparticles that have a metal core with size ranging from a few atoms to a few hundred atoms, and a protecting surface layer of organic molecules such as thiols, phosphines, alkynyls, or carbenes. 1MPCs are synthesised via wet chemistry by reducing metal salts in presence of the protecting molecules.A variety of synthesis recipes and combination of metals and protecting molecules yields a rich chemistry and a large array of products in terms of size, shape, and composition of metal cores and the molecular overlayer.The wide range of synthetic parameters gives a unique possibility to study the fundamental structure-stability-property relations, and to engineer the properties for applications such as catalysis, plasmonics, biosensing, and drug delivery.
The first crystallographically resolved MPCs were reported already over 50 years ago (such as the so-called undecagold Au 11 cluster protected by phosphines 2 ), and first advances in synthesis and structural characterization produced a series of mostly noble metal clusters protected by L-type (such as phosphine) and mixed L-X type (X being an electronegative ligand such as halide or thiolate) ligands.The largest such known cluster was the phosphinehalide protected Au 39 , reported in 1992. 3nsiderable steps forward were taken when Brust and coworkers 4 reported a synthesis that produced all-thiolate protected gold clusters for an average size of two nanometers.
Several new chemical compositions of both organo-soluble and water-soluble clusters were reported soon after, [5][6][7][8] culminating to the breakthroughs of the first crystal structure of a large Water-soluble all-thiol protected cluster Au 102 (pMBA) 44 (pMBA = para mercapto benzoic acid) by the Kornberg group in 2007 9 as well as the organo-soluble Au 25 (PET) - 18 [10][11][12] in 2008 and Au 38 (PET) 24 (PET = phenyl ethyl thiolate) 13,14 clusters in 2008-2010.Up to date, atomic structures of at least 150 different compounds are crystallographically known, which facilitates detailed theoretical computations and dynamical simulations of the properties of MPCs and greatly helps to correlate structures to measured properties in experimental data.
Density functional theory (DFT) methods are the cornerstone for all computations that need to deal with details of the electronic structure, such as studies of optical absorption, optical excitation, fluorescence, and magnetism.However, while giving the most accurate and detailed information, DFT methods are also numerically the most demanding.DFT computations of some of the largest structurally known MPCs like the thiolate protected Ag 374 15,16 have to deal with up to 13 000 valence electrons, and even a single-point DFT energy calculation can take minutes and use hundreds or even thousands of CPU cores in a supercomputer.Force fields describing gold-thiolate MPCs have been developed to be used in molecular dynamics (MD) simulations , e.g., in the context of ReaxFF 17 and AMBER-GROMACS. 18[21] However, developing such force fields may be time-consuming, system-or problem-specific and suffer from poor transferability.
Machine learning (ML) and data-driven methods are emerging as a promising alternative to analyse structure-property correlations and make systematic predictions of physicochemical properties in materials science. 22,235][26][27][28] A few homogeneous systems such as bulk water 29,30 or pure metal nanoparticles 31,32 have been studied as well.There has been very few studies of applying ML to MPCs.Recently deep neural networks and support vector machines were applied successfully to predict formation of MPCs in varying synthesis conditions. 33,34stems with diverse chemical environments, such as MPCs, possess a large number of degrees of freedom, a range of chemical interactions of varying strength, and may require large training sets in order to cover the chemical space thoroughly enough.The most popular ML methods include neural networks, kernel ridge regression and Gaussian processes. 35Neural networks have a great potential to learn very complicated data, because of their large number of parameters, weights and network shapes to be adjusted during training.On the other hand, this flexibility also makes the method prone to overfitting.Kernel ridge regression and Gaussian processes are versatile tools, since one can define different kernel functions suiting a problem at hand.These kernels can easily transform the method to a complex one.
Here we demonstrate that even simple distance-based methods are applicable to complex systems such as MPCs.We use two methods, the so-called Minimal Learning Machine (MLM) 36 and the Extreme Minimal Learning Machine (EMLM) 37

Theoretical methods
Here we discuss the necessary components of the development of the ML method to deal with dynamical simulations of thiolate protected gold nanoclusters.We introduce the used descriptor for the cluster structures, the general principles of the distance-based machine learning, and the Monte Carlo method to probe the configuration space.

Many-Body Tensor Representation
The Cartesian coordinates of atomic positions include the whole structural information about a single nanostructure, however one cannot use them to describe the system for a machine learning method.If even a small rotation or translation is applied to the system, the coordinates would change but physically the situation is still the same.In order to overcome this problem one needs to use suitable structural descriptor, which are required to be invariant to translation, rotation and permutation.Cartesian coordinates are not fulfilling any of these requirements.In addition to these requirements it is desirable that description would be continuous, unique in the sense of description-property correlation and fast to be computed. 41There have been several different approaches with a varying level of complexity to describe nanostructures for machine learning methods.Frequently used descriptors in the field are atom-centered symmetry functions, 42 Coulomb matrices, 43 Ewald sum and sine matrices, 44 Bag of Bonds, 45 Zernike functions, 46,47 Smooth Overlap of Atomic Positions (SOAP), 48 to name a few.These descriptions can be divided to local and global ones de-pending on whether they describe the environment around a single atom or the whole system as relationships between atoms.In this study, we used global descriptor called Many-Body Tensor Representation (MBTR), 41 which is implemented in the DScribe package. 49e basic idea of the MBTR is based on Bag of Bonds description.There, the system is first divided into the contributions of different element pairs and then described with pairwise distances between the atoms belonging to the elements of interest.Huo and Rupp used this as a starting point and formalized the basis of MBTR. 41Afterwards Jäger et al.
simplified the theoretical presentation 50 and Himanen et al. implemented it into the DScribe package. 49The backbone of the description is where In equation (1) summations are going through atoms with atomic (element) number of z 1 and z 2 .Function D(x, g) introduces broadening, which can be controlled by changing the parameter .Here x is sweeping variable, which probes the values produced by the function Parameter k is the one defining the properties, that are used to describe the system.In the theory there is no limits for k, therefore in principle one can freely define suitable property.Usually choices are k = 1 for atomic numbers, k = 2 for pairwise atomic distances (or the inverse of the distance) and k = 3 for angles formed by three different atoms.In this study, we chose to set k = 2 in order to use pairwise distances, therefore the weights are w 2 (i, j) = exp( dR ij ) and the property measure is defined as Here d is a parameter, which is used to define the amount of weight for the contributions of atoms i and j if they are R i,j apart from each other.As the name suggest, MBTR is a tensor with dimensions of N elements ⇥ N elements ⇥ n x , when k = 2. N elements is the number of different elements in the system and n x is the number of points that variable x can probe.Every element pair is described with their own summation but all pairs are using the same set of parameters.We list parameters as sets of {min,max,n x , ,d,cut-off}.First there are minimum and maximum values of the variable x. n x is the number of points for x.As mentioned earlier controls the broadening and d is used in weighting.DScribe package has also its own parameter to define cut-off.Only the values of the equation ( 1), which are greater than the cut-off, are used in summation for every value of x.This can affect to the speed of computations and also the sensitivity of the descriptor.

Distance-based machine learning methods
Minimal Learning Machine MLM.Here we briefly introduce the theoretical background of the utilized distance-based machine learning methods.First we go through Minimal Learning Machine (MLM) formalized by de Souza Júnior et al. 36 In general, we assume that , are given with the corresponding output points , y i 2 R p , to be predicted.We restrict here to univariate (nonlinear) regression problems.In supervised machine learning one usually trains a model to map input points to certain output directly or through some kernel space.In that case the mapping f : X !Y between input and output spaces would be used to make regression model as where E denotes residuals.MLM, on the other hand, determines the Euclidean distances between input and reference points and then uses these distances to construct a linear regression model to predict the Euclidean distances in the output space.These predicted distances with respect to the output space reference points form a multilateration problem from which the actual output is computed.
Reference points are defined as M = {m k } K k=1 with M ✓ X and corresponding outputs are naturally In the notation Greek letters are used for output space distances in order to distinguish them from input space notations.Next the mapping g is used to create regression model between distances in input and output spaces as Next, de Souza Júnior et al. assume that the mapping g has a linear structure for each response.The model simplifies into a matrix product 36 In order to get the matrix B containing the coefficients for the K responses some approximations are needed.B is estimated from training data through minimizing the multivariate residual sum of squares.This provides a least squares estimate of the matrix Solving the B corresponds to training of the model.Now the last task is the multilateration problem in the output space.There is no single definite way to approach this problem but many approaches can be applied. 51The idea is to minimize the objective function of single output regression problem where d(x, M) 2 R 1⇥K is a vector containing distances between a new input x and all reference points M .The task is to find suitable output y, which minimizes the objective function.In our case we adopted cubic equation introduced by Mesquita et al. 52 The minimum or minima are found where the derivative equals zero.Differentiation yields This can be thought as a cubic equation ay 3 + by 2 + cy + d = 0. From three possible roots we choose the one that yields the smallest value of the objective function.
Extreme Minimal Learning Machine EMLM.Another distance-based machine learning method, which was used in this study, is the Extreme Minimal Learning Machine (EMLM).The origin of the method lies in the so-called Extreme Learning Machine (ELM), which are single-layer perceptrons with special training and optimization methods. is the number of reference points, therefore the elements of H are defined as This is just the Euclidean distance between a reference point and an input point.We simplify the notation by writing h j ⌘ h(x j ).Now h j 2 R K⇥1 and H 2 R K⇥N d .Then as Kärkkäinen states, the training of the model is done through regularized least-squares (RLS) optimization problem 37 min The parameter ↵ is a small positive real number (square root of machine ✏ by default) used for regularization.V is a matrix containing the coefficients used for the actual regression and V 2 R p⇥K .One could say, that V and reference points together form the actual machine learning model.The minimum of the optimization problem lies on the zero point of the matrix derivative.The optimal solution W ⌘ V optimal satisfies After getting the optimal solution for the RLS problem one can use W to predict output for a new arbitary input x.This is done as where h is the same vector valued kernel function as before.With input vector x it yields K ⇥ 1 vector.The elements of this vector are defined to be Euclidean distances as |m

Monte Carlo
We used Monte Carlo to simulate the dynamics of the Au 38 (SCH 3 ) 24 clusters with simplified methyl ligands.Clusters are divided to three different moving parts: gold, sulfur and methyl.
Gold atoms are moved into a random direction according to the step size.Sulfur is moved in a similar fashion but in order to preserve the orientation of sulfur-carbon bond the methyl group is rotated making it to face the sulfur atom.The same principle is applied for the movement of the methyl groups.When methyl is moved according to the step size, the S-C bond orientation is preserved.In addition to this we allowed methyl group to rotate around the sulfur-carbon bond.The way how the alignment of sulfur-carbon bond is preserved is visualized in Figures 1C and 1D.The stretching of carbon-hydrogen bond does not have a significant contribution to the total potential energy of the system, therefore we decided to fix these bonds.
The acceptance of every move is decided according to the Metropolis question.The probability of the move to be accepted is defined as E i is the potential energy of the ith configuration and E i+1 is the potential energy of the configuration after a proposed move.Going downhill in energy landscape is always permitted but going uphill is accepted with certain probability defined by the energy difference and simulation temperature T .In the exponent k B is the Boltzmann constant.The step size of a single move is adjusted during the simulations so that the acceptance of the moves is between 40% and 60%.This step size is the same throughout the whole cluster and it is not affected by the type of the moved block.During a MC step, all moving parts are sampled randomly and every one of them has an opportunity to move.This means that one MC step consists of 38+24+24=86 trial moves.

Generating training data and training the models
The training data from the Au 38 (SCH 3 ) 24 clusters was generated using density functional theory (DFT) run with GPAW code. 61,62The major training data was published earlier We used two different sets of MBTR parameters {min,max,n x , ,d,cut-off}.The first set was {0, 1.4, 100, 0.1, 0.5, 10 3 } and the second set was {0, 1.2, 100, 0.045, 0.8, 10 5 } ( for discussion on choosing the parameters, see Supporting Information text and Figures S1 and   S2).In the beginning, we trained MLM for the MBTR data corresponding to the first set of parameters.Minmax scaling was applied to the training data so that descriptor values belonged to interval [0, 1].Overfitting is rarely an issue for MLM and EMLM.Therefore, we used the Full MLM and EMLM variants meaning that all data points were selected as reference points.We used MLM to predict potential energies during the Monte Carlo simulations in various simulation temperatures and with different starting structures taken from the training data.Monte Carlo frequently found the outer boundaries of the reference points pushing itself out of the working range of MLM.This resulted in erroneous potential energy values and non-physical structures.In the Supporting Information text and Figure S3 we show that the MLM, which was trained only with the initial MD data, 38 is not able to handle configurations produced by the Monte Carlo.However, it can still find clear structure-energy correlation within the training data.
To cope with the erroneous behaviour, we expanded the MLM training set including the MC-generated "unrealistic" configurations and their energies from DFT.The training set was expanded with 1580 new configurations for Q and 2124 for T isomer.After this we used the second set of MBTR parameters, which had improved descriptive possibilities (see Supporting Information).With the expanded training set and improved descriptor we trained both MLM and EMLM.In Figure 2

Validation: potential energy MLM/EMLM vs. DFT-MD
For validation, we created new independent DFT MD reference data sets both for Q and T isomers.For the Q isomer we ran 2000 steps at 269 K , 2000 steps at 475 K, and 3653 steps at 795 K.For the T isomer we ran 2000 steps at 273 K and 2049 steps at 486 K.
Potential energies were predicted for every configuration using both MLM and EMLM and compared to the actual DFT values from the MD run.The performance is seen in Figure 3.
Generally, the predicted values correlate clearly with the DFT values, with the root-meansquared error (RMSE) being 2.98 eV for MLM and 2.67 eV for EMLM.The corresponding average relative errors are only 0.38% and 0.33%, respectively.The predicted energies are somewhat higher (less negative) than those from DFT.Our training set contains a lot of high energy configurations of Au 38 (SCH 3 ) 24 , therefore the set might be biased.The visualization of PCA in Figure 4

MC simulations with EMLM-predicted energies
As the most stringent test, we performed MC simulations of both Q and T isomers at temperatures of 200 K, 250 K, and 300 K, using the EMLM-predicted potential energy in the Metropolis criterion while advancing the dynamics.Typical simulations were run for 9000 to 10000 MC steps, one MC step consisting of 86 independent trial moves of the atoms (hence 86 EMLM energy evaluations per MC step).PCA of the runs at 300 K is shown in Figure 5(A) indicating that the MC dynamics of both isomers is concentrated on a quite small region close to the T = 0 K local potential energy minimum, as expected for this rather low temperature.Figure 5(B) shows the evolution of the potential energy of both isomers at 300 K indicating that the potential energy of the Q isomer is consistently lower by about 1 eV than that of the T isomer.This result is consistent with the energetics known from DFT.
We analysed the statistics of selected bond distances and bond angles for both isomers from the MC runs at 200 K, 250 K, and 300 K.Last 500 MC steps from each simulations were used for the analysis.Figure 6 shows the statistics for the nearest neighbour Au-Au bonds in the metal core as well as for the S-Au and S-C bonds, and compares them to the statistics obtained from DFT MD runs at 268 K and 474 K for Q isomer and 272 K and 486 K for T isomer.We observe that the EMLM-MC runs generally slightly overestimate the Au-Au bonds in both isomers as compared to DFT MD.The peaks of the distributions are at 2.862 Å(MC) and 2.805 Å(MD) for Q isomer, and 2.845 Å(MC) and 2.805 Å(MD) for T isomer.
For S-Au and S-C bonds, EMLM-MC and DFT-MD produce very similar distributions both regarding the peak position and width.This analysis shows that the EMLM-MC runs indeed are able to simulate the bond dynamics of the atoms in the harmonic vibration regime.
Figure 7 shows the corresponding comparison between EMLM-MC and DFT-MD data for Au-S-Au and S-Au-S angles.In the crystal structures of these isomers the Au-S-Au angle is close to 90 and S-Au-S angle close to 170 (Figure 1).We observe that the maxima of Au-S-Au angles produced by EMLM-MC are slightly smaller than 90 , with a small side peak around 130 for the T isomer.We see a wider scatter in describing the S-Au-S angles in EMLM-MC as compared to DFT-MD, with the distributions having a maximum around 150 and tail extending close to 100 .MD simulations shows distributions peaked around 170 .We assign these slight discrepancies to the k2 description of the MBTR which does not take into account any angular information.

Conclusion
Distance-based machine learning methods discussed in this study are conceptually straightforward and very simple to implement.We have shown here that they are suitable to simulate complex systems such as MPCs that have a number of chemical interactions with varying strength, while resulting in significantly reduced computational cost as compared to DFT.
The CPU time to predict the energy by using MLM or EMLM with MBTR k2-level descriptors for the atomic structure is several magnitudes smaller than for DFT.For a comparison, MLM/EMLM energy predictions were run on a single core of Intel Xeon CPU E5-2680 v3 @ 2.50GHz with 8GB memory.Computing MBTR k2 with our parameters took about 0.07 seconds for one atomic structure.Prediction of the potential energy using MBTR k2 took Excluding all angular information and using only pairwise distances to describe atomic structures with MBTR k2-level further helps to make these methods computationally light.
The lack of angular information in MBTR k2 description does not mean, that our methods would not be able to reproduce reasonable bond angles.As shown in the SI, we could improve the description of the angles of protecting RS(AuSR) n=1,2 units by tuning the parameters, although the MC simulations showed that the energy landscape produced by EMLM slightly differed from the one that DFT would yield.
Monte Carlo showed to be an efficient strategy to study the energy landscape learned by MLM and EMLM.The method is not bound by any assumptions, therefore it freely explores the feature space and gives useful insight of possible weaknesses of the machine learning method.An important lesson learned in this work was that the initial MC simulations showed that our initial DFT-MD training set 38 was not extensive enough to train a comprehensive machine learning method, since the DFT-MD produced atomistic configurations that were all "physical".By enlarging the training data with the structures corresponding to the DFT energies of the "unphysical" configurations predicted by MLM/EMLM-MC back to the training data, we were able to teach the methods to avoid the unphysical regions of the configurational phase space.
Our future work involves further development of the models and descriptors for MPCs and other heterogeneous nanostructures.Here we used a global descriptor and predicted the potential energy of the system as a property of a whole system.Dividing the potential energy into atomic or molecular contributions creates in principle a way to get spatial insight into the energetics. 26Fabrizio et al. have pointed out that it is reasonable to use global description when predicting global properties but it might cause size-dependence, which sometimes can be overcome with usage of local descriptions. 64Our method is currently trained solely for Au 38 (SCH 3 ) 24 with the goal to demonstrate that distance-based machine learning methods can be used to handle complex systems such as MPCs.We aim to generalize the methods by including other MPCs (other metals and ligands) and other sizes of gold-thiolate clusters in the training set.

Supporting Information
The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/Additional discussion on testing of the MBTR parameters (Figures S1 and S2).Performance of the MLM with the initial MD training data (Figure S3).

For Table of Contents only
parameter set is used.Especially the S-Au-S angle improves when the method uses the latter parameter set.
2 Pitfalls of using molecular dynamics as a training data Molecular dynamics (MD) simulations are always deterministic.They create a path in a configuration space as a function of time or simulation steps.This creates a pitfall for machine learning methods, especially for those whose construction relies on the actual observations, like the reference points with the distance-based methods.
In the beginning we used the first set of MBTR parameters to describe the structures of Au 38 (SCH 3 ) 24 clusters.The structures were from the publication of Juarez-Mosqueda et al. 3 The data set contained 25060 configurations in total.This data was then used to train the Minimal Learning Machine (MLM).From the whole data set 80% was randomly chosen to the training set.All training data points were selected as reference points during the training process.Finally the remaining 20% of the MD data was used for testing.The test results can be seen in Figure S3.It seems that the predicted potential energies of MLM follow accurately the results of DFT.In other words, the MLM could find a clear structureproperty correlation from the data set.Unfortunately the MLM is greatly restricted to the path that MD had made.It is not a difficult task to predict the property (in our case potential energy) between two similar data points along the path but predicting what is outside the path is much more difficult.This was seen in Monte Carlo simulations.They frequently broke the structures and made non-physical configurations, when this MLM was used to predict potential energies.In the main article principal component analysis (PCA) of the MBTR descriptors in Figure 2 shows how much the predicted structures differ from the original training data.In order to improve the generalization capability of the method we used configurations from these Monte Carlo simulations to expand the training set.

ix|.
We can see that the EMLM framework is fundamentally a Kernel Ridge Regression with the Euclidean distance basis as a kernel.Because of the structural similarity to the linear radial basis function network, the EMLM model possesses the universal approximation capability.[58][59][60]

Figure 1 :
Figure 1: The initial structures of Au 38 (SCH 3 ) 24 are visualized for Q and T isomers in (A) and (B) respectively.While moving sulfur atoms and methyls the orientation of the S-C bond has to be preserved.(C) shows how alignment is preserved if methyl is moved.(D) show the same when sulfur atom is moved.Long protecting unit is visualized in Figure (E) and short unit in (F).In (E) and (F) methyls are omitted for the sake of clarity.Orange: gold, yellow: sulfur, gray: carbon, white: hydrogen.
the principal component analysis (PCA) of the MBTR shows that the training set contains a large variety of configurations of both isomers spanning a large area of the feature space.Due to the fact that MLM/EMLM methods are using the Euclidean distances to measure the similarity of input point it is educative to visualize how the datapoints are arranged in the feature space.

Figure 2 :
Figure 2: PCA visualisation of MBTR descriptors of the training data.For the sake of clarity only 25% of the points are present in the graph.(i) the initial structures and (ii) high-temperature structures of the original MD simulations 38 (iii) snapshots from Monte Carlo simulations, where S-Au bonds have been broken.In (ii) and (iii) left/right structures originate from Q/T isomers.Orange: gold, yellow: sulfur, gray: carbon, white: hydrogen.
indicates that the new MD simulations are rather far away from the points in the original training set.However, they are not outside of the working region of the MLM and EMLM like the first Monte Carlo simulations, which were used to expand the training set.This enables distance-based methods to predict well the potential energy values.

Figure 3 :
Figure 3: Correlation between the predicted potential energy from (A) MLM and (B) EMLM to the DFT energy from the MD calculations for Q and T isomers.

Figure 4 :
Figure 4: Visualization for PCA from training data and test MD data.Potential energies on z axis are computed with DFT.The graph is rotated with respect to Figure 2. In order to keep visualization clear, only 25% of the points are included

Figure 5 :
Figure 5: (A) Same as Figure 2, but including also the PCA analysis of EMLM MC runs at 300 K for isomers Q and T. The arrow highlights the region of the MC data.The analysis indicates that both of the isomers are vibrating close to their minima.Only 25% of the points are included into the Figure and PC1 values are multiplied with 1 to produce a comparable graph.(B) shows the evolution of potential energies of both isomers predicted by EMLM during MC.

Figure 6 :
Figure 6: Top row: bond distance distributions from EMLM MC simulations at the indicted temperatures.Bottom row: the same data from DFT MD simulations at indicated temperatures.Labels on the top indicate the isomer and bond type.The statistics is summed from gaussian-smoothened ( = 0.05 Å) data points.

Figure 7 :
Figure 7: Top row: Selected bond angles distributions from EMLM MC simulations at the indicted temperatures.Bottom row: the same data from DFT MD simulations at indicated temperatures.Labels on the top indicate the isomer and type of the angle.The statistics is summed from gaussian-smoothened ( = 1.75 ) data points.

Figure S2 :
Figure S2: Here we visualize the effect of MBTR k2 to the angles of protecting units during the Monte Carlo simulations of Au 38 (SCH 3 ) 24 T. In the top row MBTR k2 is shown for different element pairs.Left side shows the results for the first parameter set and right side for the second set.The statistics of angles are summed from gaussian-smoothened ( = 1.75 ) data points.

Figure S3 :
Figure S3: Here the potential energies predicted by MLM are shown as a function of real DFT level potential energies.Data set contains both Q and T isomers of Au 38 (SCH 3 ) 24 .From the data 80% is used as reference structures and the remaining 20% are used for testing.When MLM is interpolating within the set its predictive power is excellent.RMSE = root mean squared error, MAE = mean absolute error