Using Machine Learning to predict analysis solve times from local data: Part 2

Prediction

This is part 2 of my data parsing and prediction example and assumes you successfully gathered data and saved it locally, as explained in Part 1.

Once we have all the data from our solve.out files (Nodes, Elements, DOF and time), we can use that data set and some python AI/ML (Artificial Intelligence/Machine Learning) tools to create a method to predict the time based on our own inputs. This means once we have a tool built to do the prediction, we can use our analysis preliminary data to predict the time to solve.

First we need to get a few packages for our python virtual environment installation.

Step 0: Python venv activation

If you are restarting, you can just activate the virtual environment (bob) you created in Part 1.

C:/bob/Scripts/Activate.ps1
# Note non VS Code users might not need the .ps1

There is a short list of required packages for the prediction part of the example. So either add or create a new requirements.txt file in the C:\bob file and with these lines:

scikit-learn
matplotlib
joblib

These will be imported with the pip command:

C:\bob\Scripts\pip install -r C:\bob\requirements.txt

It will take a few seconds to import the packages.

Step 1: Importing the data, cleaning and sorting

We will again use pandas to import the json data file we created in part 1. Note there is data in the file that did not appear to be needed in this iteration of the predictor tool, so we delete a few columns that are not needed. Memory and version were not used in this analysis.

# imports
from sklearn.model_selection import train_test_split
from sklearn import ensemble
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import GridSearchCV
import pandas as pd

path2result = r'path to my json file'
df = pd.read_json(path2result)
print(df.columns)
df_features = df.copy()
del df_features['Version']
del df_features['memory_available']
del df_features['memory_used_old']

Data preparation is a constant challenge for data analysis. In this case, we have some holes in our data. Some of the node and element count came in as zero. My solution was to determine which data sets have no data and remove them. Then I determined which data set has the most data (nodes, elements, dof) has the use that as a key feature, and remove the others.

# determine which column has the most data!
sort_results = []
columns_to_filter = ['Nodes','Elements','DOF','Time']
for column in columns_to_filter:
    count = (df_features[column]!=0).astype(int).sum()
    print(count)
    sort_results.append([column,count])
print(sort_results)

My quick analysis showed that DOF are not consistent (missing data). So I delete it and remove zero data sets from the pandas dataframe.

[['Nodes', np.int64(835)], ['Elements', np.int64(835)], ['DOF', np.int64(300)], ['Time', np.int64(832)]]

# Either nodes or elements have the max number of data points, delete DOF
del df_features['DOF']

# sort for nodes/elements/time ==0 and remove by index
index_count = []
for index,row in df_features.iterrows():
    if row['Nodes'] == 0 or row['Elements'] == 0 or row['Time'] == 0:
        index_count.append(index)

df_features.drop(axis = 0, index = index_count,inplace = True)

Step 2: Setup the data for using scikit-learn

The scikit ML tools are set up to take a set of data and divide it into two categories, testing and training data. The training data should be the majority, commonly 70% is used. The testing data is then used to test the accuracy of the predictions from the training data.

For more information about scikit-learn and what it does, and you will need more information to understand it: scikit-learn.

The standard procedure is to break out the data you are hoping to predict to a y parameter and remove it from the data set. Then train_test_split is used to build the predictive model (but not launch it).

import matplotlib.pyplot as plt
# Quick plot to see data
x_feature = 'Elements'
y_feature = 'Time'
df_features.plot(kind = 'scatter', x = x_feature, y = y_feature,grid = True)
plt.show()

# scikit learn data set up
y = df_features['Time'].values
del df_features['Time']
X = df_features.values

test_size = 0.30
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=0)

Step 3: Baseline Regression model fitting (Prediction)

The prediction model has hyperparameters that allow the users to tune the fitting to the data. The extensive list of scikit hyperparameters can be found at: hyperparameters.

Here I use some baseline values as suggested in another online training. Later we can refine them. ensemble.GradientBoostingRegressor is the command that will run the data fitting.

# Fit regression model
model = ensemble.GradientBoostingRegressor(
    n_estimators = 1000,
    max_depth = 6,
    min_samples_leaf = 9,
    learning_rate = 0.1,
    max_features = 0.1,
    loss = 'huber')

model.fit(X_train,y_train)
## Find the error rate on the training set
mse = mean_absolute_error(y_train,model.predict(X_train))
print("Training Set Mean Absolute Error: %.4f" % mse)

# Find the error rate on the test set
mse = mean_absolute_error(y_test,model.predict(X_test))
print("Test Set Mean Absolute Error: %.4f" % mse)

The error comparison allows you to see the quality of the data fitting.

Training Set Mean Absolute Error: 312.3432
Test Set Mean Absolute Error: 265.2018

The goal is to have both a low error in both the training and test data. If the values are wildly off, the data is either over or underfit. Naturally there are ways to correct these issues unfit.

Step 4: Refine the Regression model fitting (parameter optimization)

However, being short of time, I decided to use the GridSearchCV capabilities of the prediction tool to allow a best fit to be determine from ranges of the hyperparameters. Computers are there to do these tasks for us. To set up some ranges I just used 1/2 and 2 times the original values. Note, this will take time depending on how much data you have. Check the documentation for the ranges of the hyperparameters. Using more CPUs will speed up the process. The parameter search is run only on the training data. Here a timer is inserted to see how long it takes. Note: once you launch this you might want to go get a coffee. It takes a bit of time.

import time
n_cpus = 6
param_grid = {
    'n_estimators': [500, 1000, 2000 ],
    'max_depth': [3, 6, 12],
    'min_samples_leaf': [4, 9, 18],
    'learning_rate': [ 0.05, 0.1, 0.2],
    'max_features': [0.05, 0.1, 0.2],
    'loss': [ 'huber']
}

t1=time.time()
# Define the grid search we want to run. Run it with the maximum cpus in parallel 
gs_cv = GridSearchCV(model, param_grid, n_jobs = n_cpus)

# Run the grid search - on only the training data!
gs_cv.fit(X_train, y_train)
## Note, you will get a lot of error messages as the data has issues. Scroll down for results.
t2 = time.time()

# Print the parameters that gave us the best fit
print(gs_cv.best_params_)
print('time : ' + str(round(t2-t1)/60) + ' minutes' )

# Find the error rate on the training set using the best parameters
mse = mean_absolute_error(y_train, gs_cv.predict(X_train))
print("Training Set Mean Absolute Error: %.4f" % mse)
# Find the error rate on the test set using the best parameters
mse = mean_absolute_error(y_test, gs_cv.predict(X_test))
print("Test Set Mean Absolute Error: %.4f" % mse)

We can save the best set to file if we import the joblib package.

import joblib
# Save the trained model to a file so we can use it in other instances
joblib.dump(model, 'ansys_data_best_parameters.pkl')

We can later import just the ansys_data_best_parameters pkl file to do the prediction (coming up).

Step 5: Important! as in, what is?

Not all inputs or features have the same importance. We can use scikit to determine the importance of the different features. The higher the value, the more relevant that value is to the prediction, here, time. We could actually remove low value features with very little loss of prediction accuracy.

# Create a numpy array based on the model's feature importances
importance = model.feature_importances_

# Sort the feature labels based on the feature importance rankings from the model
feature_indexes_by_importance = importance.argsort()

# These are the feature labels from our data set
feature_labels = df_features.columns
print(feature_labels)
# Print each feature label, from most important to least important (reverse order)
for index in feature_indexes_by_importance:
    print("{} - {:.2f}%".format(feature_labels[index], (importance[index] * 100.0)))

The refinement did not improve the prediction much. And we see, some of the features or inputs have very little effect on the prediction. If we want to further refine the prediction, some of the inputs can be dropped.

Training Set Mean Absolute Error: 336.0002
Test Set Mean Absolute Error: 270.7797

Solver - 3.03%
Cores - 10.51%
Steps - 17.98%
Nodes - 29.21%
Elements - 39.26%

Step 6: Predict using your best values

Let's use the best values from our grid search, that we saved, by reimporting the pkl.

from sklearn.externals import joblib
# Load the model we trained previously
model = joblib.load('ansys_data_best_parameters.pkl')

To run a prediction, we need to input our values or features in the exact same order as we did when we set up the training data set. Note the prediction wants lists of lists, so package appropriately.

Any Ansys run of respectable size should give you the required data you need to run a prediction. For comparison I use a local run of a box model with 1e6 elements and 8 cores.

### predict
nodes = 4e6
elements = 1e6
solver = 1
cores = 8
steps = 1

analysis_inputs = [
    # exact same order as training pandas column headers/features!
    nodes, # nodes
    elements, # Elements
    solver, # solver
    cores, # cores
    steps # Steps
    ]

analyses_inputs = [analysis_inputs]
predicted_time = model.predict(analyses_inputs)
predicted_value = predicted_time[0]

print("This analysis of {:,d} nodes, {:,d} elements,{:,d} cores has an estimated time of {:,.2f} seconds"
      .format(int(nodes),int(elements),cores,predicted_value))

The results are within reason:

This analysis of 4,000,000 nodes, 1,000,000 elements, 8 cores has an estimated time of 138.57 seconds

Summation

And that is how easy it could be. The python modules (and more thorough trainings!) are there for users to learn and use. Use the links in the text to find out more about the respective packages.

This example demonstrated that users can gather their own solve data, which is always the most representative, and use scikit learn to create a (relatively) accurate prediction tool for the time to solve. This could be set up on local Ansys installations as a Mechanical ACT button using the pkl to provide users a current model prediction!

Special Thanx to JD who, in spite of his best efforts, is super helpful in all things python and bringing form and order to all things scripting.

The author, Mark Capellaro, is an Ansys Application Engineer and data enthusiast and always looking for interesting ways to turn data into information.

The full code can be viewed on GitHub here.

If you missed the first part of this series you can read it here: Using Machine Learning to predict analysis solve times from local data: Part 1