Using Machine Learning to predict analysis solve times from local data: Part 1

Data Parsing

Finite Element (FE) engineers are data engineers. Every FE solve involves organizing data into certain formats, sending it to a solver, and then post processing the outputs. It literally is all data and working with and manipulating data. And that takes time.

Naturally any engineer would like to know, in advance, how long it will take to run their model. By using some basic AI/ML (Artificial Intelligence/Machine Learning) tools available from scikit-learn, a freely available python package, we can determine the relationships between the model's dimensions (nodes, elements, DOFs, cores, steps...) and the time it takes to solve the model. (Better said, we use python and scikit to automatically determine the relationships and optimize the prediction.) We can do this using our own local data (or any data one has available) to create a scikit-learn ML estimator for the time to solve a model. What the ML part does is determine the weighting of the inputs and their effect on the output (time).

This short example will show you 1. how to gather your local data using a parser and, in part, 2. how to use scikit-learn to create a ML system to create a method to estimate times to solve for your FE models.

I will use VS Code here to demo how to run this code. If you use a different IDE (good for you) it will be easier and more straightforward.

Step 0: Python venv setup

Create a virtual environment to store the required modules. Here I create a directory called 'bob' because I call everything bob when I start a project.

python -m venv C:\\bob # Note here I set my new venv directly in C!

Activate that venv:

C:/bob/Scripts/Activate.ps1
# Note non VS Code users might not need the .ps1

And update pip in that directory:

C:\bob\Scripts\python.exe -m pip install --upgrade pip

Most of the packages required are default python packages, but there is a short list of other required packages for the data grab. Create a requirements.txt file in the C:\bob file and with these lines:

pandas
parsimonious

These will be imported with the updated pip command:

C:\bob\Scripts\pip install -r C:\bob\requirements.txt

It will take a few seconds to import the packages.

Step 1: Get Data

Getting the data itself is a bit of a procedure. Parsing can be done very efficiently if the source(s) are uniform and the search well organized.

Get Data: Steps

Determine a few key parameters that are likely to be key to estimating solve times
Locate the solve.out files in the local directory
Then parse the individual solve.out files
And naturally, get the time to solve out of the .out file.
Write these data points to a file for later use

Key parameters that are likely to be key to estimating solve times

These are my best quess as to what is required to estimate the solve time. Note the solve time is also included in the last python dictionary key at 'Time'. Each dictionary item has the name of the parameter (Version, Nodes'), the text that needs to be searched in the solve. out files, and the type of data that it should find. Some of this makes more sense when we see the parsing function.

required_data = {'Version': ['BUILD=','=', 'str'],
             'Nodes':['Total number of nodes', '=' , 'float'],
             'Nodes < v20':['Number of total nodes', '=', 'float'],
             'Elements': ['Total number of elements','=', 'float'],
             'Elements < v20':['Number of total elements', '=', 'float'],
             'DOF': ['Number of DOF','=', 'float'],
             'Solver': ['Equation solver used',':', 'str'],
             'Cores': ['Total number of cores requested',':', 'float'],
             'Steps' : ['SOLVE FOR LS','OF', 'float'],
             'memory_available': ['physical memory available',':', 'float'],
             'memory_used_old': ['Sum of memory used on all processes',':', 'float'],
             'memory_used': ['Maximum total memory used',':', 'float'],
             'Time':['Total CPU time summed for all threads',':', 'float']}

Note I made some duplicate dictionary keys since the format has changed slightly over the years. The parser needs an exact match to pull out the data. So the lower/upper case is important (if you don't use python to correct it later).

Example:

* 'Elements': ['Total number of elements','=', 'float'],
* 'Elements < v20':['Number of total elements', '=', 'float'],

How do we get locate the solve.out files?

We use the os module to walk through the directory or directories and look for a file named solve.out since they are consistently named across Ansys versions.

data = []
d2 = []
df = pd.DataFrame()
t1 = time.time()
for root, dirs, files in os.walk(directory):
    for file in files:
        if file == ("solve.out"):
            path2file = os.path.join(root, file)
			print(path2file)

Parse the individual solve.out files

The parsimonious package parses text files: parsimonious package. We will use parsimonious to find in our .out files the data points I decided were relevant to the time to solve. Since we can get the path to the files (path2file) we can import and parse the file. The solve.out are not so massive that it can't be done quickly.

from parsimonious.grammar import Grammar
# interim parameters
temp_df = {}
return_data = []

with open(file_path, 'r') as file:
        text = file.read()
    grammar = Grammar(r"""
        file = line+
        line = ~".*?\n"
    """)
	tree = grammar.parse(text)
    all_text = tree.text

Here I just grabbed the method from some online example (what'd you expect?) but what we end up with is all_text containing all the text from the imported file.

Then we can use the required_data to build a loop to search through all_text.

for key in required_data:
        search_term = required_data[key][0]
        try:
			s1 = all_text.index(search_term)
			line = all_text[s1:all_text.index('\n',s1)] # complete line containing search term
            datapoint = line.split(search_strings[key][1])[1].split()[0] # data point parsed from complete line
            temp_df[key] = [datapoint]
        except:
            datapoint = 0 # put in a zero of no data is found

At the end of this loop, we want to write what we find, if it exists (try/except) to a pandas dataframe.

import pandas as pd
data_frame = pd.DataFrame(temp_df)

This will create the dataframe with the column headers and values in the order of the keys in the required_data dictionary. An example data set from a single solve.out is shown below.

>>data_frame.loc[0]
Version              23.2
Nodes                  81
Nodes < v20            81
Elements               12
Elements < v20         12
DOF                   180
Solver             Sparse
Cores                   4
Steps                   2
memory_used_old     180.0
Time                  0.8

That is basically it for parsing, the rest is putting this all together in a loop to do all the steps. Here I have changed a few parameters and added some functions and moved things around a bit. And I added a counter to check how long it takes to search.

Click here to see the full code on GitHub

Time to get all data for 945 files = 9 seconds

My local example shows the program can search my D:/ drive, find 900+ solve.out files and get the data in 9 seconds. If you got this far, congratulations. There is still some non interesting data clean up (always part of data projects) that I am not going to document, I will simply place it here. Mostly just removing rows that had no data and consolidating the elements and nodes from the various versions into one dictionary.

# clean up dataframe
solver_dict = {'Sparse':0 , 'PCG':1, 0:-1} # Simplified numeric data for solver (0,1,-1)
a = df['Elements'].str.isnumeric()!=False
df = df[a]
for key in required_data.keys():
    data_type = required_data[key][-1]
    if data_type == 'float':
        df[key] = df[key].astype(float)
# get elements and nodes to one column
df['Elements'] = df[['Elements','Elements < v20']].max(axis=1)
df['Nodes'] = df[['Nodes','Nodes < v20']].max(axis=1)
df['memory_used'] = df[['memory_used','memory_used_old']].max(axis=1)
solver_data = [solver_dict[i] for i in df['Solver']]

df['Solver'] = solver_data                
df['Solver'] = df['Solver'].astype(int)
df = df.drop(['Elements < v20','Nodes < v20', 'memory_used'], axis=1)

We could use this data directly in part 2 (the interesting part) but to be sure, let's save it to file as a json file. Pandas has the ability to write .json files. The tempfile package is used to write the data to that always inconvenient but available directory.

tempfile.gettempdir()
user = os.getlogin()
out_file_name = 'solves_out_' + user + '.json'
path2result = os.path.join(tempfile.gettempdir(),out_file_name)
df.to_json(path2result)
print(path2result)
print('Data saved to: ' + path2result)

Summation

The parser built here in part 1 can be used to rapidly pick out the required data points from the solve.out files. Once we have our data saved to a json file or available locally as a pandas dataframe, we can go onto Part 2 and start building our tool to estimate the time to solve using scikit-learn. Note this script can easily be adapted and run on larger archives of data if you happen to have access to a cluster where your colleagues might neglect to delete their solve information.

Part 2 will deal with how to use scikit and our data to predict the time to solve.