Working with text files in Python

One of the most important skills in any programming language is knowing how to open text files and get the data/information in them out and into your program. When working with Ansys you may want to parse the information out of a material card (xml) from a Granta MI database or perhaps you're working with a JSON configuration file for your application. Or maybe you're just trying to work with some excel data with the aim of doing some simple statistical analysis. In all these examples you need to know how to open and read files. It is usually quite straightforward, but there are various aspects that make it a lot trickier than you might initially suspect and the basic recipes of how to do it seem to be overlooked in many tutorials as a result.

This guide is not intended to be comprehensive, but to give readers a few recipes that cover the most common situations that they can build on, as well as pointing out some of the common pitfalls you can expect. It uses the standard library where possible and mentions other libraries when it is relevant.

What is a text file?

A text file is a computer file that is structured as lines of electronic text.. For the purposes of programming and Python, it is a file containing a single string-type data object. Generally, it is also encoded by the computer and must be decoded before it can be parsed by a program. However, many different encodings can be used, and you can't always tell which encoding has been used by just looking at the file.

Encoding, however, is quite complicated and if you are new to it then all you need to know for this article is that the most common form of encoding is "UTF-8". You can safely assume any file you encounter will be UTF-8 encoded, although this is not guaranteed. You can read more about encoding in this Real Python encodings guide for more information.

Text file types

In both your professional and personal life, you encounter four common types of text files. These files contain plain text, but their distinct extensions indicate the various conventions they use to structure their data.

From this point on we will refer to the category of file as "text files" and files that have the extension .txt as txt files in reference to the extension.

.txt files have no set convention and is the "generic" extension.
.csv files are "comma separated variable/value" files and typically contain data in columns, separated by the comma , delimiter (or sometimes the semicolon delimiter as well ;).
- .tsv files are a close relation, but the "t" here stands for "tab".
.json. "JavaScript Object Notation" files structure data as key-value pairs contained within pairs of {}. They look like Python dictionaries.
.xml. "eXtensible Markup Language" files structure data using angle brackets <> and are the hardest to work with in Python. HTML looks very similar, but isn't the same.

Note on interpreting text files: file extensions, like .txt, provide the computer with information about the type of file it is processing and how to interpret it. This is normally hidden to the user. When you are writing code, however, you are explicitly taking control of how the file is handled. This means you can treat ANY file as a text file if you want to, regardless of its extension. And the reverse applies too: a file doesn't have to have the .txt extension to be* a text file.

We can treat all four of the types above as pure text files and simply process their context in different ways. In addition, some files may not have the expected extension, but can still be interpreted on the basis of their internal structure. For example, jupyter notebooks .ipynb files are JSON files and can be treated like JSON files.*

The first step when working with any file in Python is to open it (and the last step should be to close it!). Working with files is like working with boxes. You have to open the box before you can take anything out of it. All Python text files can be opened in one of these two methods:

Explicitly open the file object, read the file, then close the file again after.
use the with statement which automatically opens the file when the program enters the indented block and auto-closes it when the program leaves it.
- This makes it less likely that one will forget the closing statement as it is built into the opening statement.
- Learn more about the with pattern at Real Python: Python's "with open() as" Pattern.

Method 1

# Method 1 (not recommended)
file_object = open('my_file.txt', encoding='utf8')
data = file_object.read()
file_object.close()

Method 2

# Method 2 (recommended)
with open('my_file.txt', encoding='utf8') as file_object:
    data = file_object.read()

Method 2 is the recommended approach in Python.

Note: This may not appear to be true when using some libraries like numpy or pandas, however, the same process is still going on behind the scenes, it's just hidden in the functions you use.

These methods open the file, and parse the entire file as a single string. This can be useful but often you want to split the file into different lines or something else entirely. The file object has several methods available to cover common use cases.

file_object.read() - read the file as a single string and return that string.
file_object.readlines() - reads all the lines in the file and returns a list of strings.
file_object.readline() - reads ONE line of the file and returns it as a string. Can be called sequentially to parse lines in sequence.

For example:

with open('my_file.txt', encoding='utf8') as file_object:
    data = file_object.readlines()

with open('my_file.txt', encoding='utf8') as file_object:
    first_ten_lines = []
    for i in range(10):
        line = file_object.readline()
        first_ten_lines.append(line)

Though it shouldn’t be forgotten that when you read a file it can only be "read" once per program execution. Once the end of the file is reached, subsequent calls to read lines in the file will return an empty string. For example, if my_txt.txt contains:

My text file
line 2
line 3
line 4

This is what we see when we try to read lines after line 4.

In [11]: with open("my_txt.txt") as file:
    ...:     line = ' '
    ...:     for i in range(10):
    ...:         # See the next section for information on the `strip` method
    ...:         line = file.readline().strip('\n')
    ...:         print(f'line num: {i+1} - line: "{line}"')
    ...:
line num: 1 - line: "My text file"
line num: 2 - line: "line 2"
line num: 3 - line: "line 3"
line num: 4 - line: "line 4"
line num: 5 - line: ""
line num: 6 - line: ""
line num: 7 - line: ""
line num: 8 - line: ""
line num: 9 - line: ""
line num: 10 - line: ""

Everything will be read from a file as a string by Python. Meaning you need to be familiar with string operations to get the most out of working with them. Two fundamental methods you should be familiar with are split and strip.

So, for example if you want to parse a text file into a list of strings (one per line) and strip out all the newline characters (\n) at the end of each line (a common thing to do), the following recipe will do just that.

# E.g.
with open('my_file.txt', encoding='utf8') as file_object:
    data = [line.strip('\n') for line in file_object.readlines()]

Note: When you read a file like this it reads everything in it, which includes all the characters that are there but can't be seen. This includes the newline character \n! Strings and Character Data in Python is a great article in RealPython that covers the intricacies of characters and strings in Python and it worth a read for further information.

Working with CSV files in Python

CSV files are a bit easier to work with, since there are several good libraries you can use to help. In particular there's the built-in library csv and the external, but common, libraries numpy and pandas.

`csv`

The csv library is fairly simple to use and will do the stripping of unwanted characters and the splitting up of the strings for you. However, it will not interpret the correct types for you. You still have to do that yourself. For example if your csv file (my_csv.csv) looks like:

number 1, number 2
1,2
3,4
5,6

The following code would parse that into two lists of integer values stored in the variables column1 and column2:

import csv

with open('my_csv.csv') as file_object:
    reader = csv.reader(file_object, delimiter=',')
    data = [row for row in reader]
    # we want to skip the first row so we only iterate over a slice 
    # of the list `data` that doesn't include it
    column1 = [int(row[0]) for row in data[1:]]
    column2 = [int(row[1]) for row in data[1:]]

`numpy`

The numpy library is even easier and does even more for you, but you need to be careful about the limitations it has, because its methods are not as flexible. For example, numpy arrays must all contain the same type and csv files parsed by numpy must be of a consistent shape. Lines of different lengths will cause problems and numpy can't handle them. However, csv data doesn't usually have these issues.

import numpy

data = numpy.genfromtxt('my_csv.csv', delimiter=',', skip_header=1)

This example parses our csv into a 2D numpy array containing the data in the csv as floats, whilst skipping the header.

`pandas`

The pandas library is even easier and parses the csv file into a dataframe, but now the datatype needs to be consistent within a column, but not the file itself. Plus you can use the headers as labels in the dataframe. The downside is you still need rectangular data; uneven inputs per line will cause issues. You do also need to learn how to use dataframes, which are useful to know about but are another layer to learn.

This makes our code look even more concise.

import pandas

data = pandas.read_csv('my_csv.csv')

Where data is now a dataframe containing the csv data.

Working with JSON files

JSON files are very similar to dictionaries in Python. They are key:value pairs with no defined schema within pairs of {}. All JSON files can be interpreted as Python dictionaries but not all Python dictionaries can be turned into ("serialized") as JSON objects. You can use the builtin library json to work with JSON files in Python.

For example, if your JSON file (my_json.json) looks like:

{
    "foo": 0,
    "bar": ["baz", null, 1.0, 2]
}

You can turn that into a Python dictionary:

import json

with open('my_json.json') as file_object:
    data = json.load(file_object)

If you run this and print the value of data to screen, you can see:

>>> data
{'foo': 0, 'bar': ['baz', None, 1.0, 2]}

Working with XML files

XML is the trickiest to work with as XML schemas can vary enormously and they don't translate seamlessly into Python like JSON does. People sometimes try to do it with regular expressions. Don't do this.

XML is hierarchical. You have a root node and then that node will have child nodes which can have their own children and so on and so forth... Each node can possess key-value pairs of attributes, and a value. For example, a simple XML file (my_xml.xml) looks like:

<data>
    <number name="1">
        <point>1</point>
        <point>3</point>
        <point>5</point>
    </number>
    <number name="2">
        <point>2</point>
        <point>4</point>
        <point>6</point>
    </number>
</data>

There are many libraries designed to work with XML, but in this example, we use the built-in ElementTree module. The following Python code parses the XML data into two lists named number1 and number2. Note that you access the children of a node by iterating over the parent node.

import xml.etree.ElementTree as ET

tree = ET.parse('my_xml.xml')
root = tree.getroot()

numbers = [child for child in root]
number1 = [int(child.text) for child in numbers[0]]
number2 = [int(child.text) for child in numbers[1]]

You can also access the attributes on the number nodes by accessing the attrib property.

In [36]: number1
Out[36]: [1, 3, 5]

In [37]: number2
Out[37]: [2, 4, 6]

In [38]: numbers[0].attrib
Out[38]: {'name': '1'}

What is a text file?

Text file types