CSci 150: Foundations of computer science
Home Syllabus Readings Projects Tests

Text files

A text file is simply a sequence of characters stored on disk, typically represented using either ASCII or Unicode, with absolutely no formatting information. This is different from most files, like a file saved by a word processor, which contains specific formatting information.

Tab-separated values

A text file might represent an e-mail message or a novel, but it is also often used for representing tabular data. For example, we might want to represent a table of county populations.

countystatepop.
ArkansasAR18777
AshleyAR21283
BaxterAR40957

One common way to store a table like this in a text file is tab-separated values, where rows of the table are separated by line breaks (typically ASCII character code 10), and where columns within each row are separated by tab characters (ASCII character code 9). So the first three rows of the above table might be represented as the following character sequence.

Arkansas   AR    18777
Ashley     AR    21283
Baxter     AR    40957

Or more literally (though this will display well only if your browser supports the character “” to reprent a tab and “” to represent a line break):

Arkansas⇥AR⇥18777↵Ashley⇥AR⇥21283↵Baxter⇥AR⇥40957↵

Processing text file

To access a text file in Python, you should use the built-in open, which takes a string representing a file's name and returns a “file object” that can be used to access the file's contents. In the below example, open creates a file object corresponding to the text in the file named data.txt.

infile = open('data.txt')

(Of course, the word infile is just a variable name I chose. You could choose a different variable to reference the file object returned by open.)

Once you have an object referencing the file, you can iterate through the file by using a regular for loop:

for x in infile:
    # code to process line x in the file

The body of the loop will be processed for each line of the file. For example, if the file had the three lines mentioned above, we would go through the loop three times, with x being each of the following.

  1. Arkansas⇥AR⇥18777↵
  2. Ashley⇥AR⇥21283↵
  3. Baxter⇥AR⇥40957↵

In practice, the body of the loop will almost always want to take off the newline character at the end, so you would want to use the rstrip method; and then to divide the line into its component parts, you'd want to use the strip method passing it the tab key as a parameter.

Here is a complete program; which goes through the file and displays all counties where the population is 100,000 or above.

infile = open('data.txt')
for line in infile:
    data = line.rstrip().split('\t')
    pop = int(data[2])
    if pop >= 50000:
        print('{0}, {1}'.format(data[0], data[1]))

Notice how the first part of each iteration is to use rstrip and then strip to take a line and divide it into its component parts, assuming that we're working with a tab-separated line.

As another more complex example, suppose we want to compute the total population of each state, computed as the total of its counties' populations. In this case, we want a dictionary that maps state abbreviations to the total population found so far for that state. Only after completing the file, then, would we go through the dictionary again and display the total for each state.

all_data = open('data.txt')
pops = {}
for line in infile:
    data = line.rstrip().split('\t')
    state = data[1]
    pop = int(data[2])
    if state in pops:
        pops[state] += pop
    else:
        pops[state] = pop
for state in pops:
    print('{0:32s}{1:9d}'.format(statepops[state]))