How to Think Like a Computer Scientist: Learning with Python 2nd Edition/Strings

Strings

edit

7.1 A compound data type

edit

So far we have seen five types: int, float, bool, NoneType and str. Strings are qualitatively different from the other four because they are made up of smaller pieces --- characters.

Types that comprise smaller pieces are called compound data types. Depending on what we are doing, we may want to treat a compound data type as a single thing, or we may want to access its parts. This ambiguity is useful.

The bracket operator selects a single character from a string:

>>> fruit = "banana"
>>> letter = fruit[1]
>>> print letter

The expression fruit[1] selects character number 1 from fruit. The variable letter refers to the result. When we display letter, we get a surprise:

a

The first letter of "banana" is not a, unless you are a computer scientist. For perverse reasons, computer scientists always start counting from zero. The 0th letter ( zero-eth ) of "banana" is b. The 1th letter ( one-eth ) is a, and the 2th ( two-eth ) letter is n.

If you want the zero-eth letter of a string, you just put 0, or any expression with the value 0, in the brackets:

>>> letter = fruit[0]
>>> print letter
b

The expression in brackets is called an index. An index specifies a member of an ordered set, in this case the set of characters in the string. The index indicates which one you want, hence the name. It can be any integer expression.

7.2 Length

edit

The len function returns the number of characters in a string:

>>> fruit = "banana"
>>> len(fruit)
6

To get the last letter of a string, you might be tempted to try something like this:

length = len(fruit)
last = fruit[length]       # ERROR!

That won't work. It causes the runtime error IndexError: string index out of range. The reason is that there is no 6th letter in "banana". Since we started counting at zero, the six letters are numbered 0 to 5. To get the last character, we have to subtract 1 from length:

length = len(fruit)
last = fruit[length-1]

Alternatively, we can use negative indices, which count backward from the end of the string. The expression fruit[-1] yields the last letter, fruit[-2] yields the second to last, and so on.

7.3 Traversal and the for loop

edit

A lot of computations involve processing a string one character at a time. Often they start at the beginning, select each character in turn, do something to it, and continue until the end. This pattern of processing is called a traversal. One way to encode a traversal is with a while statement:

index = 0
while index < len(fruit):
    letter = fruit[index]
    print letter
    index += 1
#fruit = "banana"
#while index is less than 6.
#6 is the length of fruit
#letter = fruit[index]
#Since index = 0, "b" is equal to letter in loop 1
#letter is printed
#1 is added to whatever the value of index is
#the loop continues until index < 6

This loop traverses the string and displays each letter on a line by itself. The loop condition is index < len(fruit), so when index is equal to the length of the string, the condition is false, and the body of the loop is not executed. The last character accessed is the one with the index len(fruit)-1, which is the last character in the string.

Using an index to traverse a set of values is so common that Python provides an alternative, simpler syntax --- the for loop:

for letter in fruit:
    print letter

Each time through the loop, the next character in the string is assigned to the variable letter. The loop continues until no characters are left.

The following example shows how to use concatenation and a for loop to generate an abecedarian series. Abecedarian refers to a series or list in which the elements appear in alphabetical order. For example, in Robert McCloskey's book Make Way for Ducklings, the names of the ducklings are Jack, Kack, Lack, Mack, Nack, Ouack, Pack, and Quack. This loop outputs these names in order:

prefixes = "JKLMNOPQ"
suffix = "ack"

for letter in prefixes:
    print letter + suffix

The output of this program is:

Jack
Kack
Lack
Mack
Nack
Oack
Pack
Qack

Of course, that's not quite right because Ouack and Quack are misspelled. You'll fix this as an exercise below.

7.4 String slices

edit

A substring of a string is called a slice. Selecting a slice is similar to selecting a character:

>>> s = "Peter, Paul, and Mary"
>>> print s[0:5]
Peter
>>> print s[7:11]
Paul
>>> print s[17:21]
Mary

The operator [n:m] returns the part of the string from the n-eth character to the m-eth character, including the first but excluding the last. This behavior is counterintuitive; it makes more sense if you imagine the indices pointing between the characters, as in the following diagram:

 

If you omit the first index (before the colon), the slice starts at the beginning of the string. If you omit the second index, the slice goes to the end of the string. Thus:

>>> fruit = "banana"
>>> fruit[:3]
'ban'
>>> fruit[3:]
'ana'

What do you think s[:] means?

String comparison

edit

The comparison operators work on strings. To see if two strings are equal:

Other comparison operations are useful for putting words in lexigraphical order_:

This is similar to the alphabetical order you would use with a dictionary, except that all the uppercase letters come before all the lowercase letters. As a result:

A common way to address this problem is to convert strings to a standard format, such as all lowercase, before performing the comparison. A more difficult problem is making the program realize that zebras are not fruit.

Strings are immutable

edit

It is tempting to use the [] operator on the left side of an assignment, with the intention of changing a character in a string. For example:

Instead of producing the output Jello, world!, this code produces the runtime error TypeError: 'str' object doesn't support item assignment.

Strings are immutable, which means you can't change an existing string. The best you can do is create a new string that is a variation on the original:

The solution here is to concatenate a new first letter onto a slice of greeting. This operation has no effect on the original string.

The in operator

edit

The in operator tests if one string is a substring of another:

Note that a string is a substring of itself:

Combining the in operator with string concatenation using +, we can write a function that removes all the vowels from a string:

Test this function to confirm that it does what we wanted it to do.

7.8 A find function

edit

What does the following function do?

def find(strng, ch):
    index = 0
    while index < len(strng):
        if strng[index] == ch:
            return index
        index += 1
    return -1

#assume strng is "banana" and ch is "a"
#if strng[index] == ch:
#return index
#the above 2 lines check if strng[index#] == a
#when the loop runs first index is 0 which is b (not a)
#so 1 is added to whatever the value of index is
#when the loop runs second time index is 1 which is a
#the loop is then broken, and 1 is returned.
#if it cannot find ch in strng -1 is returned

In a sense, find is the opposite of the [] operator. Instead of taking an index and extracting the corresponding character, it takes a character and finds the index where that character appears. If the character is not found, the function returns -1.

This is the first example we have seen of a return statement inside a loop. If strng[index] == ch, the function returns immediately, breaking out of the loop prematurely.

If the character doesn't appear in the string, then the program exits the loop normally and returns -1.

This pattern of computation is sometimes called a eureka traversal because as soon as we find what we are looking for, we can cry Eureka! and stop looking.

Looping and counting

edit

The following program counts the number of times the letter a appears in a string, and is another example of the counter pattern introduced in :ref:`counting`:

7.10 Optional parameters

edit

To find the locations of the second or third occurrence of a character in a string, we can modify the find function, adding a third parameter for the starting position in the search string:

def find2(strng, ch, start):
    index = start
    while index < len(strng):
        if strng[index] == ch:
            return index
        index += 1
    return -1

The call find2('banana', 'a', 2) now returns 3, the index of the first occurrence of 'a' in 'banana' after index 2. What does find2('banana', 'n', 3) return? If you said, 4, there is a good chance you understand how find2 works.

Better still, we can combine find and find2 using an optional parameter:

def find(strng, ch, start=0):
    index = start
    while index < len(strng):
        if strng[index] == ch:
            return index
        index += 1
    return -1
#index = start = 0 by default
#while index is less than the length of string:
#if strng[index] equals ch
#return index i.e. location of ch in strng -- note return breaks out of loop
#else add 1 to index and continue until index equals the length of sting
#if no match return -1

The call find('banana', 'a', 2) to this version of find behaves just like find2, while in the call find('banana', 'a'), start will be set to the default value of 0.

Adding another optional parameter to find makes it search both forward and backward:

def find(strng, ch, start=0, step=1):
    index = start
    while 0 <= index < len(strng):
        if strng[index] == ch:
            return index
        index += step
    return -1

Passing in a value of len(strng)-1 for start and -1 for step will make it search toward the beginning of the string instead of the end. Note that we needed to check for a lower bound for index in the while loop as well as an upper bound to accommodate this change.

The string module

edit

The string module contains useful functions that manipulate strings. As usual, we have to import the module before we can use it:

To see what is inside it, use the dir function with the module name as an argument.

which will return the list of items inside the string module:

['Template', '_TemplateMetaclass', '__builtins__', '__doc__', '__file__', '__name__', '_float', '_idmap', '_idmapL', '_int', '_long', '_multimap', '_re', 'ascii_letters', 'ascii_lowercase', 'ascii_uppercase', 'atof', 'atof_error', 'atoi', 'atoi_error', 'atol', 'atol_error', 'capitalize', 'capwords', 'center', 'count', 'digits', 'expandtabs', 'find', 'hexdigits', 'index', 'index_error', 'join', 'joinfields', 'letters', 'ljust', 'lower', 'lowercase', 'lstrip', 'maketrans', 'octdigits', 'printable', 'punctuation', 'replace', 'rfind', 'rindex', 'rjust', 'rsplit', 'rstrip', 'split', 'splitfields', 'strip', 'swapcase', 'translate', 'upper', 'uppercase', 'whitespace', 'zfill']

To find out more about an item in this list, we can use the type command. We need to specify the module name followed by the item using dot notation.

Since string.digits is a string, we can print it to see what it contains:

Not surprisingly, it contains each of the decimal digits.

string.find is a function which does much the same thing as the function we wrote. To find out more about it, we can print out its docstring, __doc__, which contains documentation on the function:

The parameters in square brackets are optional parameters. We can use string.find much as we did our own find:

This example demonstrates one of the benefits of modules --- they help avoid collisions between the names of built-in functions and user-defined functions. By using dot notation we can specify which version of find we want.

Actually, string.find is more general than our version. it can find substrings, not just characters:

Like ours, it takes an additional argument that specifies the index at which it should start:

Unlike ours, its second optional parameter specifies the index at which the search should end:

In this example, the search fails because the letter b does not appear in the index range from 1 to 2 (not including 2).

Character classification

edit

It is often helpful to examine a character and test whether it is upper- or lowercase, or whether it is a character or a digit. The string module provides several constants that are useful for these purposes. One of these, string.digits, we have already seen.

The string string.lowercase contains all of the letters that the system considers to be lowercase. Similarly, string.uppercase contains all of the uppercase letters. Try the following and see what you get:

We can use these constants and find to classify characters. For example, if find(lowercase, ch) returns a value other than -1, then ch must be lowercase:

Alternatively, we can take advantage of the in operator:

As yet another alternative, we can use the comparison operator:

If ch is between a and z, it must be a lowercase letter.

Another constant defined in the string module may surprise you when you print it:

Whitespace characters move the cursor without printing anything. They create the white space between visible characters (at least on white paper). The constant string.whitespace contains all the whitespace characters, including space, tab (\t), and newline (\n).

There are other useful functions in the string module, but this book isn't intended to be a reference manual. On the other hand, the Python Library Reference is. Along with a wealth of other documentation, it's available from the Python website, [http://www.python.org http://www.python.org]_.

String formatting

edit

The most concise and powerful way to format a string in Python is to use the string formatting operator, %, together with Python's string formatting operations. To see how this works, let's start with a few examples:

The syntax for the string formatting operation looks like this:

It begins with a format which contains a sequence of characters and conversion specifications. Conversion specifications start with a % operator. Following the format string is a single % and then a sequence of values, one per conversion specification, separated by commas and enclosed in parenthesis. The parenthesis are optional if there is only a single value.

In the first example above, there is a single conversion specification, %s, which indicates a string. The single value, "Arthur", maps to it, and is not enclosed in parenthesis.

In the second example, name has string value, "Alice", and age has integer value, 10. These map to the two conversion specifications, %s and %d. The d in the second conversion specification indicates that the value is a decimal integer.

In the third example variables n1 and n2 have integer values 4 and 5 respectively. There are four conversion specifications in the format string: three %d's and a %f. The f indicates that the value should be represented as a floating point number. The four values that map to the four conversion specifications are: 2**10, n1, n2, and n1 * n2.

s, d, and f are all the conversion types we will need for this book. To see a complete list, see the String Formatting Operations_ section of the Python Library Reference.

The following example illustrates the real utility of string formatting:

This program prints out a table of various powers of the numbers from 1 to 10. In its current form it relies on the tab character ( \t) to align the columns of values, but this breaks down when the values in the table get larger than the 8 character tab width:

i       i**2    i**3    i**5    i**10   i**20
1       1       1       1       1       1
2       4       8       32      1024    1048576
3       9       27      243     59049   3486784401
4       16      64      1024    1048576         1099511627776
5       25      125     3125    9765625         95367431640625
6       36      216     7776    60466176        3656158440062976
7       49      343     16807   282475249       79792266297612001
8       64      512     32768   1073741824      1152921504606846976
9       81      729     59049   3486784401      12157665459056928801
10      100     1000    100000  10000000000     100000000000000000000

One possible solution would be to change the tab width, but the first column already has more space than it needs. The best solution would be to set the width of each column independently. As you may have guessed by now, string formatting provides the solution:

Running this version produces the following output:

i   i**2 i**3  i**5    i**10        i**20          
1   1    1     1       1            1              
2   4    8     32      1024         1048576        
3   9    27    243     59049        3486784401     
4   16   64    1024    1048576      1099511627776  
5   25   125   3125    9765625      95367431640625 
6   36   216   7776    60466176     3656158440062976
7   49   343   16807   282475249    79792266297612001
8   64   512   32768   1073741824   1152921504606846976
9   81   729   59049   3486784401   12157665459056928801
10  100  1000  100000  10000000000  100000000000000000000

The - after each % in the converstion specifications indicates left justification. The numerical values specify the minimum length, so %-13d is a left justified number at least 13 characters wide.

Summary and First Exercises

edit

This chapter introduced a lot of new ideas. The following summary and set of exercises may prove helpful in remembering what you learned:

Exercises:

  1. Write the Python interpreter's evaluation to each of the following expressions:

    • >>> 'Python'[1]
    • >>> "Strings are sequences of characters."[5]
    • >>> len("wonderful")
    • >>> 'Mystery'[:4]
    • >>> 'p' in 'Pinapple'
    • >>> 'apple' in 'Pinapple'
    • >>> 'pear' in 'Pinapple'
    • >>> 'apple' > 'pinapple'
    • >>> 'pinapple' < 'Peach'
  2. Write Python code to make each of the following doctests pass:
    *
    *
    *

Glossary

edit

Exercises

edit

Question 1

edit

Modify:

prefixes = "JKLMNOPQ"
suffix = "ack"

for letter in prefixes:
    print letter + suffix

so that Ouack and Quack are spelled correctly.

Question 2

edit

Encapsulate:

fruit = "banana"
count = 0
for char in fruit:
    if char == 'a':
        count += 1
print count

in a function named count_letters, and generalize it so that it accepts the string and the letter as arguments.

Question 3

edit

Now rewrite the count_letters function so that instead of traversing the string, it repeatedly calls find (the version from Optional parameters), with the optional third parameter to locate new occurences of the letter being counted.

Question 4

edit

Which version of is_lower do you think will be fastest? Can you think of other reasons besides speed to prefer one version or the other?

Question 5

edit
  • Create a file named stringtools.py and put the following in it:

    Add a function body to reverse to make the doctests pass.
  • Add mirror to stringtools.py .

    Write a function body for it that will make it work as indicated by the doctests.
  • Include remove_letter in stringtools.py .

    Write a function body for it that will make it work as indicated by the doctests.
  • Finally, add bodies to each of the following functions, one at a time

    until all the doctests pass.
  • Try each of the following formatted string operations in a Python shell and record the results:

    1. "%s %d %f" % (5, 5, 5)
    2. "%-.2f" % 3
    3. "%-10.2f%-10.2f" % (7, 1.0/2)
    4. print " $%5.2fn $%5.2fn $%5.2f" % (3, 4.5, 11.2)
  • The following formatted strings have errors. Fix them:

    1. "%s %s %s %s" % ('this', 'that', 'something')
    2. "%s %s %s" % ('yes', 'no', 'up', 'down')
    3. "%d %f %f" % (3, 3, 'three')