Python 3 String-Processing Causing Problems?

by David Bolton Oct 29, 2014 7 min read

Widely known as a general-purpose programming language, Python is excellent at string handling—but a few things have changed between Python 2 and Python 3. This article is a reminder of what Python strings can (still) do for you, as well as a look at what you need to know about Python 3 strings.

We discussed some of these Python 3 changes in a previous article. Python 3 relies on Unicode Characters, more specifically UTF-8 as the default source encoding. This means that a character can be one to four bytes long, with the character codes 0-127 being the same as ASCII. While many U.S. programmers may never need to go outside the ASCII range, you should definitely know something about Unicode.

For the examples here, I’m using Python 3.4.2, the most current version. (This discussion also touches on changes first introduced in Python 3.2, including a whole new set of str functions, which means we’ll look at the 3.2 join() further on.)

Unlike C#, which distinguishes strings from chars by using double quotes for strings and single quotes for chars, Python uses both single quotes and double quotes for strings. The key is consistency: If it starts with a single quote, it must end with a single quote.

To have either type of quotation mark as part of the string, use the slash prefix to escape it, so:

print(‘\”Jackie\”‘)

outputs “Jackie.”

If you want the raw string without any escaping, just add an r prefix:

a = ‘\”Fred\”‘

b = r’\”Fred\”‘

print(a, b)

This outputs “Fred” \”Fred\”

PHP heredoc in Python

In PHP you can define a multi-line string, called heredoc, using three <<< to start and terminate the string. In Python, this is done with three quotes or double quotes. If the text is split over multiple lines, line breaks are included, unless you use a slash at the end of each line to show that it’s continuing and not breaking:

a=”””This is an

interesting piece of literature

but not quite a classic!

“””

b=”””This is also an \

interesting piece of literature \

but not quite a classic…\

“””

print(a)

print(b)

The output looks like this:

This is an

interesting piece of literature

but not quite a classic!

This is also an interesting piece of literature but not quite a classic…

Accessing Chars in Strings

The len(string) function returns the length in chars. Here’s an example with a unicode value for the £ sign:

a=”””This is an \

interesting piece of literature

but not quite a classic and costs \u00a3 10.00

“””

print(a)

print(len(a))

This outputs:

This is an interesting piece of literature

but not quite a classic and costs £ 10.00

Remove the \u00a3 and len(a) will return 85 as \u00a3 counts as one character. If you want to know more about text and UTF-8, I recommend the excellent Nick Coghlan’s Notes on processing text files.

Strings and Sequences

A Python 3 string is a sequence of UTF-8 chars and the usual sequence operators that apply to lists, tuples, bytes and more. These operators include in and not in, + for concatenation, shallow copies, indexing individual items, slices, len, min and max functions and comparisons of items.

The ‘in’ operator works like this:

a=”This is an interesting piece of literature”

if (‘an’ in a):

print(‘in’)

else:

print(‘not in’)

This prints “in.”

This prints “in.”

Concatenation is the same as in Python 2.

a=”A ”

a = a + “word”

print(a)

This outputs “A word.” You can use the C type operator += as well:

a+= “word “

Java and C# developers will spot a potential issue with Python string concatenation; think StringBuilder. If you are concatenating a number of strings, it’s way more efficient in Java/C# to use a StringBuilder class, but Python doesn’t have that. Strings are immutable, which means that changing the string’s value causes a new memory buffer to be allocated and the value of the string stored there. The original string in memory isn’t changed, and this makes string handling faster.

Because of this immutability, doing a string concatenation in a loop is memory-intensive, as it has to allocate memory for the new string with every iteration.

Why Is There No StringIO Function in Python 3?

A faster way of concatenating strings is to use a StringIO object, which is a text buffer held in memory that works only in Python 2; StringIO and cStringIO no longer exist in Python 3. The io package has a StringIO class that’s used.

import io

out = io.StringIO()

for i in range(10):

print(i, file=out)

print(out.getvalue())

out.close()

The first print sends the loop values (0-9) to the in-memory file “out,” one number per line. The second print calls outfile.getvalue() and that returns all the text in one go.

Yet another way to join strings is to use the str.join(iterable) method from Python 3.2 onwards; see the Python 3 string methods. The value in the string is used as a separator, and the join parameter is an iterable, or an object which can return members one by one. In the example below, I’ve used a list of IP digits and combined them into an IP address:

import io

out = io.StringIO()

a=”.”

a=a.join([‘174′,’129′,’202′,’211’])

print(a, file=out)

print(out.getvalue())

out.close()

If you’re converting from Python 2, you should be aware of these string differences. Most fundamental is the Unicode nature of strings now, which puts Python 3 up there with Java and C# for Unicode support.

Main image of article Python 3 String-Processing Causing Problems?

To have either type of quotation mark as part of the string, use the slash prefix to escape it, so:

print(‘\”Jackie\”‘)

outputs “Jackie.”

If you want the raw string without any escaping, just add an r prefix:

a = ‘\”Fred\”‘

b = r’\”Fred\”‘

print(a, b)

This outputs “Fred” \”Fred\”

PHP heredoc in Python

a=”””This is an

interesting piece of literature

but not quite a classic!

“””

b=”””This is also an \

interesting piece of literature \

but not quite a classic…\

“””

print(a)

print(b)

The output looks like this:

This is an

interesting piece of literature

but not quite a classic!

This is also an interesting piece of literature but not quite a classic…

Accessing Chars in Strings

The len(string) function returns the length in chars. Here’s an example with a unicode value for the £ sign:

a=”””This is an \

interesting piece of literature

but not quite a classic and costs \u00a3 10.00

“””

print(a)

print(len(a))

This outputs:

This is an interesting piece of literature

but not quite a classic and costs £ 10.00

Strings and Sequences

The ‘in’ operator works like this:

a=”This is an interesting piece of literature”

if (‘an’ in a):

print(‘in’)

else:

print(‘not in’)

This prints “in.”

This prints “in.”

Concatenation is the same as in Python 2.

a=”A ”

a = a + “word”

print(a)

This outputs “A word.” You can use the C type operator += as well:

a+= “word “

Because of this immutability, doing a string concatenation in a loop is memory-intensive, as it has to allocate memory for the new string with every iteration.

Why Is There No StringIO Function in Python 3?

import io

out = io.StringIO()

for i in range(10):

print(i, file=out)

print(out.getvalue())

out.close()

The first print sends the loop values (0-9) to the in-memory file “out,” one number per line. The second print calls outfile.getvalue() and that returns all the text in one go.

import io

out = io.StringIO()

a=”.”

a=a.join([‘174′,’129′,’202′,’211’])

print(a, file=out)

print(out.getvalue())

out.close()