Working with strings in Python

Strings are a crucial class of data because they represent textual information. I don’t recall any of my projects where I didn’t encounter strings, so it’s important to become familiar with the various ways to manipulate and work with them.

Of course I won’t be able to cover all methods, but I’ll keep some important ones here for my future reference.

# We can concatenate strings by combining them.
'Hello ' + 'world'
'Hello world'
# Strings can be multiplied by integers.
danger = 'Danger! '
danger * 3
'Danger!  Danger!  Danger!  '
# \ is an escape character that modifies the character that follows it.
quote = "\"It's good to be alive!\""
print(quote)
"It's good to be alive!"
# \n creates a newline.
greeting = "Good day,\nlady."
print(greeting)
Good day,
lady.
# We can loop over strings.
python = 'Python'
for letter in python:
    print(letter + 'ut')
Put
yut
tut
hut
out
nut

String Indexing

# The index() method returns the index of the character's first occurrence in the string.
pets = 'cats and dogs'
pets.index('s')
3
# Access the character at a given index of a string.
name = 'Respect'
name[0]
'R'
# Negative indexing begins at the end of the string.
sentence = 'That must be love!'
sentence[-1]
'!'

String Slicing

# Access a substring by using a slice.
color = 'orange'
color[1:4]
'ran'
# Omitting the first value of the slice implies a value of 0.
fruit = 'pineapple'
fruit[:4]
'pine'
# Omitting the last value of the slice implies a value of len(string).
fruit[4:]
'apple'
# The `in` keyword returns Boolean of whether substring is in string.
'banana' in fruit
'apple' in fruit
False
True

Remember that in a slice the starting index is inclusive and the stopping index is exclusive, and that negative indices count from the end of the sequence.

Format Strings

format() method formats and inserts specific substrings into designated places within a larger string.

# We use format() method to insert values into our string, indicated by braces.
name = 'Samuel'
number = 7
print('Hello {}, your lucky number is {}.'.format(name, number))
Hello Samuel, your lucky number is 7.
# We can assign names to designate how we want values to be inserted.
print('Hello {name}, your lucky number is {num}.'.format(num=number, name=name))
Hello Samuel, your lucky number is 7.
# Or we can use argument indices to designate how we want values to be inserted.
print('Hello {1}, your lucky number is {0}.'.format(number, name))

If we have a long string, we can enclose it in triple quotes, that  lets us break the string over multiple lines.

x = 'values'
y = 100

print('''String formatting lets you insert {} into strings.
They can even be numbers, like {}.'''.format(x, y))
String formatting lets you insert values into strings.
They can even be numbers, like 100.

We can repeat arguments’ indices as well.

print('{0}{1}{0}'.format('abra', 'cad'))
abracadabra

More samples with numbers:

# Example inserting prices into string
price = 7.75
with_tax = price * 1.07
print('Base price: ${} USD. \nWith tax: ${} USD.'.format(price, with_tax))
Base price: $7.75 USD. 
With tax: $8.2925 USD.
# Use :.2f to round a float value to two places beyond the decimal.
print('Base price: ${:.2f} USD. \nWith tax: ${:.2f} USD.'.format(price, with_tax))
Base price: $7.75 USD. 
With tax: $8.29 USD.

Let’s rewrite the converter code from previous posts.

# Define a function that converts Fahrenheit to Celsius.
def to_celsius(x):
    return (x-32) * 5/9

# Create a temperature conversion table using string formatting
for x in range(0, 101, 10):
    print("{:>4} F | {:>6.2f} C".format(x, to_celsius(x)))
   0 F | -17.78 C
10 F | -12.22 C
20 F | -6.67 C
30 F | -1.11 C
40 F | 4.44 C
50 F | 10.00 C
60 F | 15.56 C
70 F | 21.11 C
80 F | 26.67 C
90 F | 32.22 C
100 F | 37.78 C

We use ‘>’ to align to the right, ‘>3’ means ‘three spaces to the right’.

Literal string interpolation (f-strings) 

We can use literal string interpolation, also known as f-strings (with Python version 3.6+), to further minimize the syntax required to embed expressions into strings. 

var_a = 1
var_b = 2
print(f'{var_a} + {var_b}')
print(f'{var_a + var_b}')
print(f'var_a = {var_a} \nvar_b = {var_b}')
1 + 2
3
var_a = 1
var_b = 2

Float formatting option would be:

num = 1000.987123
f'{num:.2f}'
1000.99

Here are some of the most common presentation types:

TypeMeaning
‘e’Scientific notation. For a given precision p, formats the number in scientific notation with the letter ‘e’ separating the coefficient from the exponent. The coefficient has one digit before and p digits after the decimal point, for a total of p + 1 significant digits. With no precision given, e uses a precision of 6 digits after the decimal point for float, and shows all coefficient digits for decimal
‘f’Fixed-point notation. For a given precision p, formats the number as a decimal number with exactly p digits following the decimal point. 
‘%’Percentage. Multiplies the number by 100 and displays in fixed (‘f’) format, followed by a percent sign.

Here are some examples:

num = 1000.987123
print(f'{num:.3e}')
1.001e+03
decimal = 0.2497856
print(f'{decimal:.4%}')
24.9786%

String Methods

str.count(sub[, start[, end]])

Return the number of non-overlapping occurrences of substring sub in the range [start , end].

my_string = 'Happy birthday'
print(my_string.count('y'))
print(my_string.count('y', 2, 13))
2
1
str.find(sub)

Return the lowest index in the string where the substring sub is found. Return -1 if sub is not found.

my_string = 'Happy birthday'
my_string.find('birth')
6
str.join()

Return a string which is the concatenation of the strings in iterable. The separator between elements is the string providing this method.

separator_string = ' '
iterable_of_strings = ['Happy', 'birthday', 'to', 'you']
separator_string.join(iterable_of_strings)
Happy birthday to you
str.partition(sep)

Split the string at the first occurrence of sep, and return a 3-tuple containing the part before the separator, the separator itself, and the part after the separator. If the separator is not found, return a 3-tuple containing the string itself, followed by two empty strings.

my_string = 'https://www.google.com/'
my_string.partition('.')
('https://www', '.', 'google.com/')
str.replace(old, new[, count])

Return a copy of the string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.

my_string = 'https://www.google.com/'
my_string.replace('google', 'youtube')
https://www.youtube.com
str.split([sep])

Return a list of the words in the string, using sep (optional) as the delimiter string. If no sep is given, whitespace characters are used as the delimiter. Any number of consecutive whitespaces would indicate a split point, so ‘ ‘ (a single whitespace) would split the same way as ‘ ‘ (two or more whitespaces).

my_string = 'Do you know the muffin man?'
my_string.split()
['Do', 'you', 'know', 'the', 'muffin', 'man?']

Regular expressions

Regular expressions, also known as regex, refer to techniques to modify and process string data. Since it requires an advanced level of knowledge, I’ll put these basic notes just for future reference.

Regex works by matching patterns in Python. It allows us to search for specific patterns of text within a string of text. Regex is used extensively in web scraping, text processing and cleaning, and data analysis.

The first step in working with regular expressions is to import the re module. This module provides the tools necessary for working with regular expressions. Once we have imported the module, we can start working with regular expressions.

import re
my_string = 'Three sad tigers swallowed wheat in a wheat field'
re.search('wall', my_string)
<_sre.SRE_Match object; span=(18, 22), match='wall'>

This example above returns a match object that contains information about the search. In this case, it tells us that the substring ‘wall’ does occur in the string from indices 18–22.

Regex is especially useful because it allows us a very high degree of customization when performing our searches.

import re
my_string = 'Three sad tigers swallowed wheat in a wheat field'
re.search('[bms]ad', my_string)
<_sre.SRE_Match object; span=(6, 9), match='sad'>

This example will search for “bad,” “mad,” and “sad.”


Let’s practice more:

String concatenation

  • Define a function called zip_checker that accepts the following argument:
    • zipcode – a string with either four or five characters
  • Return:
    • If zipcode has five characters, and the first two characters are NOT ’00’, return zipcode as a string. Otherwise, return ‘Invalid ZIP Code.’. (ZIP Codes do not begin with 00 in the mainland U.S.)
    • If zipcode has four characters and the first character is NOT ‘0’, the function must add a zero to the beginning of the string and return the five-character zipcode as a string.
    • If zipcode has four characters and the first character is ‘0’, the function must return ‘Invalid ZIP Code.’.
Example:
[IN] zip_checker('02806')
[OUT] '02806'

[IN] zip_checker('2806')
[OUT] '02806'

[IN] zip_checker('0280')
[OUT] 'Invalid ZIP Code.'

[IN] zip_checker('00280')
[OUT] 'Invalid ZIP Code.'
def zip_checker(zipcode):
    if len(zipcode) == 5:
        if zipcode[0:2] =='00':
            return 'Invalid ZIP Code.'
        else:
            return zipcode
    elif zipcode[0] != '0':
        zipcode = '0' + zipcode
        return zipcode
    else:
        return 'Invalid ZIP Code.'

To test the code above:

print(zip_checker('02806'))     # Should return 02806.
print(zip_checker('2806'))      # Should return 02806.
print(zip_checker('0280'))      # Should return 'Invalid ZIP Code.'
print(zip_checker('00280'))     # Should return 'Invalid ZIP Code.'
02806
02806
Invalid ZIP Code.
Invalid ZIP Code.

Of course there are other ways to solve this problem. My initial solution was as follows:

def zip_checker (zipcode):
    zipcode = str(zipcode)
    if len(zipcode) == 5 and zipcode[0:2] != "00":
        return zipcode
    elif len(zipcode) == 5 and zipcode[0:2] == "00":
        return "Invalid ZIP Code."
    elif len(zipcode) == 4 and zipcode[0] != "0":
        zipcode = "0" + zipcode
        return zipcode
    elif len(zipcode) == 4 and zipcode[0] == "0":
        return "Invalid ZIP Code."

String extraction

  • The correct URL protocol is https: Anything else is invalid.
  • A valid store ID must have exactly seven characters.

Define a function called url_checker that accepts the following argument:

  • url – a URL string

Return:

  • If both the protocol and the store ID are invalid:
    • print two lines:
      ‘{protocol} is an invalid protocol.’
      ‘{store_id} is an invalid store ID.’
  • If only the protocol is invalid:
    • print:
      ‘{protocol} is an invalid protocol.’
  • If only the store ID is invalid:
    • print:
      ‘{store_id} is an invalid store ID.’
  • If both the protocol and the store ID are valid, return the store ID.
# Sample valid URL for reference while writing our function:
url = 'https://exampleURL1.com/r626c36'

def url_checker(url):
    url = url.split('/')
    protocol = url[0]
    store_id = url[-1]
    # If both protocol and store_id bad
    if protocol != 'https:' and len(store_id) != 7:
        print(f'{protocol} is an invalid protocol.',
            f'\n{store_id} is an invalid store ID.')
    # If just protocol bad
    elif protocol != 'https:':
        print(f'{protocol} is an invalid protocol.')
    # If just store_id bad
    elif len(store_id) != 7:
        print(f'{store_id} is an invalid store ID.')
    # If all ok
    else:
        return store_id

To test the code above:

                                                # Should return:
url_checker('http://exampleURL1.com/r626c3') # 'http: is an invalid protocol.'
print() # 'r626c3 is an invalid store ID.'

url_checker('ftps://exampleURL1.com/r626c36') # 'ftps: is an invalid protocol.
print()
url_checker('https://exampleURL1.com/r626c3') # 'r626c3 is an invalid store ID.'
print()
url_checker('https://exampleURL1.com/r626c36') # 'r626c36'
http: is an invalid protocol.
r626c3 is an invalid store ID.

ftps: is an invalid protocol.

r626c3 is an invalid store ID.

'r626c36'

Once again, my initial solution was different:

def url_checker(url):
    [protocol, url_part] = url.split(":")
    [url_part, store_id] = url_part.split(".com/")

    check_protocol = None
    if len(protocol) == 4:
       check_protocol = False
    elif len(protocol) == 5:
        check_protocol = True

    check_id = None
    if len(store_id) == 7:
        check_id = True
    elif len(store_id) < 7:
        check_id = False

    if not check_protocol and not check_id:
       print (f'{protocol}: is an invalid protocol \n{store_id} is an invalid store ID.')
    elif not check_protocol and check_id:
        print (f'{protocol}: is an invalid protocol')
    elif check_protocol and not check_id:
        print (f'{store_id} is an invalid store ID.')
    elif check_protocol and check_id:
        return store_id

It’s much longer and makes it hard to understand. In the last part I even forgot to use the ‘else’ statement 🙂 But that’s the fun part of learning, isn’t it?


In