Regular expressions (regex) are powerful for pattern matching in text. Pandas integrates regular expressions into its .str
methods, enabling you to extract and manipulate text based on patterns.
a. Extracting Patterns with .str.extract()
You can use .str.extract()
to extract substrings matching a given regular expression pattern.
# Extract first letter of each city
df['City_first_letter'] = df['City'].str.extract(r'(\w)', expand=False)
print(df)
Output:
Name City Occupation City_upper City_length \
0 Alice New York Engineer NEW YORK 8
1 Bob Los Angeles Artist LOS ANGELES 11
2 Charlie San Francisco Scientist SAN FRANCISCO 13
3 David Chicago Chef CHICAGO 7
City_cleaned City_corrected City_lower City_title City_first_letter
0 New York New York new york New York N
1 Los Angeles L.A. Angeles los angeles Los Angeles L
2 San Francisco San Francisco san francisco San Francisco S
3 Chicago Chicago chicago Chicago
In the above code, the regular expression r'(\w)'
matches the first word character in the city name.
b. Finding Substrings with .str.contains()
If you want to check whether a substring exists in a string, you can use .str.contains()
. This method returns a boolean Series indicating whether each string contains the specified substring.
df['Has_LA'] = df['City'].str.contains('Los')
print(df)
Output:
Name City Occupation City_upper City_length \
0 Alice New York Engineer NEW YORK 8
1 Bob Los Angeles Artist LOS ANGELES 11
2 Charlie San Francisco Scientist SAN FRANCISCO 13
3 David Chicago Chef CHICAGO 7
City_cleaned City_corrected City_lower City_title City_first_letter Has_LA
0 New York New York new york New York N False
1 Los Angeles L.A. Angeles los angeles Los Angeles L True
2 San Francisco San Francisco san francisco San Francisco S False
3 Chicago Chicago chicago Chicago
In this example, the Has_LA
column checks if ‘Los’ is present in the City
column.
c. Splitting Text into Separate Columns
The .str.split()
method allows you to split a string into multiple substrings based on a delimiter. The result is a Series where each element is a list of substrings.
# Split the 'City' column by space into two new columns
df[['City_part1', 'City_part2']] = df['City'].str.split(' ', 1, expand=True)
print(df)
Output:
Name City Occupation City_upper City_length \
0 Alice New York Engineer NEW YORK 8
1 Bob Los Angeles Artist LOS ANGELES 11
2 Charlie San Francisco Scientist SAN FRANCISCO 13
3 David Chicago Chef CHICAGO 7
City_cleaned City_corrected City_lower City_title City_first_letter Has_LA City_part1 City_part2
0 New York New York new york New York N False New York
1 Los Angeles L.A. Angeles los angeles Los Angeles L True Los Angeles
2 San Francisco San Francisco san francisco San Francisco S False San Francisco
3 Chicago Chicago chicago Chicago
Here, the str.split()
method splits the City
column by the space character into two new columns: City_part1
and City_part2
.