Extracting Information Using Regular Expressions

Pandas Tutorial

About Lesson

Regular expressions (regex) are powerful for pattern matching in text. Pandas integrates regular expressions into its .str methods, enabling you to extract and manipulate text based on patterns.

a. Extracting Patterns with `.str.extract()`

You can use .str.extract() to extract substrings matching a given regular expression pattern.

# Extract first letter of each city
df['City_first_letter'] = df['City'].str.extract(r'(\w)', expand=False)
print(df)

Output:

      Name           City   Occupation     City_upper  City_length  \
0    Alice       New York     Engineer      NEW YORK           8   
1      Bob    Los Angeles       Artist    LOS ANGELES          11   
2  Charlie  San Francisco    Scientist  SAN FRANCISCO          13   
3    David        Chicago         Chef        CHICAGO           7   

       City_cleaned  City_corrected   City_lower    City_title City_first_letter  
0       New York        New York      new york      New York                 N  
1    Los Angeles    L.A. Angeles    los angeles    Los Angeles                 L  
2  San Francisco  San Francisco  san francisco  San Francisco                 S  
3        Chicago        Chicago      chicago      Chicago

In the above code, the regular expression r'(\w)' matches the first word character in the city name.

b. Finding Substrings with `.str.contains()`

If you want to check whether a substring exists in a string, you can use .str.contains(). This method returns a boolean Series indicating whether each string contains the specified substring.

df['Has_LA'] = df['City'].str.contains('Los')
print(df)

Output:

      Name           City   Occupation     City_upper  City_length  \
0    Alice       New York     Engineer      NEW YORK           8   
1      Bob    Los Angeles       Artist    LOS ANGELES          11   
2  Charlie  San Francisco    Scientist  SAN FRANCISCO          13   
3    David        Chicago         Chef        CHICAGO           7   

       City_cleaned  City_corrected   City_lower    City_title City_first_letter  Has_LA  
0       New York        New York      new york      New York                 N    False  
1    Los Angeles    L.A. Angeles    los angeles    Los Angeles                 L     True  
2  San Francisco  San Francisco  san francisco  San Francisco                 S    False  
3        Chicago        Chicago      chicago      Chicago

In this example, the Has_LA column checks if ‘Los’ is present in the City column.

c. Splitting Text into Separate Columns

The .str.split() method allows you to split a string into multiple substrings based on a delimiter. The result is a Series where each element is a list of substrings.

# Split the 'City' column by space into two new columns
df[['City_part1', 'City_part2']] = df['City'].str.split(' ', 1, expand=True)
print(df)

Output:

      Name           City   Occupation     City_upper  City_length  \
0    Alice       New York     Engineer      NEW YORK           8   
1      Bob    Los Angeles       Artist    LOS ANGELES          11   
2  Charlie  San Francisco    Scientist  SAN FRANCISCO          13   
3    David        Chicago         Chef        CHICAGO           7   

       City_cleaned  City_corrected   City_lower    City_title City_first_letter  Has_LA    City_part1   City_part2  
0       New York        New York      new york      New York                 N    False          New       York  
1    Los Angeles    L.A. Angeles    los angeles    Los Angeles                 L     True        Los   Angeles  
2  San Francisco  San Francisco  san francisco  San Francisco                 S    False        San   Francisco  
3        Chicago        Chicago      chicago      Chicago

Here, the str.split() method splits the City column by the space character into two new columns: City_part1 and City_part2.

a. Extracting Patterns with .str.extract()

b. Finding Substrings with .str.contains()

c. Splitting Text into Separate Columns

Follow the newsletter & get attractive promotions

a. Extracting Patterns with `.str.extract()`

b. Finding Substrings with `.str.contains()`