python – How to extract string from dataframe after regex matching

python – How to extract string from dataframe after regex matching

United States will be fixed string which can be stripped off as on
exact match


My solution is to remove the country name, which will leave us with the city name only.
This approach seems to be easier since country names are fixed and can be easily removed based on a list, i.e.:

  1. split() the address in two based on comma (,);
  2. replace() the country name with nothing;
  3. Use pandas apply() to apply get_city() function containing the steps above.
  4. Use pandas tolist() to convert column City to a list. This last step is optional, as it depends on what youll do with the city names.

i.e.:

import pandas as pd
addresses = [[10 rue des Treuils BP 12 33023, Bordeaux France],[Rua da Alegria 22, Lisboa Portugal],[22 Some Street, NYC United States]]
df = pd.DataFrame(addresses,columns=[Address])

countries = [Portugal, France, United States]

def get_city(address):
    city_country = address.split(,)[1]
    for i in countries: city = city_country.replace(i, )
    return city.strip()

df[City] = df[Address].apply(get_city)
print (df[City].tolist())

Output:

[Bordeaux, Lisboa, NYC]

PS:
You may want to lower() both the addresses and countries list in order to avoid case SenSitIve mismatches.

If we consider your regex to be working with French addresses (ending by France), then you can use this :

/,s([A-Z][A-Za-zs-]+)sFrance/gm

enter

Link to the online regex simulator where I tested the expression

You mentioned earlier about the United States, but actually the way adresses are written is totally different, so youll have to make another regex for it, I guess. (i.e:
4 Cross Lane
Schererville, IN 46375)

python – How to extract string from dataframe after regex matching

Yeah maybe some advanced regex could handle this but the pandas naive approach would be:

import pandas as pd
import numpy as np

col = pd.Series([10 rue des Treuils BP 12 33023, Bordeaux France,
                 10 rue des Treuils BP 12 33023, Les Deux Alpes France,
                 10 rue des Treuils BP 12 33023, New York United States])

cities = np.where(col.str.endswith(United States), 
                  col.str.split(, ).str[1].str.split().str[:-2].str.join( ), 
                  col.str.split(, ).str[1].str.split().str[:-1].str.join( ))

print(cities)
#[Bordeaux Les Deux Alpes New York]

A more general but not as effective solution (but who needs speed right?)

import pandas as pd

col = pd.Series([10 rue des Treuils BP 12 33023, Bordeaux France,
                 10 rue des Treuils BP 12 33023, New York United States,
                 10 rue des Treuils BP 12 33023, Seoul South Korea,
                 10 rue des Treuils BP 12 33023, Brazzaville Republic of Congo])

countries = {United States: 2 , South Korea: 2, Republic of Congo: 3}
n = [next((countries[k] for k,v in countries.items() if i.endswith(k)), 1) for i in col]
cities = [ .join(i.split(, )[1].split()[:-y]) for i,y in zip(col,n)]

print(cities)
# [Bordeaux, Les Deux Alpes, New York, Seoul, Brazzaville]

And then simply assign back with:

df[city] = cities

Leave a Reply

Your email address will not be published. Required fields are marked *