Check string for specific format of substring, how to..?


April 2019


41 time


Two strings. My items name:

Parfume name EDT 50ml

And competitor's items name:

Parfume another name EDP 60ml

And i have a long list of these names in one column, competitors names in other column, and I want to leave only those rows in dataframe, that have same amount of ml in both my and competitors names no matter what everything else in these strings look like. So how do I find a substring ending with 'ml' in a bigger string? I could simply do

"**ml" in competitors_name

to see if they both contain the same amount of ml.

Thank you


'ml' is not always at the end of string. It might look like this

Parfume yet another great name 60ml EDP

4 answers


You could use the python Regex library to select the 'xxml' values for each of your data rows and then do some logic to check if they match.

import re

data_rows = [["Parfume name EDT", "Parfume another name EDP 50ml"]]

for data_pairs in data_rows:
    my_ml = None
    comp_ml = None

    # Check for my ml matches and set value
    my_ml_matches ='(\d{1,3}[Mm][Ll])', data_pairs[0])
    if my_ml_matches != None:
        my_ml = my_ml_matches[0]
        print("my_ml has no ml")

    # Check for comp ml matches and set value
    comp_ml_matches ='(\d{1,3}[Mm][Ll])', data_pairs[1])     
    if comp_ml_matches != None:
        comp_ml = comp_ml_matches[0]
        print("comp_ml has no ml")

    # Print outputs
    if (my_ml != None) and (comp_ml != None):
        if my_ml == comp_ml:
            print("my_ml: {0} == comp_ml: {1}".format(my_ml, comp_ml))
            print("my_ml: {0} != comp_ml: {1}".format(my_ml, comp_ml))

Where data_rows = each row in the data set

Where data_pairs = {your_item_name, competitor_item_name}


You could use a lambda function to do that.

import pandas as pd
import re
d = {
        ['Parfume one 50ml', 'Parfume two 100ml'],
        ['Parfume uno 50ml', 'Parfume dos 200ml']
df = pd.DataFrame(data=d)

df['Eq'] = df.apply(lambda x : 'Yes' if'(\d+)ml', x['Us']).group(1) =='(\d+)ml', x['Competitor']).group(1) else "No", axis = 1)


enter image description here

Doesn't matter whether 'ml' is in the end of in the middle of the string.


Try this:

import re

def same_measurement(my_item, competitor_item, unit="ml"):
    matcher = re.compile(r".*?(\d+){}".format(unit))
    my_match = matcher.match(my_item)
    competitor_match = matcher.match(competitor_item)
    return my_match and competitor_match and ==

my_item = "Parfume name EDT 50ml"
competitor_item = "Parfume another name EDP 50ml"
assert same_measurement(my_item, competitor_item)

my_item = "Parfume name EDT 50ml"
competitor_item = "Parfume another name EDP 60ml"
assert not same_measurement(my_item, competitor_item)

You can use str.split for this, and then just select the last (aka [-1]) element:

>>> str1 = "Parfume name EDT 50ml"
>>> str2 = "Parfume another name EDP 60ml"
>>> volume1 = str1.split()[-1]
>>> volume2 = str2.split()[-1]
>>> volume1
>>> volume2
>>> volume1 == volume2

As a function:

def same_volume(str1, str2):
    return str1.split()[-1] == str2.split()[-1]