Issue with boundary \b in regex

606 views python

Goal: Using regex, (not split) I would like to take a string of numbers and only return the "properly formatted" numbers. I define "properly formatted" as every three digits must be preceded by a comma.

My code:

import re
numRegex = re.compile(r'\b\d{1,3}(?:,\d{3})*\b')
print(numRegex.findall('42 1,234 6,368,745 12,34,567 1234'))

When I run the code I would expect to get:

['42', '1,234', '6,368,745']

Instead I get back:

['42', '1,234', '6,368',745', '12', '34,567']

I would guess it's treating the comma (,) as a boundary (\b), but I'm not sure how to get around this elegantly.

FYI: This example is an adaptation of the problem question from "Automate the Boring Stuff with Python: Practical Programming for Total Beginners". The example problem only requested a regex to figure out if an individual number is formatted correctly and didn't expect you to parse out all "properly formatted" numbers from a long string of multiple numbers. I misinterpreted the question initially and now I'm on a mission to finish it out this way.

answered question

1 Answer


Try negative lookarounds:

numRegex = re.compile(r'\b\d{1,3}(?:,\d{3})*\b(?!,)')

There's a lookahead assertion (?!,) so that the boundary on the right side cannot be followed by a comma.

Similarly you can have lookbehind assertions that require the matched text to not be preceded by a comma:

numRegex = re.compile(r'(?<!,)\b\d{1,3}(?:,\d{3})*\b(?!,)')

This way when a "number" has a comma on its either side, it will not be matched.

posted this

Have an answer?


Please login first before posting an answer.