Python regex: Difference between (.+) and (.+?)
Python regex: Difference between (.+) and (.+?)
.+
is greedy — it matches until it cant match any more and gives back only as much as needed.
.+?
is not — it stops at the first opportunity.
Examples:
Assume you have this HTML:
<span id=yfs_l84_sbux>foo bar</span><span id=yfs_l84_sbux2>foo bar</span>
This regex matches the whole thing:
<span id=yfs_l84_sbux>(.+)</span>
It goes all the way to the end, then gives back one </span>
, but the rest of the regex matches that last </span>
, so the complete regex matches the entire HTML chunk.
But this regex stops at the first </span>
:
<span id=yfs_l84_sbux>(.+?)</span>
?
is a non-greedy modifier. *
by default is a greedy repetition operator – it will gobble up everything it can; when modified by ?
it becomes non-greedy and will eat up only as much as will satisfy it.
Thus for
<span id=yfs_l84_sbux>want</span>text<span id=somethingelse>dontwant</span>
.*?</span>
will eat up want
, then hit </span>
– and this satisfies the regexp with minimal repetitions of .
, resulting in <span id=yfs_l84_sbux>want</span>
being the match. However, .*
will try to see if it can eat more – it will go and find the other </span>
, with .*?
matching want</span>text<span id=somethingelse>dontwant
, resulting in what you got – much more than you wanted.
Python regex: Difference between (.+) and (.+?)
(.+)
is greedy. It takes what it can and gives back when needed.
(.+?)
is ungreedy. It takes as few as possible.
See:
delegate
[delegate] /^(.+)e/
[de]legate /^(.+?)e/
Also, comparing the Regex debugger log here and here will show you what the ungreedy modifier does more effectively.