7 pat t e r n m at c h I n g w I t h r e g u L a r e X p r e s s I o n s

Eve knew Agent Bob was a double agent.')

Yüklə 397,03 Kb.

Pdf görüntüsü

səhifə	18/25
tarix	29.11.2022
ölçüsü	397,03 Kb.
	#71308

1 ... 14 15 16 17 18 19 20 21 ... 25

P A T T E R N M A T C H I N G W I T H

Combining re.IGNORECASE, re.DOTALL, and re.VERBOSE

Eve knew Agent Bob was a double agent.')
A**** told C**** that E**** knew B**** was a double agent.'
Managing Complex Regexes
Regular expressions are fine if the text pattern you need to match is simple.
But matching complicated text patterns might require long, convoluted reg-
ular expressions. You can mitigate this by telling the
re.compile()
function
to ignore whitespace and comments inside the regular expression string.
This “verbose mode” can be enabled by passing the variable
re.VERBOSE
as
the second argument to
re.compile()
.
Now instead of a hard-to-read regular expression like this:
phoneRegex = re.compile(r'((\d{3}|\(\d{3}\))?(\s|-|\.)?\d{3}(\s|-|\.)\d{4}
(\s*(ext|x|ext.)\s*\d{2,5})?)')
you can spread the regular expression over multiple lines with comments
like this:
phoneRegex = re.compile(r'''(
(\d{3}|\(\d{3}\))? # area code
(\s|-|\.)? # separator
\d{3} # first 3 digits
(\s|-|\.) # separator
\d{4} # last 4 digits

Pattern Matching with Regular Expressions
179
(\s*(ext|x|ext.)\s*\d{2,5})? # extension
)''', re.VERBOSE)
Note how the previous example uses the triple-quote syntax (
'''
) to
create a multiline string so that you can spread the regular expression defi-
nition over many lines, making it much more legible.
The comment rules inside the regular expression string are the same
as regular Python code: the
#
symbol and everything after it to the end
of the line are ignored. Also, the extra spaces inside the multiline string
for the regular expression are not considered part of the text pattern to be
matched. This lets you organize the regular expression so it’s easier to read.
Combining re.IGNORECASE, re.DOTALL, and re.VERBOSE
What if you want to use
re.VERBOSE
to write comments in your regular
expression but also want to use
re.IGNORECASE
to ignore capitalization?
Unfortunately, the
re.compile()
function takes only a single value as its
second argument. You can get around this limitation by combining the
re.IGNORECASE
,
re.DOTALL
, and
re.VERBOSE
variables using the pipe character
(
|
), which in this context is known as the bitwise or operator.
So if you want a regular expression that’s case-insensitive and includes
newlines to match the dot character, you would form your
re.compile()
call
like this:
>>> someRegexValue = re.compile('foo', re.IGNORECASE | re.DOTALL)
Including all three options in the second argument will look like this:
>>> someRegexValue = re.compile('foo', re.IGNORECASE | re.DOTALL | re.VERBOSE)
This syntax is a little old-fashioned and originates from early versions
of Python. The details of the bitwise operators are beyond the scope of this
book, but check out the resources at https://nostarch.com/automatestuff2/ for
more information. You can also pass other options for the second argument;
they’re uncommon, but you can read more about them in the resources, too.

Yüklə 397,03 Kb.

Dostları ilə paylaş:

1 ... 14 15 16 17 18 19 20 21 ... 25