7 pat t e r n m at c h I n g w I t h r e g u L a r e X p r e s s I o n s


Eve knew Agent Bob was a double agent.')



Yüklə 397,03 Kb.
Pdf görüntüsü
səhifə18/25
tarix29.11.2022
ölçüsü397,03 Kb.
#71308
1   ...   14   15   16   17   18   19   20   21   ...   25
P A T T E R N M A T C H I N G W I T H

Eve knew Agent Bob was a double agent.')
A**** told C**** that E**** knew B**** was a double agent.'
Managing Complex Regexes
Regular expressions are fine if the text pattern you need to match is simple. 
But matching complicated text patterns might require long, convoluted reg-
ular expressions. You can mitigate this by telling the 
re.compile()
function 
to ignore whitespace and comments inside the regular expression string. 
This “verbose mode” can be enabled by passing the variable 
re.VERBOSE
as 
the second argument to 
re.compile()
.
Now instead of a hard-to-read regular expression like this:
phoneRegex = re.compile(r'((\d{3}|\(\d{3}\))?(\s|-|\.)?\d{3}(\s|-|\.)\d{4}
(\s*(ext|x|ext.)\s*\d{2,5})?)')
you can spread the regular expression over multiple lines with comments 
like this:
phoneRegex = re.compile(r'''(
(\d{3}|\(\d{3}\))? # area code
(\s|-|\.)? # separator
\d{3} # first 3 digits
(\s|-|\.) # separator
\d{4} # last 4 digits


Pattern Matching with Regular Expressions
179
(\s*(ext|x|ext.)\s*\d{2,5})? # extension
)''', re.VERBOSE)
Note how the previous example uses the triple-quote syntax (
'''
) to 
create a multiline string so that you can spread the regular expression defi-
nition over many lines, making it much more legible.
The comment rules inside the regular expression string are the same 
as regular Python code: the 
#
symbol and everything after it to the end 
of the line are ignored. Also, the extra spaces inside the multiline string 
for the regular expression are not considered part of the text pattern to be 
matched. This lets you organize the regular expression so it’s easier to read.
Combining re.IGNORECASE, re.DOTALL, and re.VERBOSE
What if you want to use 
re.VERBOSE
to write comments in your regular 
expression but also want to use 
re.IGNORECASE
to ignore capitalization? 
Unfortunately, the 
re.compile()
function takes only a single value as its 
second argument. You can get around this limitation by combining the 
re.IGNORECASE

re.DOTALL
, and 
re.VERBOSE
variables using the pipe character 
(
|
), which in this context is known as the bitwise or operator.
So if you want a regular expression that’s case-insensitive and includes 
newlines to match the dot character, you would form your 
re.compile()
call 
like this:
>>> someRegexValue = re.compile('foo', re.IGNORECASE | re.DOTALL)
Including all three options in the second argument will look like this:
>>> someRegexValue = re.compile('foo', re.IGNORECASE | re.DOTALL | re.VERBOSE)
This syntax is a little old-fashioned and originates from early versions 
of Python. The details of the bitwise operators are beyond the scope of this 
book, but check out the resources at https://nostarch.com/automatestuff2/ for 
more information. You can also pass other options for the second argument; 
they’re uncommon, but you can read more about them in the resources, too.

Yüklə 397,03 Kb.

Dostları ilə paylaş:
1   ...   14   15   16   17   18   19   20   21   ...   25




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin