Совмещение r и python: зачем, когда и как?

Regular Expression Modifiers: Option Flags

Regular expression literals may include an optional modifier to control various aspects of matching. The modifiers are specified as an optional flag. You can provide multiple modifiers using exclusive OR (|), as shown previously and may be represented by one of these −

Sr.No. Modifier & Description
1

re.I

Performs case-insensitive matching.

2

re.L

Interprets words according to the current locale. This interpretation affects the alphabetic group (\w and \W), as well as word boundary behavior (\b and \B).

3

re.M

Makes $ match the end of a line (not just the end of the string) and makes ^ match the start of any line (not just the start of the string).

4

re.S

Makes a period (dot) match any character, including a newline.

5

re.U

Interprets letters according to the Unicode character set. This flag affects the behavior of \w, \W, \b, \B.

6

re.X

Permits «cuter» regular expression syntax. It ignores whitespace (except inside a set [] or when escaped by a backslash) and treats unescaped # as a comment marker.

Search and Replace

One of the most important re methods that use regular expressions is sub.

Syntax

re.sub(pattern, repl, string, max=0)

This method replaces all occurrences of the RE pattern in string with repl, substituting all occurrences unless max provided. This method returns modified string.

Example

#!/usr/bin/python
import re

phone = "2004-959-559 # This is Phone Number"

# Delete Python-style comments
num = re.sub(r'#.*$', "", phone)
print "Phone Num : ", num

# Remove anything other than digits
num = re.sub(r'\D', "", phone)    
print "Phone Num : ", num

When the above code is executed, it produces the following result −

Phone Num :  2004-959-559
Phone Num :  2004959559

Sets

A set is a set of characters inside a pair of square brackets with a special meaning:

Set Description Try it
Returns a match where one of the specified characters (,
, or ) are
present
Try it »
Returns a match for any lower case character, alphabetically between
and
Try it »
Returns a match for any character EXCEPT ,
, and
Try it »
Returns a match where any of the specified digits (,
, , or ) are
present
Try it »
Returns a match for any digit between
and
Try it »
Returns a match for any two-digit numbers from and Try it »
Returns a match for any character alphabetically between
and , lower case OR upper case
Try it »
In sets, , ,
, ,
, ,
has no special meaning, so means: return a match for any
character in the string
Try it »

Regular Expression Patterns

Except for control characters, (+ ? . * ^ $ ( ) { } | \), all characters match themselves. You can escape a control character by preceding it with a backslash.

Following table lists the regular expression syntax that is available in Python −

Sr.No. Pattern & Description
1

^

Matches beginning of line.

2

$

Matches end of line.

3

.

Matches any single character except newline. Using m option allows it to match newline as well.

4

Matches any single character in brackets.

5

Matches any single character not in brackets

6

re*

Matches 0 or more occurrences of preceding expression.

7

re+

Matches 1 or more occurrence of preceding expression.

8

re?

Matches 0 or 1 occurrence of preceding expression.

9

re{ n}

Matches exactly n number of occurrences of preceding expression.

10

re{ n,}

Matches n or more occurrences of preceding expression.

11

re{ n, m}

Matches at least n and at most m occurrences of preceding expression.

12

a| b

Matches either a or b.

13

(re)

Groups regular expressions and remembers matched text.

14

(?imx)

Temporarily toggles on i, m, or x options within a regular expression. If in parentheses, only that area is affected.

15

(?-imx)

Temporarily toggles off i, m, or x options within a regular expression. If in parentheses, only that area is affected.

16

(?: re)

Groups regular expressions without remembering matched text.

17

(?imx: re)

Temporarily toggles on i, m, or x options within parentheses.

18

(?-imx: re)

Temporarily toggles off i, m, or x options within parentheses.

19

(?#…)

Comment.

20

(?= re)

Specifies position using a pattern. Doesn’t have a range.

21

(?! re)

Specifies position using pattern negation. Doesn’t have a range.

22

(?> re)

Matches independent pattern without backtracking.

23

\w

Matches word characters.

24

\W

Matches nonword characters.

25

\s

Matches whitespace. Equivalent to .

26

\S

Matches nonwhitespace.

27

\d

Matches digits. Equivalent to .

28

\D

Matches nondigits.

29

\A

Matches beginning of string.

30

\Z

Matches end of string. If a newline exists, it matches just before newline.

31

\z

Matches end of string.

32

\G

Matches point where last match finished.

33

\b

Matches word boundaries when outside brackets. Matches backspace (0x08) when inside brackets.

34

\B

Matches nonword boundaries.

35

\n, \t, etc.

Matches newlines, carriage returns, tabs, etc.

36

\1…\9

Matches nth grouped subexpression.

37

\10

Matches nth grouped subexpression if it matched already. Otherwise refers to the octal representation of a character code.

Поиск и замена

Модуль повторного Python предоставляет re.sub на матч замены строки.

Синтаксис:

re.sub(pattern, repl, string, max=0)

Возвращаемая строка это строка с крайней левой RE матчей не будет повторяться, чтобы заменить. Если шаблон не найден, то символы будут возвращены без изменений.

Необязательный счетчик параметр является максимальное количество раз замену соответствующий шаблон, счетчик должен быть неотрицательным целым числом. Значение по умолчанию 0 означает, что для замены всех вхождений.

Пример:

#!/usr/bin/python
import re

phone = "2004-959-559 # This is Phone Number"

# Delete Python-style comments
num = re.sub(r'#.*$', "", phone)
print "Phone Num : ", num

# Remove anything other than digits
num = re.sub(r'\D', "", phone)    
print "Phone Num : ", num
Phone Num :  2004-959-559
Phone Num :  2004959559

Шаблоны, соответствующие не конкретному тексту, а позиции

Отдельные части регулярного выражения могут соответствовать не части текста, а позиции в этом тексте.
То есть такому шаблону соответствует не подстрока, а некоторая позиция в тексте, как бы «между» буквами.

Простые шаблоны, соответствующие позиции

всем текстомстрочкой текста

Шаблон Описание Пример Подходящие строки
^
Начало всего текста или начало строчки текста, если
^Привет
$
Конец всего текста или конец строчки текста, если
Будь здоров!$
\A
Строго начало всего текста
\Z
Строго конец всего текста
\b
Начало или конец слова (слева пусто или не-цифро-буква, а справа цифро-буквы, либо наоборот)
\bвал
вал, перевал
\B
Не граница слова (либо и слева, и справа цифро-буквы, либо и слева, и справа НЕ цифро-буквы)
\Bвал
перевал, вал

Special Sequences

A special sequence is a followed by one of the characters in the list below, and has a special meaning:

Character Description Example Try it
\A Returns a match if the specified characters are at the beginning of the
string
«\AThe» Try it »
\b Returns a match where the specified characters are at the beginning or at the
end of a word(the «r» in the beginning is making sure that the string is
being treated as a «raw string»)
r»\bain»r»ain\b» Try it »Try it »
\B Returns a match where the specified characters are present, but NOT at the beginning
(or at
the end) of a word(the «r» in the beginning is making sure that the string
is being treated as a «raw string»)
r»\Bain»r»ain\B» Try it »Try it »
\d Returns a match where the string contains digits (numbers from 0-9) «\d» Try it »
\D Returns a match where the string DOES NOT contain digits «\D» Try it »
\s Returns a match where the string contains a white space character «\s» Try it »
\S Returns a match where the string DOES NOT contain a white space character «\S» Try it »
\w Returns a match where the string contains any word characters (characters from
a to Z, digits from 0-9, and the underscore _ character)
«\w» Try it »
\W Returns a match where the string DOES NOT contain any word characters «\W» Try it »
\Z Returns a match if the specified characters are at the end of the string «Spain\Z» Try it »

re.compile

With the re.compile() function we can compile pattern into pattern objects,
which have methods for various operations such as searching for pattern matches
or performing string substitutions.

Let’s see two examples, using the re.compile() function.

The first example checks if the input from the user contains only letters,
spaces or . (no digits)

Any other character is not allowed.

The second example checks if the input from the user contains only numbers,
parentheses, spaces or hyphen (no letters)

Any other character is not allowed

The output of above script will be:

Please, enter your phone: s

Please enter your phone correctly!

It will continue to ask until you put in numbers only.

Specify Pattern Using RegEx

To specify regular expressions, metacharacters are used. In the above example, and are metacharacters.

MetaCharacters

Metacharacters are characters that are interpreted in a special way by a RegEx engine. Here’s a list of metacharacters:

[] . ^ $ * + ? {} () \ |

— Square brackets

Square brackets specifies a set of characters you wish to match.

Expression String Matched?
1 match
2 matches
No match
5 matches

Here, will match if the string you are trying to match contains any of the , or .

You can also specify a range of characters using inside square brackets.

  • is the same as .
  • is the same as .
  • is the same as .

You can complement (invert) the character set by using caret symbol at the start of a square-bracket.

  • means any character except a or b or c.
  • means any non-digit character.

— Period

A period matches any single character (except newline ).

Expression String Matched?
No match
1 match
1 match
2 matches (contains 4 characters)

— Caret

The caret symbol is used to check if a string starts with a certain character.

Expression String Matched?
1 match
1 match
No match
1 match
No match (starts with but not followed by )

— Dollar

The dollar symbol is used to check if a string ends with a certain character.

Expression String Matched?
1 match
1 match
No match

— Star

The star symbol matches zero or more occurrences of the pattern left to it.

Expression String Matched?
1 match
1 match
1 match
No match ( is not followed by )
1 match

— Plus

The plus symbol matches one or more occurrences of the pattern left to it.

Expression String Matched?
No match (no character)
1 match
1 match
No match (a is not followed by n)
1 match

— Question Mark

The question mark symbol matches zero or one occurrence of the pattern left to it.

Expression String Matched?
1 match
1 match
No match (more than one character)
No match (a is not followed by n)
1 match

— Braces

Consider this code: . This means at least n, and at most m repetitions of the pattern left to it.

Expression String Matched?
No match
1 match (at )
2 matches (at and )
2 matches (at and )

Let’s try one more example. This RegEx matches at least 2 digits but not more than 4 digits

Expression String Matched?
1 match (match at )
3 matches (, , )
No match

— Alternation

Vertical bar is used for alternation ( operator).

Expression String Matched?
No match
1 match (match at )
3 matches (at )

Here, match any string that contains either a or b

— Group

Parentheses is used to group sub-patterns. For example, match any string that matches either a or b or c followed by xz

Expression String Matched?
No match
1 match (match at )
2 matches (at )

— Backslash

Backlash is used to escape various characters including all metacharacters. For example,

match if a string contains followed by . Here, is not interpreted by a RegEx engine in a special way.

If you are unsure if a character has special meaning or not, you can put in front of it. This makes sure the character is not treated in a special way.

Special Sequences

Special sequences make commonly used patterns easier to write. Here’s a list of special sequences:

— Matches if the specified characters are at the start of a string.

Expression String Matched?
Match
No match

— Matches if the specified characters are at the beginning or end of a word.

Expression String Matched?
Match
Match
No match
Match
Match
No match

— Opposite of . Matches if the specified characters are not at the beginning or end of a word.

Expression String Matched?
No match
No match
Match
No match
No match
Match

— Matches any decimal digit. Equivalent to

Expression String Matched?
3 matches (at )
No match

— Matches any non-decimal digit. Equivalent to

Expression String Matched?
3 matches (at )
No match

— Matches where a string contains any whitespace character. Equivalent to .

Expression String Matched?
1 match
No match

— Matches where a string contains any non-whitespace character. Equivalent to .

Expression String Matched?
2 matches (at )
No match

— Matches any alphanumeric character (digits and alphabets). Equivalent to . By the way, underscore is also considered an alphanumeric character.

Expression String Matched?
3 matches (at )
No match

— Matches any non-alphanumeric character. Equivalent to

Expression String Matched?
1 match (at )
No match

— Matches if the specified characters are at the end of a string.

Expression String Matched?
1 match
No match
No match

Tip: To build and test regular expressions, you can use RegEx tester tools such as regex101. This tool not only helps you in creating regular expressions, but it also helps you learn it.

Now you understand the basics of RegEx, let’s discuss how to use RegEx in your Python code.

Добавить комментарий

Ваш адрес email не будет опубликован. Обязательные поля помечены *

Adblock
detector