D - Marek Pawelec

Report
Regular expressions are
regular
Marek Pawelec
[email protected]
Outline
1.
2.
3.
4.
5.
Regex vocabulary
Segmentation rules
Regex tagger
Regex text filter
Auto-translatables
(?<!(,|\.|\d|\d\s|\d'|\d’))
([-|\u2212]?[\d]{2,3})
(?:\.|,|\s|'|’)(\d\d\d)(?:\.|,)
([\d]{1,2}|[\d]{4,})
(?!(,\d|\.\d|\d|\s\d|'\d|’\d))
Wildcards...
Wildcards used in regular search:
• * – any text string
• ? – any single character
...but somewhat different.
Regular expressions
• . – any character (or symbol, digit...)
• [ ] – a range
[123] – digit 1 or 2 or 3
[1-3] – any digit from 1 to 3
[A-Za-z] – any letter
[^A] – any character except „A”
• | – or
1|2|3 – 1 or 2 or 3
Ranges
• Both [ ] and | means „or”.
What is the difference?
• [USDEUR]
matches U or S or D or E or U or R
• USD|EUR
matches USD or EUR
Special symbols
• \ – modifier (”escape” character)
. any character, but \. means dot
\\ matches backslash
• \d – digit [0-9]
• \s – white space
• \w – any ”word” character [A-Za-z0-9_]
• \u#### – unicode character, e.g. \u2212: –
Quantifiers
• ? – 0 or 1
\d? means zero or one digit
• * – 0 or more
\d* means zero or more digits
• + – 1 or more
\d+ meands at least one digit
• greedy
• *? – zero or as little as possible
• +? – one or as little as possible
• lazy
Quantifiers cont.
• {num} – value or range
\d{4} = 4 digits,
\d{2,4} = 2, 3 or 4 digits
\d{,4} = from 1 to 4 digits
\d{4,} = 4 or more
Groups
• ( ) – creates a group ($num recalls it)
• (?: ) – passive group (not numbered)
Assertions
• (?= ) – look ahead assertion
memo(?=Q) will match „memo” in memoQ,
but not in memory
• (?! ) – negative look ahead assertion
memo(?!Q) will match „memo” in memory,
but not in memoQ
• (?<! ) – negative look back assertion
(?<!s)and will match „and” in band, but not in
sand
#lists#
A list contains variables:
#currency#
(EUR|USD|GBP|HUF)
#cap#
(A|B|C|D) = [ABCD]
Regular expressions in memoQ
• Segmentation rules
• Regexp tagger
• Regexp text filter
• Auto-translatables
Segmentation rules
•
•
•
•
•
•
#end##!#[\s]+#cap#
#end##!#[\s]+[\d]
#end##!#[\s]+#lpar#[\s]*#cap#
#end##!#[\s]+#lpar#[\s]*[\d]
#end#[\s]*#rpar##!#[\s]+#cap#
#end#[\s]*#rpar##!#[\s]+[\d]
•
•
•
•
•
•
#end##!#[\s]+#cap#
#end##!#[\s]+[\d]
#end##!#[\s]+#lpar#[\s]*#cap#
#end##!#[\s]+#lpar#[\s]*[\d]
#end#[\s]*#rpar##!#[\s]+#cap#
#end#[\s]*#rpar##!#[\s]+[\d]
#end##!#[\s]+#cap#
=
[:\!\?\.]#!#\s+[A-Z]
• #end##!#[\s]+#cap#
Unless:
• #abbr_long##!#[\s]+#cap#
• [\s]#abbr_short##!#[\s]+#cap#
• \s#cap#\.#!#[\s]+#cap#
Regex tagger
<c:0xFF00FFFF>
\ <C: .* \>
0990-4905 / N537-0392
\d{4} - \d{4}
[A-Z] \d{3} - \d{4}
ERR_GRP_NO_SAMPLE
[A-Z]+ ( _[A-Z]+)+
Tip: Regex tagger without regex
Regexp text filter
*Popup "Putty" "c:\util\putty.exe"
\s* \* (.*)
*Popup .icon="$IconDir$\Fav_Star.ico" "Quick" "!DynamicFolder:$QuickLaunch$*.lnk"
" \w+(\s+\w+)* "
\w = [A-Za-z0-9_]
Auto-translatables
Rule for EN/DE/FRHU number
format conversion
(?<!(,|\.|\d|\d\s|\d'|\d’))
([-|\u2212]?[\d]{2,3})
(?:\.|,|\s|'|’)(\d\d\d)(?:\.|,)
([\d]{1,2}|[\d]{4,})
(?!(,\d|\.\d|\d|\s\d|'\d|’\d))
$2 $3,$4
(?<!(,|\.|\d|\d\s|\d'|\d’))
([-|\u2212]?[\d]{2,3})
(?:\.|,|\s|'|’)(\d\d\d)(?:\.|,)
([\d]{1,2}|[\d]{4,})
(?!(,\d|\.\d|\d|\s\d|'\d|’\d))
$2 $3,$4
(?<!(,|\.|\d|\d\s|\d'|\d’))
([-|\u2212]?[\d]{2,3})
(?:\.|,|\s|'|’)(\d\d\d)(?:\.|,)
([\d]{1,2}|[\d]{4,})
(?!(,\d|\.\d|\d|\s\d|'\d|’\d))
$2 $3,$4
(?<!(,|\.|\d|\d\s|\d'|\d’))
([-|\u2212]?[\d]{2,3})
(?:\.|,|\s|'|’)(\d\d\d)(?:\.|,)
([\d]{1,2}|[\d]{4,})
(?!(,\d|\.\d|\d|\s\d|'\d|’\d))
$2 $3,$4
(?<!(,|\.|\d|\d\s|\d'|\d’))
([-|\u2212]?[\d]{2,3})
(?:\.|,|\s|'|’)(\d\d\d)(?:\.|,)
([\d]{1,2}|[\d]{4,})
(?!(,\d|\.\d|\d|\s\d|'\d|’\d))
$2 $3,$4
(?<!(,|\.|\d|\d\s|\d'|\d’))
([-|\u2212]?[\d]{2,3})
(?:\.|,|\s|'|’)(\d\d\d)(?:\.|,)
([\d]{1,2}|[\d]{4,})
(?!(,\d|\.\d|\d|\s\d|'\d|’\d))
$2 $3,$4
(?<!(,|\.|\d|\d\s|\d'|\d’))
([-|\u2212]?[\d]{2,3})
(?:\.|,|\s|'|’)(\d\d\d)(?:\.|,)
([\d]{1,2}|[\d]{4,})
(?!(,\d|\.\d|\d|\s\d|'\d|’\d))
$2 $3,$4
(?<!(,|\.|\d|\d\s|\d'|\d’))
([-|\u2212]?[\d]{2,3})
(?:\.|,|\s|'|’)(\d\d\d)(?:\.|,)
([\d]{1,2}|[\d]{4,})
(?!(,\d|\.\d|\d|\s\d|'\d|’\d))
$2 $3,$4
12,345,67
12,345.67
12.345,67
12.345.67
12 345,67
12 345.67
12’345,67
12’345.67
12 345,67
12 345,67
12 345,67
12 345,67
12 345,67
12 345,67
12 345,67
12 345,67
.12,345,67
,12,345.67
0 12.345,67
0’12.345.67
12 345,67,0
12 345.67.0
12’345,67 0
12’345.67’0
(?<!(,|\.|\d|\d\s|\d'|\d’))
([-|\u2212]?[\d]{2,3})
(?:\.|,|\s|'|’)(\d\d\d)(?:\.|,)
([\d]{1,2}|[\d]{4,})
(?!(,\d|\.\d|\d|\s\d|'\d|’\d))
$2 $3,$4
Red elements are not necessary:
(?<!(,|\.|\d|\d\s|\d'|\d’))
([-|\u2212]?[\d]{2,3})
(?:\.|,|\s|'|’)(\d\d\d)(?:\.|,)
([\d]{1,2}|[\d]{4,})
(?!(,\d|\.\d|\d|\s\d|'\d|’\d))
$1 $2,$3
The same rule for ENHU only
(?<!\d,|\d\.|\d)
([-–]?\d{2,3}),(\d{3})\.(\d+)
(?!,\d|\.\d|\d)
12,345.67  12 345,67
(?<!\d,|\d\.|\d)
([-–]?\d{2,3}),(\d{3})\.(\d+)
(?!,\d|\.\d|\d)
12,345.67  12 345,67
Day of the week,
Month
Day number (st, nd, rd, th)
Year
day of the week
day number.
month
year
(#day#),?\s(#month#)\s
(\d{1,2})(?:st|nd|rd|th)?
\s(\d{4})
$1 $3. $2 $4
(#day#),?\s(#month#)\s(\d{1,2})(?:st|nd|rd|th)?\s(\d{4})
#day#: Friday
#month#: May
11th
2012
$1 $3. $2 $4




piątek
maja
11
2012
($1)
($2)
($3)
($4)
• http://www.cheatography.com/davechild/ch
eat-sheets/regular-expressions/
• http://www.regularexpressions.info/tutorial.html
• http://regexlib.com

similar documents