Regular Expression Syntax
Easy GREP RegEx can be written in the syntax of Javascript
and VBscript which covers most of the usually used.
Essential Grammar & Syntax
Key Words
Metacharacters
| Key |
Description |
Example |
Explanation |
| ^ |
start of a line (or a string*) |
^On |
Key word "On" must be the first word of a line (a string) |
| $ |
end of a line (or a string*) |
\d$ |
a number at the end of a line (string) |
| . |
Anything except line feed(or anything*) |
|
One character, similar as ? in Windows wildcard |
| X* |
Preceding expression exists any times or not |
be* |
b, be, bee, beee, etc. |
| X+ |
Preceding expression exists once or more |
be+ |
be, bee, beee, et.c |
| X? |
Preceding expression exists once or not |
be? |
b or be |
| (XX) |
Make a group of characters, so that we can add other metacharacter
behind or reference it in result |
(long )+ago |
long ago, long long ago, long long long ago, etc. |
| [XX] |
Make a class of characters, can match any characters inside. |
[AD]C |
AC or DC |
| | |
Either of two sides can match. |
A|DC |
AC or DC |
| (A|B) |
So that A or B can be of any length. |
(AC|DC|XYZ) |
AC or DC or XYZ. |
| [^X] |
Negative class. Any character except the included characters. |
[^a] |
Anything except letter "a" |
| {n} |
Preceding expression should be repeated n times. |
20{3} |
only 2000 matchs. |
| {n,n} |
Preceding expression should be repeated from n times to m times. |
20{1,3} |
20,200,2000 |
| ? |
Match only to the nearest expression followed |
\.+?> |
Till the first ">" |
- * Some metacharacter have different meaning at different condition.
^ and $ will match begin and end of a line when "multiline" turned on,
while match begin and end of whole string(maybe a text file) when "multiline" off.
. matchs anyhting when "singleline" turned on, or matchs anything except \n.
- Some metacharacter have different meaning at different position, such as ^ and ?
^ means begin of a line or string. but inside [], it means negative.
[^X] means a reversed meaning of [X], so here ^ actually is not an
independent metacharacter, it's only a negtive sign, so it's a combination of
[^] and X, not [] and ^X.
? means preceding character or group exist or not, but after + or *, it means
non-greedy match(match lest)
- .|\n can match any character, for . can match any character except a
newline(\n), so we . or \n means any thing. Thus (.|\n)+ can match any thing
of any length. You would found it's very useful when you want to match across
lines. (In other typed regular expression, .|\n may not work. Some have a
switch, which can make . match anycharacter including \n)
- Actually, *, +, ? are all metacharacter of times. They all can be written
in {n} mode, so you can consider them simple form of some often used {n}. *
equals to {0,}, + equals to {1,},? equals to {0,1}. Be careful, there's an
omitted number after comma, which should be ¡Þ , in regular expression, inside
{} , if the second number is omitted, it means no count limitation.
- If there are only sinle characters(including escaped characters), such as
(a|b|c|d) equals to [abcd]
- To coordinate several words by |, you must embrace each of them, such as
(word1)|(word2)|(word3) can not be written as word1|word2|word3 (Regular
expression will read it as word(1|w)ord(2|w)ord3, very terrible, isn't it?)
- All things inside [ and ] ,should be and can only be read as single
character, you can never make [ ] contain a word or a compound expression.
Such as [you,me] equals to [eoumy,] which will be interpreted as (y|o|u|,|m|e)
Escaped Characters
| Key |
Abbr. of |
Description |
Example |
Explanation |
| \d |
Digit |
means a number |
|
Such as 1,2,3 |
| \b |
Boundary |
means a word boundary. (Some other types of RegEx use \< and \>
do the same thing.) |
\bsome\b |
Only word "some" matches, "something" or "handsome" doesn't match
|
| \r |
Return |
means carriage return |
|
For Mac system |
| \n |
Newline |
means newline |
|
For Windows system |
| \r\n |
|
whole line break |
|
Mostly for Windows system, also used in most files for
cross-platform |
| \w |
Word |
includes latin character and number and _ |
|
Such as a,b,c,A,B,C,1,2,3,_ |
| \s |
Space |
includes white space, tab, line breaks. |
|
=[ \t\r\n] |
| \t |
Tab |
|
|
|
| \X |
Negative Class |
An upper case means the negative class of the lowercase one means,
reversed range. |
\S |
=[^\s] any character except \s |
- In regular expression, all escaped characters is case sensitive. A general
expression of a class is always a backslash followed with a lowercased first
character of the class name. While, an expression with a uppercased character
means reversed, all things except the class marked by the lowercased one.
- Be careful, though there's two character, a back slash with a Latin
character only represent one character. So any following metacharacter's
subject is not the preceding Latin character, but the whole expression meaning
a class with back slash escaped. For example, \s+ should be interpreted as (\s)+ instead of \(s+)
- Not all \x has a corresponding \X, such as \W means anything except \w, but \N doesn't mean anything except \n