to.. ELA Web Portal to.. jEdit Notes to.. ELA Notes

ELA NOTES ON REGULAR EXPRESSIONS


CONTENTS
Backslash ( \ ) Escape | Character Classes
Escape Sequences
GNU.RegExp | jEdit Search Examples
Metacharacters
Positional Operators | Positional Parameters
Repetitions | Special Sequences
Whitespace



^ ELAnRgEx
jEDIT SEARCH EXAMPLES

Search for..
Commented Out Links: <!-- [a-zA-Z0-9]

Search Characters Replace Characters
((<br />)+)\n<!\s*(\*)+> <br />
<! *******************>
<!*******************>$1 <!*******************><br />
<p />\n<!\s*(\*)+> <p />
<! *******************>
\n<!*****************><br /><br /> <!*******************><br /><br />
<!([^-]{2,2})([^>]+)> <!text> <!--$1$2--> <!-- text -->
<font color=\"(\w) <font color="aNumrc <font color=\"#$1 <font color="#aNumrc
<img src="([^/,>]+)> <img src="text"> <img src="$1 /> <img src="text" />
\n\n<br /><a

<br /><a
<br />\n\n<a <br />

<a
</a>\s*(\w|\() </a>
a OR (
</a> $1 </a> a
</a>\n<li \/> </a>
<li />
</a><br />\n&#149; </a>
([a-z])\n([a-z]) a
b
$1 $2 a b
([a-zA-Z0-9,=])&([a-zA-Z0-9,=]) a&b $1&#38;$2 a&#38;b
 &   &   &#38;   &#38; 
([a-z])\s+</a> a </a> $1</a> a</a>
:\s*(\w) :
a
: $1 : a
&#153; &#8482;
&#147; " "
&#186;(\w) &#186;a &#186; $1 &#186; a
<br />&#149; <br />• <br />\n&#149; <br />
&#149;(\w) &#149;a &#149; $1 &#149; a
&#149; &#149; &#8226; &#8226;
<a HREF= <a HREF= <a href= <a href=
No original content.\"\) /> No original content.") /> No original content." /> No original content." />
&AMP; &AMP; &#38; &#38;
^ ELAnRgEx
METACHARACTERS

Complete list of the metacharacters:
.  ^  $  *  +  ?  {  }  [  ]  \  |  (  )

.   period or dot - matches any single character except a newline character.
(gnu.regexp note: but see the REG_MULTILINE flag)

|  vertical bar - indicates the OR operator

POSITIONAL PARAMETERS

Regular Expressions surrounded with parentheses are assocated with numbered positional parameters eg,

FIND: </a>\s*(\w)
REPLACE: </a> $1

REPETITIONS

*   asterisk - (multiplier) specifies match the previous character shown 0 or more times.

+   plus - specifies match the previous character shown 1 or more times.

?   question mark - specifies match the previous character shown either 0 or 1 time.

REPETION RANGE

{m,n}   matches the previous character with at least m repetitions and at most n, where m and n are decimal numbers.

Default m = 0
Default n = infinity or the upper bound of computer memory.

Note that the following are equivalent:
{0,} and *
{1,} and +
{0,1} and ?

^ ELAnRgEx


CHARACTER CLASSES

[ ] square brackets - enclose the set (character class) of characters to match.

^  caret - as the first character class indicates the complementary set.

Metacharacters other than those in Table 1 below are not active inside classes.

^ ELAnRgEx


BACKSLASH ( \ ) ESCAPE

\
 backslash - can be used in two ways:

1) as an escape character
\   used in patterns before metacharacters will match the literal character.

2) in special sequences
\   followed by a particular character symbolizes the class equivalents shown in Table 1.

SPECIAL SEQUENCES

Table 1 - Special Sequences
Sequence Classs Equivalence Matches
\d [ 0-9 ] any decimal digits
\D [ ^0-9 ] any non-digit character
\s [ \t\n\r\f\v ] any whitespace character
\S [ ^ \t\n\r\f\v ] any non-whitespace character
\w [ a-zA-Z0-9_ ] any alphanumeric character
\W [ ^a-zA-Z0-9_ ] any non-alphanumeric character

The Table 1 sequences can also be included inside of a character class.

from... Regular Expressions for Java

^ ELAnRgEx


ESCAPE SEQUENCES

Since Java string processing takes care of certain escape sequences, these characters are not implemented in gnu.regexp. The following escape sequences are handled by the Java compiler if found in the Java source:

Table 2 - Escape Sequences
\b backspace
\f form feed
\n newline
\r carriage return
\t horizontal tab
\" double quote
\' single quote
\\ backslash
\xxx character, in octal (000-377)
\uxxxx Unicode character, in hexadecimal (0000-FFFF)

also from... Regular Expressions for Java

POSITIONAL OPERATORS

Table 3 - Positional Operators
^ matches at the beginning of a line1
$ matches at the end of a line2
\A matches the start of the entire string
\Z matches the end of the entire string
\b matches at a word break  *Perl5 syntax only
\B matches at a non-word break (opposite of \b)  *Perl5 syntax only
\< matches at the start of a word  *egrep syntax only
\> matches at the end of a word  *egrep syntax only
1 but see the REG_NOTBOL and REG_MULTILINE flags
2 but see the REG_NOTEOL and REG_MULTILINE flags

WHITESPACE

from... ASID Working Group, Patrik Faltstrom :

whitespace = 1*(" " / <tab> / <CR> / <LF> / "@")

from Appendix A, XML Reference Material:

Whitespace [3] S ::= (#x20 | #x9 | #xD | #xA)+

Whitespace is a run of one or more

space characters (#x20)
horizontal tab (#x9)
carriage return (#xD)
linefeed (#xA)

Because of the +, 20 of these characters in a row are treated exactly the same as one.

Other ASCII whitespace characters like the

vertical tab (#xB) are prohibited by production [2].

Other non-ASCII, Unicode whitespace characters like the

non-breaking space (#A0)

are not considered whitespace for the purposes of XML.