Search in Eddie offers the full power of regular expressions. Eddie implements the Perl regex (via the outstanding Boost.Regex library). While extremely powerful, regex searches may be a bit dense and intimidating. Eddie Find tries to help by adding a few simple touches to the standard regex search environment. Switching the Find window to regex mode enables syntax coloring in both the find and replace text fields:
With syntax coloring, it is easy to for instance spot unbalanced parentheses and brackets as they show up in red:
If you actually go ahead and hit the Find button, you will get a more comprehensive error message if your regular expression or replace format string are incorrect:
The Perl Regular Expression syntax extends grep, Posix and other traditional regular expression engines. If you are familiar with any of these, you will feel right at home. We will cover a subset of the rich syntax here, briefly.
With the exception of the characters listed below, all characters in a Perl regular expression match themselves in the general case. The following are considered metacharacters:
.()[{\*+?|^$
'^' character matches the start of the line
'$' character matches the end of the line
'.' character matches any single character except for newline
The *, +, ? and {} operators can be used to express the repetition of a single character, a marked sub-expression, a non-marking group or a character class.
There * operator will match the preceding item zero or more times. A commonly used pattern includes:
.*
that matches "anything and everything that is left up until the end of a line".
There + operator will match the preceding item one or more times.
There ? operator will match the preceding item zero or one times. This is particularly useful for optional sub-expression matches.
There {} operator let's you specify the number of repeated matches, including a range:
a{n} Matches a exactly n times
a{n,} Matches a n or more times
a{n,m} Matches a between n and m times
The default behavior of repeats is "greedy", making them consume as much input as possible. The non-greedy repeat variants consume as little input as possible while still producing a match.
There *? operator will match the preceding item zero or more times, while consuming as little as possible. A perfect example that illustrates the use of a non-greedy pattern is
/\*.*?\*/
matching C comments.
There +? operator will match the preceding item one or more times, while consuming as little as possible.
There ?? operator will match the preceding item zero or one times, while consuming as little as possible.
There {}? operator let's you specify the number of repeated matches, including a range:
a{n}? Matches a exactly n times, while consuming as little as possible
a{n,}? Matches a n or more times, while consuming as little as possible
a{n,m}? Matches a between n and m times, while consuming as little as possible
The | operator will match either of it's arguments. For example the expression
foo|bar|baz
matches either of the strings "foo", "bar" or "baz"
A section of characters bracketed in a pair of ( and ) is a marked sub-expression. Marked sub-expressions may be referred to, for instance in the replace format string. Often parenthesized sub-expressions are created to help with repetition and alternation without ever referring to them as marked.
A section of characters bracketed in a pair of (?
To bracket a group of expressions without creating a named sub-expression (one that can be treated separately in the replace string format) you can use the bracketed pair (?: and ). This is useful for instance to create a repeat group without assigning it a mark. In simple expressions just using marked sub-expressions for grouping is more common because of the simpler syntax.
Back references may be used to refer to previously defined marked or named sub-expressions. They are particularly useful in replace format strings but can be also used in search expressions. A marked sub-expression is written in the form \digit where digit is in the range 1-9, matching the order of the corresponding matching sub-expression. For instance the pattern:
([a-z]).\1
can be used to match the strings "aka", "ege", etc.
Named sub-expressions can be back-referenced using the syntax \g{name}, matching the named subexpression (?
A set of characters to be match can be specified within a [ and ] pair. Any single character within the set can become a match. A character set can contain any of the following:
Single characters, e.g. [abcd].
Character ranges, e.g. [a-d].
Negation using the ^ operator, e.g. [^z] (anything but 'z') or [^a-z] (anything but a through z).
Character classes, e.g. [[:lower:]].
Negated character classes, e.g. [[:^lower:]].
Escaped characters, e.g. [\[] will match '['.
A character set can utilize a combination of the above, e.g.
[a-z0-f_[:cntrl:]]
The complete list of special characters (metacharacters) within the scope of a character set pattern is:
\^-[]
Any special character preceded by an escape \ matches itself.
\t | tab |
\n | newline |
\r | return |
\f | form feed |
\a | alarm |
\e | escape |
\cX | An ASCII escape sequence |
\xdd | A hexadecimal escape sequence |
\x{dddd} | A hexadecimal escape sequence |
\0ddd | An octal escape sequence |
\N{name} | A named character |
The following escapes represent generic character types.
\d | any decimal digit |
\D | any character that is not a decimal digit |
\h | any horizontal white space character |
\H | any character that is not a horizontal white space |
\s | any white space |
\S | any character that is not a white space |
\v | any vertical white space character |
\V | any character that is not a any vertical white space character |
\w | any any "word" character |
\W | any character that is not a "word" character |
\u | any uppercase character |
\U | any character that is not an uppercase character character |
\l | any lowercase character |
\L | any character that is not an lowercase character character |
\N | any character that is not a newline |
The following escapes match boundaries in the text.
\< | Matches the start of a word |
\> | Matches the end of a word |
\b | Matches at word boundary (start or end of a word) |
\B | Matches only when not a word boundary |
\` | Matches at start of a buffer only |
\' | Matches end of a buffer only |
\A | Matches at start of a buffer only (same as \`) |
\z | Matches end of a buffer only (same as \') |
Instead of generic character classes one can use the more verbose Posix character classes. For instance the expression
[[:alpha:]_][[:alnum:]_]*
can be used to match any C identifier.
[:alnum:] | letters and digits |
[:alpha:] | letters |
[:ascii:] | lower ASCII characters |
[:blank:] | space and tab |
[:cntrl:] | control characters |
[:digit:] | decimal digits |
[:graph:] | printable characters excluding space |
[:lower:] | lower case letters |
[:print:] | printable characters including space |
[:punct:] | printable characters excluding letters, digits and space |
[:space:] | white space |
[:upper:] | upper case letters |
[:word:] | word characters |
[:xdigit:] | hexadecimal digits |
Format strings are used in the Replace field when performing regex Search/Replace operations. They are mostly the same replace strings you would use in regular (non-regex) Search/Replace with some notable exceptions. All characters in a format string are used as literals except for '$' and '\' which denote starts of placeholder and escape sequences, respectively.
Placeholder | Meaning |
$& | Entire expression match |
$+ | The last marked sub-expression in a match |
$$ | The literal '$' |
$n | The n-th marked subexpression in a match. (n is in the range 1-9) |
$+{name} | The subexpression named "name" in a match |
Escape | Meaning |
\a | Character '\a' |
\e | escape |
\f | form feed |
\n | newline |
\r | return |
\t | tab |
\v | vertical tab |
\xdd | A hexadecimal escape sequence |
\x{dddd} | A hexadecimal escape sequence |
\n | The n-th marked subexpression in a match (n is in the range 1-9) |
\l | Causes the next emitted character to be converted to lower case. |
\u | Causes the next emitted character to be converted to upper case. |
\L | Causes all subsequent emitted characters to be converted to lower case (up until \E). |
\U | Causes all subsequent emitted characters to be converted to upper case (up until \E). |
\E | Terminates the \L or \U sequence. |
There is more to it than what is covered here. For more detailed documentation consult for instance pcre docs