▼  Site Navigation Main Articles News Search  ▼  Anime + Manga Anime Reviews Anime Characters Gallery Screenshots Manga Reviews  ▼  Misc Links to Webcomics Bible Quotes About Older Musings
site version 7.3
PCRE –– tutorial part 1
written by: admin


Date Written: 8/18/07 Last Updated: 4/30/20

This starts off with the basics of PCRE to help you get a start on making your own PCRE expressions.  First off we will start with some definitions.

\d = a digit 0–9
\D = anything that is not a digit
\w = any word character.  a–z, A–Z, 0–9, _ (all the letters, numbers and the underscore)
\W = anything that is not a word character
\s = any whitespace such as "space" , "tab" , "newline"
\S = anything that is not whitespace
\b = boundary between a word and a non–word.
\B = there is no boundary between word characters.

There are a several others, but these are the ones we will start with.

a simple pattern

Lets start with a basic command preg_replace().  This is a PCRE command that will look for a pattern within a string and replace it with a string (not a pattern) or variable.

To identify and remove all of the word characters in the string we would use the following command:
$string = '56*#643%g3';
$string=preg_replace('/\w/','',$string);
//$string is now <span class='o'>*#%</span>
$string now is '*#%'.  The 'g' and numbers no longer present as 'g' is a word character as were the numbers.  It was replaced with whatever was in the second quotes which in this case was nothing.

replacing a word

If we want to get more specific and remove all of the "blue" from the sentence "the big blue moon has blue rocks and blue craters." and replace it with the word "white" we could set up a preg_replace() command as follows:
$string=preg_replace('/blue/' ,'white', $string);
// this produces the big white moon has white rocks and white craters.

Round brackets will do the same thing as above and the letters have will be matched in that exact order so that it looks for the string "have" as opposed to the individual letters b, l, u, and e.
$string=preg_replace('(blue)' ,'white', $string); will also do the same thing.  The significance of this is that I can now put a quantifier next to it so that it looks for the word "blue" and will match it only if it is next to another word "blue" as in the case:

"the big blueblue moon has blueblue rocks and white craters."
$string=preg_replace('(blue){2}' ,'white', $string);

notice that the quantifier is in curly brackets and is next to the term in the round brackets yet within the quotes.  We can make it two or more {2,} or look for any matches that are at least 2 in length, but not more than 5 as in {2,5}.

using square brackets

If we wanted to remove several different letters irrespective of location like the 4 letters: "rats" we would put them in square brackets like this: '/[rats]/'.  This is known as a character class.  For example:
preg_replace('/[rats]/','',$string);
//this produces "he big blue moon h blue ock nd blue ce."
Not very impressive I know, but it shows how square brackets work. We can also use quantifiers so that it will only look for matches that are 2 in length {2} so that it looks like '/[rats]{2}/'.  In this case the pattern will only match where any two of the letters r, a, t, or s occur together.
$string="The big white moon has white rocks and white craters."
$string=preg_replace('/[rats]{2}/','+',$string);
// This produces The big white moon h+ white rocks and white c+te+.



Within the square brackets the only meta–characters you can have are \, ^, and .  Meta–characters give special meaning to the other characters in the class.
^ = negates all of the characters in the class so that it will match anything except for what is in the character class (the characters in the square brackets).  It is also placed at the beginning of the character class as in '\[^rats]\'
= range as in 0–9 or a–z or A–Z or a–h.
\ = an escape character gives a character a literal meaning so that \^ will now match ^ as opposed to ^ acting as a negator.

Quantifiers

Quantifiers are interesting.  Before we were using curly brackets to determine how many times a pattern can occur in a particular string, but there are a few other ways, some of which i am just beginning to understand.

* = 0 or more
. = matches any character.
? = minimizes how often a pattern is matched.
+ = one or more

There are a few others, but these four are complicated enough, because they can also be combined to add further meanings as well.  To start '/[rats]*/' will match anything in the string that has 0 or more of the letters found in the character class as in the following:
$string="The big white moon has white rocks and white craters."
$string=preg_replace('/[rats]*/','+',$string);
// This produces +T+h+e+ +b+i+g+ +w+h+i++e+ +m+o+o+n+ +h++ +w+h+i++e+ ++o+c+k++ ++n+d+ +w+h+i++e+ +c++e++.+
How did this happen?  The first match occured before the first letter, because the pattern matched 0 times, so since it matched 0 times it replaced it with a +.  It then moved on to the next letter, which is the "T" in "The".  Since it didn't match it moved on to the next spot, which in this case is the space between the letters "T" and "h" where it matched 0 times again.   Moving ahead to the word "has" the letters "a" and "s" are located next to each other so the pattern does what it has been doing and sees if the letters "rats" match 0 or more times.  In this case both "a" and "s" are located next to each other so it replaces the set with the "+" sign.  This paragraph may need several re–readings, but there are two main concepts "0" and "or more" where "or more" is looked for first.

A "0" match can be considered the border between one character and the next.  Remember that spaces are also characters.  "or more" is scanned for first.  "or more" = the longest set and is only looked for after it has found the first match.  If we wanted there to be one or more matches we would use:  '/[rats]+/' or '/[rats]./'.

Here is the list of combinations if they occur after a character class:

.*   = cuts off the rest of the string after the first match.
.*?  = matches one at a time.
*?   = matches 0 or once.
+?   = matches one at a time.
.    = matches once and the next character.
     = if nothing is used it matches one at a time.

         .*
$string="The big white moon has white rocks and white craters.";
$string=preg_replace('/[rats].*/','+',$string);
//The big whi+


         .*?
$string="The big white moon has white rocks and white craters.";
$string=preg_replace('/[rats].*?/','+',$string);
//The big whi+e moon h++ whi+e +ock+ +nd whi+e c+++e++.


         *?
$string="The big white moon has white rocks and white craters.";
$string=preg_replace('/[rats]*?/','+',$string);
//+T+h+e+ +b+i+g+ +w+h+i+++e+ +m+o+o+n+ +h+++++ +w+h+i+++e+ +++o+c+k+++ +++n+d+ +w+h+i+++e+ +c+++++++e+++++.+


         +?
$string="The big white moon has white rocks and white craters.";
$string=preg_replace('/[rats]+?/','+',$string);
//The big whi+e moon h++ whi+e +ock+ +nd whi+e c+++e++.


         .
$string="The big white moon has white rocks and white craters.";
$string=preg_replace('/[rats]./','+',$string);
//The big whi+ moon h+ whi+ +ck++d whi+ c+++.


$string="The big white moon has white rocks and white craters.";
$string=preg_replace('/[rats]/','+',$string);
//The big whi+e moon h++ whi+e +ock+ +nd whi+e c+++e++.




some helpful links


TAGS: pcre, php
copyright 2005–2024