Lesson 04 – Regular Expressions

Regular expressions describe text patterns.

Each text pattern in a regular expression is called a metacharacter.

=~ is the operator used for regular expressions.

When characters are written between [ and ] it means they are part of a character class. One character from the character class must match in order  to continue evaluating the rest of the regular expression.

Inside a character class, – indicates a range and ^ indicates negation.

Perl has shortcuts for the most common character classes.

[a-zA-Z0-9_] can be written as \w and [^a-zA-Z0-9_] as \W.

Metacharacters.

  • . means match any character except a newline
  • \w means match any alphanumeric character or the underscore
  • \W means match any character that is not alphanumeric or the underscore
  • \d means match any character that is a digit
  • \D means match any character that is not a digit
  • \s means match any character that is a whitespace such as a space, newline or a tab
  • \S means match any character that is not a whitespace
  • ^ means match the beginning of the line
  • $ means match the end of the line

^ and $ are called anchor metacharacters. They’re also sometimes called assertions.

Quantifiers describe how many times a character can be found in a string.

  • * means zero or more
  • + means one or more
  • ? means zero or one time
  • {n} means n times where n is an integer
  • {n,m}means any number of times between n and m
  • {n,} means n or more times

Modifiers.

  • i (Ignore case)
  • s (Single line)
  • u (Unicode)
  • m (Multiline)
  • x (Verbose)
  • l (Locale)

m/regular expression here/ is the same as /regular expression here/. It checks whether the first operand matches the text pattern.

s/find this regular expression/replace with this text/

Regex can be used to find a certain text and substitute it with another text.

The following example substitutes spaghetti with pizza:

#!/usr/bin/perl

use strict;
use warnings;

my $sentence = "I love eating spaghetti.";

$sentence =~ s/spaghetti/pizza/;

print $sentence, "\n";

This example substitutes the number of slices to 4:

my $order = "3 slices of plain pizza
5 slices of pepperoni pizza";

$order =~ s/\d+/4/g;
print "Your order has been changed to:\n", $order, "\n";

/g modifier means match the regex globally so it replaces all occurrences of a digit to 4.

The program prints this on the screen:

Your order has been changed to:
4 slices of plain pizza
4 slices of pepperoni pizza

When you want to take a portion of a string based on your regular expression, you must put parentheses around each pattern that you want to match. First matching part will be stored in $1, second matching part will be stored in $2, etc. We call this process capturing.

If you read perlrequick, there is this example:

($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);

It’s capturing this:

($time =~ /(\d\d):(\d\d):(\d\d)/) # returns $1, $2, $3

The values are assigned to ($hours, $minutes, $second)
You need the parentheses to group the expression like this. Otherwise it’d first assign $time to $hours, then check $second (undef) against the regex. (Precedence issue with = and =~)

Notes: In Programming Perl, it says that an easy mistake is to think that \w matches a word. Use \w+ to match a word.

When you’re learning how to make regex, I found this very useful. http://gskinner.com/RegExr/

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>