Google

Lesson Two: Regular Expressions

Lesson Two: Regular Expressions

Contact:[email protected]

Lesson Two: Regular Expressions

One of the coolest things in Perl is regular expressions. Essentially, this
is a very sophisticated way to search through a variable and find a match.
While regular expressions may look cryptic, it doesn't take long to start
understanding them. Check this out:

#!/usr/bin/perl

$line = "Love is blindness, I don't want to see";

print "blind is in the phrase\n" if ( $line =~ /blind/ );
print "love isn't in the phrase\n" unless ( $line =~ /love/ );
print "ignoring case, love is in the phrase\n" if ( $line =~ /love/i
);

You'll want to run the above example to understand exactly what is going on
here. First, we establish a variable with a phrase in it. Next, we print
"blind is in the phrase" only if $line contains the string "blind," which of
course it does. But in the next line, "love" is not in the phrase because
"love" and "Love" are different strings. Notice the use of "unless" which
does the opposite of an "if" in this case. In the last line, we add case
insensitivity to the match and, hence, get a match.

We can also selectively replace things in a variable:

#!/usr/bin/perl

$line = "Love is blindness, I don't want to see";

$line =~ s/want to/wanna/;
print "$line\n";

In this case, we are changing "want to" to "wanna." Preceding the regular
expression with an "s" swaps the first instance of "want to" with what's on
the other side of the replace, which is the word "wanna."

What would happen if we had another occurrence of the string "want to" in
our sample? In that case, it wouldn't have been replaced unless we added a
"g" to the end of the replace statement. For instance, the output of the
following line:

$line =~ s/s/z/g;

which replaces all "s" characters with "z" characters, looks like this:
Love iz blindnezz, I don't want to zee

How about a more practical example? Let's say that you have a whole pile of
MP3 files in a directory, and you want to get rid of the spaces in the
names.

#!/usr/bin/perl

@files = qx {ls *.mp3};

foreach $original ( @files ) {
  chomp $original;
  $modified = $original;
  $modified =~ s/ /_/g;
  print "renaming '$original' to '$modified'\n";
  qx {mv '$original' '$modified'};
}

First, we create an array of all the files ending in .mp3 with the line
@files = qx {ls *.mp3}; qx executes everything between the { and } marks as
a system command and sends each line of the results to @files. Then we do a
foreach loop through all the elements in @files. Let's say that the first
element of the array is "love is blindness.mp3\n". In the first iteration of
the loop, the string $original is set to "love is blindness.mp3\n" and then
a chomp operation is done on the string, killing the \n. Then $modified is
set to the contents of $original and has all its spaces replaced with
underscores (s/ /_/g). Next we print what we are going to do and then
execute a move command with the qx line. Because we are inside a foreach
loop, the process will repeat until all the mp3 files are renamed with
underscores for spaces. Voila!

What happens if we have a bunch of illegal characters such as ( and ) in an
mp3 name, and we want to convert those to underscores? With regular
expressions come a whole slew of characters with special meanings. Here are
a few:

\w - Word character
\W - Non-word character
\t - Tab character
\d - A digit (0-9)
\D - A non-digit

So if we do this:

#!/usr/bin/per
l
@files = qx {ls *.mp3};

foreach $original ( @files ) {
  chomp $original;
  $modified = $original;
  $modified =~ s/.mp3//;
  $modified =~ s/\W/_/g;
  print "renaming '$original' to $modified.mp3'\n";
  qx {mv '$original' '$modified.mp3'};
}

Note that we use \W to select all non-word characters and replace them with
an underscore. But the dot in ".mp3" is also considered a non-word
character, so we lopped it off in the line above and added it back when we
did the print and qx command. Obviously this little script could become as
complicated as you want, adding numbers to filenames, and deleting multiple
underscore marks. Another common use...


Back to the Index