What is a Regular Expression?

A regular expression (regex) describes a set of possible input strings. In simple words they allow us to search for text in files. Regular expressions descend from a fundamental concept in Computer Science called “finite automata theory“.

The simplest regular expressions are a string of literal characters to match. The string matches the regular expression if it contains the sub string.

Linux Regular Expressions

Regular expression meta characters

Meta characters are special type of characters that has a special meaning when during the pattern processing. Following are some of the most common meta characters used in regular expressions.

  • \d – Matches any digit character (0-9)
Regex meta character to match any digit character.
  • \w – Matches any alphanumeric character and the underscore (0-9)(A-Z)(a-z)(_)
Regex meta character to match alphanumeric character
  • \W – Matches any character that is not a word character (alphanumeric & underscore).
Regex meta character to match any character that is not a word character
  • | (alternation) – Acts like a Boolean “OR”. Matches the expression before or after the |
Regex meta character to match the expression before or after the |
  • ? (Quantifier) – Matches 0 or 1 of the preceding token, effectively making it optional.
Regex Quantifier
  • * – Matches 0 or more of the preceding token.
Regex meta character to match 0 or more of the preceding token
  • . (period) – Matches any character
Regex meta character to match any character

Regular Expression Character Classes (Character Sets)

Character classes can be used to match any specific set of characters. In these the order of the characters inside the character class does not matter.

Regular Expression Character Classes Example

In here [aeiou] will match any of the characters “a”,”e”,”i”,”o”,”u”.

Negated Character Classes

Character classes can be negated using the “^” syntax.

Negating Character Classes

Named Character Classes

Commonly used character classes can be referred to by name.

  • [a-zA-Z] = [[:alpha:]]
  • [a-zA-Z0-9] = [[:alnum:]]
  • [0-9] = [[:digit:]]

Regular Expression Anchors

Anchors will match a position within a string, not a character.

  • ^ – This means beginning of a line.
Matching the beginning of a line
  • $ – This means ending of a line.
Matching the ending of a line

Regular Expression Match Length

A match will be the longest string that satisfies the regular expression.

Regular Expression Match Length

Regular Expression Repetition Ranges

We can also specify ranges in regular expressions. “{ }” notation can specify a range of repetitions for the immediately preceding regex.

  • {n} – Repeat the previous symbol exactly n times
  • {n,} – Repeat the previous symbol n or more times
  • {n,m} - Repeat the previous symbol at least n occurrences but no more than m occurrences.

For example, n{1,3} will match any text that has between 1 and 3 consecutive letters.

Regular Expression Repetition Ranges

How to use grep in Linux

grep comes from the ed (Unix text editor) search command “global regular expression print”. This was a useful command that it was written as a standalone utility. There are two other variants, egrep and fgrep that comprise the grep family.

grep is the answer to the moment where you know you want the file that contains a specific phase but you can’t remember it’s name.

Family Differences

  • grep – uses regular expressions for pattern matching.
  • fgrep – file grep, does not use regular expressions, only matches fixed strings but can get search strings from a file.
  • egrep – extended grep, uses a more powerful set of regular expressions but does not support backreferencinng.
  • agrep – approximate grep, not standard.

Also Read: How to Use GZIP Command in Linux

grep Example

Now let’s see some uses of regular expressions in the terminal. For the demonstration purpose I’m going to use the standard american dictionary as the search file.

So let’s say you want to find every word that has the sub string “cat” in it. You can find it easily using the grep command with the help of regex.

grep "cat" /usr/share/dict/american-english

Let’s say you want to search each and every word that has a sub string starts with “c” and ends with “t”. In between those two words it could be any of the characters “a”,”e”,”i”,”o”,”u”. You can do this using character classes in regex.

grep "c[aeiou]t" /usr/share/dict/american-english

You can also negate the above character class using a “^”.

grep "c[^aeiou]t" /usr/share/dict/american-english

If you want to find words that has the sub string “cat” only at the beginning of word you can use anchors.

grep "^cat" /usr/share/dict/american-english

Same as above if you want the sub string “cat” at the end of line you can use the “$” anchor.

grep "cat$" /usr/share/dict/american-english

0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *