ruk·si

Regular Expressions

Updated at 2014-03-12 11:00

This note contain regular expression (regex) cheatsheet and a few commonly used regex snippets.

General tips:

  • Regex is really good when working with natural language, searching logs, when refactoring code base and validating user input.
  • Do not use regex to validate names. Impossible.
  • Do not use regex to validate emails. Standard is so unspecific that the only thing email must have is an @. Validate by sending an email.
  • Do not use regex to parse any common and structured data format like XML or JSON. Use libraries for that. You can but you really shouldn't.
  • You can use regex to validate input like credit card numbers, but you should not be too strict about it. Allow spaces and hyphens, but replace them empty strings using regex.

Literal matching. Literal matching is the most basic type of regex search. Searches are usually case-sensitive by default, but the behaviour can be configured.

dog # match dog, nothing else cat # match cat, nothing else

Character class. A group of possible characters.

c[auo]t # match cat, cut or cot

# Character class notation may change meaning of special symbols. [.] # match only . [\[]] # match only \[]

Character class range.

[a-d] # match a, b, cord [0-9a-c] # match0, 5, b, c, etc.

Character class negation.

[^a] # match anything but a, e.g. b, c, 0... [^a-z] # match anything but letters a to z [^^] # match anything but character ^

Shorthand character classes.

. # match any character \d # match a digit character, [0-9] \D # match a non-digit character, [^0-9] \w # match a word character, [0-9A-Za-z_] \W # match a non-word character, [^0-9A-Za-z_] \s # match a space character \S # match a non-space character

c.t # match cat, cot, czt, c.t, etc., but not ceet or ct c.t # match c.t

Quantifiers.

y{3} # match yyy [abc]{2} # match aa, ab, ac, ba, etc.

Quantifier ranges.

y{3,5} # match yyy, yyyy or yyyyy colou{0,1}r # find colour or color y{1,} # match any number of ys, e.g. y, yy, yyy, etc.

Quantifiers are greedy.

It was aaaaawful a{3,5} aaaaa

To revert to ungreedy, use ?: ".*?" # match "word", but as small chunks as possible

Shorthand quantifiers.

mo?t # ? is same as {0,1}, match mt or mot. mo*t # * is same as {0,}, match mt, mot, moot, etc. mo+t # + is same as {1,}, match mot, moot, etc.

Alternation.

cat|dog # match cat or dog [cat|dog] # match a, c, d, g, o, t or |, not alternation. H(ä|ae)ndel # match Händel, or Haendel H(ä|ae)?ndel # match Hndel, Händel, or Haendel H(ä|ae?)ndel # match Händel, Handel, or Haendel

Word boundaries. There is a word boundary between every word character \w and a non-word character \W, in addition, all text input ends and starts with a word boundary. Word boundaries are not characters and have no width.

it's a cat # there are 8 word boundries in this. it's a cat # now marked with underscores.

\b # match a word boundry. \b\w{3}\b # match a three letter word

Line boundaries.

# How regular expressions see lines: start-of-line, line characters, end-of-line, line-break start-of-line, line characters, end-of-line, line-break start-of-line, line characters, end-of-line, line-break start-of-line, line characters, end-of-line

^cat # match cat starting from start-of-line cat$ # match cat ending in end-of-line ^cat$ # match a line with only cat ^Once # match start-of-line, followed by Once prince.$ # match prince., followed by end-of-line

Groups. Groups are primarily used to indicate order of operations inside regex and to map results to these groups.

(red|blue)? # match red, blue or empty string (Mon|Tues|Wednes|Thurs|Fri|Satur|Sun)day

Capture groups. Regular expressions return their results using capture groups. Capture group 0 is the whole match, then defined groups assign the group number.

([0-9]{2})\([0-9]{2})\([0-9]{2}) Today is 03\22\99. -> capture group 0 is 03\22\99, the whole match -> capture group 1 is 03 -> capture group 2 is 22 -> capture group 3 is 99

(\w*)ility # match a word ending with ility accessibility -> capture group 0 is accessibility, -> capture group 1 is accessib`

(\w+) had a ((\w+) \w+) I had a nice day -> capture group 0 is I had a nice day -> capture group 1 is I. -> capture group 2 is nice day. -> capture group 3 is nice.

((cat)|dog) dog -> capture group 0 is dog -> capture group 1 is dog -> capture group 2 is empty cat -> capture group 0 is cat -> capture group 1 is cat -> capture group 2 is cat

a(\w)* avocado -> capture group 0 is avocado -> capture group 1 is v

a(\w*) avocado -> capture group 0 is avocado -> capture group 1 is vocado

Replacement. Replacement syntax depends on the implementation, but you usually specify capture groups and then replace them e.g. \1 means "contents for group 1".

([0-9]{2})\([0-9]{2})\([0-9]{2}) # Regular expression. \2-\1-19\3 # Replacement expression, standard. ${2}-${1}-19${3} # Replacement expression, PHP style. Today is 03\22\99. => Today is 22-03-1999.

Searching between delimiters.

.*? # ? is used to make the search ungreedy Tarzan likes Jane => Tarzan

[^"]+ "Hello" "World" => "Hello" "World"

Sources