Regular Expression Tester

Test and debug regular expressions instantly. Validate patterns, view matches, and understand regex behavior with real-time feedback.

🔍Regular Expression Tester

Test String

Highlighted Matches0 matches

Replaced String

(no replacements)

Common Patterns

Regex Quick Reference

Character Classes

. = any character
\\d = digit [0-9]
\\w = word [a-zA-Z0-9_]
\\s = whitespace
[abc] = a, b, or c

Quantifiers

* = 0 or more
+ = 1 or more
? = 0 or 1
{n} = exactly n
{n,m} = n to m

Anchors & Groups

^ = start of string
$ = end of string
\\b = word boundary
(...) = capture group
(?:...) = non-capture

What are Regular Expressions?

Regular expressions (regex) are powerful patterns used to match character combinations in strings. They provide a concise and flexible means of searching, matching, and manipulating text.

Regex is supported in virtually all programming languages and text editors, making it an essential tool for developers, data analysts, and anyone working with text processing.

Common Use Cases

•Email Validation: Validate email address formats
•Phone Numbers: Extract and validate phone number formats
•URL Parsing: Extract URLs and validate web addresses
•Data Extraction: Parse log files and structured text
•Input Validation: Validate user input in forms

Quick Reference

Basic Metacharacters

• . (any character)
• * (zero or more)
• + (one or more)
• ? (zero or one)
• ^ (start of string)
• $ (end of string)

Character Classes

• [abc] (a, b, or c)
• [a-z] (any lowercase)
• [A-Z] (any uppercase)
• [0-9] (any digit)
• \d (digit)
• \w (word character)
• \s (whitespace)

Quantifiers

• {n} (exactly n times)
• {n,} (n or more times)
• {n,m} (n to m times)
• () (grouping)
• | (alternation)
• \ (escape character)

Common Patterns

Email Pattern

[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}

Matches most common email formats

Phone Number (US)

^$?(\d{3})$?[-.\s]?(\d{3})[-.\s]?(\d{4})$

Matches US phone number formats

URL Pattern

https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)

Matches HTTP and HTTPS URLs

Date (MM/DD/YYYY)

^(0[1-9]|1[0-2])\/(0[1-9]|[12]\d|3[01])\/(19|20)\d{2}$

Matches MM/DD/YYYY date format

How Regular Expressions Work

The Theory Behind Regular Expressions

Regular expressions are based on formal language theory, specifically finite automata. A regex pattern can be converted into a state machine that processes input one character at a time. The engine maintains a current state and transitions between states based on the input characters and pattern rules.

There are two main matching algorithms: DFA (Deterministic Finite Automaton) and NFA (Nondeterministic Finite Automaton). DFA engines are faster and more predictable—they examine each character exactly once. NFA engines support more features like backreferences and lookarounds but can be slower, potentially exhibiting catastrophic backtracking on certain patterns.

The dot metacharacter (.) matches any single character except newline. Character classes [abc] match any one character from the set. Negated classes [^abc] match any character not in the set. Ranges like [a-z] specify all characters between a and z. The caret ^ and dollar $ anchor patterns to the start and end of strings respectively.

Quantifiers specify how many times an element should repeat. The asterisk * means zero or more times, plus + means one or more times, question mark ? means zero or one time. Specific counts use curly braces: {3} matches exactly 3 times, {2,5} matches 2 to 5 times, {3,} matches 3 or more times.

Greedy vs. lazy quantifiers affect matching behavior. By default, quantifiers are greedy—they match as much as possible. For example, .* in the pattern <.*> applied to "<div>Hello</div>" matches the entire string. Lazy quantifiers (*?, +?, ??) match as little as possible: <.*?> matches just <div> then </div> separately.

History and Evolution

Regular expressions originated in the 1950s when mathematician Stephen Kleene described regular languages using mathematical notation. The Kleene star (*) is named after him. In 1968, Ken Thompson implemented the first computational regex support in the text editor QED, later bringing it to ed and then grep (Global Regular Expression Print) in Unix.

Early Unix tools used simple regular expressions with basic metacharacters. In the 1980s, Henry Spencer created a regex library for Unix that became the foundation for many implementations. Perl, released in 1987, significantly extended regex capabilities with features like lookaheads, lazy quantifiers, and non-capturing groups, creating what's often called "Perl-Compatible Regular Expressions" (PCRE).

PCRE became an unofficial standard, influencing implementations in PHP, Python, Java, JavaScript, and other languages. Each language added its own extensions: Python's named groups, Java's Unicode support, JavaScript's sticky flag. The result is a family of similar but not identical regex flavors, each with unique capabilities and quirks.

Modern regex engines have evolved to handle Unicode properly. Early implementations assumed ASCII—each character was one byte. Unicode introduced multibyte characters, combining characters, and complex scripts. Engines now support Unicode properties like \p{Letter} for any letter in any language, and \p{Script=Arabic} for Arabic characters.

The ECMAScript 2018 specification added several features to JavaScript regex: lookbehind assertions, named capture groups, Unicode property escapes, and the dotAll (s) flag. These brought JavaScript closer to other modern regex implementations, though differences remain—JavaScript lacks possessive quantifiers and atomic groups that prevent backtracking.

Advanced Pattern Features

Capturing groups use parentheses to remember matched substrings: (\d{4})-(\d{2})-(\d{2}) captures year, month, and day separately. You can reference these captures: \1 refers to the first captured group. Non-capturing groups (?:...) group elements without creating a capture, improving performance when you don't need the captured value.

Lookaheads and lookbehinds are zero-width assertions—they match a position without consuming characters. Positive lookahead (?=...) succeeds if the pattern matches what follows. Negative lookahead (?!...) succeeds if the pattern doesn't match. Lookbehind (?<=...) and (?<!...) check what comes before. For example, \d+(?= dollars) matches numbers followed by " dollars" without including " dollars" in the match.

Word boundaries \b match positions between word and non-word characters. The pattern \bcat\b matches "cat" but not the "cat" in "category" or "concatenate". This is crucial for matching whole words. \B matches non-boundary positions: \Bcat\B would match the "cat" in "concatenate" but not standalone "cat".

Character class shorthands simplify common patterns: \d matches digits [0-9], \w matches word characters [a-zA-Z0-9_], \s matches whitespace. Uppercase versions are negated: \D matches non-digits, \W matches non-word characters, \S matches non-whitespace. These are locale-dependent in many engines, potentially matching Unicode characters beyond ASCII.

Backreferences allow matching repeated patterns: (['"])(.*?)\1 matches quoted strings with the same quote character at start and end. The \1 must match whatever the first group captured. If the first group matched a single quote, \1 matches only a single quote. This enables sophisticated pattern matching but can cause performance issues in NFA engines.

Practical Applications

Input validation is a primary regex use case. Email validation requires checking for valid characters before and after @ with a proper domain. However, perfectly validating emails per RFC 5322 requires an extremely complex regex—practical patterns balance strictness with usability. Phone number validation similarly varies by region and format requirements.

Text parsing and extraction leverage capturing groups. Parse log files with patterns like (\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}) (\w+): (.*) to extract timestamps, log levels, and messages. Extract URLs from text with patterns matching protocol, domain, and path components. Process CSV data respecting quoted fields containing commas.

Search and replace operations become powerful with regex. Find-and-replace with captured groups: replace (\d{3})-(\d{2})-(\d{4}) with XXX-XX-$3 to mask SSNs except last four digits. Reformat dates, normalize whitespace, or restructure code. Many text editors and IDEs support regex in find-and-replace, making bulk refactoring efficient.

Syntax highlighting and code analysis use regex to identify tokens. Lexers tokenize source code using regex patterns for keywords, identifiers, strings, and operators. Markdown parsers use regex to find headers, links, and formatting. While complex parsing often requires more sophisticated tools like parsers and ASTs, regex handles many lightweight parsing tasks effectively.

Performance and Optimization

Catastrophic backtracking occurs when an NFA engine explores exponentially many paths. The pattern (a+)+ tested against "aaaaaaaaab" causes massive backtracking as the engine tries every possible way to distribute characters between the nested quantifiers. The solution is to rewrite patterns to avoid ambiguous quantifier nesting: use possessive quantifiers (a++)+ or atomic groups (?>a+)+ when available.

Anchor patterns when possible to reduce unnecessary matching attempts. If you know text must start with specific content, begin the pattern with ^. If validating an entire string, use ^...$. Without anchors, the engine might try matching at every position. The pattern \d{3}-\d{4} tested against long text tries matching at every character position until it finds a match.

Character classes are more efficient than alternation. Use [abc] instead of (a|b|c). The character class is evaluated in one step, while alternation creates multiple paths for the engine to explore. Similarly, [0-9] is optimized in most engines compared to (0|1|2|3|4|5|6|7|8|9). Use specific patterns instead of overly general ones when possible.

Compile regex patterns once and reuse them. Most languages offer a compile step that processes the pattern into an internal representation. In loops or repeated operations, compile once before the loop rather than on each iteration. In JavaScript, create a RegExp object outside the loop. In Python, use re.compile(). This preprocessing can provide significant performance improvements.

Common Pitfalls and Best Practices

Escaping special characters correctly is crucial. Metacharacters like . * + ? [ ] ( ) {} | ^ $ \ have special meanings and must be escaped with backslash to match literally. In string literals, you often need double escaping: "\\d" in many languages. Use raw strings (r"" in Python) or template literals to avoid escaping confusion.

Don't use regex when simpler alternatives exist. Checking if a string starts with "http" is better done with startsWith() than regex. Simple substring matching with indexOf() or includes() is faster and clearer than regex. Reserve regex for pattern matching where you need its power—wildcards, character classes, quantifiers, or alternation.

Test edge cases thoroughly. Empty strings, very long strings, strings with special characters, Unicode characters, and strings that almost match can reveal bugs. Regex bugs are often subtle—a pattern that works for typical inputs may fail on edge cases. Use a comprehensive test suite and consider fuzz testing with random input generation.

Document complex patterns with comments and examples. Many regex flavors support verbose mode (x flag) allowing whitespace and comments within patterns. Even without verbose mode, add comments explaining what each pattern does and provide example matches and non-matches. Your future self and teammates will thank you when maintaining the code.

FAQ

Why does my regex work in one language but not another?

Different programming languages implement different regex flavors with varying features. JavaScript lacks lookbehinds (until ES2018) and possessive quantifiers. Python uses \A and \Z for string anchors instead of ^ and $ in multiline mode. Java requires double-escaping backslashes in string literals. Test patterns in your target language and consult its specific documentation. Tools like regex101.com let you select different flavors to test compatibility.

What's the difference between greedy and lazy quantifiers?

Greedy quantifiers (* + {n,}) match as much as possible while still allowing the overall pattern to match. Lazy quantifiers (*? +? {n,}?) match as little as possible. In <div>test</div>, the pattern <.*> (greedy) matches the entire string, while <.*?> (lazy) matches <div> and </div> separately. Use lazy quantifiers when you want minimal matches or need to capture content between delimiters.

How can I match Unicode characters and emoji?

Modern regex engines support Unicode through property escapes and the u flag. In JavaScript (ES2018+), use \p{Letter} to match any letter in any language, \p{Emoji} for emoji, or \p{Script=Han} for Chinese characters. Enable the u flag: /\p{Letter}+/u. In Python, patterns automatically handle Unicode. For older engines, you may need to specify Unicode code point ranges like [\u4E00-\u9FFF] for Chinese characters.

What is catastrophic backtracking and how do I avoid it?

Catastrophic backtracking occurs when a regex with nested quantifiers tries exponentially many match combinations. Patterns like (a+)+ or (a|a)* cause this. The engine backtracks through every possible way to distribute characters, taking seconds or minutes for short inputs. Avoid it by: not nesting quantifiers, using possessive quantifiers or atomic groups when available, being specific instead of using .*, and testing patterns with increasingly long inputs. Rewrite (a+)+ as a+ (usually what you meant anyway).