Skip to main content
Sonar.tv
Back
Finding the Bad Apple in Your Regular ExpressionsNow Playing

Finding the Bad Apple in Your Regular Expressions

Code SecurityMarch 13th 202430:00

Deep dive into how SonarQube's security rules detect dangerous, malformed, or ReDoS-vulnerable regular expressions hiding in your codebase before they reach production.

Introduction to Regular Expression Vulnerabilities

During a Sonar webinar presentation, Johann Bilites, a Java and Kotlin software engineer at Sonar, demonstrated critical security vulnerabilities that can arise from poorly written regular expressions. The session highlighted how regular expressions, despite their utility in pattern matching, can become significant security liabilities when not carefully constructed. The motivation for this discussion stems from high-profile infrastructure outages at Stack Overflow and Cloudflare in 2016 and 2019 respectively, both traced back to the same root cause: ReDoS (Regular Expression Denial of Service) attacks, also known as catastrophic backtracking or runaway regular expressions.

Understanding Catastrophic Backtracking

To comprehend catastrophic backtracking, one must first understand how backtracking functions within regex engines. Backtracking is a technique used by many regular expression engines to match input strings by allowing the engine to try different sub-expressions when matching characters or character sequences. When an initial choice fails to produce a match, the engine backtracks and attempts alternative options. The problem emerges when this backtracking process becomes excessive, causing the computational complexity to grow non-linearly—ranging from quadratic to exponential time relative to input length.

Bilites illustrated this problem with a practical example involving an email domain matching regex designed to filter students from the UK and US (.edu and .ac.uk domains). When matching a valid email like "name@cam.ac.uk," the regex engine required approximately 30-40 steps to process the input successfully. However, when presented with a malformed input ending in ".com" instead of the expected domains, the engine required over 200,000 steps to determine that the input did not match. This dramatic increase in processing steps demonstrates how seemingly innocent regex patterns can become performance nightmares when encountering unexpected or adversarial input.

The Mechanism of Regex Backtracking Issues

The root cause of this exponential complexity lies in how the regex engine handles quantifiers and alternation patterns. In the email domain example, the pattern used a lazy (non-greedy) quantifier on a wildcard followed by alternatives. When the engine encounters input that partially matches but ultimately fails, it must explore numerous combinations of how preceding wildcards could consume characters. With 14 dots in the malformed input, the engine attempts to determine all possible ways the dot wildcard could match those dots before reaching the final alternation pattern, leading to exponential growth in computational steps.

Static Analysis and Sonar's Approach

Recognizing the severity of these issues, Sonar began developing static code analysis tools two years prior to the webinar to detect and flag problematic regular expressions. Rather than waiting for these vulnerabilities to cause runtime issues or outages, Sonar's approach involves analyzing code during the development phase to identify patterns that could lead to catastrophic backtracking. This proactive detection allows developers to refactor their regex patterns before deployment, preventing potential denial-of-service vulnerabilities in production environments.

Key Takeaways

  • Catastrophic backtracking occurs when regex engines perform exponential-time computations due to excessive backtracking on ambiguous quantifier patterns, particularly when input fails to match expected patterns
  • High-profile infrastructure outages at major companies have been traced directly to ReDoS vulnerabilities, making regex security a critical concern for production systems
  • Static code analysis tools like Sonar can detect problematic regex patterns during development, enabling developers to refactor expressions before they cause performance issues
  • Regex engine behavior is non-intuitive — patterns that work efficiently on valid input can require orders of magnitude more processing steps on invalid or malformed input
  • Developers must carefully consider quantifier combinations and alternation patterns when constructing regular expressions to avoid creating potential denial-of-service vulnerabilities