Regular Expressions | Quiz 1 | #WinjaCTF2021

4 min readDec 22, 2020

Introduction

To promote the upcoming Winja CTF 2021 competition which is to be held during Nullcon 2021, Winja released the 1st online quiz challenge as part of its promotional events. Participants were expected to understand the regular expressions and choose the most appropriate answer.

This blog post is an attempt to explain regular expressions by taking the example from this quiz.

The Question

To extract absolute URLs from source code, which of the following regular expressions cannot be used?

http(s)*://[^”]*
http[s]?://[^”’]+
[\w]+[s]?[/:]{3}[^”’]+
[\w]+[s]?[^”’]+
[\w]+[s]?://[^”’]+

Defining Regular Expressions

A regular expression is a search pattern that can be used in search engines, in search and replace dialogues of word processors and text editors, in text processing utilities such as grep, sed and AWK, etc. It is also referred to as regex or regexp.

Decoding Regex Elements

If you are new to regular expressions, these strings might appear too cryptic. But, as you start using them and once you understand the elements that are used in building a regular expression, these same cryptic texts would leave you in awe.

For our beginners, mentioned below is a quick explanation of the elements used in building the 5 regular expressions that were part of the quiz.

()
*
[]
^
?
+
\w
{}

Parentheses are used for grouping. They can be used to create (capturing) or (?:non-capturing) groups. And, an asterisk * is a special character that can match 0 or more occurrences of a regular expression. Together, they can be used to match string variations. For example,

http(s)* — It will match both http and https

A character class can be used to match only one out of several characters. This can be achieved by placing the range of characters that we want to match, between square brackets.

[fh]t(t)*p — It will match both ftp and http

Typing a caret ^ after the opening square bracket will negate the character class, thus, matching any character that is not in the character class.

https://[^”’]* — It will match all strings that start with the text https:// and will match all the following characters as long as the character is not a double quote or a single quote.

A character class followed by a question mark ? will match 0 or more occurrences of the specified characters, i.e., either one of the specified characters can be present, or none of them could be present in the matched string. Without a question mark, at least one of the characters must be matched from the specified character set.

[fh]t[t]?p — It can match both ftp and http

A plus sign + will match at least one or more occurrences of the preceding character or the specified regular expression (if grouped in a set of parentheses).

https://[^”]+ — It will match all strings that start with the text https:// only if the following character is not a double quote. It will match all the following characters as long as the character is not a double quote.

The \w meta-character matches word characters, i.e., it is equivalent to character class [a-zA-Z0–9_] in ASCII character set.

[\w]+ — It will match http and https but not http://

Finally, the curly braces {m,n} can be used as an occurrence indicator to match the preceding item at least m times, but not more than n times.

http[:/]{3} — It will match all of the following patterns http:// or http::: or http/// or http//: or http/:/

Now that you have learned how to read regular expressions, let’s interpret the 5 listed quiz options.

Hands-On Exercise

Download the source code of any random website. I am choosing view-source:https://www.winja.site/
Open the downloaded text content in Visual Studio Code (or any other text editor that supports regular expressions)
Enable regular expression search

Try the first regular expression, and analyze the results:

http(s)*://[^”]*

Try the second regular expression, and analyze the results:

http[s]?://[^”’]+

Try the third regular expression, and analyze the results:

[\w]+[s]?[/:]{3}[^”’]+

Try the fourth regular expression, and analyze the results. Does the output look different?

[\w]+[s]?[^”’]+

And now, try the final option:

[\w]+[s]?://[^”’]+

The Answer

Clearly, fourth option did not meet our expectations of identifying the absolute URLs correctly. In fact, it was incapable of identifying any URL pattern at all. The corresponding regular expression cannot be used to extract absolute URLs from any source code.

Thus, the correct answer is option number 4.