C# Regex, Regular Expression 사용법, 그리고 문법 알아보기

C# 정규표현식(regex)에 대해 알아보는 시간입니다. 정규표현식 또는 정규식 표현은 Regular Expression(REGEX)이라는 단어에서 볼 수 있듯이 정규화 된 특정의 규칙을 표현하는 방법입니다. 거의 모든 프로그래밍에서 활용이 되므로 한 번쯤 읽어보시면 도움이 많이 될 것입니다.

주요 개념으로는 패턴(pattern)과 매치(match)가 있습니다. 패턴은 일치하는 표현을 구성하는 문자열을 의미하고 매치는 일치하는 결과 배열을 뜻합니다. 매치는 Matches(Matchcollection), Match로 C#에서 자주 언급합니다.

그밖에도 정규표현식은 Regex의 생성자뿐만 아니라 Replace, Split 으로 일치하는 표현을 치환하거나 나누는 용도로 사용할 수 있습니다. 이번 게시물에서는 다음의 목차와 같이 C#에서 사용하는 표현식에 대해 안내해 드리겠습니다.

Control Single Characters
Control Characters
Non-ascii codes
Character Classes
Quantifiers
Anchors
Groups
Inline Options
Backreferences
Alternation
Substitution
Comments
Supported Unicode Categories

Control Single Characters

set, not set, range 등 정규표현식의 핵심이라 할 수 있는 내용입니다. 매우 자주 사용하는 편이므로 숙지해 두시는 게 좋습니다.

표현식	의미
[set]	셋에 있는 어느 문자든 : In that set
[^set]	셋에 있는 모든 문자 제외 : Not in that set
[a–z]	a부터 z까지 범위의 어느 문자든 : In the a-z range
[^a–z]	a부터 z까지 범위 모두 제외 문자 : Not in the a-z range
.	\n을 제외한 어느 문자든 : Any except \n (new line)
\char	표현식에서 사용하는 특수문자를 문자 그대로 : Escaped special character

Control Characters

제어문자 또는 특수문자에 해당하는 표현식 목록입니다.

표현식	일치	유니코드
\t	Horizontal tab	\u0009
\v	Vertical tab	\u000B
\b	Backspace	\u0008
\e	Escape	\u001B
\r	Carriage return	\u000D
\f	Form feed	\u000C
\n	New line	\u000A
\a	Bell (alarm)	\u0007
\c char	ASCII control character	-

Non-ascii codes

표현식	다음과 같은 문자코드와 일치
\octal	2-3 digit octal character code (8진수)
\x hex	2-digit hex character code (16진수)
\u hex	4-digit hex character code (16진수)

Character Classes

소문자는 해당 캐릭터, 대문자를 해당 캐릭터가 아닌(not)을 뜻한다는 확인할 수 있습니다.

표현식	일치 문자
\p{ctgry}	In that Unicode category or block
\P{ctgry}	Not in that Unicode category or block
\w	Word character
\W	Non-word character
\d	Decimal digit
\D	Not a decimal digit
\s	White-space character
\S	Non-white-space char

Quantifiers

수량자입니다. 정규표현식의 시작이 컨트롤 싱글 캐릭터(본문 최상단)라면, 수량자는 꽃이자 마지막이라 할 수 있습니다. 이 문서에서 가장 중요한 표현식 중 하나이기도 합니다.

하단의 표를 보면, 그리디와 레이지가 나오는데, 직역하면 Greedy는 탐욕스러운 것, Lazy는 게으른 것이라는 건 알고 계시죠?

수량자는 Greedy Quantifier, Lazy Quantifier 로 나눌 수 있습니다.
Greedy는 탐욕스럽다는 표현에서 볼 수 있듯이 최대한 많은 표현식과의 일치 여부를 캡처합니다.
반대로 Lazy는 최소한의 일치여부로 표현을 캡처합니다.

그리고 "Greedy + ?" 은 Lazy 라는 공통점이 있으니, 알고 계시면 도움이 된답니다.

Greedy	Lazy	일치하는 표현
*	*?	0 이상 : 0 or more times
+	+?	1 이상 : 1 or more times
?	??	0이거나 1이거나(둘 중 하나) : 0 or 1 time
{n}	{n}?	정확히 n 번 : Exactly n times
{n,}	{n,}?	최소 n 번 : At least n times
{n,m}	{n,m}?	최소 n 번, 최대 m 번 : From n to m times

Anchors

스트링 또는 라인의 시작을 나타내는 ^ 표현식와, 끝을 나타내는 $ 표현식은 많이 접하셨을 겁니다.

표현식	특정하는 위치
^	스트링 또는 라인의 시작 : At start of string or line
\A	At start of string
\z	At end of string
\Z	At end (or before \n at end) of string
$	스트링 또는 라인의 끝 : At end (or before \n at end) of string or line
\G	Where previous match ended
\b	On word boundary
\B	Not on word boundary

Groups

인덱스드 그룹, 네임드 그룹은 자주 사용하는 편에 속하므로 한 번 훑어보세요.

표현	정의
(exp)	색인된 그룹 : Indexed group
(?<name>exp)	명명된 그룹 : Named group
(?<name1-name2>exp)	Balancing group
(?:exp)	표현식 캡처 제외 그룹 : Noncapturing group
(?=exp)	Zero-width positive lookahead
(?!exp)	Zero-width negative lookahead
(?<=exp)	Zero-width positive lookbehind
(?<!exp)	Zero-width negative lookbehind
(?>exp)	Non-backtracking (greedy)

Inline Options

Case-insensitive를 나타내는 i 와 같은 옵션은 간혹 사용하는 것을 볼 수 있습니다.

옵션	표현식에서의 옵션 의미
i	Case-insensitive
m	Multiline mode
n	Explicit (named)
s	Single-line mode
x	Ignore white space

옵션	사용 목적
(?imnsx-imnsx)	Set or disable the specified options
(?imnsx-imnsx:exp)	Set or disable the specified options within the expression

Backreferences

그룹 표현식의 인덱스드 그룹과 네임드 그룹에 대한 역참조입니다. 그룹 표현식 부분을 참고할 수 있습니다.

인덱스드 그룹 : (exp)
네임드 그룹 : (?<name>exp)

표현식	매치 그룹
\n	Indexed group
\k<name>	Named group

Alternation

a 또는 b를 나타낼 때에는 a|b 와 같이 표현식을 사용합니다. 자주 사용하는 부분이니 한 번 확인해 주세요.

표현식	일치 여부
a \|b	Either a or b
(?(exp) yes \| no)	yes if exp is matched no if exp isn't matched
(?(name) yes \| no)	yes if name is matched no if name isn't matched

Substitution

표현식	대체하려는 표현
$n	Substring matched by group number n
${name}	Substring matched by group name
$$	Literal $ character
$&	Copy of whole match
$`	Text before the match
$'	Text after the match
$+	Last captured group
$_	Entire input string

Comments

표현	목적
(?# comment)	Add inline comment
#	Add x-mode comment

Supported Unicode Categories

카테고리	설명
Lu	Letter, uppercase
LI	Letter, lowercase
Lt	Letter, title case
Lm	Letter, modifier
Lo	Letter, other
L	Letter, all
Mn	Mark, nonspacing combining
Mc	Mark, spacing combining
Me	Mark, enclosing combining
M	Mark, all diacritic
Nd	Number, decimal digit
Nl	Number, letterlike
No	Number, other
N	Number, all
Pc	Punctuation, connector
Pd	Punctuation, dash
Ps	Punctuation, opening mark
Pe	Punctuation, closing mark
Pi	Punctuation, initial quote mark
Pf	Puntuation, final quote mark
Po	Punctuation, other
P	Punctuation, all
Sm	Symbol, math
Sc	Symbol, currency
Sk	Symbol, modifier
So	Symbol, other
S	Symbol, all
Zs	Separator, space
Zl	Separator, line
Zp	Separator, paragraph
Z	Separator, all
Cc	Control code
Cf	Format control character
Cs	Surrogate code point
Co	Private-use character
Cn	Unassigned
C	Control characters, all