-
Notifications
You must be signed in to change notification settings - Fork 13
Home
Welcome to the regexgen.js wiki!
RegexGen.js is a JavaScript regular expression generator that helps to construct complex regular expressions.
The generator is exported as the regexGen()
function, everything must be referenced from it.
To generate a regular expression, pass sub-expressions as parameters to the call of regexGen()
function.
Sub-expressions are then concatenated together to form the whole regular expression.
Sub-expressions can either be a string
, a number
, a RegExp
object, or any combinations of the call to methods (i.e., the sub-generators
) of the regexGen()
function object.
Strings passed to the the call of regexGen()
, text()
, maybe()
, anyCharOf()
and anyCharBut()
functions, are always escaped as necessary, so you don't have to worry about which characters to escape.
The result of calling the regexGen()
function is a RegExp
object. See The RegExp Object section for detail.
Since everything must be referenced from the regexGen()
function, to simplify codes, assign it to a short variable is preferable.
Usage:
var _ = regexGen;
var regex = regexGen(
_.startOfLine(),
_.capture( 'http', _.maybe( 's' ) ), '://',
_.capture( _.anyCharBut( ':/' ).repeat() ),
_.group( ':', _.capture( _.digital().multiple(2,4) ) ).maybe(), '/',
_.capture( _.anything() ),
_.endOfLine()
);
var matches = regex.exec( url );
The mixin()
function is a method of the regexGen()
function object. For convenient, you can use the regexGen.mixin()
function to export all methods of the regexGen()
function object to the global object. Note that this will pollute the global object.
Usage:
regexGen.mixin( window );
var regex = regexGen(
startOfLine(),
capture( 'http', maybe( 's' ) ), '://',
capture( anyCharBut( ':/' ).repeat() ),
group( ':', capture( digital().multiple(2,4) ) ).maybe(), '/',
capture( anything() ),
endOfLine()
);
var matches = regex.exec( url );
Modifiers alter behavior of regular expression. If specified, modifiers can have any combination of the following values:
Case-insensitive search. Equivalent to /.../i
.
Global search. Equivalent to /.../g
.
Multiline. If the input string has multiple lines, startOfLine()
(^
) and endOfLine()
($
) match the beginning and end of each line within the string, instead of matching the beginning and end of the whole string only. Equivalent to /.../m
.
Sub-generators are methods of the regexGen()
function object that generate parts of the whole regular expression.
Matches beginning of input. If the multiline modifier searchMultiLine()
is specified, also matches immediately after a line break character. Equivalent to /^.../
.
Matches end of input. If the multiline modifier searchMultiLine()
is specified, also matches immediately before a line break character. Equivalent to /...$/
.
Matches boundary of a word. Equivalent to /\b/
.
Matches a non-word boundary. Equivalent to /\B/
.
Matches the text
specified. The characters in text
is properly escaped when necessary. Note this is the equivalent of passing a string literal to the regexGen()
generator, except that you can't use any quantifiers on a string literal.
Usage:
text( "subject" ) // ==> /subject/
Matches the text
specified 0 or 1 time. The characters in text
is properly escaped when necessary.
Usage:
maybe( "subject" ) // ==> /(?:subject)?/
Matches any given character. Each arguments are concatenated and can be any of:
- string literal, e.g.,
"abcde"
, Equivalent to/[abcde]/
.- array of two element indicating a range of characters, e.g.,
["a", "z"]
, Equivalent to/[a-z]/
.- character shorthand generator, including:
anyChar(), ascii()
,unicode()
,nullChar()
,controlChar()
,formFeed()
,lineFeed()
,carriageReturn()
,space()
,nonSpace
,tab()
,vertTab()
,digital()
,nonDigital()
,word()
andnonWord()
.
Usage:
anyCharOf( [ 'a', 'c' ], ['2', '6'], 'fgh', 'z', space() ) // ==> /[a-c2-6fghz\s]/
Matches anything but these characters. see anyCharOf()
for instructions of arguments.
Usage:
anyCharBut( [ 'a', 'c' ], ['2', '6'], 'fgh', 'z', space() ) // ==> /[^a-c2-6fghz\s]/
Matches any single character except the newline character. Equivalent to /./
.
Matches the character with the code hh (two hexadecimal digits).
Usage:
ascii( '20' ) // ==> /\x20/
Matches the character with the code hhhh (four hexadecimal digits).
Usage:
unicode( '2000' ) // ==> /\u2000/
Matches a NULL (U+0000) character. Equivalent to /\0/
.
Do not follow this with another digit, because \0 is an octal escape sequence.
Matches a control character in a string. Where value is a character ranging from A to Z.
Usage:
controlChar( 'Z' ) // ==> /\cZ/
Matches a backspace (U+0008). Equivalent to /[\b]/
.
Note: in regular expression, you need to use square brackets if you want to match a literal backspace character. (Not to be confused with \b
.)
Matches a form feed. Equivalent to /\f/
.
Matches a line feed. Equivalent to /\n/
.
Matches a carriage return. Equivalent to /\r/
.
Matches a single white space character, including space, tab, form feed, line feed. Equivalent to /\s/
.
Matches a single character other than white space. Equivalent to /\S/
.
Matches a tab (U+0009). Equivalent to /\t/
.
Matches a vertical tab (U+000B). Equivalent to /\v/
.
Matches a digit character. Equivalent to /\d/
.
Matches any non-digit character. Equivalent to /\D/
.
Matches any alphanumeric character including the underscore. Equivalent to /\w/
.
Matches any non-word character. Equivalent to /\W/
.
Matches any characters except the newline character. Equivalent to /.*/
.
Matches a hex digital character. Equivalent to /[0-9A-Fa-f]/
.
Matches any line break, includes Unix and windows CRLF. Equivalent to /\r\n|\r|\n/
.
Matches any alphanumeric character sequence including the underscore. Equivalent to /\w+/
.
Adds alternative expressions.
Usage:
either( 'first', '1st' ) // ==> /first|1st/
Matches specified terms but does not remember the match. The generated parentheses are called non-capturing parentheses.
Usage:
group( 'http', maybe( 's' ) ).maybe() // ==> /(?:https?)?/
Matches specified terms and remembers the match. The genrated parentheses are called capturing parentheses.
Usage:
var _ = regexGen;
var regex = regexGen(
_.capture( _.label('prefix'), _.words() ),
'o',
_.sameAs( 'prefix' ),
_.searchAll()
); // ==> /(\w+)o\1/g
"lol, wow, aboab, foo, bar".match( regex ); // ["lol", "wow", "aboab" ]
See also label()
, sameAs()
.
See extract()
for extended usage.
Label is a named index to a capture group, and is allowed only as the very first argument in the capture() method.
Label can be refered by sameAs()
generator, i.e., back-reference.
See also capture()
, sameAs()
.
See extract()
for extended usage.
Back reference to a labeled capture group, matching the same text as that capture group.
See also capture()
, label()
.
See extract()
for extended usage.
Use the given regex, i.e., trust me, just put the value as is.
Usage:
regex( /\w\d/ ) // ==> /\w\d/
regex( "\\w\\d" ) // ==> /\w\d/
Quantifiers can apply to all of the above sub-generators.
Matches the expression generated by the preceding sub-generator 0 or more times. Equivalent to /.*/
and /.{0,}/
.
Usage:
anyChar().any() // ==> /.*/
Matches the expression generated by the preceding sub-generator 1 or more times. Equivalent to /.+/
and /.{1,}/
.
Usage:
anyChar().many() // ==> /.+/
Matches the expression generated by the preceding sub-generator 0 or 1 time. Equivalent to /.?/
and /.{0,1}/
.
Usage:
anyChar().maybe() // ==> /.?/
Matches the expression generated by the preceding sub-generator at least once or exactly specified times. Equivalent to /.+/
, /.{n}/
.
Usage:
anyChar().repeat() // ==> /.+/
anyChar().repeat(5) // ==> /.{5}/
Matches the expression generated by the preceding sub-generator at least minTimes and at most maxTimes times. Equivalent to /.{min,max}/
. Note that the generator try to optimize the expression when possible.
Usage:
anyChar().multiple() // ==> /.*/
anyChar().multiple(1) // ==> /.+/
anyChar().multiple(0,1) // ==> /.?/
anyChar().multiple(5) // ==> /.{5,}/
anyChar().multiple(5,9) // ==> /.{5,9}/
Makes a quantifier greedy. Note that quantifier are greedy by default.
Usage:
anyChar().any().greedy() // ==> /.*/
anyChar().many().greedy() // ==> /.+/
anyChar().maybe().greedy() // ==> /.?/
Makes a quantifier lazy.
Usage:
anyChar().any().lazy() // ==> /.*?/
anyChar().many().lazy() // ==> /.+?/
anyChar().maybe().lazy() // ==> /.??/
anyChar().multiple(5,9).lazy() // ==> /.{5,9}?/
This is an alias of lazy()
.
Matches the expression generated by the preceding sub-generator only if it matches the given expression.
Usage:
// Simple Password Validation
var _ = regexGen;
var regex = regexGen(
// Anchor: the beginning of the string
_.startOfLine(),
// Match: six to ten word characters
_.word().multiple(6,10).
// Look ahead: anything, then a lower-case letter
contains( _.anything().reluctant(), _.anyCharOf(['a','z']) ).
// Look ahead: anything, then an upper-case letter
contains( _.anything().reluctant(), _.anyCharOf(['A','Z']) ).
// Look ahead: anything, then one digit
contains( _.anything().reluctant(), _.digital() ),
// Anchor: the end of the string
_.endOfLine()
);
Matches the expression generated by the preceding sub-generator only if it not matches the given expression.
Matches the expression generated by the preceding sub-generator only if followed by contents that matches the given expression.
Matches the expression generated by the preceding sub-generator only if not followed by contents that matches the given expression.
The RegExp
object returned from the call of regexGen()
function, can be used directly as usual. In addition, there are four properties injected to the RegExp
object:
-
warnings
array
The warnings
property is an array of strings contains errors detected while processing and generating the final regular expression. One of the best practices of programming is: always treat warnings as error and fix them.
-
captures
array
The captures
property is an array of strings contains the indexes of captures and/or labels of named captures in the order they appeared in the regular expression. The first item is always "0", that is the index of the whole matches, the second item can be either '1' or the label of named capture that passed to the label()
generator, and so forth.
-
extract( _string text )
method
Instead of access the array returned by RegExp.exec()
method or String.match()
method, you can obtain a JSON object from the injected RegExp.extract()
method if you are using the label()
generator to capture patterns:
var sample = 'Conan: 8: Hi, there, my name is Conan.';
var _ = regexGen;
var regex = regexGen(
_.capture(_.label('name'), _.words()),
':', _.space().any(),
_.capture(_.label('age'), _.digital().many()),
':', _.space().any(),
_.capture(_.label('intro'), _.anything())
);
var result = regex.extract(sample);
expect(regex.source).to.equal(/(\w+):\s*(\d+):\s*(.*)/.source);
expect(result).to.eql({
'0': sample,
name: 'Conan',
age: '8',
intro: 'Hi, there, my name is Conan.'
});
-
extractAll( _string text )
method
Same as extract()
method, but returns all matches in an array. Note this method must be used with the searchAll()
modifier is specified.
Usage:
var sample = 'Conan: 8, Kudo: 17';
var regex = regexGen(
capture(label('name'), words()),
':', space().any(),
capture(label('age'), digital().many()),
searchAll()
);
expect(regex.extractAll(sample)).to.eql([{
'0': 'Conan: 8',
name: 'Conan',
age: '8'
}, {
'0': 'Kudo: 17',
name: 'Kudo',
age: '17'
}]);