Next: POSIX C to Scheme correspondence, Previous: POSIX I/O utilities, Up: POSIX interface
The procedures in this section provide access to POSIX regular expression matching. The regular expression syntax and semantics are far too complex to be described here.
Note: Because the C interface uses ASCII NUL bytes to
mark the ends of strings, patterns & strings that contain NUL
characters will not work correctly.
The first interface to regular expressions is a thin layer over the
interface that POSIX provides. It is exported by the structures
posix-regexps & posix.
Make-regexpcreates a regular expression with the given string pattern. The arguments after string specify various options for the regular expression; seeregexp-optionbelow. The regular expression is not compiled until it is matched against a string, so any errors in the pattern string will not be reported until that point.Regexp?is the disjoint type predicate for regular expression objects.
Evaluates to a regular expression option, suitable to be passed to
make-regexp, with the given name. The possible option names are:
extended- use the extended patterns
ignore-case- ignore case differences when matching
submatches- report submatches
newline- treat newlines specially
Regexp-matchmatches regexp against the characters in string, starting at position start. If the string does not match the regular expression,regexp-matchreturns#f. If the string does match, then a list of match records is returned if submatches? is true or#tif submatches? is false. The first match record gives the location of the substring that matched regexp. If the pattern in regexp contained submatches, then the submatches are returned in order, with match records in the positions where submatches succeeded and#fin the positions where submatches failed.Starts-line? should be true if string starts at the beginning of a line, and ends-line? should be true if it ends one.
Match?is the disjoint type predicate for match records. Match records contain three values: the beginning & end of the substring that matched the pattern and an association list of submatch keys and corresponding match records for any named submatches that also matched.Match-startreturns the index of the first character in the matching substring, andmatch-endgives the index of the first character after the matching substring.Match-submatchesreturns the alist of submatches.
This section describes a functional interface for building regular
expressions and matching them against strings, higher-level than the
direct POSIX interface. The matching is done using the POSIX regular
expression package. Regular expressions constructed by procedures
listed here are compatible with those in the previous section; that is,
they satisfy the predicate regexp? from the posix-regexps
structure. These names are exported by the structure regexps.
Character sets may be defined using a list of characters and strings, using a range or ranges of characters, or by using set operations on existing character sets.
Setreturns a character set that contains all of the character arguments and all of the characters in all of the string arguments.Rangereturns a character set that contains all characters between low-char and high-char, inclusive.Rangesreturns a set that contains all of the characters in the given set of ranges.Range&rangesuse the ordering imposed bychar->integer.Ascii-range&ascii-rangesare likerange&ranges, but they use the ASCII ordering.Ranges&ascii-rangesmust be given an even number of arguments. It is an error for a high-char to be less than the preceding low-char in the appropriate ordering.
Set operations on character sets.
Negatereturns a character set of all characters that are not in char-set.Unionreturns a character set that contains all of the characters in char-seta and all of the characters in char-setb.Intersectionreturns a character set of all of the characters that are in both char-seta and char-setb.Subtractreturns a character set of all the characters in char-seta that are not also in char-setb.
(set "abcdefghijklmnopqrstuvwxyz")(set "abcdefghijklmnopqrstuvwxyz")(set "ABCDEFGHIJKLMNOPQRSTUVWXYZ")(union lower-case upper-case)(set "0123456789")(union alphabetic numeric)(set "!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~")(union alphanumeric punctuation)(union graphic (set #\space))(negate printing)(set #\space (ascii->char 9)) ; ASCII 9 = TAB(union (set #\space) (ascii-range 9 13))(set "0123456789ABCDEF")Predefined character sets.
String-startreturns a regular expression that matches the beginning of the string being matched against;string-endreturns one that matches the end.
Sequencereturns a regular expression that matches concatenation of all of its arguments;one-ofreturns a regular expression that matches any one of its arguments.
Returns a regular expression that matches exactly the characters in string, in order.
Repeatreturns a regular expression that matches zero or more occurrences of its regexp argument. With only one argument, the result will match regexp any number of times. With two arguments, i.e. one count argument, the returned regular expression will match regexp exactly that number of times. The final case will match from min to max repetitions, inclusive. Max may be#f, in which case there is no maximum number of matches. Count & min must be exact, non-negative integers; max should be either#for an exact, non-negative integer.
Regular expressions are normally case-sensitive, but case sensitivity can be manipulated simply.
The regular expression returned by
ignore-caseis identical to its argument except that the case will be ignored when matching. The value returned byuse-caseis protected from future applications ofignore-case. The expressions returned byuse-caseandignore-caseare unaffected by any enclosing uses of these procedures.By way of example, the following matches
"ab", but not"aB","Ab", or"AB":(text "ab")while
(ignore-case (text "ab"))matches all of those, and
(ignore-case (sequence (text "a") (use-case (text "b"))))matches
"ab"or"Ab", but not"aB"or"AB".
A subexpression within a larger expression can be marked as a submatch. When an expression is matched against a string, the success or failure of each submatch within that expression is reported, as well as the location of the substring matched by each successful submatch.
Submatchreturns a regular expression that is equivalent to regexp in every way except that the regular expression returned bysubmatchwill produce a submatch record in the output for the part of the string matched by regexp.No-submatchesreturns a regular expression that is equivalent to regexp in every respect except that all submatches generated by regexp will be ignored & removed from the output.
#f
Any-match?returns#tif string matches regexp or contains a substring that does, or#fif otherwise.Exact-match?returns#tif string matches regexp exactly, or#fif it does not.
Matchreturns#fif string does not match regexp, or a match record if it does, as described in the previous section. Matching occurs according to POSIX. The match returned is the one with the lowest starting index in string. If there is more than one such match, the longest is returned. Within that match, the longest possible submatches are returned.All three matching procedures cache a compiled version of regexp. Subsequent calls with the same input regular expression will be more efficient.
Here are some examples of the high-level regular expression interface:
(define pattern (text "abc"))
(any-match? pattern "abc") => #t
(any-match? pattern "abx") => #f
(any-match? pattern "xxabcxx") => #t
(exact-match? pattern "abc") => #t
(exact-match? pattern "abx") => #f
(exact-match? pattern "xxabcxx") => #f
(let ((m (match (sequence (text "ab")
(submatch 'foo (text "cd"))
(text "ef")))
"xxabcdefxx"))
(list m (match-submatches m)))
=> (#{Match 3 9} ((foo . #{Match 5 7})))
(match-submatches
(match (sequence (set "a")
(one-of (submatch 'foo (text "bc"))
(submatch 'bar (text "BC"))))
"xxxaBCd"))
=> ((bar . #{Match 4 6}))