----------------------------------------------------------------
    Regexp/OCaml: Syntax Sugar for Regular Expression Match
----------------------------------------------------------------
						Yutaka Oiwa
					     yutaka@oiwa.jp

1. Introduction

This camlp4 macro provides convenient syntax sugar for regular
expression match for strings using PCRE/OCaml library. The features of
this macro package are the following:

  * Convenient syntax, which resembles to the usual match-with 
    construct
  * Pre-compiling regular pattern languages
  * Binding matching substrings to variables
  * Easy-to-use type-coercion
  * Support for optional-patterns
  * Default values for optional-patterns


2. Features

2.1 Simple Pattern Match

	Regexp.match str with 
	  "^\d+$" -> "number"
	| "^\w+$" -> "alphabets"
	| _       -> "others"

A "Regexp.match" construct provides regular pattern matching facility.
It first evaluates the target expression (str, in the above example)
once, and then performs pattern match with given patterns in order
presented at the following clauses. The pattern language is basically 
a subset of Perl (PCRE) regular expression language.

	^		beginning of strings
	$		end of strings
	.		any single character

	(<regexp>)	a group, make a binding
	(?:<regexp>)	a group, does not make any binding
	[<char list>]	one of characters in the set
	[^<char list>]	one of characters not in the set
			  <char list> may contain ranges specified by -,
			  any \-escaped characters (sets) shown below

	\a,\e,\f,\r,\n,\t	an ASCII 7, 27, 12, 13, 10, 9, respectively

	\d		a digit	[0-9]
	\s		a space-like character
	\w		a word-constructing character [0-9A-Za-z_]
	\D		a character which is not a \d
	\S		a character which is not a \s
	\W		a character which is not a \w

	\x<hh>		an ASCII character <hh> in hexadecimal
	\<ddd>		an ASCII character <ddd> in *decimal*
			   (following usual OCaml convention)
	\o<ooo>		an ASCII character <ooo> in octal

	\<symbol>	a character <symbol>, canceling any meta meaning

	<regexp>{<n>}		<n> repeations of <regexp>
	<regexp>{<n>,<m>}	<n> .. <m> repeations of <regexp>
				  takes greedy (longest) match
				  <n> assumed to be 0 if omitted
				  <m> assumed to be infinity if omitted
	<regexp>{<n>,<m>}?	same of above, but performs ungreedy match
	                        (usually a "shortest" match)
	<regexp>?	same as <regexp>{0,1}
	<regexp>??	same as <regexp>{0,1}?
	<regexp>*	same as <regexp>{0,}
	<regexp>*?	same as <regexp>{0,}?
	<regexp>+	same as <regexp>{1,}
	<regexp>+?	same as <regexp>{1,}?

	<re1>|<re2>	string matching either <re1> or <re2>

	(?i)<regexp>	case folding match
	(?x)<regexp>	ignores blanks and comments in <regexp>
				these flags must be at the top of patterns

You can also use a Python/Ruby1.9 extension of named pattern in
regexp, in format (?P<var>regexp) or (?<var>regexp) ('<' and '>'
literally appear in regular expressions).  See Section 2.6 for details.
Please do not duplicate backslashes twice inside regexp (see the example
above), unless you need to match literal backslash characters.

If one of the patterns matches to the target string, the corresponding
expression part is evaluated.

A final clause may also be a single variable name or a wild-card
(_). The expression of that clause is evaluated if all given patterns
do not match. If there is no such clause and none of patterns matches,
an exception Match_failure is raised.

Patterns appeared in the source files are "pre"-compiled before pattern 
matching operation, if possible.


2.2 Binding sub-patterns to variables

	Regexp.match str with
	  "^(.+)-(.+)$" as f, t
	      -> printf "range: from '%s' to '%s'" f t
	| "^(.+)$" as v -> printf "singular: '%s'" v

If a pattern contains one or more groups (parentheses), the substrings
which have matched to those groups can be captured to variables by
using "as" clause. The substrings are bound to the variables listed in
the "as" clause respectively. For example, the above program, if str
evaluates to "abc-xyz", prints "range: from 'abc' to 'xyz'".  You can
use "_" instead of variable names if some of the groups are not needed
to be bound.

The number of variable patterns in the "as" clause is checked
statically: if it does not match to the number of groups in the
pattern, compile-time error will occur.


2.3 Type-coercion

	Regexp.match str with
	  "^(\d+)-(\d+)$" as f : int, t > int_of_string ->
	      for i = f to to do
	          printf "%d\n" i
	      done

If a type is specified after a variable with ":", matched substrings
will be automagically coerced to the specified type. The function used
for coercion will be selected by the following ad-hoc rules:

	Int32.of_string			for int32,
	Int64.of_string			for int64,
	<Module>.of_string		for <Module>.t,
	<Module>.<type>_of_string	for <Module>.<type>,
	<type>_of_string		for <type>.

For the above example, int_of_string will be called for the variable
f.  You can also specify any conversion function by using ">" symbol
instead of ":" (see the specfication for variable t).


2.4 Optional Pattern

	Regexp.match str with
	  "^(\d+)?$" as f ->
	      match f with
	          None   -> printf "value not specified"
		| Some v -> printf "value is %s" v

If a group is qualified by "?", the corresponding variable will
automatically have an "option" type. In the above example, the
variable "f" will have the type "string option". You can use this
feature with type-coercion: specify the type like "f : int option".
(Note: the keyword "option" must be explicit: alias names created by
type declaration cannot be used here.)

If some groups are placed inside another optional group (see Example
below), these are also treated as optional pattern.

If a group is qualified by "*" or "+", that can't be bound to a
variable. The corresponding pattern specified in the "as" clause
must be "_" (wildcard). Otherwise compile-time error is signalled.


2.5 Default Value

	Regexp.match str with
	  "^(\d+)(-(\d+))?$" as f : int, _, t : int = f ->
             for i = f to t do
               Printf.printf "%d\n" i
             done

Default values can be attached to any optional patterns.  If a
subgroup does not match any substring, and a expression for the
default value is specified after corresponding pattern (by "="), the
value of the expression is used as a value for the correspoinding
variable. The expression is evaluated in the context in which all
variables appeared prior in the pattern are bound. In the above
example, the variable "f" is visible from the default value expression
for t (that is "f").


2.6 Named Subpatterns (new in version 1.0.pre1)

	Regexp.match str with
	  "(?x)^
               (?P<s> \w+) : 
               ((?P<f> \d+) -)?
               (?P<t> \d+)
               $"
           as t : int, f : int = t ->
             Printf.printd "%s\n" s;
             for i = f to t do
               Printf.printf "%d\n" i
             done

From version 1.0.pre1, Regexp/OCaml supports named subpatterns
which are firstly introduced with Python scripting language.
From version 1.0, it also accepts Ruby-1.9 syntax ("(?<...>...)").
If one or more named subpatterns appear in the regular pattern,
substrings matched to subpatterns are bound to corresponding variables
speficied inside < >'s.

Conversion specifications are still allowed in "as" clause, but
its interpretation is changed as follows:

  1) No wildcard pattern ("_" pattern) is allowed.
  2) the order of variables in "as" clause may be different from
     the order of corresponding subpatterns:
     these claueses merely specify conversions.

It is error to specify a conversion on variables not appeared in
named regexps (mixing positioned/named subpatterns is not allowed).
If no conversion is specified on variables appeared in named regexp,
it will have either string or string option type, according to
its quantity.

Values to be bound are converted in order of specifications appeared
in "as" clause, and values of previously bound variables can be used
in default expression for variables appeared later in "as" clause.
In above example, the value of "t" can be used as a default value of
"f", even though it appears later in the regular pattern.
Values of subpatterns which do not appear in "as" clauses are
not useful in default expressions for other bound variables
(but visible in a "when" clause, if any).


3. Usage

The code generated by this macro requires "pcre-ocaml" library
implemented by Markus Mottl.  Get it from
  http://www.ai.univie.ac.at/~markus/home/ocaml_sources.html
and install it. The translator itself does not depend on
pcre-ocaml at compilation and translation.

By using findlib and OCamlMakefile, simply invoking 'make all' should
create all required binaries.

Note: 1) Current pcre-ocaml and ocaml-make requires findlib.
      (http://www.ocaml-programming.de). However, this package
      itself does not depend on findlib if it is hand-compiled.
      Give it try if you want.


To use Regexp/OCaml, load "pa_regexp_match.cma" to camlp4
preprocessor. There is several methods to do this:

  1) pass option "-pp 'camlp4o ./pa_regexp_match.cma'" to
     ocamlc/ocamlopt.

  2) pass option "-pp 'camlp4o -I .'" to ocamlc/ocamlopt, and
     put '#load "pa_regexp_match.cma"' to the top of source files.

  3) pass options "-syntax camlp4o -package pa_regexp_match" to
     ocamlfind (if you use findlib)

The path "." should be replaced with the actual path where the .cma
file is placed.

To use Regexp/OCaml with toplevel, load "camlp4o.cma", "pcre.cma" and
"pa_regexp_match.cmo" to the ocaml toplevel.  Some internal parameters
are automatically adjusted if the package is loaded into toplevel
environment. For example, sharing pre-compiled patterns between
several declarations is disabled under top-level environment.


4. Additional packages

4.1. declare_once

Declare_once is a generic module to put any value declarations at the
top of current structure-item. See declare_once.mli for interface, if
you are a camlp4-macro programmer and interested in this module.


4.2. pa_once

pa_once is a small module which adds an "once" construct.  You can
write "(once <body>)" at any place where an expression is required.
The value of the once expression is the value of the body, but it is
evaluated only once when enclosing module is constructed.  (That is,
once per program execution, unless either functor or "let module" is
used.)

Any local variables which is bound between the top of current
structure item and the place of the once expression may not visible
from the body of once expression. (But don't try to use them in
confusing way: the scoping may depend on the surrounding statements,
and may differ for future version.)


4.3. pa_pragma

The pa_pragma module adds the compiler directive "#pragma".
You can write

  #pragma "some-option" "argument value";;
  #pragma "some-option";;

at the any part of source file (as a top-level structure item), and
the corresponding camlp4 user-defined option (e.g. -some-option) is
processed.  Note that not every options are implemented to support
this directive.  The options defined in pa_regexp_match and
declare_once modules are supporting #pragma, at least when it is
called prior to any declarations other than #pragma and #load.


5. Future Work

Planned:

 * Improve syntax
 * Support binding to list type (for *- or +-quantified groups)
 * Support for other back-end libraries
	(especially Str module in to-be ocaml-3.07)


Planned only if someone needs:

 * Support camlp4r syntax (I don't need at all: does anyone need this?)


May be? :

 * Some soundness checking for type coercion and the corresponding
   pattern (see XPerl)


6. Acknowledgement


The idea of pre-compiling regular expressions is firstly implemented by
Francois Pottier [http://caml.inria.fr/archives/200107/msg00187.html].
I have taken his code as a basis of the declare_once module and then
hacked.

When I was writing a regexp parser, I have used a code from Jerome
Vouillon's re library for reference and hints.

This project is inspired by the above syntax sugar by Francois, and
also by XPerl project [Naoshi Tabuchi et al].

The idea of binding ?-quantified groups to option type is suggested by
Eijiro Sumii et al.


7. License

(c) 2002-2005 Yutaka Oiwa.  All rights are reserved.

The package is distributed under the same license terms as the OCaml
system.  Especially, the declare_once module is distributed under the
terms for LGPL2 with exception, for reusability. See the file LICENSE
for the detail.
