From xemacs-m  Thu Aug 21 01:12:11 1997
Received: from turnbull.sk.tsukuba.ac.jp (turnbull.sk.tsukuba.ac.jp [130.158.99.4])
	by xemacs.org (8.8.5/8.8.5) with SMTP id BAA27793
	for <xemacs-beta@xemacs.org>; Thu, 21 Aug 1997 01:11:45 -0500 (CDT)
Received: from turnbull.sk.tsukuba.ac.jp(really [127.0.0.1]) by turnbull.sk.tsukuba.ac.jp
	via smtpd with esmtp
	id <m0x1QRE-00006tC@turnbull.sk.tsukuba.ac.jp>
	for <xemacs-beta@xemacs.org>; Thu, 21 Aug 1997 15:09:44 +0900 (JST)
	(Smail-3.2 1996-Jul-4 #3 built 1997-Jun-24)
Message-Id: <m0x1QRE-00006tC@turnbull.sk.tsukuba.ac.jp>
To: xemacs-beta@xemacs.org
Subject: Re: New regex syntax 
In-reply-to: Your message of "21 Aug 1997 02:36:26 GMT."
             <m2rabogt3p.fsf@haruspex.demon.co.uk> 
Date: Thu, 21 Aug 1997 15:09:44 +0900
From: "Stephen J. Turnbull" <turnbull@turnbull.sk.tsukuba.ac.jp>

>>>>> "Karl" == Karl M Hegbloom <karlheg@inetarena.com> writes:

 Karl> If I want to match any character, I use ".*".  But that doesn't
 Karl> match newline.

>>>>> "Len" == Leonard Blanks <ltb@haruspex.demon.co.uk> writes:

    Len> You mistyped "."; but considering ".*" for the moment, that
    Len> would match quite a bit more in most circumstances than you
    Len> might wish if the match is allowed to extend beyond line
    Len> boundaries.  It is enough of a noddy trap disallowing its
    Len> greed to extend past the line's end.

    Len> There are good reasons for requiring explicit matches with
    Len> "$" (well *implicitly* as a match anchor, but you know what I
    Len> mean).

Indeed.  I don't use regexps that much in Emacs Lisp, but I do in
Perl.  Maybe it's sloppy, but using, eg, ";\\(.*\\)" to grab Lisp
comments, is something I do a lot.  This would break, _especially_ if
the variable Karl asks for below got set somehow behind my back.  I
don't want to have to protect all my regexps by making that variable
local....

To get ".*" to cross line boundaries only requires "\\(.\\|\n\\)*".
You can also insert quoted newlines in the minibuffer for interactive
searches.  (I thought that you could use `$' instead of `\n', but it
doesn't seem to work after all.  I guess this isn't a portability
problem, `\r' and so on get caught by the `.'.)  This I've used for a
long time to match addresses buried deep in continued RFC 822 headers.

Note that Emacs Lisp already partially supports Perl 5's "literate
regexp" style (as did Perl 4, for that matter :-).  Here's a regexp
that gets C comments, using the "minimal matching" extension:

(concat "/\\*"            ; match "open C comment"
        "\\(.\\|\n\\)*?"  ; match newlines, too, but only until ...
        "\\*/")           ; match "close C comment"

Takes a minor efficiency hit, of course, but it's nice and readable.
It warns the naive about the line-crossing behavior and the use of an
extension.  In most of my applications, the regexp gets assigned to a
variable whose value is used, rather than being hardcoded, so it's
only computed at load time.  Except for the fact that it's in a
dialect of LIDBS (Lots of Irritating Doubled BackSlashes), not Perl,
I'm happy with that....

Now there's something I'd like to see: a group of functions that take
regexps in various dialects and translates them to Emacs Lisp syntax.
(Just kidding; fixing the IDBS syndrome means breaking the reader, of
course.)  Although egrep/Perl style lexing (where unquoted punctuation
are operators and quoted punctuation is literal) would be a big help,
for my uses anyway.  In the presence of "exploited child labor coal
miners" I have to say that this would screw up anybody who's writing
code that operates on Lisp, so let's not make it the default.  It
probably wouldn't help code that operates on C very much either.

But this could be done either with a helper function (as above, in my
usage this is not a big efficiency hit ... hmmm, I guess I should
start coding RSN :-) or ("yet another"---see below) optional argument
to the regexp matchers.

 Karl> Is there a lisp variable I can bind to make the emacs'
 Karl> include newline in that?

    Len> I hope not.

I see your "hope not," and raise one: I hope never.  The advantage of
the "or newline" idiom is that you have visible indication that this
regexp might read to the end of your buffer; if you must have a switch
to make "." match newlines too (although I can't remember ever using
more than one line-crossing construct in a regexp, so it's not really
necessary for my purposes), it should be an optional argument to the
regexp-matching function, like Perl's "m//s".  But I wouldn't use
this, either, because I often catenate up regexps from pieces stored
in variables.  Heaven help me if I used a `.' in a regexp I first
wrote in 1980.  I'm so near-sighted I'd never see back from the end of
the buffer to where the bug was....

Steve

-- 
                            Stephen J. Turnbull
Institute of Policy and Planning Sciences                    Yaseppochi-Gumi
University of Tsukuba                      http://turnbull.sk.tsukuba.ac.jp/
Tel: +81 (298) 53-5091;  Fax: 55-3849              turnbull@sk.tsukuba.ac.jp

