|
1.
Description |
This small application implements an extension of
Definite Clause Grammars (DCGs) which introduces lookahead symbols in the
compiled code. Ordinary DCGs introduce two additional arguments in each
compiled clause, one for the input list to parse and other for the
remaining list to parse after execution of the predicate (production). Our
compilation method introduces 4 additional arguments:
- The current lookahead symbol, in the 1st predicate argument, i.e.
the first symbol in the input.
- The rest input list in the 2nd predicate argument.
- The DCG predicate arguments appear after the 2nd argument.
- The lookahead symbol of the remaining string to parse, in the
penultimate argument.
- The remaining to parse list in the last argument.
This technique allows the lookup DCG code to explore the indexing
facilities of most Prolog implementations and the user to write the
grammars in a more natural way, with significant performance improvments.
However, in order to be able to use lookahead information, the input
string must be terminated with a special symbol (usually -1). To support
the development of large applications we've introduced additional
syntactic sugar.
To simplify the determination of lookahead symbol information, the
lookup DGC compiler resorts to the tabling features of XSB Prolog and
therefore is not portable to othe Prolog systems. However, the generated
code is fully standard and can be used in any Prolog system. This parser
generator has been used for the implementation of a full non-validating
XML Parser. |
|
|
|
2.
Lookup DCG syntax Productions can have two forms:
- Head --> Body. These behave
as ordinary DCG productions, except that the extra arguments for
lookahead symbol propagation are introduced.
- Head ::= Body. These
productions obtain lookahead symbol information from their bodies, and
use it to optimize the execution of the grammar. This must be used with
care since a large number of rules might be generated from a single
production, the rule of thumb being one rule for each lookahed symbol in
the Body.
The bodies of productions have a similar syntax to ordinary DGCs,
except that we introduced additional syntax to represent terminal symbols,
permitting the specification of (union of ) interval ranges. Regarding
non-terminals, we allow the inline expansion of non-terminals by its
rules. Cuts are allowed in production bodies, as well as actions with the
usual { Prolog Code } syntax. The full
syntex is described next:
Non-Terminal symbols in the body:
- + NonTerminal, indicates that
NonTerminal rules are expanded inline.
- NonTerminal, where NonTerminal is
an atom specifying a non-terminal symbol
Terminal symbols in the body:
- [], the empty list is used to
represent the empty string.
- [S1,S2...,Sn], recognizes the
sequence obtained from recognizing S1, S2, ..., Sn.
- [S1,S2...,Sn]/[C1,C2...Cn], as
before but C1 is the symbol in the input recognized by S1, C2 is the
symbol in the input recognized by S2 ..., and Cn is the symbol in the
input recognized by Cn.
The third case above is an extension to Prolog DCGs, since we allow
the use of ranges in any of the Si symbol expressions above. A symbol
expression might be:
- An atom or character code, as in ordinary DCGs
- Min-Max, recognizing any character
code between Min and Max, and thus these must be integer numbers such
that Min <= Max.
- [Min1-Max1,Min2-Max2,...,MinN-MaxN],
recognizing any character code between Min1 and Max1, or Min2 and
Max2, ... or MinN and MaxN.
The parser generator does not take into account ranges for the
generation of optimized code in productions of the form ::=, so
these must be used with care (the same behaviour of DGCs is obtained).
Production control
The following constructions are allowed in the bodies of production
in order to control the execution of the parser:
- !, as in ordinary DGCs
- { Prolog Code }, actions as in
ordinary DCGs
- ? [C1,...,Cn], tests if the input
starts with [C1,...,Cn] where C1, ..,
Cn are character codes. This does not consume input. This construction
is an extension and is mostly used in the form
Head --> ? "test", !. allowing for the
programmer to use base conditions without input consumption.
|
|
|
3.
Installation and usage of the Lookup DCG parser generator
- Construct your parser according to the previous syntax. The parser
may be divided in several files and might contain auxiliary Prolog code
and declarations. We suggest the use of the extension .G in these
files.
- Declare in the parser file the start non-terminal symbol with the
declaration :- start( Name/Arity).
- Declare in the parser file the end terminal symbol with the
declaration :- end( Symbol ), usually -1
if parsing lists of character codes.
- The generation of parser code for some productions can be prevented
by adding the declaration :- - Name/Arity.
This is used, for instance, for removing all the code for fully expanded
non-terminal symbols.
- The parser generator code must be
extracted to a directory and compiled with the goal
?-[lookupdcg].
- Generate the parser with the call ?-
gen_parser( ['File1.G', 'File2.G',...,'FileN.G'], 'OutFile.P').
The first argument contains the list of files of the parser to be
generated. The compiled code is put in a single file, given in the 2nd
argument of the gen_parser/2 predicate. This file must be
afterwards compiled.
|
|
|
|
4.
Example The following grammar
parses lists of natural numbers and names separated by line feeds, either
0xA or 0xD.
% An example Look Up DGC
:- start( example/1 ).
:- end( -1 ).
:- - digit/1.
example( Is ) ::= lf, !, example( Is ).
example( [] ) ::= [].
example( [I|Is] ) --> item( I ), !, lf, example( Is ).
item( I ) ::= !, number( I ).
item( I ) ::= name( I ).
number( N ) --> + digit(D), !, rest_digits( Ds ), { number_codes( N, [D|Ds] ) }.
rest_digits( [D|Ds] ) --> + digit( D ), !, rest_digits( Ds ).
rest_digits( [] ) ::= [].
digit( 0'0 ) --> "0".
digit( 0'1 ) --> "1".
digit( 0'2 ) --> "2".
digit( 0'3 ) --> "3".
digit( 0'4 ) --> "4".
digit( 0'5 ) --> "5".
digit( 0'6 ) --> "6".
digit( 0'7 ) --> "7".
digit( 0'8 ) --> "8".
digit( 0'9 ) --> "9".
name( N ) --> startchar(C), !, rest_name( Cs ), { atom_codes( N, [C|Cs] ) }.
rest_name( [C|Cs] ) --> namechar( C ), !, rest_name( Cs ).
rest_name( [] ) --> [].
startchar( C ) --> [[0'A-0'Z,0'a-0'z]]/[C], !.
namechar( D ) ::= + digit(D), !.
namechar( C ) ::= startchar(C).
lf --> [16'A].
lf --> [16'D].
To generate the parser for this grammar, consult the parser generator
file and then call gen_parser/2:
| ?- [lookupdcg].
[lookupdcg loaded]
[readgram loaded]
[predparserint loaded]
[parserexp loaded]
yes
| ?- gen_parser( ['example.G'], 'example.P' ).
example / 1
item / 1
number / 1
rest_digits / 1
name / 1
rest_name / 1
startchar / 1
namechar / 1
lf / 0
yes
The generated code is stored in example.P. The
user is suggested to view and try to understand the code. Notice that no
rules for digit/1 are generated since all
occurrences of digit in the grammar are in-line expanded using the
+ digit(D) facility. The use of cuts can
be very subtle, as can be noticed from the rules for
item/1 and
startchar/1.
To use the parser, the following goal must be invoked:
example( FirstSymbol, RestSymbols,
Itens, -1, [] ), as in the example below:
| ?- example( 10, [0'a,0'0,0'Z,10,0'1,0'0,10,10,0'1,10,-1], Is, -1, [] ).
Is = [a0Z,10,1]
yes
|
|
|
|
5.
Copyright This is an academical and experimental tool. It
cannot be used for commercial purposes without explicit consent of the
author. |
|
|
|
6.
Disclaimer This is an academical and experimental tool. I
do not give any guarantee of any form regarding the use of this tool. |
|
|
|
Last update: October 28th, 2003 |