![]() | ![]() | Copyright © 1992, 1997 International Organization for Standardization. All rights reserved. This electronic document is for use during development and review of International Standards. Official printed copies of International Standards can be purchased from the ISO and the national standards organization of your country. | ||
| Next Clause | Previous Clause | |||
A.4 Property Set Definition Requirements (PSDR)
This clause defines grove construction processes of general utility to applications.
One of the intended uses for groves is as an abstraction for locating information. Each node in a grove represents a piece of information; the property set according to which the grove was constructed defines the types of information that can be located.
Each node can be further broken down into a list of properties and their values. List property values can be broken down into their constituent members. These also represent pieces of information that may need to be located. Nodal property values present no problem; as the values are nodes, they are already valid locations. However, non-nodal property values are not independent of the nodes that exhibit them, and therefore present the problem that they cannot be located directly.
The "Value-To-Node grove construction process" solves this problem by defining a way to construct groves derived from property values. Nodes thus constructed are considered to be the locations of their source property values.
Each class in the VTN property set corresponds to a primitive non-nodal datatype. Each class has a "source property" property of datatype compname, which in combination with the intrinsic "source" property is used to identify the property value a particular VTN grove represents.
Each class also has a "value" property. If the name of the class is that of a simple datatype, then the datatype of the value property of the class is that simple datatype, with the exception of the "enum" class, whose value property is of datatype compname. If the name of the class is that of a list datatype, then the datatype of the value property of the class is node list (with node relationship "subnode"), and the allowed class for the property is the class corresponding to the simple datatype of which the list datatype is composed. The value properties of the char, string, and list classes are also the content properties of those classes.
The VTN property set is:
<!-- Value-To-Node (VTN) Property Set -->
<!DOCTYPE propset
PUBLIC "ISO/IEC 10744:1997//DTD Property Set//EN"
[
<!NOTATION VTN
PUBLIC "ISO/IEC 10744:1997//NOTATION
Value-To-Node Grove Construction Process//EN"
>
]>
<propset gcsd="VTN">
<classdef rcsnm=enum fullnm="enumeration">
<propdef rcsnm=srcprop appnm="source property" compname>
<propdef rcsnm=value compname>
<desc>
The name of the enumdef corresponding to the value of the source
property.
<classdef rcsnm=char fullnm="character" conprop=value>
<propdef rcsnm=srcprop appnm="source property" compname>
<propdef rcsnm=value char>
<classdef rcsnm=string conprop=value>
<propdef rcsnm=srcprop appnm="source property" compname>
<when>
The string is not a member of a string list.
<propdef rcsnm=value string>
<classdef rcsnm=strlist fullnm="string list" conprop=value>
<propdef rcsnm=srcprop appnm="source property" compname>
<propdef rcsnm=value nodelist subnode ac="string">
<classdef rcsnm=integer>
<propdef rcsnm=srcprop appnm="source property" compname>
<when>
The integer is not a member of an integer list.
<propdef rcsnm=value integer>
<classdef rcsnm=intlist fullnm="integer list" conprop=value>
<propdef rcsnm=srcprop appnm="source property" compname>
<propdef rcsnm=value nodelist subnode ac="integer">
<classdef rcsnm=boolean>
<propdef rcsnm=srcprop appnm="source property" compname>
<propdef rcsnm=value boolean>
<classdef rcsnm=compname fullnm="component name">
<propdef rcsnm=srcprop appnm="source property" compname>
<when>
The component name is not a member of a component name list.
<propdef rcsnm=value compname>
<classdef rcsnm=cnmlist fullnm="component name list" conprop=value>
<propdef rcsnm=srcprop appnm="source property" compname>
<propdef rcsnm=value nodelist subnode ac="compname">VTN groves are constructed from the values of non-nodal properties. The class of the grove root is determined by the datatype of the "source property" property (not to be confused with the intrinsic "source" property); that is, if the datatype of the source property is "integer", then the class of the grove root is "integer".
The VTN grove constructed from the value of a property with a simple datatype consists of a single node. The value of the VTN node's intrinsic "source" property is the node that exhibits the source value. The value of the VTN node's "source property" property is the component name of the property for which the source node exhibited the source value. The value of the VTN node's "value" property is the same as the value exhibited by the source node for the source property, except when the source property's datatype is an enumeration, in which case the value of the VTN node's "value" property is the component name of the enumerated value.
Multiple VTN nodes are constructed from the value of a property with a list datatype. They consist of a grove root and a list of its children nodes. The grove root's intrinsic "source" and (not intrinsic) "source property" properties have values as described for simple datatypes above. The value of the grove root's "value" property is a list of children nodes, whose class names are the same as the name of the simple datatype of which the list datatype is composed. Each child node corresponds to one member of the list property value, and the children nodes appear in the same order as the members to which they correspond. The value of each child node's "value" property is identical to the source node's corresponding list property value member.
A datatok grove construction process applies a lexical model with subordinate "match token models" to data in a grove to produce a datatok grove. The datatok process is defined by the following steps:
A grove root (a node of class "tokroot") is created. The value of its "source" property is set to the list of nodes that is the grove source.
A list of character strings is created by extracting the data from each of the source nodes.
If a "source concatenation separator" is specified, the strings are concatenated together, separated by the specified separator, resulting in a list containing a single string.
The lexical model is applied to each string in the list. For each string, a list of the sub-strings that satisfy a match token model within the lexical model is created. The result is a list of sub-lists of these strings. Each sub-string retains any lexicographic reordering performed during the application of the lexical model.
If a "token concatenation separator" is specified, the strings of each sub-list are concatenated together, separated by the specified separator, resulting in a list of sub-lists, each containing a single string.
Each sub-list of strings is replaced by its contained strings, resulting in a single list of strings.
If a "result concatenation separator" is specified, the strings are concatenated together, separated by the specified separator, resulting in a list containing a single string.
A list of "tokenstr" nodes is created where each node corresponds to one of the strings in the list and the nodes occur in the order of their corresponding strings. The value of the "string" property of each node is set to the node's corresponding string, and the value of the "source" property of each node is set to the list of nodes in the grove source containing all or part of the node's corresponding string.
The value of the "strings" property of the grove root is set to the list of tokenstr nodes.
The following example shows the result of applying a "word" tokenizer with different concatenation options to the data contained by the three sibling nodes resulting from this document fragment (when using the default HyTime grove plan):
<state>Oregon</state> <animal>river otters</animal> are cute
The list of strings resulting from step 2 is: ("Oregon" " " "river otters" " are cute"). (Note that because this example is mixed content, the second string in the list is the space between the State end-tag and the Animal start tag.)
The results when performing concatenation at different points in the process are shown below (parentheses indicate lists, square brackets indicate result tokenstr nodes):
No concatenation:
Step 4 result: (("Oregon") () ("river" "otters") ("are" "cute"))
Step 6 result: ("Oregon" "river" "otters" "are" "cute")
Step 8 Tokenstr nodes: ["Oregon"] ["river"] ["otters"] ["are"] ["cute"]
Source concatenation, separation character="X":
Step 3 result: ("OregonX Xriver ottersX are cute")
Step 4 result: ("OregonX" "Xriver" "ottersX" "are" "cute")
Step 6 result: ("OregonX" "Xriver" "ottersX" "are" "cute")
Step 8 Tokenstr nodes: ["OregonX"] ["Xriver"] ["ottersX"] ["are"] ["cute"]
Token concatenation, separation character="X":
Step 4 result: (("Oregon") () ("river" "otters") ("are" "cute"))
Step 5 result: (("Oregon") () ("riverXotters") ("areXcute"))
Step 6 result: ("Oregon" "riverXotters" "areXcute")
Step 8 Tokenstr nodes: ["Oregon"] ["riverXotters"] ["areXcute"]
Result concatenation, separation character="X":
Step 4 result: (("Oregon") () ("river" "otters") ("are" "cute"))
Step 6 result: ("Oregon" "river" "otters" "are" "cute")
Step 7 result: ("OregonXriverXottersXareXcute")
Step 8 Tokenstr nodes: ["OregonXriverXottersXareXcute"]
The datatok property set is:
<!-- Data Tokenizer Property Set -->
<!DOCTYPE propset
PUBLIC "ISO/IEC 10744:1997//DTD Property Set//EN"
[
<!NOTATION datatok
PUBLIC "ISO/IEC 10744:1997//NOTATION
Data Tokenizer Grove Construction Process//EN"
>
]>
<propset gcsd=datatok>
<classdef rcsnm=tokroot appnm="tokenized root" conprop=strings>
<desc>
The grove root resulting from a data tokenizer process.
<propdef rcsnm=strings nodelist subnode ac=tokenstr>
<desc>
A node list of token strings
<classdef rcsnm=tokenstr appnm="tokenized string" conprop=string>
<desc>
A string of one or more tokens.
<propdef rcsnm=string string>Data tokenizer grove construction processes are declared as data content notations derived from the data tokenizer (datatok) notation form.
The lexical model used by a data tokenizer process may be inherent to the process (for example, a process that performs lexical analysis of a natural language), or may be specified separately as data in a notation whose definition is included in the definition of the data tokenizer process itself. The lexical model is specified as the content of elements conforming to data tokenizer notations.
NOTE 465 The NotNames attribute of the data attributes for elements facility of the General Architecture may be used to map notation content to an element attribute (see A.5.3 Data Attributes for Elements (DAFE)).
The attribute concatenate source (catsrc) specifies that source concatenation be performed using the specified separator string.
The attribute concatenate tokens (cattoken) specifies that token concatenation be performed using the specified separator string.
The attribute concatenate result (catres) specifies that result concatenation be performed using the specified separator string.
The attribute hit boundary constraint (boundary) specifies the boundary constraints, within the match domain, of a hit that satisfies the model, as follows:
The hit must extend from the start of domain to the end.
The hit must begin at the start of domain but need not extend to the end of domain.
The hit must end at the end of domain but need not begin at the start of domain.
The hit can occur anywhere in the match domain.
Boundary requirements are expressed within the model.
NOTE 466 The boundary constraints correspond to the "^" (caret) and "$" (dollar) operators in POSIX regular expressions. E.g., "sodeod" corresponds to a regular expression that starts with caret and ends with dollar.
The attribute maximum token size (maxtoksz) specifies the maximum number of characters that can satisfy a single match token group. It is an error if a match token group for which matching is attempted cannot be matched within the maxtoksz limit.
<!-- Data Tokenizer Grove Construction Process -->
<![ %datatok; [
<!notation
datatok -- Data tokenizer grove construction process --
-- Clause: A.4.4.2.2 --
PUBLIC "ISO/IEC 10744:1997//NOTATION
Data Tokenizer Grove Construction Process//EN"
-- Attributes [locs]: datatok --
-- CommonAttributes [GenArc]: altreps, included, superdcn --
-- CommonAttributes [base]: bosdatt --
-- CommonAttributes [locs]: egrvplan --
>
<!attlist #NOTATION
datatok -- Data tokenizer grove construction process --
-- Clause: A.4.4.2.2 --
catsrc -- Source concatenation separator --
CDATA
#IMPLIED -- Default: no source concatenation --
cattoken -- Token concatentation separator --
CDATA
#IMPLIED -- Default: no token concatenation --
catres -- Result concatenation separator --
CDATA
#IMPLIED -- Default: no result concatenation --
boundary -- Hit boundary constraint --
(sodeod|sodiec|isceod|isciec|inmodel)
isciec
maxtoksz -- Maximum token size --
NUMBER
#IMPLIED -- Default: system defined --
>
]]><!-- datatok -->The plain text grove construction process builds groves from data about which all that is known is that it is composed of characters.
A plain text grove consists of a grove root of class plain text document (document), the value of the text property exhibited by which is a list of nodes of class data character (datachar). One datachar node is created for each character recognized in the source data, setting the value of each datachar node's character (char) property to the corresponding recognized character. The order of the datachar nodes is determined by the order of their corresponding characters in the source data.
The plaintxt property set is:
<!-- Plain Text Property Set --> <!DOCTYPE propset PUBLIC "ISO/IEC 10744:1997//DTD Property Set//EN" [ <!NOTATION plaintxt PUBLIC "ISO/IEC 10744:1997//NOTATION Plain Text//EN" > ]> <propset nsd=plaintxt> <classdef rcsnm=document fullnm="plain text document" conprop=text> <propdef rcsnm=text nodelist subnode ac="datachar"> <classdef rcsnm=datachar appnm="data char" fullnm="data character" conprop=char> <propdef rcsnm=char fullnm="character" char>
| Next Clause | Previous Clause |
HTML generated from the original SGML source using a DSSSL style specification and the SGML output back-end of the JADE DSSSL engine.