module Pxp_types:This module defines and exports all the types listed in Pxp_core_types_type.CORE_TYPES:sig
..end
type ext_id type private_id val allocate_private_id type_resolver_id val resolver_id_of_ext_id type dtd_id type content_model_type type mixed_spec type regexp_spec type att_type type att_default type att_value class type collect_warnings class drop_warnings class type symbolic_warnings type warning type encoding type rep_encoding exception Validation_error exception WF_error exception Namespace_error exception Error exception Character_not_supported exception At exception Undeclared exception Method_not_applicable exception Namespace_method_not_applicable val string_of_exn type output_stream val write type pool val make_probabilistic_pool val pool_string
See the file pxp_core_types_type.mli for the exact definitions of
these types/values.
include Pxp_core_types_type.CORE_TYPES
type ext_id type private_id val allocate_private_id type_resolver_id val resolver_id_of_ext_id type dtd_id type content_model_type type mixed_spec type regexp_spec type att_type type att_default type att_value class type collect_warnings class drop_warnings class type symbolic_warnings type warning type encoding type rep_encoding exception Validation_error exception WF_error exception Namespace_error exception Error exception Character_not_supported exception At exception Undeclared exception Method_not_applicable exception Namespace_method_not_applicable val string_of_exn type output_stream val write type pool val make_probabilistic_pool val pool_string
See the file pxp_core_types_type.mli for the exact definitions of
these types/values.
type
config = {
|
warner : |
(* | An object that collects warnings. | *) |
|
swarner : |
(* | Another object getting warnings expressed as polymorphic
variants. This is especially useful to turn warnings into
errors. If defined, the swarner gets the warning
first before it is sent to the classic warner . | *) |
|
enable_pinstr_nodes : |
|||
|
enable_super_root_node : |
|||
|
enable_comment_nodes : |
(* | When enabled, comments are represented as nodes with type =
T_comment.
To access the contents of comments, use the method "comment"
for the comment nodes.
These nodes behave like elements; however, they are normally
empty and do not have attributes. Note that it is possible to
add children to comment nodes and to set attributes, but it is
strongly recommended not to do so. There are no checks on
such abnormal use, because they would cost too
much time, even when no comment nodes are generated at all.
Comment nodes should be disabled unless you must parse a third-party XML text which uses comments as another data container. The nodes of type T_comment are created from the comment exemplars in your spec. Event-based parser: This flag controls whether E_comment events are generated. | *) |
|
drop_ignorable_whitespace : |
|||
|
encoding : |
(* | Specifies the encoding used for the *internal* representation of any character data. Note that the default is still Enc_iso88591. | *) |
|
recognize_standalone_declaration : |
(* | Whether the "standalone" declaration is recognized or not.
This option does not have an effect on well-formedness parsing:
in this case such declarations are never recognized.
Recognizing the "standalone" declaration means that the value of the declaration is scanned and passed to the DTD, and that the "standalone-check" is performed. Standalone-check: If a document is flagged standalone='yes' some additional constraints apply. The idea is that a parser without access to any external document subsets can still parse the document, and will still return the same values as the parser with such access. For example, if the DTD is external and if there are attributes with default values, it is checked that there is no element instance where these attributes are omitted - the parser would return the default value but this requires access to the external DTD subset. Event-based parser: The option has an effect if the `Parse_xml_decl entry flag is set. In this case, it is passed to the DTD whether there is a standalone declaration, ... and the rest is unclear. | *) |
|
store_element_positions : |
(* | Whether the file name, the line and the column of the
beginning of elements are stored in the element nodes.
This option may be useful to generate error messages.
Positions are only stored for:
Event-based parser: If true, the E_position events will be generated. | *) |
|
idref_pass : |
(* | Whether the parser does a second pass and checks that all
IDREF and IDREFS attributes contain valid references.
This option works only if an ID index is available. To create
an ID index, pass an index object as id_index argument to the
parsing functions (such as parse_document_entity; see below).
"Second pass" does not mean that the XML text is again parsed; only the existing document tree is traversed, and the check on bad IDREF/IDREFS attributes is performed for every node. Event-based parser: this option is ignored. | *) |
|
validate_by_dfa : |
(* | If true, and if DFAs are available for validation, the DFAs will
actually be used for validation.
If false, or if no DFAs are available, the standard backtracking
algorithm will be used.
DFA = deterministic finite automaton.
DFAs are only available if accept_only_deterministic_models is "true" (because in this case, it is relatively cheap to construct the DFAs). DFAs are a data structure which ensures that validation can always be performed in linear time. I strongly recommend using DFAs; however, there are examples for which validation by backtracking is faster. Event-based parser: this option is ignored. | *) |
|
accept_only_deterministic_models : |
(* | Whether only deterministic content models are accepted in DTDs.
Event-based parser: this option is ignored. | *) |
|
disable_content_validation : |
(* | When set to 'true', content validation is disabled; however,
other validation checks remain activated.
This option is intended to save time when a validated document
is parsed and it can be assumed that it is valid.
Do not forget to set accept_only_deterministic_models to false to save maximum time (or DFAs will be computed which is rather expensive). Event-based parser: this option is ignored. | *) |
|
name_pool : |
|||
|
enable_name_pool_for_element_types : |
|||
|
enable_name_pool_for_attribute_names : |
|||
|
enable_name_pool_for_attribute_values : |
(* | enable_name_pool_for_notation_names : bool; | *) |
|
enable_name_pool_for_pinstr_targets : |
(* | The name pool maps strings to pool strings such that strings with
the same value share the same block of memory.
Enabling the name pool saves memory, but makes the parser
slower.
Event-based parser: As far as I remember, some of the pool options are honoured, but not all. | *) |
|
enable_namespace_processing : |
(* | Setting this option to a namespace_manager enables namespace
processing. This works only if the namespace-aware implementation
namespace_element_impl of element nodes is used in the spec;
otherwise you will get error messages complaining about missing
methods.
Note that PXP uses a technique called "prefix normalization" to implement namespaces on top of the plain document model. This means that the namespace prefixes of elements and attributes are changed to unique prefixes if they are ambiguous, and that these "normprefixes" are actually stored in the document tree. Furthermore, the normprefixes are used for validation. Every normprefix corresponds uniquely to a namespace URI, and this mapping is controlled by the namespace_manager. It is possible to fill the namespace_manager before parsing starts such that the programmer knows which normprefix is used for which namespace URI. Example: let mng = new namespace_manager in mng # add_namespace "html" "http://www.w3.org/1999/xhtml"; ... This forces that elements with the mentioned URI are rewritten to a form using the normprefix "html". For instance, "html:table" always refers to the HTML table construct, independently of the prefix used in the parsed XML text. By default, namespace processing is turned off. Event-based parser: If true, the events E_ns_start_tag and E_ns_end_tag are generated instead of E_start_tag, and E_end_tag, respectively. | *) |
|
escape_contents : |
|||
|
escape_attributes : |
|||
|
debugging_mode : |
val default_config : config
val default_namespace_config : config
source
is often not used directly, but sources are constructed
with the help of the functions from_channel
, from_obj_channel
,
from_file
, and from_string
(see below).
The type source
is an abstraction on top of resolver
(defined in
module Pxp_reader). The resolver
is a configurable object that knows
how to access files that are
resolver
knows a lot about the character encoding
of the files. See Pxp_reader for details.source
is a resolver that is applied to a certain ID that should
be initially opened.typesource =
Pxp_dtd.source
=
| |
Entity of |
| |
ExtID of |
| |
XExtID of |
source
is intended to implement customized versions of the entity
classes. Use it only if there is a strong need to do so.xid
is opened by using the resolver r
.sys_base
is the base URI to assume if xid
is a relative URI (i.e.
a SYSTEM ID).val from_channel : ?alt:Pxp_reader.resolver list ->
?system_id:string ->
?fixenc:encoding ->
?id:ext_id ->
?system_encoding:encoding -> Pervasives.in_channel -> source
in_channel
. By default, this source
is not able to read
XML text from any other location (you cannot read from files etc.).
The optional arguments allow it to modify this behaviour.
Keep the following in mind:
new Pxp_reader.resolve_as_file()
to enable resolving of
file names found in SYSTEM IDs.
~system_id: By default, the XML text found in the in_channel
does not
have any ID (to be exact, the in_channel
has a private ID, but
this is hidden). Because of this, it is not possible to open
a second file by using a relative SYSTEM ID. The parameter ~system_id
assigns the channel a SYSTEM ID that is only used to resolve
further relative SYSTEM IDs.
This parameter must be encoded as UTF-8 string.
~fixenc: By default, the character encoding of the XML text is
determined by looking at the XML declaration. Setting ~fixenc
forces a certain character encoding. Useful if you can assume
that the XML text has been recoded by the transmission media.
THE FOLLOWING OPTIONS ARE DEPRECATED:
~id: This parameter assigns the channel an arbitrary ID (like ~system_id,
but PUBLIC, anonmyous, and private IDs are also possible - although
not reasonable). Furthermore, setting ~id also enables resolving
of file names.
~id has higher precedence than ~system_id.
~system_encoding: (Only useful together with ~id.) The character encoding
used for file names. (UTF-8 by default.)
val from_obj_channel : ?alt:Pxp_reader.resolver list ->
?system_id:string ->
?fixenc:encoding ->
?id:ext_id ->
?system_encoding:encoding -> Netchannels.in_obj_channel -> source
from_channel
, but reads from a netchannel instead.val from_string : ?alt:Pxp_reader.resolver list ->
?system_id:string -> ?fixenc:encoding -> string -> source
from_channel
, but reads from a string.
Of course, it is possible to parse this source several times, unlike
the channel-based sources.
val from_file : ?alt:Pxp_reader.resolver list ->
?system_encoding:encoding -> ?enc:encoding -> string -> source
This source can open further files by default, and relative URLs work.
~alt: A list of further resolvers, especially useful to open
non-SYSTEM IDs, and non-file entities.
~system_encoding: The character encoding the system uses to represent
filenames. By default, UTF-8 is assumed.
~enc: The character encoding of the string argument. As mentioned, this
is UTF-8 by default.
val open_source : config ->
source ->
bool -> Pxp_dtd.dtd -> Pxp_reader.resolver * Pxp_entity.entity
typeentry =
[ `Entry_content of [ `Dummy ] list
| `Entry_declarations of [ `Extend_dtd_fully | `Val_mode_dtd ] list
| `Entry_document of
[ `Extend_dtd_fully | `Parse_xml_decl | `Val_mode_dtd ] list
| `Entry_expr of [ `Dummy ] list ]
process_entity
:
type
event =
| |
E_start_doc of |
|||
| |
E_end_doc of |
|||
| |
E_start_tag of |
|||
| |
E_end_tag of |
|||
| |
E_char_data of |
|||
| |
E_pinstr of |
|||
| |
E_pinstr_member of |
|||
| |
E_comment of |
|||
| |
E_start_super |
|||
| |
E_end_super |
|||
| |
E_position of |
|||
| |
E_error of |
|||
| |
E_end_of_stream |
(* | may be extended in the future | *) |
E_start_tag (name, attlist, scope_opt, entid): <name attlist> scope_opt is None in non-namespace mode, and the namespace scope object in namespace mode.
E_end_tag (name, entid): </name>
E_char_data data: The parser usually generates several E_char_data events for a longer section of character data.
E_pinstr (target,value): <?target value?> as node
E_pinstr_member (target,value): <?target value?> as member of the parent element (add_pinstr)
E_comment value: <!--value-->
E_start_super, E_end_super: Indicates where the "super root node" is. Only generated when enable_super_root_node is on.
E_position(entity,line,col): these events are only created if the next event will be E_start_tag, E_pinstr, or E_comment, and if the configuration option store_element_position is true.
E_end_of_stream: this last event indicates that the parser has terminated without error
E_error(exn): this last event indicates that the parser has terminated with error
Note Pxp_lexer_types.entity_id: currently, this is just < >, i.e. the class type without properties. It is planned, however, that one can at least query the base URI of the entity. The best way of dealing with this parameter for now: