XMLParser配置

发布时间：2024年01月15日

概要

python-docx中的oxml包用于加载、处理、序列化ElementTree节点元素。例如，docx包内的xml文档都可以被加载为ElementTree。基于包内的xml文件创建ElementTree时，oxml库依赖XMLParser——定义于lxml.etree模块。

配置XMLParser需注意：1.遵循 Office XML 中命名空间的标准规范；2.将omxl中自定义的元素类对象注册到Office XML 中命名空间下。这样在使用XMLParser创建Element节点时，才会创建符合要求的元素节点对象。
?

XML Namespace¹

XML中的命名空间是为了解决元素重命名的问题。在XML中命名空间使用URI：

Definition: An XML namespace is identified by a URI reference

命名空间与元素本地名称组成限定性名称，限定性名称在一个命名空间内应具有唯一性，限定性命名一般遵守{namespace_uri}%s格式：

Definition: A qualified name is a name subject to namespace interpretation.

虽然限定性名称可以解决重命名问题，但是由于命名空间一般冗长，在实际使用中，一般使用命名空间前缀替换命名空间，并声明命名空间。声明命名空间，一般通过元素特殊属性xmlns进行声明，基本语法为xmlns:前缀=命名空间。经过声明后，可使用namespace_prefix:tag的方式标识元素名。注意，通过此种方式命名空间声明的范围，包含该元素的所有子节点元素。²
?

lxml.etree.XMLParser

XMLParser用于解析XML字符串，从XML字符串中创建ElementTree，并返回root节点元素。

# the code bellow origin from docx.oxml.parser.py module
from typing import cast

oxml_parser = etree.XMLParser(remove_blank_text=True, resolve_entities=False)

def parse_xml(xml: str) -> "BaseOxmlElement":
    """Root lxml element obtained by parsing XML character string `xml`.

    The custom parser is used, so custom element classes are produced for elements in
    `xml` that have them.
    """
    return cast("BaseOxmlElement", etree.fromstring(xml, oxml_parser))

实例化解析器中的“remove_blank_text”是指删除空文本节点，“resolve_entities”是指将实体对象转换为文本表示。XMLParser有两个方法需要留意。parse_xml中的cast函数将etree返回的元素节点映射为BaseOxmlElement类型。

# excute `help(etree.XMLParser)` in python  console
def set_element_class_lookup(self, lookup=None):
	"""Set a lookup scheme for element classes generated from this parser."""
	...

def makeelement(self, _tag, attrib, nsmap, **_extra):
	"""Creates a new element associated with this parser."
	...

set_element_class_lookup为解析器设置元素类的检索模式。makeelement方法基于解析器、元素类的tag、属性及命名空间映射创建一个元素节点。

# the code bellow origin from docx.oxml.parser.py module
element_class_lookup = etree.ElementNamespaceClassLookup()
oxml_parser.set_element_class_lookup(element_class_lookup)

上述源码中的第一条语句用于实例化一个元素查找模式，第二条用于将自定义的查找模式替换默认的查找模式。如果仅执行以上操作，oxml_parser解析xml字符串仍然返回的是etree.ElementTree

【重要】注册命名空间与元素类

先看源码中注册命名空间与元素类的实现逻辑：

from docx.oxml.ns import NamespacePrefixedTag, nsmap

def register_element_cls(tag: str, cls: Type["BaseOxmlElement"]):
    """Register an lxml custom element-class to use for `tag`.

    A instance of `cls` to be constructed when the oxml parser encounters an element
    with matching `tag`. `tag` is a string of the form `nspfx:tagroot`, e.g.
    `'w:document'`.
    """
    nspfx, tagroot = tag.split(":")
    namespace = element_class_lookup.get_namespace(nsmap[nspfx])
    namespace[tagroot] = cls

register_element_cls中的tag由两部分组成并由冒号分隔，两部分分别表示“命名空间前缀”与ElementTree中Root节点的tag名称——也可以简单理解为Element节点的tag名称。命名空间前缀与命名空间具有一一对应关系。
该函数的第二步get_namespace会获取或者注册命名空间，如果命名空间存在则直接返回命名空间，如果命名空间不存在，则创建指定的命名空间并返回。nsmap是一个字典对象，常见的前缀“w”对应的命名空间为’http://schemas.openxmlformats.org/wordprocessingml/2006/main’。
该函数第三步在返回的命名空间中注册元素类。比如在’http://schemas.openxmlformats.org/wordprocessingml/2006/main’命名空间下注册CT_P元素类。当解析器遇到‘{http://schemas.openxmlformats.org/wordprocessingml/2006/main}p’则创建对应的CT_P元素节点。

oxml_parser.makeelement(r"{http://schemas.openxmlformats.org/wordprocessingml/2006/main}p", nsmap={"w": nsmap["w"]})

from docx.oxml.text.paragraph import CT_P
register_element_cls("w:p", CT_P)
oxml_parser.makeelement(r"{http://schemas.openxmlformats.org/wordprocessingml/2006/main}p", nsmap={"w": nsmap["w"]})

创建OxmlElement

配置好了XMLParser，则可以基于解析器创建自定义元素节点，源码中的逻辑如下：

from docx.oxml.ns import NamespacePrefixedTag, nsmap

def OxmlElement(
    nsptag_str: str,
    attrs: Dict[str, str] | None = None,
    nsdecls: Dict[str, str] | None = None,
) -> BaseOxmlElement:
    """Return a 'loose' lxml element having the tag specified by `nsptag_str`.

    The tag in `nsptag_str` must contain the standard namespace prefix, e.g. `a:tbl`.
    The resulting element is an instance of the custom element class for this tag name
    if one is defined. A dictionary of attribute values may be provided as `attrs`; they
    are set if present. All namespaces defined in the dict `nsdecls` are declared in the
    element using the key as the prefix and the value as the namespace name. If
    `nsdecls` is not provided, a single namespace declaration is added based on the
    prefix on `nsptag_str`.
    """
    nsptag = NamespacePrefixedTag(nsptag_str)
    if nsdecls is None:
        nsdecls = nsptag.nsmap
    return oxml_parser.makeelement(nsptag.clark_name, attrib=attrs, nsmap=nsdecls)

第一步用于构建命名空间前缀名称标签。
第二步如果不指定命名空间声明信息，则使用默认的命名空间声明信息。
第三步，创建元素节点，其中标签名是命名空间前缀名对应的限定性名称，nsmap传入的是命名空间声明信息，该信息会赋值到新建节点的xmlns属性中。

文章来源:https://blog.csdn.net/weixin_44815943/article/details/135555880
本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若内容造成侵权/违法违规/事实不符，请联系我的编程经验分享网邮箱：chenni525@qq.com进行投诉反馈，一经查实，立即删除！

XMLParser配置

概要

XML Namespace1

lxml.etree.XMLParser

【重要】 注册命名空间与元素类

创建OxmlElement

XML Namespace¹

【重要】注册命名空间与元素类