Type hinting bonanza (#1)

tobywf · Feb 19, 2020 · d077d28 · d077d28
1 parent 1515954
commit d077d28
Show file tree

Hide file tree

Showing 10 changed files with 185 additions and 89 deletions.
diff --git a/README.md b/README.md
@@ -2,70 +2,15 @@
 
 [![License: MPL 2.0](https://img.shields.io/badge/License-MPL%202.0-brightgreen.svg)](https://opensource.org/licenses/MPL-2.0)
 
-This is a prototype of how a library might look like for (de)serialising XML into  Python dataclasses. XML dataclasses build on normal dataclasses from the standard library and [`lxml`](https://pypi.org/project/lxml/) elements. Loading and saving these elements is left to the consumer for flexibility of the desired output.
+This library enables (de)serialising XML into Python dataclasses. XML dataclasses build on normal dataclasses from the standard library and [`lxml`](https://pypi.org/project/lxml/) elements. Loading and saving these elements is left to the consumer for flexibility of the desired output.
 
-It isn't ready for production if you aren't willing to do your own evaluation/quality assurance. I don't recommend using this library with untrusted content. It inherits all of `lxml`'s flaws with regards to XML attacks, and recursively resolves data structures. Because deserialisation is driven from the dataclass definitions, it shouldn't be possible to execute arbitrary Python code. But denial of service attacks would very likely be feasible.
+It's currently in alpha. It isn't ready for production if you aren't willing to do your own evaluation/quality assurance. I don't recommend using this library with untrusted content. It inherits all of `lxml`'s flaws with regards to XML attacks, and recursively resolves data structures. Because deserialisation is driven from the dataclass definitions, it shouldn't be possible to execute arbitrary Python code (not a guarantee, see license). Denial of service attacks would very likely be feasible. One workaround may be to [use `lxml` to validate](https://lxml.de/validation.html) untrusted content with a strict schema.
 
 Requires Python 3.7 or higher.
 
-## Example
-
-(This is a simplified real world example - the container can also include optional `links` child elements.)
-
-```xml
-<?xml version="1.0"?>
-<container version="1.0" xmlns="urn:oasis:names:tc:opendocument:xmlns:container">
-  <rootfiles>
-    <rootfile full-path="OEBPS/content.opf" media-type="application/oebps-package+xml" />
-  </rootfiles>
-</container>
-```
-
-```python
-from lxml import etree
-from typing import List
-from xml_dataclasses import xml_dataclass, rename, load, dump
-
-CONTAINER_NS = "urn:oasis:names:tc:opendocument:xmlns:container"
-
-@xml_dataclass
-class RootFile:
-    __ns__ = CONTAINER_NS
-    full_path: str = rename(name="full-path")
-    media_type: str = rename(name="media-type")
-
-
-@xml_dataclass
-class RootFiles:
-    __ns__ = CONTAINER_NS
-    rootfile: List[RootFile]
-
-
-@xml_dataclass
-class Container:
-    __ns__ = CONTAINER_NS
-    version: str
-    rootfiles: RootFiles
-    # WARNING: this is an incomplete implementation of an OPF container
-
-    def xml_validate(self):
-        if self.version != "1.0":
-            raise ValueError(f"Unknown container version '{self.version}'")
-
-
-if __name__ == "__main__":
-    nsmap = {None: CONTAINER_NS}
-    # see Gotchas, stripping whitespace is highly recommended
-    parser = etree.XMLParser(remove_blank_text=True)
-    lxml_el_in = etree.parse("container.xml", parser).getroot()
-    container = load(Container, lxml_el_in, "container")
-    lxml_el_out = dump(container, "container", nsmap)
-    print(etree.tostring(lxml_el_out, encoding="unicode", pretty_print=True))
-```
-
 ## Features
 
-* XML dataclasses are also dataclasses, and only require a single decorator
+* XML dataclasses are also dataclasses, and only require a single decorator to work (but see type hinting section for issues)
 * Convert XML documents to well-defined dataclasses, which should work with IDE auto-completion
 * Loading and dumping of attributes, child elements, and text content
 * Required and optional attributes and child elements
@@ -91,7 +36,7 @@ class Foo:
     existing_field: str = rename(field(...), name="existing-field")
 ```
 
-I would like to add support for validation in future, which might also make it easier to support other types. For now, you can work around this limitation with properties that do the conversion.
+For now, you can work around this limitation with properties that do the conversion, and perform post-load validation.
 
 ### Defining text
 
@@ -122,10 +67,10 @@ Children must ultimately be other XML dataclasses. However, they can also be `Op
 * Next, `List` should be defined (if multiple child elements are allowed). Valid: `List[Union[XmlDataclass1, XmlDataclass2]]`. Invalid: `Union[List[XmlDataclass1], XmlDataclass2]`
 * Finally, if `Optional` or `List` were used, a union type should be the inner-most (again, if needed)
 
-Children can be renamed via the `rename` function, however attempting to set a namespace is invalid, since the namespace is provided by the child type's XML dataclass. Also, unions of XML dataclasses must have the same namespace (you can use different fields if they have different namespaces).
-
 If a class has children, it cannot have text content.
 
+Children can be renamed via the `rename` function. However, attempting to set a namespace is invalid, since the namespace is provided by the child type's XML dataclass. Also, unions of XML dataclasses must have the same namespace (you can use different fields with renaming if they have different namespaces, since the XML names will be resolved as a combination of namespace and name).
+
 ### Defining post-load validation
 
 Simply implement an instance method called `xml_validate` with no parameters, and no return value (if you're using type hints):
@@ -137,8 +82,89 @@ def xml_validate(self) -> None:
 
 If defined, the `load` function will call it after all values have been loaded and assigned to the XML dataclass. You can validate the fields you want inside this method. Return values are ignored; instead raise and catch exceptions.
 
+## Example (fully type hinted)
+
+(This is a simplified real world example - the container can also include optional `links` child elements.)
+
+```xml
+<?xml version="1.0"?>
+<container version="1.0" xmlns="urn:oasis:names:tc:opendocument:xmlns:container">
+  <rootfiles>
+    <rootfile full-path="OEBPS/content.opf" media-type="application/oebps-package+xml" />
+  </rootfiles>
+</container>
+```
+
+```python
+from dataclasses import dataclass
+from typing import List
+from lxml import etree  # type: ignore
+from xml_dataclasses import xml_dataclass, rename, load, dump, NsMap, XmlDataclass
+
+CONTAINER_NS = "urn:oasis:names:tc:opendocument:xmlns:container"
+
+
+@xml_dataclass
+@dataclass
+class RootFile:
+    __ns__ = CONTAINER_NS
+    full_path: str = rename(name="full-path")
+    media_type: str = rename(name="media-type")
+
+
+@xml_dataclass
+@dataclass
+class RootFiles:
+    __ns__ = CONTAINER_NS
+    rootfile: List[RootFile]
+
+
+# see Gotchas, this workaround is required for type hinting
+@xml_dataclass
+@dataclass
+class Container(XmlDataclass):
+    __ns__ = CONTAINER_NS
+    version: str
+    rootfiles: RootFiles
+    # WARNING: this is an incomplete implementation of an OPF container
+
+    def xml_validate(self) -> None:
+        if self.version != "1.0":
+            raise ValueError(f"Unknown container version '{self.version}'")
+
+
+if __name__ == "__main__":
+    nsmap: NsMap = {None: CONTAINER_NS}
+    # see Gotchas, stripping whitespace is highly recommended
+    parser = etree.XMLParser(remove_blank_text=True)
+    lxml_el_in = etree.parse("container.xml", parser).getroot()
+    container = load(Container, lxml_el_in, "container")
+    lxml_el_out = dump(container, "container", nsmap)
+    print(etree.tostring(lxml_el_out, encoding="unicode", pretty_print=True))
+```
+
 ## Gotchas
 
+### Type hinting
+
+This can be a real pain to get right. Unfortunately, if you need this, you may have to resort to:
+
+```python
+@xml_dataclass
+@dataclass
+class Child:
+    __ns__ = None
+    pass
+
+@xml_dataclass
+@dataclass
+class Parent(XmlDataclass):
+    __ns__ = None
+    children: Child
+```
+
+It's important that `@dataclass` be the *last* decorator, i.e. the closest to the class definition (and so the first to be applied). Luckily, only the root class you intend to pass to `load`/`dump` has to inherit from `XmlDataclass`, but all classes should have the `@dataclass` decorator applied.
+
 ### Whitespace
 
 If you are able to, it is strongly recommended you strip whitespace from the input via `lxml`:
@@ -151,7 +177,7 @@ By default, `lxml` preserves whitespace. This can cause a problem when checking
 
 ### Optional vs required
 
-On dataclasses, optional fields also usually have a default value to be useful. But this isn't required; `Optional` is just a type hint to say `None` is allowed.
+On dataclasses, optional fields also usually have a default value to be useful. But this isn't required; `Optional` is just a type hint to say `None` is allowed. This would occur e.g. if an element has no children.
 
 For XML dataclasses, on loading/deserialisation, whether or not a field is required is determined by if it has a `default`/`default_factory` defined. If so, and it's missing, that default is used. Otherwise, an error is raised.
 
@@ -163,8 +189,8 @@ This makes sense in many cases, but possibly not every case.
 
 Most of these limitations/assumptions are enforced. They may make this project unsuitable for your use-case.
 
-* It isn't possible to pass any parameters to the wrapped `@dataclass` decorator
-* Setting the `init` parameter of a dataclass' `field` will lead to bad things happening, this isn't supported
+* If you need to pass any parameters to the wrapped `@dataclass` decorator, apply it before the `@xml_dataclass` decorator
+* Setting the `init` parameter of a dataclass' `field` will lead to bad things happening, this isn't supported.
 * Deserialisation is strict; missing required attributes and child elements will cause an error. I want this to be the default behaviour, but it should be straightforward to add a parameter to `load` for lenient operation
 * Dataclasses must be written by hand, no tools are provided to generate these from, DTDs, XML schema definitions, or RELAX NG schemas
 

diff --git a/functional/container_test.py b/functional/container_test.py
@@ -1,10 +1,11 @@
+from dataclasses import dataclass
 from pathlib import Path
 from typing import List
 
-import pytest
-from lxml import etree
+import pytest  # type: ignore
+from lxml import etree  # type: ignore
 
-from xml_dataclasses import dump, load, rename, xml_dataclass
+from xml_dataclasses import NsMap, XmlDataclass, dump, load, rename, xml_dataclass
 
 from .utils import lmxl_dump
 
@@ -14,32 +15,36 @@
 
 
 @xml_dataclass
+@dataclass
 class RootFile:
     __ns__ = CONTAINER_NS
     full_path: str = rename(name="full-path")
     media_type: str = rename(name="media-type")
 
 
 @xml_dataclass
+@dataclass
 class RootFiles:
     __ns__ = CONTAINER_NS
     rootfile: List[RootFile]
 
 
 @xml_dataclass
-class Container:
+@dataclass
+class Container(XmlDataclass):
     __ns__ = CONTAINER_NS
     version: str
     rootfiles: RootFiles
     # WARNING: this is an incomplete implementation of an OPF container
 
-    def xml_validate(self):
+    def xml_validate(self) -> None:
         if self.version != "1.0":
             raise ValueError(f"Unknown container version '{self.version}'")
 
 
 @pytest.mark.parametrize("remove_blank_text", [True, False])
-def test_functional_container_no_whitespace(remove_blank_text):
+def test_functional_container_no_whitespace(remove_blank_text):  # type: ignore
+    nsmap: NsMap = {None: CONTAINER_NS}
     parser = etree.XMLParser(remove_blank_text=remove_blank_text)
     el = etree.parse(str(BASE / "container.xml"), parser).getroot()
     original = lmxl_dump(el)
@@ -55,6 +60,6 @@ def test_functional_container_no_whitespace(remove_blank_text):
             ],
         ),
     )
-    el = dump(container, "container", {None: CONTAINER_NS})
+    el = dump(container, "container", nsmap)
     roundtrip = lmxl_dump(el)
     assert original == roundtrip
diff --git a/functional/utils.py b/functional/utils.py
@@ -1,8 +1,10 @@
-from lxml import etree
+from typing import Any
 
+from lxml import etree  # type: ignore
 
-def lmxl_dump(el):
-    encoded = etree.tostring(
+
+def lmxl_dump(el: Any) -> str:
+    encoded: bytes = etree.tostring(
         el, encoding="utf-8", pretty_print=True, xml_declaration=True
     )
     return encoded.decode("utf-8")
diff --git a/lint b/lint
@@ -22,4 +22,6 @@ else
   coverage html
   exit 1
 fi
+
+mypy functional/container_test.py --strict
 pytest functional/ --random-order $PYTEST_DEBUG
diff --git a/pyproject.toml b/pyproject.toml
@@ -1,6 +1,6 @@
 [tool.poetry]
 name = "xml_dataclasses"
-version = "0.0.4"
+version = "0.0.5"
 description = "(De)serialize XML documents into specially-annotated dataclasses"
 authors = ["Toby Fleming <[email protected]>"]
 license = "MPL-2.0"

diff --git a/src/xml_dataclasses/__init__.py b/src/xml_dataclasses/__init__.py
@@ -3,5 +3,24 @@
 logging.getLogger(__name__).addHandler(logging.NullHandler())
 
 from .modifiers import rename, text  # isort:skip
-from .resolve_types import xml_dataclass  # isort:skip
+from .resolve_types import (  # isort:skip
+    is_xml_dataclass,
+    xml_dataclass,
+    NsMap,
+    XmlDataclass,
+)
 from .serde import dump, load  # isort:skip
+
+
+# __all__ is required for mypy to pick up the imports
+# for errors, use `from xml_dataclasses.errors import ...`
+__all__ = [
+    "rename",
+    "text",
+    "dump",
+    "load",
+    "is_xml_dataclass",
+    "xml_dataclass",
+    "NsMap",
+    "XmlDataclass",
+]
diff --git a/src/xml_dataclasses/modifiers.py b/src/xml_dataclasses/modifiers.py
@@ -14,12 +14,15 @@ def make_field(default: Union[_T, _MISSING_TYPE]) -> Field[_T]:
     return field(default=default)
 
 
+# NOTE: Actual return type is 'Field[_T]', but we want to help type checkers
+# to understand the magic that happens at runtime.
+# see https://github.com/python/typeshed/blob/master/stdlib/3.7/dataclasses.pyi
 def rename(
     f: Optional[Field[_T]] = None,
     default: Union[_T, _MISSING_TYPE] = MISSING,
     name: Optional[str] = None,
     ns: Optional[str] = None,
-) -> Field[_T]:
+) -> _T:
     if f is None:
         f = make_field(default=default)
     metadata = dict(f.metadata)
@@ -28,15 +31,18 @@ def rename(
     if ns:
         metadata["xml:ns"] = ns
     f.metadata = metadata
-    return f
+    return f  # type: ignore
 
 
+# NOTE: Actual return type is 'Field[_T]', but we want to help type checkers
+# to understand the magic that happens at runtime.
+# see https://github.com/python/typeshed/blob/master/stdlib/3.7/dataclasses.pyi
 def text(
     f: Optional[Field[_T]] = None, default: Union[_T, _MISSING_TYPE] = MISSING
-) -> Field[_T]:
+) -> _T:
     if f is None:
         f = make_field(default=default)
     metadata = dict(f.metadata)
     metadata["xml:text"] = True
     f.metadata = metadata
-    return f
+    return f  # type: ignore