From 7ccb2f315100954bb89f93f9e9aa43a7dad0e49c Mon Sep 17 00:00:00 2001 From: Marcel Hellkamp Date: Sat, 12 Oct 2024 17:50:04 +0200 Subject: [PATCH 1/2] perf: Avoid unnecessary header search --- multipart.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/multipart.py b/multipart.py index b201ddc..7da7dc3 100644 --- a/multipart.py +++ b/multipart.py @@ -509,8 +509,8 @@ def _close_headers(self): elif h == "Content-Type": self.content_type, args = parse_options_header(v) self.charset = args.get("charset") - elif h == "Content-Length": - self._clen = int(self.header("Content-Length", -1)) + elif h == "Content-Length" and v.isdecimal(): + self._clen = int(v) if self.name is None: raise self._fail("Missing Content-Disposition segment header") From 1969677dbf2811c5ceb979be99ce472f19230fcd Mon Sep 17 00:00:00 2001 From: Marcel Hellkamp Date: Tue, 8 Oct 2024 19:25:16 +0200 Subject: [PATCH 2/2] docs: Split and improve readme --- CHANGELOG.rst | 77 ++++++++++++++++++++++ README.rst | 172 +++++++++++++++++++++++++------------------------- 2 files changed, 163 insertions(+), 86 deletions(-) create mode 100644 CHANGELOG.rst diff --git a/CHANGELOG.rst b/CHANGELOG.rst new file mode 100644 index 0000000..9fa9861 --- /dev/null +++ b/CHANGELOG.rst @@ -0,0 +1,77 @@ +========= +Changelog +========= + +Release 1.1 +=========== + +This release could have been a patch release, but some of the fixes include +change in behavior to match documentation or specification. None of them should +be a surprise or have an impact on real-world clients, though. Existing apps +should be able to upgrade without issues. + +* fix: Fail faster on input with invalid line breaks (#55) +* fix: Allow empty segment names (#56) +* fix: Avoid ResourceWarning when using parse_form_data (#57) +* fix: MultipartPart now always has a sensible content type. +* fix: Actually check parser state on context manager exit. +* fix: Honor Content-Length header, if present. +* perf: Reduce overhead for small segments (-21%) +* perf: Reduce write overhead for large uploads (-2%) + +Release 1.0 +=========== + +This release introduces a completely new, fast, non-blocking ``PushMultipartParser`` +parser, which now serves as the basis for all other parsers. + +* feat: new ``PushMultipartParser`` parser. +* change: Parser is stricter by default and now rejects clearly broken input. + This should not affect data sent by actual browsers or HTTP clients, but may break some artificial unit tests. + * Fail on invalid line-breaks in headers or around boundaries. + * Fail on invalid header names. +* change: Default charset for segment headers and text fields changed to ``utf8``, as recommended by W3C HTTP. +* change: Default disk and memory limits for ``MultipartParser`` increased, but multiple other limits were introduced to allow finer control. Check if the new defaults still fit your needs. +* change: Several undocumented APIs were deprecated or removed, some of which were not strictly private but should only be used by the parser itself, not by applications. + +Release 0.2 +=========== + +This release dropped support for Python versions below ``3.6``. Stay on ``multipart-0.1`` if you need Python 2.5+ support. + +Patch 0.2.5 +----------- + +* security: Don't test semicolon separators in urlencoded data (#33) +* build: Add python-requires directive, indicating Python 3.5 or later is required and preventing older Pythons from attempting to download this version (#32) +* fix: Add official support for Python 3.10-3.12 (#38, #48) +* fix: Default value of ``copy_file`` should be ``2 ** 16``, not ``2 * 16`` (#41) +* docs: Update URL for Bottle (#42) + +Patch 0.2.4 +----------- + +* fix: Consistently decode non-utf8 URL-encoded form-data + +Patch 0.2.3 +----------- + +* fix: Import MutableMapping from collections.abc (#23) +* fix: Allow stream to contain data before first boundary (#25) +* tests: Fix a few more ResourceWarnings in the test suite (#24) + +Patch 0.2.2 +----------- + +* fix: ResourceWarnings on Python 3 (#21) + +Patch 0.2.1 +----------- + +* fix: empty payload (#20) + + +Release 0.1 +=========== + +First release diff --git a/README.rst b/README.rst index 5262bf2..2649989 100644 --- a/README.rst +++ b/README.rst @@ -1,3 +1,4 @@ +============================== Parser for multipart/form-data ============================== @@ -25,13 +26,14 @@ low-level for framework authors and high-level for WSGI application developers: and ``application/x-www-form-urlencoded`` form submissions from a `WSGI `_ environment. + Installation ------------- +============ ``pip install multipart`` Features --------- +======== * Pure python single file module with no dependencies. * 100% test coverage. Tested with inputs as seen from actual browsers and HTTP clients. @@ -48,8 +50,17 @@ actual modern browsers and HTTP clients, which means: * No ``encoded-word`` or ``name=_charset_`` encoding markers (discouraged in RFC 7578). * No support for clearly broken input (e.g. invalid line breaks or header names). -Usage and examples ------------------- + +Usage and Examples +================== + +Here are some basic examples for the most common use cases. There are more +parameters and features available than shown here, so check out the docstrings +(or your IDEs built-in help) to get a full picture. + + +Helper function for WSGI or CGI +------------------------------- For WSGI application developers we strongly suggest using the ``parse_form_data`` helper function. It accepts a WSGI ``environ`` dictionary and parses both types @@ -64,16 +75,54 @@ instances in return, one for text fields and the other for file uploads: def wsgi(environ, start_response): if environ["REQUEST_METHOD"] == "POST": forms, files = parse_form_data(environ) - - title = forms["title"] # string - upload = files["upload"] # MultipartPart + + title = forms["title"] # string + upload = files["upload"] # MultipartPart upload.save_as(...) -The ``parse_form_data`` helper function internally uses ``MultipartParser``, a -streaming parser that reads from a ``multipart/form-data`` encoded binary data -stream and emits ``MultipartPart`` instances as soon as a part is fully parsed. -This is most useful if you want to consume the individual parts as soon as they -arrive, instead of waiting for the entire request to be parsed: +Note that form fields that are too large to fit into memory will end up as +``MultipartPart`` instances in the ``files`` dict instead. This is to protect +your app from running out of memory or crashing. ``MultipartPart`` instances are +buffered to temporary files on disk if they exceed a certain size. The default +limits should be fine for most use cases, but can be configured if you need to. +See ``MultipartParser`` for details. + +Flask, Bottle & Co +^^^^^^^^^^^^^^^^^^ + +Most WSGI web frameworks already have multipart functionality built in, but +you may still get better throughput for large files (or better limits control) +by switching parsers: + +.. code-block:: python + + forms, files = multipart.parse_form_data(flask.request.environ) + +Legacy CGI +^^^^^^^^^^ + +If you are in the unfortunate position to have to rely on CGI, but can't use +``cgi.FieldStorage`` anymore, it's possible to build a minimal WSGI environment +from a CGI environment and use that with ``parse_form_data``. This is not a real +WSGI environment, but it contains enough information for ``parse_form_data`` +to do its job. Do not forget to add proper error handling. + +.. code-block:: python + + import sys, os, multipart + + environ = dict(os.environ.items()) + environ['wsgi.input'] = sys.stdin.buffer + forms, files = multipart.parse_form_data(environ) + + +Stream parser: ``MultipartParser`` +---------------------------------- + +The ``parse_form_data`` helper may be convenient, but it expects a WSGI +environment and parses the entire request in one go before it returns any +results. Using ``MultipartParser`` directly gives you more control and also +allows you to process ``MultipartPart`` instances as soon as they arrive: .. code-block:: python @@ -81,12 +130,16 @@ arrive, instead of waiting for the entire request to be parsed: def wsgi(environ, start_response): assert environ["REQUEST_METHOD"] == "POST" - ctype, copts = parse_options_header(environ.get("CONTENT_TYPE", "")) - boundary = copts.get("boundary") - charset = copts.get("charset", "utf8") - assert ctype == "multipart/form-data" - - parser = MultipartParser(environ["wsgi.input"], boundary, charset) + + content_type = environ["CONTENT_TYPE"] + content_type, content_params = parse_options_header(content_type) + assert content_type == "multipart/form-data" + + stream = environ["wsgi.input"] + boundary = content_params.get("boundary") + charset = content_params.get("charset", "utf8") + + parser = MultipartParser(stream, boundary, charset) for part in parser: if part.filename: print(f"{part.name}: File upload ({part.size} bytes)") @@ -96,11 +149,15 @@ arrive, instead of waiting for the entire request to be parsed: else: print(f"{part.name}: Test field, but too big to print :/") -The ``MultipartParser`` handles IO and file buffering for you, but does so using + +Non-blocking parser: ``PushMultipartParser`` +-------------------------------------------- + +The ``MultipartParser`` handles IO and file buffering for you, but relies on blocking APIs. If you need absolute control over the parsing process and want to avoid blocking IO at all cost, then have a look at ``PushMultipartParser``, the -low-level non-blocking incremental ``multipart/form-data`` parser that powers all -the other parsers in this library: +low-level non-blocking incremental ``multipart/form-data`` parser that powers +all the other parsers in this library: .. code-block:: python @@ -109,10 +166,14 @@ the other parsers in this library: async def process_multipart(reader: asyncio.StreamReader, boundary: str): with PushMultipartParser(boundary) as parser: while not parser.closed: + chunk = await reader.read(1024*64) for result in parser.parse(chunk): + if isinstance(result, MultipartSegment): print(f"== Start of segment: {result.name}") + if result.filename: + print(f"== Client-side filename: {result.filename}") for header, value in result.headerlist: print(f"{header}: {value}") elif result: # Result is a non-empty bytearray @@ -121,70 +182,9 @@ the other parsers in this library: print(f"== End of segment") -Changelog ---------- - -* **1.1** - - * Some of these fixes changed behavior to match documentation or specification, - none of them should be a surprise. Existing apps should be able to upgrade - without change. - * fix: Fail faster on input with invalid line breaks (#55) - * fix: Allow empty segment names (#56) - * fix: Avoid ResourceWarning when using parse_form_data (#57) - * fix: MultipartPart now always has a sensible content type. - * fix: Actually check parser state on context manager exit. - * fix: Honor Content-Length header, if present. - * perf: Reduce overhead for small segments (-21%) - * perf: Reduce write overhead for large uploads (-2%) - -* **1.0** - - * A completely new, fast, non-blocking ``PushMultipartParser`` parser, which - now serves as the basis for all other parsers. - * The new parser is stricter and rejects clearly broken input quicker, even in - non-strict mode (e.g. invalid line breaks or header names). This should not - affect data sent by actual browsers or HTTP clients. - * Default charset for ``MultipartParser`` headers and text fields changed to - ``utf8``, as recommended by W3C HTTP. - * Default disk and memory limits for ``MultipartParser`` increased, but - multiple other limits added for finer control. Check if the the new defaults - still fit your needs. - * Undocumented APIs deprecated or removed, some of which were not strictly - private. This includes parameters for ``MultipartParser`` and some - ``MultipartPart`` methods, but those should not be used by anyone but the - parser itself. - -* **0.2.5** - - * Don't test semicolon separators in urlencoded data (#33) - * Add python-requires directive, indicating Python 3.5 or later is required and preventing older Pythons from attempting to download this version (#32) - * Add official support for Python 3.10-3.12 (#38, #48) - * Default value of ``copy_file`` should be ``2 ** 16``, not ``2 * 16`` (#41) - * Update URL for Bottle (#42) - -* **0.2.4** - - * Consistently decode non-utf8 URL-encoded form-data - -* **0.2.3** - - * Import MutableMapping from collections.abc (#23) - * Fix a few more ResourceWarnings in the test suite (#24) - * Allow stream to contain data before first boundary (#25) - -* **0.2.2** - - * Fix #21 ResourceWarnings on Python 3 - -* **0.2.1** - - * Fix #20 empty payload - -* **0.2** - - * Dropped support for Python versions below 3.6. Stay on 0.1 if you need Python 2.5+ support. +License +======= -* **0.1** +.. __: https://github.com/defnull/multipart/raw/master/LICENSE - * First release +Code and documentation are available under MIT License (see LICENSE__).