Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

\r by itself is mapped to \n which causes things like AWS S3 Key names to have information loss when parsed by SweetXxml #92

Open
cmarkle opened this issue May 7, 2023 · 0 comments

Comments

@cmarkle
Copy link

cmarkle commented May 7, 2023

Can SweetXml be configured / motivated to not map single \r characters to \n? This causes issues when parsing XML form AWS S3 storage service when object names contain \r (which although not smart to use, is legal to use).

Working with this XML returned from Amazon AWS S3 storage service:

<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
  <Name>obfuscated_bucket_name</Name>
  <Prefix/>
  <NextContinuationToken>1ILUW_obfuscated_continuation_token_1yMUM</NextContinuationToken>
  <KeyCount>1000</KeyCount>
  <MaxKeys>1000</MaxKeys>
  <Delimiter/>
  <IsTruncated>true</IsTruncated>
  <Contents>
    <Key>workspaces/51109/packages/ND-7H46rSQ.asp-package/contents/WHALE -&gt; MH/Whale_TTB_DOCUMENTS/Aspera Inboxes/Icon&#13;</Key>
    <LastModified>2022-03-25T23:52:22.122Z</LastModified>
    <ETag>"d41d8cd98f00b204e9800998ecf8427e"</ETag>
    <Size>0</Size>
    <StorageClass>STANDARD</StorageClass>
  </Contents>
  <Contents>
    <Key>workspaces/51109/packages/ND-7H46rSQ.asp-package/contents/WHALE -&gt; MH/Whale_TTB_DOCUMENTS/Aspera Inboxes/Icon&#13;inmiddle</Key>
    <LastModified>2022-03-26T23:52:22.122Z</LastModified>
    <ETag>"c41d8cd98f00b204e9800998ecf8427e"</ETag>
    <Size>0</Size>
    <StorageClass>STANDARD</StorageClass>
  </Contents>
</ListBucketResult>

I used the SweetXml xpath snippet from lib/ex_aws/s3/parsers.ex in the ex_aws/ex_aws_s3 project, as-is:

iex(19)>  xml
"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?>\n<ListBucketResult xmlns=\"http://s3.amazonaws.com/doc/2006-03-01/\">\n  <Name>obfuscated_bucket_name</Name>\n  <Prefix></Prefix>\n  <NextContinuationToken>1ILUW_obfuscated_continuation_token_1yMUM</NextContinuationToken>\n  <KeyCount>1000</KeyCount>\n  <MaxKeys>1000</MaxKeys>\n  <Delimiter></Delimiter>\n  <IsTruncated>true</IsTruncated>\n  <Contents>\n    <Key>workspaces/51109/packages/ND-7H46rSQ.asp-package/contents/WHALE -&gt; MH/Whale_TTB_DOCUMENTS/Aspera Inboxes/Icon&#13;</Key>\n    <LastModified>2022-03-25T23:52:22.122Z</LastModified>\n    <ETag>&quot;d41d8cd98f00b204e9800998ecf8427e&quot;</ETag>\n    <Size>0</Size>\n    <StorageClass>STANDARD</StorageClass>\n  </Contents>\n  <Contents>\n    <Key>workspaces/51109/packages/ND-7H46rSQ.asp-package/contents/WHALE -&gt; MH/Whale_TTB_DOCUMENTS/Aspera Inboxes/Icon&#13;inmiddle</Key>\n    <LastModified>2022-03-26T23:52:22.122Z</LastModified>\n    <ETag>&quot;c41d8cd98f00b204e9800998ecf8427e&quot;</ETag>\n    <Size>0</Size>\n    <StorageClass>STANDARD</StorageClass>\n  </Contents>\n</ListBucketResult>"

iex(22)> parsed_body =
...(22)>         xml |> SweetXml.xpath(~x"//ListBucketResult",
...(22)>           name: ~x"./Name/text()"s,
...(22)>           is_truncated: ~x"./IsTruncated/text()"s,
...(22)>           prefix: ~x"./Prefix/text()"s,
...(22)>           marker: ~x"./Marker/text()"s,
...(22)>           next_continuation_token: ~x"./NextContinuationToken/text()"s,
...(22)>           key_count: ~x"./KeyCount/text()"s,
...(22)>           max_keys: ~x"./MaxKeys/text()"s,
...(22)>           next_marker: ~x"./NextMarker/text()"s,
...(22)>           contents: [
...(22)>             ~x"./Contents"l,
...(22)>             key: ~x"./Key/text()"s,
...(22)>             last_modified: ~x"./LastModified/text()"s,
...(22)>             e_tag: ~x"./ETag/text()"s,
...(22)>             size: ~x"./Size/text()"s,
...(22)>             storage_class: ~x"./StorageClass/text()"s,
...(22)>             owner: [
...(22)>               ~x"./Owner"o,
...(22)>               id: ~x"./ID/text()"s,
...(22)>               display_name: ~x"./DisplayName/text()"s
...(22)>             ]
...(22)>           ],
...(22)>           common_prefixes: [
...(22)>             ~x"./CommonPrefixes"l,
...(22)>             prefix: ~x"./Prefix/text()"s
...(22)>           ]
...(22)>         )
%{
  common_prefixes: [],
  contents: [
    %{
      e_tag: "\"d41d8cd98f00b204e9800998ecf8427e\"",
      key: "workspaces/51109/packages/ND-7H46rSQ.asp-package/contents/WHALE -> MH/Whale_TTB_DOCUMENTS/Aspera Inboxes/Icon\n",
      last_modified: "2022-03-25T23:52:22.122Z",
      owner: nil,
      size: "0",
      storage_class: "STANDARD"
    },
    %{
      e_tag: "\"c41d8cd98f00b204e9800998ecf8427e\"",
      key: "workspaces/51109/packages/ND-7H46rSQ.asp-package/contents/WHALE -> MH/Whale_TTB_DOCUMENTS/Aspera Inboxes/Icon\ninmiddle",
      last_modified: "2022-03-26T23:52:22.122Z",
      owner: nil,
      size: "0",
      storage_class: "STANDARD"
    }
  ],
  is_truncated: "true",
  key_count: "1000",
  marker: "",
  max_keys: "1000",
  name: "obfuscated_bucket_name",
  next_continuation_token: "1ILUW_obfuscated_continuation_token_1yMUM",
  next_marker: "",
  prefix: ""
}

Note that the S3 Key's with \r in them have their \r characters mapped to \n. In general this is probably "right" and in conformance with the XML specs wrt end of line handling, but I think we should ideally be able to use SweetXml in some way against this type of input and have the \r and other similar characters mapped to their XML escape equivalents (e.g., &#13; in the \r case).

Can SweetXml be used in a way in which this mapping of \r to \n is not done?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant