Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive memory usage #57

Open
lpil opened this issue Jan 30, 2018 · 15 comments
Open

Excessive memory usage #57

lpil opened this issue Jan 30, 2018 · 15 comments
Assignees

Comments

@lpil
Copy link

lpil commented Jan 30, 2018

Hello! We're using SweetXML in production and we've been having some excessive memory usage that we've not been able to debug.

In our latest text an 80MB XML file uses >9GB of memory, this causes the VM to crash.

We are parsing XML in a format like this:

<?xml version='1.0' encoding='utf-8'?>
<ArrayOfCommercialDetail xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <CommercialDetail>
    <Status>text</Status>
    <ActionDate>text</ActionDate>
    <MusicDetails>
      <Title>text</Title>
      <Arranger></Arranger>
      <Composer>text</Composer>
      <Duration>text</Duration>
    </MusicDetails>
    <ActionedBy>text</ActionedBy>
    <ClockNumber>text</ClockNumber>
    <VODFinalAction>text</VODFinalAction>
    <FinalAction>text</FinalAction>
    <CommercialRestrictions>
      <Restriction>
        <Comment>text</Comment>
        <Code>text</Code>
        <Text>text</Text>
        <ID>text</ID>
      </Restriction>
    </CommercialRestrictions>
    <VODFinalActionId>text</VODFinalActionId>
    <CommercialPresentations>
      <Presentation>
        <Comment>text</Comment>
        <Code>text</Code>
        <Text>text</Text>
        <ID>text</ID>
      </Presentation>
      <Presentation>
        <Comment>text</Comment>
        <Code>text</Code>
        <Text>text</Text>
        <ID>text</ID>
      </Presentation>
    </CommercialPresentations>
    <CommercialArtists>
      <Artist>
        <Name>text</Name>
        <Type>text</Type>
        <ID>text</ID>
      </Artist>
      <Artist>
        <Name>text</Name>
        <Type>text</Type>
        <ID>text</ID>
      </Artist>
    </CommercialArtists>
    <FinalActionId>text</FinalActionId>
    <StatusId>text</StatusId>
  </CommercialDetail>

  <!-- Many more CommercialDetail here... -->

</ArrayOfCommercialDetail>

We parse this XML like so:

    data =
      xpath(
        xml,
        ~x"//CommercialDetail"l,
        clock_number: ~x"./ClockNumber/text()"s,
        actioned_at: ~x"./ActionDate/text()"s,
        presentation_codes: ~x"./CommercialPresentations/Presentation/Code/text()"sl,
        restriction_codes: ~x"./CommercialRestrictions/Restriction/Code/text()"sl,
        status: ~x"./Status/text()"s,
        vod_final_action_id: ~x"./VODFinalActionId/text()"s,
        final_action_id: ~x"./FinalActionId/text()"s
      )

2018-01-30-172813_3840x1080_scrot

Here's the load charts while iterating over and parsing XML files, and then discarding the result. It spikes each time the XML is parsed

What are we doing wrong here?

Extra note: We re-wrote this code to use the streaming API which used slightly less memory. Most our XML will not have newlines in it so this seemed to not be the rather path for a solution, and we would expect lower memory usage from the eager and streaming API.

After digging into the source it seems that memory spikes when :xmerl_scan.string/1 is called.

Thanks,
Louis

@awetzel
Copy link
Collaborator

awetzel commented Jan 30, 2018

Hello Louis,
Can you show me the code you use for streaming ? because yes, the memory footprint is mostly determined by the xmerl library of erlang when you do not use the streaming API : :xmerl_scan.string/1.
But in the case of streaming, you should be able to avoid construction of the whole tree with the discard option.

@lpil
Copy link
Author

lpil commented Jan 31, 2018

Hi @awetzel , thanks for looking into this.

Here's the streaming code. We've not used discard so I suspect it could be optimised.

data =
  xml_stream
  |> stream_tags(:CommercialDetail)
  |> Stream.map(fn {_, doc} ->
    %{
      clock_number: xpath(doc, ~x"./ClockNumber/text()"s),
      actioned_at: xpath( doc, ~x"./ActionDate/text()"s)),
      presentation_codes: xpath(doc, ~x"./CommercialPresentations/Presentation/Code/text()"sl),
      restriction_codes: xpath(doc, ~x"./CommercialRestrictions/Restriction/Code/text()"sl),
      status: xpath(doc, ~x"./Status/text()"s),
      vod_final_action_id: xpath(doc, ~x"./VODFinalActionId/text()"s),
      final_action_id: xpath(doc, ~x"./FinalActionId/text()"s)
    }
  end)
  |> Enum.to_list()

Cheers,
Louis

@antoinereyt
Copy link
Contributor

Hi,

Thanks for pointing this issue, I just updated the documentation to mention this option.

For you specific case, you should use the discard option this way:

|> stream_tags(:"CommercialDetail", discard: [:"CommercialDetail"])

Can you keep me posted, so that I can close the issue ?

Thanks.

@steffkes
Copy link

@antoinereyt would you mind explaining about as to why this is?

from reading the code, i'd guess that it is meant to free the memory used for the current iteration - after that one is done and we're on our way to the next one .. is that somewhat correct?

@antoinereyt
Copy link
Contributor

@awetzel do you have some details after your investigations ?

@lpil
Copy link
Author

lpil commented Feb 26, 2018

Hi @antoinereyt, I'm afraid we were unable to find a solution with the streaming code before your message there, so we rewrote to use another XML library to meet our deadline.

@gmalkas
Copy link

gmalkas commented Apr 16, 2018

Hi @antoinereyt, just wanted to let you know we managed to reduce our memory consumption while parsing ~10MB XML files by up to 800MB thanks to the discard option.

We were really surprised by the need for this option since bounded memory consumption was the reason we used the streaming interface in the first place (we found out thanks to this issue as we shipped our code months ago before the doc mentioned the option).

I understand there is a trade-off and it's hard to avoid surprising behavior: either you discard tags by default but then the streaming output might differ from the non-streaming output for no obvious reason, or you do not discard by default to have a consistent output but memory usage blows up.

Assuming the main reason people end up using the streaming API is to get bounded memory usage, wouldn't it be fair to discard by default, with documentation warning of the consequences on the output?

Thanks for the work.

@9mm
Copy link

9mm commented Nov 29, 2018

@lpil which one did you choose that supports xpath?

@lpil
Copy link
Author

lpil commented Nov 29, 2018

@9mm we used https://github.com/processone/fast_xml/

@9mm
Copy link

9mm commented Nov 29, 2018

@lpil so I looked into that but how did you get that to support xpath?

@lpil
Copy link
Author

lpil commented Nov 29, 2018

I don't believe we used xpath though I've left the company so I can't be sure.

@augnustin
Copy link

I'm also having memory issues...

I don't understand the comment:

the streaming output might differ from the non-streaming output for no obvious reason

The :discard option is really undocumented. What does it do?

I read there:

It would be an interesting improvement if we could get :xmerl_xpath.string/2 to return an Elixir stream instead of a list. Maybe someday ...

I'm guessing this still hasn't been implemented, has it?

As far as I'm concerned, this doesn't feel much complicated: I have a huge list of <FICHE /> objects, for which each needs to be processed independently. Which means that all memory could be swapped at each iteration ... but it obviously isn't. 😢

    File.stream!(filename)
    |> SweetXml.stream_tags(:FICHE, discard: [:FICHE])
    |> Stream.map(fn {_, fiche} -> transform.(fiche) end)
    |> Enum.to_list()

Any ideas? Suggestions? Should I also move to another library?

@lpil
Copy link
Author

lpil commented Oct 7, 2020

Since I made this issue an excellent XML pull parser that used a very small amount of memor was released. It worked really well for us now. I've been searching for 15 minutes but I can't find it now (I've forgotten the name) but it exists!

@thbar
Copy link
Contributor

thbar commented Oct 7, 2020

@lpil by any chance, is it this one ? https://github.com/zadean/yaccety_sax (I haven't tested it yet, but it came across my radar)

@lpil
Copy link
Author

lpil commented Oct 7, 2020

Yes! It was head and shoulders better than the rest memory wise. A shame it's not well known.

@Shakadak Shakadak self-assigned this Mar 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants