Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory footprint of evaluated values in LazyList #10937

Open
wants to merge 9 commits into
base: 2.13.x
Choose a base branch
from

Conversation

lrytz
Copy link
Member

@lrytz lrytz commented Nov 25, 2024

The LazyList implementation uses two objects per evaluated value,
a LazyList and one State instance. So the overhead for evaluated
values is 2x compared to Stream.

This PR reduces the overhead to one LazyList instance with two
fields (head and tail) per evaluated value.

Thread safety is implemented through checking the volatile head
field and synchronizing on this for initialization. This is
the equivalent to the previous implementation which uses a lazy val.

Breaking changes:

Serialization is not compatible. Non-evaluated (tails of) streams are
serialized using plain object serialization. The internals are
different and serialization is incompatible in both directions.

toString of cyclic streams displays one more element, <cycle>
is only shown staring starting off the second element. For example,
in val x: LL[Int] = 1 #:: 2 #:: x, the x.tail.tail instance
is different than x, only the next tail is a shared instance.
Before, the cycle could be detected earlier because the state
instance of x and x.tail.tail is the same, even if the
LazyList instances were different.

Fixes scala/bug#13062

@lrytz
Copy link
Member Author

lrytz commented Nov 25, 2024

JOL output before this change:

scala.collection.immutable.LazyList@2d2e5f00d footprint:
     COUNT       AVG       SUM   DESCRIPTION
       100        16      1600   java.lang.Object
       101        24      2424   scala.collection.immutable.LazyList
       100        24      2400   scala.collection.immutable.LazyList$State$Cons
         1        16        16   scala.collection.immutable.LazyList$State$Empty$
       302                6440   (total)

After:

scala.collection.immutable.LazyList@2ea6137d footprint:
     COUNT       AVG       SUM   DESCRIPTION
       100        16      1600   java.lang.Object
       101        24      2424   scala.collection.immutable.LazyList
       201                4024   (total)

@sjrd
Copy link
Member

sjrd commented Nov 25, 2024

That looks like a reasonable encoding. However it's going to be tricky to make thread-safe while staying portable for the non-JVM platforms. The straightforward approach would be to use CAS'es on the fields, but that's not supported on JS/Native other than through ju.concurrent.AtomicX classes. Using the latter would of course defeat the purpose of the whole PR.

@He-Pin
Copy link
Contributor

He-Pin commented Nov 25, 2024

Does other platform support the AtomicXFieldUpdatper? that will help save more bytes

@sjrd
Copy link
Member

sjrd commented Nov 25, 2024

No, they don't. These classes are also reflection-based.

@Ichoran
Copy link
Contributor

Ichoran commented Nov 25, 2024

Is space or runtime performance more important for LazyList? If space isn't clearly more important, we'd better have a benchmark before/after also. synchronized is an obvious way to fix the thread safety; it's just slower under some scenarios.

@SethTisue SethTisue requested a review from NthPortal November 25, 2024 21:22
@NthPortal
Copy link
Contributor

I'll try to have a look at this. I've had another idea for a while of boxing all the evaluation into its own class, and then just swapping the val from the box to the result after the evaluation is done, which sidesteps the concurrency issues I believe

@lrytz
Copy link
Member Author

lrytz commented Nov 25, 2024

👍 I also think this can probably be cleaned up furhter. def state now creates a new State.Cons on each invocation which is unnecessary. I just did the minimal change to see what can be done.

@lrytz lrytz modified the milestones: 2.13.16, 2.13.17 Nov 26, 2024
@SethTisue SethTisue added library:collections PRs involving changes to the standard collection library performance the need for speed. usually compiler performance, sometimes runtime performance. labels Nov 27, 2024
@NthPortal
Copy link
Contributor

@lrytz I haven't fixed this enough for it to compile, but thoughts on this design? https://github.com/scala/scala/compare/2.13.x...NthPortal:scala:ll-memory/R1?expand=1

@lrytz
Copy link
Member Author

lrytz commented Nov 28, 2024

@NthPortal I see, the idea is using a lazy val in the Evaluator to implement thread safety, and make sure that multiple threads assigning to _head / _tail is idempotent.

I'm not sure which one is easier, your design or implementing thread safety manually.

Inspired by the desugaring for class C { lazy val x = 42 }:

public class C {
    private int x;
    private volatile boolean bitmap$0;

    private int x$lzycompute() {
        C c = this;
        synchronized (c) {
            if (!this.bitmap$0) {
                this.x = 42;
                this.bitmap$0 = true;
            }
        }
        return this.x;
    }

    public int x() {
        if (!this.bitmap$0) {
            return this.x$lzycompute();
        }
        return this.x;
    }
}

We should be able to do it as:

class LazyList {
  @volatile var _head = Undefined
  var _state = lazyState

  def state =
    if (_head ne Undefined) sCons(_head, _state)
    else {
      synchronized {
        if (_head eq Undefined) {
          // evaluate lazy state
          _state = ...
          _head = ...
        }
      }
      state
    }

I pushed that, wdyt? I would need to brush up my java memory model knowledge to be sure. If a thread observes _head ne Undefined, is it guaranteed to also see the changed _state? It seems so from the lazy val desugaring, but I don't have the knowledge to explain it.

@Ichoran
Copy link
Contributor

Ichoran commented Nov 28, 2024

Aren't you creating a new object on every access now? Are we sure this is a win?

@NthPortal
Copy link
Contributor

@lrytz I'm pretty sure you're correct that setting @volatile var _head second acts as a fence for _state. I think my biggest issue is that it no longer detects self-reference, which means that you get a stack overflow instead, which is bad UX. I do like the idea of less boxing though. on the other hand, this makes it more difficult (maybe impossible?) to share the state when prepending without double-evaluating

@lrytz lrytz force-pushed the t13062 branch 2 times, most recently from 5fc4205 to 799ff04 Compare November 29, 2024 12:03
@lrytz
Copy link
Member Author

lrytz commented Nov 29, 2024

Aren't you creating a new object on every access now? Are we sure this is a win?

@Ichoran I changed this now (see commit "remove State"). WDYT?


it no longer detects self-reference, which means that you get a stack overflow instead, which is bad UX

@NthPortal could you explain this in more detail, I don't know what you mean. The selfReferentialFailure JUnit test passes.

I adjusted the tests for toString of cyclic lazy lists. The underlying reason

scala> lazy val ll: LazyList[Int] = 1 #:: 2 #:: ll
lazy val ll: LazyList[Int] // unevaluated

scala> ll.tail.tail eq ll
val res0: Boolean = false

scala> ll.tail.tail.tail eq ll.tail
val res1: Boolean = true

Until here it's the same before and after my change. But ll.toString is now different: before LazyList(1, 2, <cycle>), after LazyList(1, 2, 1, <cycle>)

Before the change, ll.tail.tail.state eq ll.state is true

import scala.util.chaining._
def getMethodAccessible[A: reflect.ClassTag](name: String): java.lang.reflect.Method =
  implicitly[reflect.ClassTag[A]]
    .runtimeClass.getDeclaredMethods
    .find(_.getName == name).get.tap(_.setAccessible(true))

val state = getMethodAccessible[LazyList[_]]("scala$collection$immutable$LazyList$$state")

scala> state.invoke(ll.tail.tail) eq state.invoke(ll)
val res6: Boolean = true

But we no longer have the state objects, so cycles are detected one element later.


Serialization: is actually not compatible, because un-evaluated lazy lists are serialized using plain object serialization. I don't think we can work around this. But we don't guarantee serialization compatibility between releases, so 🤷‍♂️.

@Ichoran
Copy link
Contributor

Ichoran commented Nov 29, 2024

@lrytz - I guess is about the best that can be hoped for given the binary compatibility requirements. There is the one extra creation of a LazyList that you destructure and throw away, but reworking everything to use a mutating visitor instead of a regular lambda is probably (1) hard to get right and (2) hard to make binary compatible (because you inherently need visibility into structure that used to not exist).

So I think the performance overhead, if it's an issue, would be the only thing remaining (not counting correctness, of course--I can't eyeball that, but there are a fair number of unit tests, plus some collection-laws).

I'm pretty sure if two different threads are building up erroniously mutually referential lists, you can get a deadlock instead of an error condition here. Maybe we don't care.

If thread 1 is trying to get the next item in xs, and to do so it starts traversing ys; but meanwhile, thread 2 is trying to extend ys and to do that it's looking into xs; you can have thread 1 enter the synchronized block of the next uninitialized xs, then run down ys and wait for ys's synch lock to be free. But thread 2 has it, runs down xs, hits the sync block of xs...and...no more progress can be made.

Now, this is an error condition. If ys(i) depends on xs(j) and xs(j) depends on ys(i), there's no way to perform the initialization. But you won't get the friendly "hey, re-entrancy means you messed up" exception; you won't even get a stack overflow.

I don't know if robustness to this kind of messup is big enough to worry about. To do it, you'd need to drop synchronization while executing the lambda, then try to pick it up again. This would allow one (and eventually both) threads to crash out with exceptions. But under contention, this might increase the cost considerably. Maybe with the volatile var guard, it wouldn't actually matter much. But I wouldn't bet on it without a test, and it's really hard to microbenchmark multi-threaded code.

@lrytz
Copy link
Member Author

lrytz commented Dec 2, 2024

@Ichoran you're right that a lot of short lived objects are allocated while evaluating lazy lists.

For example

  • LazyList.from(1).take(N).toList allocates N*4 LazyList instances during evaluation
  • LazyList.from(1).take(N+1).take(N).toList allocates N*6
  • (1 #:: 2 ... #:: N #:: LazyList.empty).toList allocates N*3

There's the Function0 allocations on top of that (not one for every LazyList allocation though).

I compared the state of this PR with current LazyList, it's exactly the same (when adding up LazyList and State instances in the current implementation).

Maybe that situation can also be improved..? It doesn't seem to hold up this PR though.

Comparing allocation to Stream:

  • Stream.from(1).take(N).toList allocates N*2 Stream instances
  • Stream.from(1).take(N+1).take(N).toList allocates N*3
  • (1 #:: 2 ... #:: N #:: Stream.empty).toList allocates N

For the deadlock scenarion, I think it's the same with the current LazyList implementation which uses val state: State[A] = { ... lazyState() ... }, no?

@Ichoran
Copy link
Contributor

Ichoran commented Dec 2, 2024

@lrytz - I think the deadlock situation is different because lazy val doesn't lock, does it? It uses a CAS-like mechanism, right? (Multiple attempts are cool; whoever finishes first wins and gets to set the cached value.) If that's correct, I think the lambda goes through, tries to initialize the lazy val a second time, and is caught by the re-entrant detection code.

Regarding the temporary object overhead--yeah, I wasn't thinking it was particularly worse. Just that it's a shame, in general. But no reason to fix it with this PR; the memory savings are the key part.

@lrytz
Copy link
Member Author

lrytz commented Dec 2, 2024

@Ichoran lazy vals are evaluated at most once, i posted the desugaring in a comment above #10937 (comment)

@Ichoran
Copy link
Contributor

Ichoran commented Dec 2, 2024

@lrytz - Oh, oops. Then I agree: the deadlock issue is the same.

@lrytz
Copy link
Member Author

lrytz commented Dec 2, 2024

lazy vals are evaluated at most once

modulo exceptions, that is...

scala> lazy val x = { println("x"); throw new RuntimeException(); 1 }

scala> x
x
java.lang.RuntimeException

scala> x
x
java.lang.RuntimeException

@Ichoran
Copy link
Contributor

Ichoran commented Dec 2, 2024

@lrytz - Yes, that was my mistake--I was incorrectly inferring the general behavior from the behavior with exceptions.

@NthPortal
Copy link
Contributor

NthPortal commented Dec 4, 2024

could you explain this in more detail, I don't know what you mean. The selfReferentialFailure JUnit test passes.

@lrytz my comment was about your code snippet that didn't have MidEvaluation, so it's moot. I was concerned about the following:

lazy val ll: LazyList[Int] = 1 #:: ll.drop(1)
ll.tail.head // oops

@lrytz lrytz force-pushed the t13062 branch 3 times, most recently from 250431f to 770b377 Compare December 4, 2024 15:39
Copy link
Contributor

@NthPortal NthPortal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this—thank you for all the work you've put into this Lukas! 💜

I just have a few nitpicks, some bikeshedding, and a single stack safety question that I'm very unsure about, but I'd be in no way unhappy to see it merged as is

src/library/scala/collection/immutable/SeqMap.scala Outdated Show resolved Hide resolved
src/library/scala/collection/immutable/LazyList.scala Outdated Show resolved Hide resolved
src/library/scala/collection/immutable/LazyList.scala Outdated Show resolved Hide resolved
test/junit/scala/collection/immutable/LazyListTest.scala Outdated Show resolved Hide resolved
test/junit/scala/collection/immutable/LazyListTest.scala Outdated Show resolved Hide resolved
test/junit/scala/collection/immutable/LazyListTest.scala Outdated Show resolved Hide resolved
@NthPortal
Copy link
Contributor

hey Lukas, have you signed the CLA? the bot isn't sure 😜

Also simplify a couple of operation implementations and
align the helper method names with the current design.
@lrytz
Copy link
Member Author

lrytz commented Dec 5, 2024

Thanks a lot for the sharp-eyed (and -brained) review @NthPortal!

Copy link
Contributor

@NthPortal NthPortal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this popped into my mind late last night (and thankfully I remembered it in the morning), and it might be a blocker =/

@lrytz lrytz force-pushed the t13062 branch 3 times, most recently from cc7fe7a to 58e886e Compare December 10, 2024 20:31
@@ -349,13 +396,13 @@ class LazyListTest {

// Check forced cyclic and non-cyclic streams
assertEquals("LazyList(-2, -1)", pre(2).force.toString)
assertEquals("LazyList(0, 1, <cycle>)", cyc(2).force.toString)
assertEquals("LazyList(-2, -1, 0, 1, <cycle>)", precyc(2,2).force.toString)
assertEquals("LazyList(0, 1, 0, <cycle>)", cyc(2).force.toString)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it occurred to me that we could almost certainly prevent the duplicate element by buffering the last element in a var and appending it "late" (and skipping that append if we detect a cycle)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, thank you for the idea. I cleaned up addStringNoForce and implemented your suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
library:collections PRs involving changes to the standard collection library performance the need for speed. usually compiler performance, sometimes runtime performance.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LazyList uses 2x memory for evaluated values compared to Stream
7 participants