Atomics vs Load Linked & Store Conditional (LL & SC)

Reviving a page originally posted to http://semipublic.comp-arch.net/wiki/Load-linked/store-conditional_(LL/SC)

The load-linked/store-conditional instruction pair provide a RISC flavor of atomic RMW synchronization, emphasizing primitives which can be flexibly composed. It can be viewed as a minimal form of transactional memory.

Part of the RISC philosophy was to espouse load/store architecture - instruction sets that separated load and store memory operations from computational or ALU operations such as add or increment. This works fine for single processor operations, but runs into problems for multiprocessor synchronization.

Table of Contents Pre-LL/SC Load-linked/store-conditional Synthesizing Complex RMWs Remote Atomic RMWs The best of both worlds? Extending the semantics TechPubs / Wiki Limitations See Also

Pre-LL/SC

After years of evolution, prior to the so-called RISC revolution multiprocessor synchronization instruction sets were converging on simple atomic RMW instructions such as compare-and-swap, atomic increment memory or fetch-and-add and other fetch-and-ops. Now, these atomic RMWs can be seen as composed of fairly simple primitives, for example:

     [[locked increment memory]]
          begin-atomic
              tmp := load(MEM)
              tmp := tmp+1
              store( MEM, tmp )
          end-atomic

However, the implementation of begin-atomic/end-atomic is not necessarily simple.

The atomicity can be provided in a simple way:

     [[locked increment memory]]
              tmp := [[load-locked]](MEM)
              tmp := tmp+1
              [[store-unlock]]( MEM, tmp )

Where load-locked may be

implemented at the memory module
- locking the entire module
- or a limited number of memory locations at the module
- or potentially an arbitrary number of memory locations, using

per-location lock-bit memory metadata, e.g. stolen from ECC

or

implemented via a bus lock

or

implemented via a cache protocol
- e.g. an address lock: acquiring ownership of the cache line, and

then refusing to respond to snoops or probes until the address lock was released

- or, more primitive: acquiring ownership of the cache line, and then

acquiring a cache lock, locking the entire cache or a fraction thereof

Interestingly, implementing locking at the memory module quite possibly came first, since many early multiprocessor systems were not snoopy or bus-based.

So far, so good: load-locked and store-unlocked are somewhat RISCy.

But they have a problem: load-locked and store-unlocked as separate instructions raise certain security, performance, and reliability problems in some (but not necessarily all) implementations.

E.g. what happens if a user uses load-locked to lock the bus, and never unlocks it? That *might* be interpreted as giving one user exclusive access to the bus.

Obviously, one can eliminate these problems by architecture and microarchitecture. But doing so is a source of complexity. Many, probably most, implementations have found it easier to package the atomic RMW so that the primitive operations are not exposed to the user

Load-linked/store-conditional

The basic idea behind LL/SC is that, instead of guaranteeing forward progress with a load-locked that always succeeds, but which may deadlock, load-linked would employ optimistic concurrency control. Software would assume that it had worked, perform the operation that you want to be atomic, and then try to commit the operation using store-conditional.

If the operation was atomic, i.e. if nobody else had accessed the memory location load-link'ed, the store-conditional would proceed.

But if the operation were non-atomic, if somebody else had accessed the memory location, the store-conditional would fail. And the user software would be required to handled that failure, e.g. by looping.

      fetch-and-add:
          L:  oldval := load-linked MEM
              newval := oldval+a
              store-conditional( MEM, newval )
              if SC_failed goto L
              return oldval

Typically LL/SC are implemented via a snoopy bus protocol: a memory location is "linked" by putting its address into a snooper. If another location writes to it, the link is broken so that SC will fail.

Some implementations did not implement an address snooper - they might fail if there was ANY other activity on the bus. Needless to say, this does not exhibit good performance on contended systems.

Non-snoopy implementations of LL/SC are possible. E.g. implementing them at a memory module is possible. TBD.

As with more extensive forms of transactional memory, LL/SC has issues with fairness and forward progress. It is not desirable to loop forever in the LL/SC code. This is solvable - through much the same mechanisms as one uses to implement a locked atomic RMW with load-locked.

One way amounts to converting a load-linked into a load-locked after a certain number of failures.

Synthesizing Complex RMWs

The nice thing about LL/SC is that they can be used to implement many different forms of synchronization. E.g. almost any form of fetch-and-op. Say you want floating point fetch-and-add .. you can build that with LL/SC. Whereas if you don't have LL/SC, and just have a fixed repertoire of integer atomic RMWs, you may just be out of luck.

This *is* one of the big advantages of RISC: provide primitives, rather than full solutions.

The problem, of course, is what was mentioned above: "plain" LL/SC may have problems with fairness and forward progress. Mechanisms that solve these problems for LL/SC may be more complicated than they would be for non-LL/SC instructions.

See list of possible atomic RMW operations.

Remote Atomic RMWs

The "most natural" implementation of LL/SC is to pull the data into registers and/or cache of the processor on which the instructions are executing. By "the instructions" I mean those before the LL, the LL itself, between the LL and SC, the SC itself, and after the SC.

This is also the most natural implementation for locked implementations of atomic RMWs:

     [[LOCK INC mem]]
          tmp := [[load-locked]] M[addr]
          dstreg := tmp+1
          [[store-unlock]] M[addr] := dstreg

I call this the "local implementation" of the atomic RMW.

However, many systems have found it useful to create "remote implementations" of the atomic RMW.

For example, special bus or interconnect transactions could be created that indicate that the memory controller should do the atomic RMW operation itself.

For example, the NYU Ultracomputer allowed many fetch-and-add operations to be exported. They could be performed at the ultimate destination, the memory controller. But they could also be handled specially at intermediate nodes in the interconnection fabric, where conflicting atomic RMWs to the same location could be combined, forwarding only their net effect on towards the destination.

You can imagine such remote atomic RMWs as being performed by

     [[LOCK INC mem]]
          send the command "fetch-and-add M[addr],1" to the outside system

This is the advantage of CISCy atomic RMW instructions: they can hide the implementation. If the interconnect fabric supports only fetch-and-add but not fetch-and-OR, then the two respective microoperation expansions might be:

     [[LOCK INC mem]]
          send the command "fetch-and-add M[addr],1" to the outside system

     [[LOCK FETCH-AND-OR mem]]
          oldval := [[load-locked]] M[addr]
          newtmp := oldval OR src1
          [[store-unlock]] M[addr] := newtmp

For that matter, the CISC implementation migh well use LL/SC optimistic concurrency control within its microflow.

It is more work to create such a remote atomic RMW with LL/SC, since the very strength of LL/SC - that the user can place almost arbitrarily complicated instruction sequences between the LL and SC is also its weakness. If you wanted to create a remote atomic RMW implementation that supported just, say, fetch-and-add, then you would have to create an idiom recognizer that recognized code sequences such as:

      fetch-and-add:
          L:  oldval := load-linked MEM
              newval := oldval+a
              store-conditional( MEM, newval )
              if SC_failed goto L
              return oldval

Actually, you might not have to recognize the full sequence, complete with the retry loop. You would only need to recognize the sequence

              oldval := load-linked MEM
              newval := oldval+a
              store-conditional( MEM, newval )

and convert that into whatever bus operations create the remote atomic RMW.

Recognizing a 3-instruction idiom is not THAT hard. But in general it is harder to combine separate instructions that to break up complex instructions in an instruction decoder.

One might even imagine creating a fairly complete set of instructions executable on the interconnect.

The best of both worlds?

Let me end this section by describing a hybrid implementation of the CISC atomic RMWs that combines the best of both worlds wrt local and remote atomic RMWs.

     [[LOCK FETCH-AND-OP mem]]
          oldval := [[load-locked/fetch-and-op]] M[addr]
          newtmp := oldval OP src1
          [[store-unlock/fetch-and-op]] M[addr] := newtmp
          return oldval

If M[addr] is exclusive in the locak cache, then the load-locked/fetch-and-op and store-unlock/fetch-and-op are equivalent to ordinary load-locked and store-unlock.

If M[addr]] is not present in the local cache, then load-locked/fetch-and-op is equivalent to sending a remote atomic RMW fetch-and-op command to the external network. The store-unlock/fetch-and-op may then become a NOP (although it may be used for bookkeeping purposes).

If M[addr] is present in the local cache in a shared state, then we

COULD* perform the atomic RMW both locally and remotely. The remote

version might invalidate or update other copies. However, if the operation is supposed to be serializing, then the local update cannot proceed without coordinating with the remote.

Extending the semantics

PAC:

(Paul Clayton added this.)

Because existing LL/SC definitions either fail or have undefined semantics if other memory operations are performed, the semantics can be safely extended without sacrificing compatibility. (While it may be possible in some architectures (TBD: check whether this is the case for Alpha, MIPS, ARM, Power, etc.) that a non-SC memory operation was used by software to cancel an atomic section, such is generally discouraged and unlikely to have been done.) Such extensibility could allow increasingly more extensive forms of transactional memory to be implemented without requiring additional instructions. While an implementation of an older architecture would not generate an illegal instruction, an architecture which guaranteed failure on additional memory accesses would guarantee either deadlock (which would simplify debugging and fixing) or the use of an already in place software fallback (which, while slower, would maintain correctness). In architectures which use a full-sized register to return the success condition and define a single success condition, that register could be used to communicate failure information.

A natural progression for such extensions might be: to initially constrain the transaction to an aligned block of 32 or 64 bytes or the implementation-defined unit of coherence (This last could cause a significant performance incompatibility but would maintain the semantics.), perhaps with a single write, then to expand the semantics to something like those presented in Cliff Click's "IWannaBit!" where any cache miss or eviction cancels the reservation (For many uses, this would require that the cache be warmed up before the transaction can succeed. If the reservation mechanism is expensive, prefetching data outside of the atomic section might be preferred over retrying the transaction.), perhaps extending writes to an entire cache line, and then to an arbitrary number of memory accesses.

Providing non-transactional memory accesses within the atomic section would be more difficult. It would be possible to define register numbers which when used as base pointers for memory accesses have non-transactional semantics (See Semantic register numbers). This definition would have to be made at the first extension of LL/SC and it might be difficult to establish compiler support.

A non-transactional extension of LL/SC would be to guarantee the success of the store-conditional under certain limited conditions. E.g., a simple register-register operation might be guaranteed to succeed, providing the semantics of most RMW instructions. Such a guaranteed transaction could itself be extended to allow more complex operations, though it might be desirable to allow a transaction that can be guaranteed to fail if lock semantics are not desired under contention. E.g., a thread might work on other tasks rather than wait for the completion of a heavily contended locked operation.

TechPubs / Wiki Limitations

This page uses & in its title rather than /, e.g. LL & SC rather than the typical LL/SC, because of limitations in the GitHub wiki:

It does not urlencode slashes in wiki page names. Slashes can be forced into page names, but they cause problems.

Furthermore, I am editing this wiki by git cloning to cygwin on Windows. Even if/is urlencode it on the web, it is not in the file system, and causes confusion with subdirectories.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly