Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Ruby string interning #3185

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ New features:
Bug fixes:

* Fix `Dir.glob` returning blank string entry with leading `**/` in glob and `base:` argument (@rwstauner).
* Fix class lookup after an object's class has been replaced by `IO#reopen` (@itarato, @eregon).
* Fix class lookup after an object's class has been replaced by `IO#reopen` (@itarato, @nirvdrum, @eregon).
* Fix `Marshal.load` and raise `ArgumentError` when dump is broken and is too short (#3108, @andrykonchin).
* Fix `super` method lookup for unbounded attached methods (#3131, @itarato).
* Fix `Module#define_method(name, Method)` to respect `module_function` visibility (#3181, @andrykonchin).
Expand Down Expand Up @@ -41,6 +41,7 @@ Compatibility:
Performance:

* Improve `Truffle::FeatureLoader.loaded_feature_path` by removing expensive string ops from a loop. Speeds up feature lookup time (#3010, @itarato).
* Improve `String#-@` performance by reducing unnecessary data copying and supporting substring lookups (@nirvdrum)

Changes:

Expand Down
2 changes: 1 addition & 1 deletion doc/user/truffleruby-additions.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ TruffleRuby provides these non-standard methods and classes that provide additio

### Concurrent Maps

`TruffleRuby::ConcurrentMap` is a key-value data structure, like a `Hash` and using `#hash` and `#eql?` to compare keys and identity to compare values. Unlike `Hash` it is unordered. All methods on `TruffleRuby::ConcurrentMap` are thread-safe but should have higher concurrency than a fully syncronized implementation. It is intended to be used by gems such as [`concurrent-ruby`](https://github.com/ruby-concurrency/concurrent-ruby) - please use via this gem rather than using directly.
`TruffleRuby::ConcurrentMap` is a key-value data structure, like a `Hash` and using `#hash` and `#eql?` to compare keys and identity to compare values. Unlike `Hash` it is unordered. All methods on `TruffleRuby::ConcurrentMap` are thread-safe but should have higher concurrency than a fully synchronized implementation. It is intended to be used by gems such as [`concurrent-ruby`](https://github.com/ruby-concurrency/concurrent-ruby) - please use via this gem rather than using directly.

* `map = TruffleRuby::ConcurrentMap.new([initial_capacity: ...], [load_factor: ...])`

Expand Down
2 changes: 1 addition & 1 deletion spec/ruby/core/io/new_spec.rb
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
require_relative '../../spec_helper'
require_relative 'shared/new'

# NOTE: should be syncronized with library/stringio/initialize_spec.rb
# NOTE: should be synchronized with library/stringio/initialize_spec.rb

describe "IO.new" do
it_behaves_like :io_new, :new
Expand Down
2 changes: 1 addition & 1 deletion spec/ruby/core/io/shared/new.rb
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
require_relative '../fixtures/classes'

# NOTE: should be syncronized with library/stringio/initialize_spec.rb
# NOTE: should be synchronized with library/stringio/initialize_spec.rb

# This group of specs may ONLY contain specs that do successfully create
# an IO instance from the file descriptor returned by #new_fd helper.
Expand Down
6 changes: 6 additions & 0 deletions src/main/java/org/truffleruby/RubyLanguage.java
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@
import com.oracle.truffle.api.source.Source;
import com.oracle.truffle.api.source.SourceSection;
import com.oracle.truffle.api.strings.AbstractTruffleString;
import com.oracle.truffle.api.strings.InternalByteArray;
import com.oracle.truffle.api.strings.TruffleString;
import org.graalvm.options.OptionDescriptors;
import org.truffleruby.annotations.SuppressFBWarnings;
Expand Down Expand Up @@ -788,6 +789,11 @@ public ImmutableRubyString getFrozenStringLiteral(TruffleString tstring, RubyEnc
return frozenStringLiterals.getFrozenStringLiteral(tstring, encoding);
}

public ImmutableRubyString getFrozenStringLiteral(InternalByteArray byteArray, boolean isImmutable,
RubyEncoding encoding) {
return frozenStringLiterals.getFrozenStringLiteral(byteArray, isImmutable, encoding);
}

public long getNextObjectID() {
final long id = nextObjectID.getAndAdd(ObjectSpaceManager.OBJECT_ID_INCREMENT_BY);

Expand Down
19 changes: 14 additions & 5 deletions src/main/java/org/truffleruby/core/encoding/TStringUtils.java
Original file line number Diff line number Diff line change
Expand Up @@ -44,9 +44,13 @@ public static TruffleString.Encoding jcodingToTEncoding(Encoding jcoding) {
}

public static TruffleString fromByteArray(byte[] bytes, TruffleString.Encoding tencoding) {
return fromByteArray(bytes, 0, bytes.length, tencoding);
}

public static TruffleString fromByteArray(byte[] bytes, int offset, int length, TruffleString.Encoding tencoding) {
CompilerAsserts.neverPartOfCompilation(
"Use createString(TruffleString.FromByteArrayNode, byte[], RubyEncoding) instead");
return TruffleString.fromByteArrayUncached(bytes, 0, bytes.length, tencoding, false);
return TruffleString.fromByteArrayUncached(bytes, offset, length, tencoding, false);
}

public static TruffleString fromByteArray(byte[] bytes, RubyEncoding rubyEncoding) {
Expand Down Expand Up @@ -75,8 +79,7 @@ public static TruffleString fromJavaString(String javaString, RubyEncoding encod
public static byte[] getBytesOrCopy(AbstractTruffleString tstring, RubyEncoding encoding) {
CompilerAsserts.neverPartOfCompilation("uncached");
var bytes = tstring.getInternalByteArrayUncached(encoding.tencoding);
if (tstring instanceof TruffleString && bytes.getOffset() == 0 &&
bytes.getLength() == bytes.getArray().length) {
if (tstring.isImmutable() && bytes.getOffset() == 0 && bytes.getLength() == bytes.getArray().length) {
return bytes.getArray();
} else {
return ArrayUtils.extractRange(bytes.getArray(), bytes.getOffset(), bytes.getEnd());
Expand All @@ -88,8 +91,8 @@ public static byte[] getBytesOrCopy(Node node, AbstractTruffleString tstring, Tr
TruffleString.GetInternalByteArrayNode getInternalByteArrayNode,
InlinedConditionProfile noCopyProfile) {
var bytes = getInternalByteArrayNode.execute(tstring, encoding);
if (noCopyProfile.profile(node, tstring instanceof TruffleString && bytes.getOffset() == 0 &&
bytes.getLength() == bytes.getArray().length)) {
if (noCopyProfile.profile(node,
tstring.isImmutable() && bytes.getOffset() == 0 && bytes.getLength() == bytes.getArray().length)) {
return bytes.getArray();
} else {
return ArrayUtils.extractRange(bytes.getArray(), bytes.getOffset(), bytes.getEnd());
Expand Down Expand Up @@ -149,4 +152,10 @@ public static String toJavaStringOrThrow(AbstractTruffleString tstring, RubyEnco
return tstring.toJavaStringUncached();
}
}

public static boolean hasImmutableInternalByteArray(AbstractTruffleString string) {
// Immutable strings trivially have immutable byte arrays.
// Native strings also have immutable byte arrays because we need to copy the data into Java.
return string.isImmutable() || string.isNative();
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@

import com.oracle.truffle.api.CompilerDirectives;
import com.oracle.truffle.api.CompilerDirectives.TruffleBoundary;
import com.oracle.truffle.api.strings.InternalByteArray;
import com.oracle.truffle.api.strings.TruffleString;
import org.truffleruby.collections.WeakValueCache;
import org.truffleruby.core.encoding.RubyEncoding;
Expand All @@ -37,25 +38,23 @@ public FrozenStringLiterals(TStringCache tStringCache) {

@TruffleBoundary
public ImmutableRubyString getFrozenStringLiteral(TruffleString tstring, RubyEncoding encoding) {
if (tstring.isNative()) {
throw CompilerDirectives.shouldNotReachHere();
}

return getFrozenStringLiteral(TStringUtils.getBytesOrCopy(tstring, encoding), encoding);
return getFrozenStringLiteral(tstring.getInternalByteArrayUncached(encoding.tencoding),
TStringUtils.hasImmutableInternalByteArray(tstring),
encoding);
}

@TruffleBoundary
public ImmutableRubyString getFrozenStringLiteral(byte[] bytes, RubyEncoding encoding) {
public ImmutableRubyString getFrozenStringLiteral(InternalByteArray byteArray, boolean isImmutable,
RubyEncoding encoding) {
// Ensure all ImmutableRubyString have a TruffleString from the TStringCache
var cachedTString = tstringCache.getTString(bytes, encoding);
var cachedTString = tstringCache.getTString(byteArray, isImmutable, encoding);
var tstringWithEncoding = new TStringWithEncoding(cachedTString, encoding);

final ImmutableRubyString string = values.get(tstringWithEncoding);
if (string != null) {
return string;
} else {
return values.addInCacheIfAbsent(tstringWithEncoding,
new ImmutableRubyString(cachedTString, encoding));
return values.addInCacheIfAbsent(tstringWithEncoding, new ImmutableRubyString(cachedTString, encoding));
}
}

Expand Down
7 changes: 4 additions & 3 deletions src/main/java/org/truffleruby/core/string/StringNodes.java
Original file line number Diff line number Diff line change
Expand Up @@ -4357,10 +4357,11 @@ public abstract static class InternNode extends PrimitiveArrayArgumentsNode {
@Specialization
protected ImmutableRubyString internString(RubyString string,
@Cached RubyStringLibrary libString,
@Cached TruffleString.AsManagedNode asManagedNode) {
@Cached TruffleString.GetInternalByteArrayNode getInternalByteArrayNode) {
var encoding = libString.getEncoding(string);
TruffleString immutableManagedString = asManagedNode.execute(string.tstring, encoding.tencoding);
return getLanguage().getFrozenStringLiteral(immutableManagedString, encoding);
var byteArray = getInternalByteArrayNode.execute(string.tstring, encoding.tencoding);
return getLanguage().getFrozenStringLiteral(byteArray,
TStringUtils.hasImmutableInternalByteArray(string.tstring), encoding);
}
}

Expand Down
80 changes: 76 additions & 4 deletions src/main/java/org/truffleruby/core/string/TBytesKey.java
Original file line number Diff line number Diff line change
Expand Up @@ -12,19 +12,44 @@
import java.util.Arrays;
import java.util.Objects;

import com.oracle.truffle.api.strings.InternalByteArray;
import com.oracle.truffle.api.strings.TruffleString;
import org.truffleruby.core.array.ArrayUtils;
import org.truffleruby.core.encoding.RubyEncoding;
import org.truffleruby.core.encoding.TStringUtils;

public final class TBytesKey {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you document here or somewhere else why we support offset and length? (to avoid extra copying for substrings during lookup, but not when inserting a new entry, ref #makeCacheable)


private final byte[] bytes;
private final int offset;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we cannot construct TruffleString's InternalByteArray, we basically have to recreate it here. It'd be nice if there were a cleaner option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the plus side this actually consumes less memory (i.e., we don't keep the InternalByteArray instance around).

private final int length;
private RubyEncoding encoding;
private final int bytesHashCode;

public TBytesKey(byte[] bytes, RubyEncoding encoding) {
public TBytesKey(
byte[] bytes,
int offset,
int length,
int bytesHashCode,
RubyEncoding encoding) {
this.bytes = bytes;
this.offset = offset;
this.length = length;
this.bytesHashCode = bytesHashCode;
this.encoding = encoding;
this.bytesHashCode = Arrays.hashCode(bytes);
}

public TBytesKey(byte[] bytes, RubyEncoding encoding) {
this(bytes, 0, bytes.length, Arrays.hashCode(bytes), encoding);
}

public TBytesKey(InternalByteArray byteArray, RubyEncoding encoding) {
this(
byteArray.getArray(),
byteArray.getOffset(),
byteArray.getLength(),
hashCode(byteArray),
encoding);
}

@Override
Expand All @@ -37,15 +62,15 @@ public boolean equals(Object o) {
if (o instanceof TBytesKey) {
final TBytesKey other = (TBytesKey) o;
if (encoding == null) {
if (Arrays.equals(bytes, other.bytes)) {
if (equalBytes(this, other)) {
// For getMatchedEncoding()
this.encoding = Objects.requireNonNull(other.encoding);
return true;
} else {
return false;
}
} else {
return encoding == other.encoding && Arrays.equals(bytes, other.bytes);
return encoding == other.encoding && equalBytes(this, other);
}
}

Expand All @@ -62,4 +87,51 @@ public String toString() {
return TruffleString.fromByteArrayUncached(bytes, encoding, false).toString();
}

private static int hashCode(InternalByteArray byteArray) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice if we could upstream this into Truffle's ArrayUtils.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in InternalByteArray would be a better place, TruffleString has the byte[] intrinsics nowadays.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That'd be fine, too. There had been some concerns previously about exposing public APIs in something named "internal", but I don't really care where it goes.

return hashCode(byteArray.getArray(), byteArray.getOffset(), byteArray.getLength());
}

// A variant of <code>Arrays.hashCode</code> that allows for selecting a range within the array.
private static int hashCode(byte[] bytes, int offset, int length) {
if (bytes == null) {
return 0;
}

int result = 1;
for (int i = offset; i < offset + length; i++) {
result = 31 * result + bytes[i];
}

return result;
}

private boolean equalBytes(TBytesKey a, TBytesKey b) {
if (a.isPerfectFit() && b.isPerfectFit()) {
return Arrays.equals(a.bytes, b.bytes);
}
Comment on lines +109 to +111
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There doesn't seem to be a big perf advantage here so I'd just remove this and use the general case below.

Copy link
Collaborator Author

@nirvdrum nirvdrum Aug 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will Graal optimize both cases? Array.equals(byte[], byte[]) is annotated as @IntrinsicCandidate, while the variant with specified offsets and end points is not. I thought that would be useful for the interpreter, at least. Granted, I didn't benchmark the two.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They both end up in vectorizedMismatch and that's intrinsified by Graal, yes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrays.equals(byte[], byte[]) is also intrinsified by Graal so maybe it's a tiny bit better.
I guess the only way to know for sure is to benchmark it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine as it is, no need to spend too much time on it, either way is fine.


return Arrays.equals(a.bytes, a.offset, a.offset + a.length, b.bytes, b.offset, b.offset + b.length);
}

private boolean isPerfectFit() {
return offset == 0 && length == bytes.length;
}

public TBytesKey makeCacheable(boolean isImmutable) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you document this? (it creates a perfect fit key, so it does not hold on any extra bytes that it does not need when inserting a new entry to avoid leaking bytes outside the substring)

if (isImmutable && isPerfectFit()) {
return new TBytesKey(bytes, encoding);
}

var simplified = ArrayUtils.extractRange(this.bytes, this.offset, this.offset + this.length);
return new TBytesKey(simplified, encoding);
}

public TBytesKey withNewEncoding(RubyEncoding encoding) {
return new TBytesKey(bytes, offset, length, bytesHashCode, encoding);
}

public TruffleString toTruffleString() {
return TStringUtils.fromByteArray(bytes, offset, length, encoding.tencoding);
}

}
36 changes: 27 additions & 9 deletions src/main/java/org/truffleruby/core/string/TStringCache.java
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
*/
package org.truffleruby.core.string;

import com.oracle.truffle.api.strings.InternalByteArray;
import com.oracle.truffle.api.strings.TruffleString;
import org.truffleruby.collections.WeakValueCache;
import org.truffleruby.core.encoding.Encodings;
Expand Down Expand Up @@ -69,20 +70,38 @@ private void register(TruffleString tstring, RubyEncoding encoding) {
}
}

public TruffleString getTString(TruffleString string, RubyEncoding encoding) {
return getTString(TStringUtils.getBytesOrCopy(string, encoding), encoding);
@TruffleBoundary
public TruffleString getTString(TruffleString string, RubyEncoding rubyEncoding) {
assert rubyEncoding != null;

var byteArray = string.getInternalByteArrayUncached(rubyEncoding.tencoding);
final TBytesKey key = new TBytesKey(byteArray, rubyEncoding);

return getTString(key, TStringUtils.hasImmutableInternalByteArray(string));
}

@TruffleBoundary
public TruffleString getTString(InternalByteArray byteArray, boolean isImmutable, RubyEncoding rubyEncoding) {
nirvdrum marked this conversation as resolved.
Show resolved Hide resolved
assert rubyEncoding != null;

return getTString(new TBytesKey(byteArray, rubyEncoding), isImmutable);
}

@TruffleBoundary
public TruffleString getTString(byte[] bytes, RubyEncoding rubyEncoding) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should still be @TruffleBoundary, it's counter-productive to allocate the TBytesKey in PE code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really wish we had a reason field on @TruffleBoundary because it's easy to look at this and say "there's no reason this can't be compiled, so lets do away with the boundary". I've been operating under the assumption that anything that could run without a boundary should and let Truffle's heuristics sort out the rest.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to add a code comment about it. Here it's a case of "no value to PE this code, and worse due to forcing an extra allocation that might not be needed otherwise". We should only PE what can benefit from PE, there is a warmup cost to PE too much code.

assert rubyEncoding != null;

final TBytesKey key = new TBytesKey(bytes, rubyEncoding);
return getTString(new TBytesKey(bytes, rubyEncoding), true);
}

@TruffleBoundary
private TruffleString getTString(TBytesKey lookupKey, boolean isLookupKeyImmutable) {
final TruffleString tstring = bytesToTString.get(lookupKey);
var rubyEncoding = lookupKey.getMatchedEncoding();

final TruffleString tstring = bytesToTString.get(key);
if (tstring != null) {
++tstringsReusedCount;
tstringBytesSaved += tstring.byteLength(rubyEncoding.tencoding);
tstringBytesSaved += tstring.byteLength(lookupKey.getMatchedEncoding().tencoding);

return tstring;
}
Expand All @@ -92,7 +111,7 @@ public TruffleString getTString(byte[] bytes, RubyEncoding rubyEncoding) {
// reference equality optimizations. So, do another search but with a marker encoding. The only guarantee
// we can make about the resulting TruffleString is that it would have the same logical byte[], but that's good enough
// for our purposes.
TBytesKey keyNoEncoding = new TBytesKey(bytes, null);
TBytesKey keyNoEncoding = lookupKey.withNewEncoding(null);
final TruffleString tstringWithSameBytesButDifferentEncoding = bytesToTString.get(keyNoEncoding);

final TruffleString newTString;
Expand All @@ -104,12 +123,11 @@ public TruffleString getTString(byte[] bytes, RubyEncoding rubyEncoding) {
++byteArrayReusedCount;
tstringBytesSaved += newTString.byteLength(rubyEncoding.tencoding);
} else {
newTString = TStringUtils.fromByteArray(bytes, rubyEncoding);
newTString = lookupKey.toTruffleString();
}

// Use the new TruffleString bytes in the cache, so we do not keep bytes alive unnecessarily.
final TBytesKey newKey = new TBytesKey(TStringUtils.getBytesOrCopy(newTString, rubyEncoding), rubyEncoding);
return bytesToTString.addInCacheIfAbsent(newKey, newTString);
return bytesToTString.addInCacheIfAbsent(lookupKey.makeCacheable(isLookupKeyImmutable), newTString);
}

public boolean contains(TruffleString string, RubyEncoding encoding) {
Expand Down
Loading