Skip to content

Latest commit

 

History

History
144 lines (116 loc) · 7.58 KB

data-structures.md

File metadata and controls

144 lines (116 loc) · 7.58 KB

HHVM Runtime Data Structures

This page contains a brief overview of the most important runtime data structures in HHVM. It is not meant to be an exhaustive reference for all details about these data structures, since those details change too often to reliably document so far away from the code. Rather, reading the descriptions here should leave you with a good high-level understanding of what each one is used for and where to find the code implementing it for further investigation.

Hack-visible values

DataType, Value, and TypedValue

DataType, defined in datatype.h, is an enum representing the type of a Hack value: KindOfNull means null, KindOfBoolean means bool, KindOfInt64 means int, etc. Some user-visible types are split into multiple DataTypes to keep track of whether or not a value is reference counted: KindOfPersistentFoo means "KindOfFoo that we know is not reference counted." The reverse, however, is not true: KindOfPersistentFoo is a subtype of KindOfFoo, so it is valid to use KindOfFoo with a non-reference-counted Foo.

TypedValue, defined in typed-value.h represents a single Hack value, and appears in many different places in HHVM. It contains a DataType and a Value, which is a union with one member for every possible type. Primitive types (null, bool, int, and float) are stored inline in the TypedValue, while all other types heap-allocate a data object and store a pointer to it in the TypedValue.

ArrayData

ArrayData is used to represent all array-like types in Hack: array, dict, vec, and keyset, though you'll never see any raw ArrayData objects created anywhere. Instead, a specific kind of array is created and tagged using one of the [several current array HeaderKind values] (https://github.com/facebook/hhvm/blob/HHVM-3.27/hphp/runtime/base/header-kind.h#L46-L49). We use a custom vtable to dispatch to the appropriate implementation for all ArrayData member functions; the current implementation classes are PackedArray, MixedArray, and SetArray. Note that many of these types don't directly inherit from ArrayData, so they're only subtypes of ArrayData by convention.

Arrays in Hack have value semantics, implemented using copy-on-write. As a result, most member functions that perform mutations take a bool copy parameter to indicate whether or not the array should be copied before performing the mutation. It is up to the caller to call cowCheck() before any mutations to determine if a copy is necessary. Additionally, any mutation could cause reallocation of a new ArrayData, either to grow or escalate to a different array kind. To support this, all mutation functions return a new ArrayData*; the ArrayData* that was mutated should be considered dead and you should use the new one in its place.

StringData

StringData represents a Hack string value. Like arrays, strings have value semantics using copy-on-write, so callers are responsible for calling cowCheck() before mutating, although the copy must be done manually with StringData::Make() rather than a bool copy parameter to the mutator function.

Most StringDatas store their data in space allocated immediately after the StringData. This layout is required in USE_LOWPTR, so StringData::m_data is conditionally defined. For normal builds, m_data will usually point right after the StringData, but it may point elsewhere for strings from APC.

ObjectData

ObjectData represents a Hack object. It contains a Class* describing its type (described below) and a series of attributes. Declared properties are stored in an array of TypedValues that is allocated after the ObjectData.

Smart pointer wrappers

All of the types described so far have smart pointer wrappers that are generally used for high-level non-performance-critical C++ code that needs to work with one of them. The wrapper type can be found by dropping the Data suffix, so StringData's wrapper is String. Note that like most smart pointer types, these wrappers can all represent a null value, but the pointer in a TypedValue representing an array, string, etc. must never be nullptr. A null value is represented using KindOfNull.

Runtime-internal data structures

Unit

A Unit represents all of the information contained in one Hack file: classes, functions, constants, top-level code, etc. All references to entities that could be defined in another file are only referenced by name, even if they are defined in the current file. This includes, but it not limited to, function names in call expressions, parent class names, and used traits. This is to support one of HHBC's core principles: it is always possible to emit bytecode (and construct a Unit) for a single file in complete isolation. If that file references entities that are undefined at runtime, the appropriate error will be raised.

PreClass and Class

Every Unit has a list of classes that are defined in the file it came from. Each of these classes is stored as a PreClass, which represents the class definition as it is visible in the source code. Parent classes, used traits, and any other references to entities other than the class itself are stored as string names.

When a class is defined at runtime, these references are resolved to concrete entities, producing a Class. The resolved parent class is stored as a Class* rather than the string name from the PreClass.

This two-level representation of classes is necessary because the mapping between name and class can change with every request. All classes are redefined from scratch in each request1, and different definitions of a class can be selected by including different files, or by putting the definitions on different control flow paths. Properly-typed Hack code will never have multiple definitions of the same class, but HHVM still supports it.

Func

All Hack functions, including methods on classes, are represented using a Func. Each Func contains a pointer to the Unit it was defined in, the Class it is a member of (if any), its location in the Unit's bytecode, information about its parameters, and various other metadata.

Every Func has a 32-bit FuncId, which is assigned in ascending order as each Func is created. There is a global table that allows mapping from this ID back to a Func*, and FuncIds are often used in places where we want to store a reference to a Func without a full 64-bit pointer2. FuncIds are also guaranteed to be unique for the lifetime of the process, unlike Func*.


1: This is what HHVM presents to the user, but we have mechanisms to avoid the work of actually redefining everything in every request when possible.

2: HHVM also has a USE_LOWPTR build mode that allocates certain data structures, including Funcs, in the lower 4GiB of address space, allowing us to store 32-bit pointers using HPHP::LowPtr<T>. However, LowPtr<T> is 64 bits in non-USE_LOWPTR builds, and some uses of FuncId rely on it being 32 bits in all build modes.