Proposal: Merge Template Generators and Vtables

Note

Feature awaiting further design consideration.

This is a proposal to significantly reduce the overhead of templates by merging template generators and vtables.

Introduction

Template generators are used to make it possible to instantiate templates across API boundaries, however in the current implementation they add overhead. For example, a reference to an interface is:

struct ref_t {
        void* this;
        void* vtable;
        void* template_generator;
        uint64_t path;
};

Likely overhead:

  • 32-bit machine: size = 20 bytes, overhead versus pointer pair = 2.5x
  • 64-bit machine: size = 32 bytes, overhead versus pointer pair = 2x

Other types, such as templatedfunctionptr_t and typename_t, also need the { template_generator, path } pair. It is unfortunate to increase the size of an interface reference since most such references will refer to code that is either not templated or has been instantiated at compile-time.

This proposal removes the { template_generator, path } pair by incorporating that information into the vtable pointer:

struct ref_t {
        void* this;
        void* vtable;
};

Another issue is that templated code must perform an indirect call to its template generator to get its arguments. This is both slow and is more work for the optimiser to eliminate. This proposal instead avoids these indirect calls, and all templated code simply accesses an array of template arguments provided to it.

Unifying template generators and vtables

Looking at the interface reference type above, we can make the following observations:

  • The language allows any type, including primitives like int, to be used polymorphically, so we can’t get any type information into/out of the this pointer. We therefore need at least an extra pointer.
  • The vtable is generated by the compiler at a point where the template arguments are unknown. For example, we know the vtable of std::varray<T>, but type T may come from another module.
  • There is a chain of template generators from the point where template arguments are known to where the vtable is known.

This means that the template generator could be used to call into the vtable, turning the interface reference type into:

struct ref_t {
        void* this;
        void* vtable;
        uint64_t path;
};

Accessing vtable via template generator

The vtable pointer would actually point to a vtable generated for the root template generator:

const vtable_t ROOT_vtable = { ROOT_fn, ROOT_fn, ROOT_fn, ... };

The root template generator simply sets up the template arguments and then calls the intermediate template generator:

<return arg> ROOT_fn(hidden callinfo_t* callinfo, ...<call args>...) {
        callinfo->types[0] = { VTABLE_int };
        return tailcall TPLGEN_f(callinfo, ...<call args>...);
}

At some point the template graph reaches the actual function/method to be called, which is then called directly (selected based on the method hash). This is facilitated by all template generators perfectly forwarding their arguments via tail calls.

Calls through template generator

Dynamic dispatch now directly calls the template generator, and this means template generators no longer produce template arguments, but instead they call into a target function/method and give it the arguments directly.

Dynamic dispatch doesn’t know whether it is calling a normal vtable or a template generator vtable, since both now work identically. It does however have to allocate a callinfo_t on its stack in case it is calling a template generator vtable.

Dynamic dispatch calls therefore might look like:

int callMethod(ref_t reference, int arg0, int arg1) {
        typedef int (*method_type)(hidden callinfo_t*, int, int);
        method_type ptr = reference.vtable[METHOD_HASH_INDEX];

        callinfo_t callinfo;

        // Must set this to disambiguate which method should be called.
        callinfo.method_hash = METHOD_HASH;

        // Must set this to disambiguate which route to take down the template generator graph.
        callinfo.path = reference.path;

        return ptr(&callinfo, arg0, arg1);
}

By calling through the template generator we also eliminate the extra indirect call when the function/method tries to access its own template arguments.

Passing template arguments on stack

Currently all templated functions/methods receive a template generator function pointer and have to call that to generate their arguments. However, in most cases we can pass template arguments on the stack:

template <typename T>
void f() {
        g<T>();
}

In this case it seems like f() could pass its template arguments to g() on the stack:

void f(hidden const callinfo_t* callinfo) {
        callinfo_t g_callinfo = *callinfo;
        g_callinfo.types[1] = VTABLE_short;
        g(&g_callinfo);
}

Captured Templated References

While it is possible to use the stack to pass template arguments in many cases, it doesn’t work for cases where a reference to a templated function/method is captured:

template <typename T>
Interface& cast_to_interface(Class<T>& object) {
        return object;
}

Any method of Class<T> that is called via Interface& is likely to want to access its template arguments, but by that point the stack frame of cast_to_interface() will have unwound, so we can’t just put the template arguments on the stack.

Setting template arguments on stack in template generators

Template generators are the solution here, because they don’t rely on the program stack. However, calling them for every templated function/method is a significant overhead.

Instead, this proposal has every templated function/method receive:

  • Its template arguments.
  • A template generator (in the form of a vtable pointer).

The template arguments are set either by:

  • Direct call: The templated function’s/method’s caller.
  • Captured references: A template generator (which then calls the templated function/method).

This means that templated functions and methods no longer have to call a template generator to get their arguments.

Callee Captured References

For the call from f() to g() shown above, we still need to give g() a template generator, because it is possible that g() (or something it calls) captures a templated reference:

template <typename T>
void g() {
        Class<T>& v = get_class_ref();
        Interface& i = v;
        set_interface_ref(i);
}

The cast from Class<T>& to Interface& will take the template generator pointer given and insert it as the vtable field of the interface reference (ref_t).

Template Generators Only Call Captured References

An important observation is that template generators now only call templated functions/methods captured by reference. If a templated function is only ever called directly (i.e. we don’t try to get a pointer to it and pass that around), it can never be called from a template generator and therefore template generators don’t need to include calls to it (since these calls would be dead code).

For example:

template <typename T>
void f() {
        g<T>();
        h<T>();
}

Initially from the perspective of the template generator of f(), there are four possibilities:

  • Call g
  • Call template generator of g
  • Call h
  • Call template generator of h

However we never capture a reference to g or h so there are only two possibilities:

  • Call template generator of g
  • Call template generator of h

Template Generators Don’t Call Their Own Function

Currently template generators detect the end of the path corresponding to their templated function/method, and return at that point. However with the modifications in this proposal to have template generators directly call functions/methods it makes more sense for the calling template generator to call the templated function/method:

(*<>)(void)() f() {
        return g<int>;
}

template <typename T>
void g() {
        h<T>();
}

The template generator for g has one possibility to proceed: call the template generator for h. It doesn’t call h since it never captured a reference to h. It also doesn’t call its own function (g) since it doesn’t capture a reference to that.

This means that the caller is responsible for calling g if a reference is captured. f does capture a reference to g, but the template arguments are fully known at that point so it just generates a stub:

void g_int_ref() {
        callinfo_t callinfo;
        callinfo.vtable = g_int_ROOT_VTABLE;
        callinfo.types[0] = VTABLE_int;
        g(&callinfo);
}

(*<>)(void)() f() {
        return g_int_ref;
}

So no template generators in the chain call g because f can generate a stub.

Only the caller knows whether a reference was captured to the templated function/method, so it makes sense for it to be responsible for calling the function/method, since in cases such as that shown it can avoid adding the call to its template generator.

Encoding path into vtable pointer

The solution described above reduces an interface reference down to:

struct ref_t {
        void* this;
        void* vtable;
        uint64_t path;
};

We would like to remove the path element; this can only be achieved by encoding the path into the vtable pointer.

Available Bits from Vtable Alignment

There are some available bits in the vtable pointer due to the alignment of vtables. If a vtable contained 16 pointers then we would have:

  • 32-bit machine: vtable size = 64 bytes, (assume align=size), available bits = 6 bits
  • 64-bit machine: vtable size = 128 bytes, (assume align=size), available bits = 7 bits

These bits can be used to encode the path value. We do, however, need to clear the bits when performing dynamic dispatch:

int callMethod(ref_t reference, int arg0, int arg1) {
        typedef int (*method_type)(hidden callinfo_t*, int, int);

        void** fixed_vtable = reference.vtable & ~(127);
        method_type ptr = fixed_vtable[METHOD_HASH_INDEX];

        callinfo_t callinfo;

        // Must set this to disambiguate which method should be called.
        callinfo.method_hash = METHOD_HASH;

        // Must set this to disambiguate which route to take down the template graph.
        callinfo.vtable = reference.vtable;

        return ptr(&callinfo, arg0, arg1);
}

Encoding larger paths into vtable pointer

A simple solution to encode larger paths into the vtable pointer is to produce a larger vtable for the root template generator. For example, a vtable could be repeated a power of two number of times in memory to get extra available bits:

  • 2 contiguous copies of vtable: extra 1 bit
  • 4 contiguous copies of vtable: extra 2 bits
  • 8 contiguous copies of vtable: extra 3 bits
  • etc.

Increasing the size of the allocated space for the vtable effectively means allocating more bits in the address. Regardless of the machine you have:

  • 8 available bits: requires 256 bytes of vtable data
  • 9 available bits: requires 512 bytes of vtable data
  • 10 available bits: requires 1024 bytes of vtable data
  • etc.

Clearly, the memory required is exponential in terms of the number of path bits required, so it’s important to keep the path size as small as possible. Another issue is that the root template generator must know that it needs to allocated more vtable data, since very few cases will need this.

Reducing path size using modules

Path bits are currently allocated for every templated function/method calls within a module, but this is unnecessary. We can perform template substitution within modules, so we only need to allocate path bits for calls from one module to another.

// ---- Module 1.
void a() {
        b<int>();
}

// ---- Module 2.
template <typename T>
void b() {
        c<T, float>();
}

template <typename S, typename T>
void c() {
        d<T, S>();
}

// ---- Module 3.
template <typename S, typename T>
void h();

We can partially substitute c() to produce:

// ---- Module 1.
void a() {
        b<int>();
}

// ---- Module 2.
template <typename T>
void b() {
        c_SUBSTITUTED<T>();
}

template <typename T>
void c_SUBSTITUTED() {
        d<float, T>();
}

// ---- Module 3.
template <typename S, typename T>
void d();

Effectively the substitutions pass the template arguments as received to our module directly around our module’s code unmodified. For example:

// All c
template <typename T>
export void f(T value) {
        g<wrapper<T>>(wrapper<T>(value));
}

template <typename T>
void g(T value) {
        h<T>(value);
}

template <typename T>
import void h(T value);

This becomes:

template <typename T>
void f(T value) {
        g_SUBSTITUTED<T>(wrapper<T>(value));
}

template <typename T>
void g_SUBSTITUTED(wrapper<T> value) {
        h<wrapper<T>>(value);
}

template <typename T>
import void h(T value);

These substitutions mean that all code in our module can use the same path value (for a given template generator graph).

Determining path size at compile-time

One possible approach is to add a depth attribute to imported templates, which indicates how many bits they require in their path:

template(depth 2) <typename T>
import void f(T value);

Not specifying the depth means that it is zero, and hence either:

  • The module does not pass the template variables it is given to any other modules.
  • The module passes template variables in the same form to other modules as it is given them, and those modules have depth=0.

(The second case is the result of the pass-through optimisation.)

This has the following advantages:

  • A known depth means root template generators know how many bits must be available and hence can allocate vtable sizes accordingly.
  • The compiler can warn when the depth becomes large enough that the template generator vtable is huge (at 12+ bits it starts taking 4+KiB).
  • We can prevent template cycles between modules, because they would end up with infinite depth.
  • We can remove path_position from callinfo_t, because each intermediate template generator knows exactly its offset within the path.

Determining path size at run-time

The depth attribute exposes an implementation detail of the module, so it would be preferable to avoid it. Hence an alternative is to compute the depth at run-time.

Doing this at run-time means root template generator vtables can’t be pre-allocated. We can call down the chain at load-time to determine the depth, but we can’t allocate storage in the data segment this way. We can allocate a one-vtable size global, but the required depth may exceed this space (i.e. when more than 6/7 bits are needed in the path). There are two approaches to this:

  • Terminate/report error. With this approach we can omit the load-time call for non-debug builds.
  • Allocate (suitably aligned) space on the heap (and copy from the vtable global).

Using vtable slots to reduce path size

Consider the following template generator graph:

Module 1   |   Module 2   |  Module 3  |

    ->   TPL(f)  -> a (slot: 2)
                 ->     TPL(a)  -> x (slot: 5)
                 -> b (slot: 3)
                 ->     TPL(b)  -> y (slot: 6)
                 -> c (slot: 4)
                 ->     TPL(c)  -> z (slot: 6)

Here a, b and c are templated functions or methods for which an indirect call reference is captured by f. x is a templated function or method for which an indirect call reference is captured by a, and similarly applies for y (captured by b) and z (captured by c).

The compiler of module 2 assumes there are 6 choices for the template generator of f:

  • Call a
  • Call b
  • Call c
  • Call template generator for a
  • Call template generator for b
  • Call template generator for c

However the template generator of f is also given the 64-bit hash of the name of the function or method it will eventually call. We can therefore divide these based on the vtable slot they will fall into, which is shown in the graph. Looking at the whole graph, if you take into account the vtable slots there is only one conflict: between b and c when the slot index is 6. In other words the template generator for f can have logic such as:

switch (path_value) {
case 0:
        switch (slot) {
        case 2: ...call a...
        case 3: ...call b...
        case 4: ...call c...
        case 5: ...call a template generator...
        case 6: ...call b template generator...
        }
case 1:
        assert(slot == 6);
        ...call c template generator...
}

This works because we know that we can only be calling y and z when the slot index is 6; if the slot index is anything else the path is unambiguous.

To compute this, we can define reachable(N) for node N where:

  • If N is a templated function/method (i.e. a leaf in the graph), then reachable(N) = { P }, where P is its vtable slot.
  • If N is a template generator (i.e. not a leaf), then reachable(N) = union(child C of N) { reachable(C) }.

This gives:

  • reachable(a) = { 2 }
  • reachable(TPLGEN(a)) = reachable(x) = { 5 }
  • reachable(b) = { 3 }
  • reachable(TPLGEN(b)) = reachable(y) = { 6 }
  • reachable(c) = { 4 }
  • reachable(TPLGEN(c)) = reachable(z) = { 6 }
  • reachable(f) = reachable(a) | ... | reachable(TPL(a)) | ... = { 2, 3, 4, 5, 6 }

You can see where there are conflicts by determining conflict(N), which is union(child A of N, child B of N, A != B) { reachable(A) & reachable(B) }:

  • conflict(TPLGEN(a)) = { }
  • conflict(TPLGEN(b)) = { }
  • conflict(TPLGEN(c)) = { }
  • conflict(TPLGEN(f)) = (reachable(a) & reachable(b)) | (reachable(a) & reachable(c)) | ... = { 6 }

By representing reachable(N) as a bit field (one bit per vtable slot to indicate whether there is a reachable function/method for that slot) we can write efficient code to perform this computation, since intersection and union map neatly onto bitwise AND and bitwise OR.

The template generator can effectively be made programmable by creating a slot action table:

struct template_child_info_t {
        // The value we put into our position in the path to identify
        // this child.
        uint16_t path_id;

        // Reachability set of this child represented as bitfield.
        uint16_t reachable;

        // Pointer to child's template_info_t; only applies for template
        // generators.
        template_info_t* info;
};

struct slot_action_t {
        // The action to take for each path ID, for this slot.
        uint8_t id_to_child_map[NUM_CHILDREN];
};

struct template_info_t {
        // The offset within the path of our component.
        uint8_t offset;

        // The mask of the path to get our component.
        uint16_t mask;

        // The reachability of each path ID.
        uint16_t path_id_reachability[NUM_CHILDREN];

        // The action table for each slot.
        slot_action_t slot_actions[VTABLE_SIZE];

        // Information about child templates.
        template_child_info_t children[NUM_CHILDREN];
};

The template generator for f would then look like:

template_info_t f_info = ...;

<return arg> TPLGEN_f(hidden callinfo_t* callinfo, ...<call args>...) {
        const auto slot = callinfo->method_hash & VTABLE_SIZE;
        const auto path_id = (callinfo->vtable >> f_info->offset) & f_info->mask;
        const auto action = f_info->slot_actions[slot].id_actions[path_id];

        switch (action) {
        case 0:
                [...modify types for templated function call...]
                return tailcall a(callinfo, ...<call args>...);
        case 1:
                [...modify types for templated function call...]
                return tailcall b(callinfo, ...<call args>...);
        case 2:
                [...modify types for templated function call...]
                return tailcall c(callinfo, ...<call args>...);
        case 3:
                [...modify types for templated function call...]
                return tailcall TPLGEN_a(callinfo, ...<call args>...);
        case 4:
                [...modify types for templated function call...]
                return tailcall TPLGEN_b(callinfo, ...<call args>...);
        case 5:
                [...modify types for templated function call...]
                return tailcall TPLGEN_c(callinfo, ...<call args>...);
        }
}

The algorithm to generate the table would look something like:

// Allocate a path ID for the given reachability set.
uint8_t allocate_path_id(template_info_t* info, uint16_t reachable) {
        uint8_t id = 0;

        while (true) {
                if (info->path_id_reachability[id] & reachable) {
                        // Conflicts with this reachability set, try
                        // next.
                        id++;
                        continue;
                }

                // Union the reachability set given; this means the path
                // ID will only be re-used for other genuinely
                // non-conflicting reachability sets.
                info->path_id_reachability[id] |= reachable;

                return id;
        }
}

void generate_slotactions(template_info_t* info) {
        size_t max_offset = 0;

        // Recursive call to children.
        for (size_t i = 0; i < NUM_CHILDREN; i++) {
                template_info_t* child_info = info->children[i].info;
                if (child_info == NULL) continue;

                const auto offset = generate_slotactions(child_info);
                if (offset > max_offset) max_offset = offset;
        }

        size_t max_path_id = 0;

        // Allocate a path ID for each child based on their reachability sets.
        for (size_t i = 0; i < NUM_CHILDREN; i++) {
                const auto path_id = allocate_path_id(info, child_info->reachable);

                // Set the path ID we'll use for this child.
                info->children[i].path_id = path_id;

                if (path_id > max_path_id) max_path_id = path_id;
        }

        // Determine offset and mask.
        info->offset = max_offset;
        info->mask = round_up_to_power_of_2(max_path_id);

        // Fill in the slot action tables.
        for (unsigned slot = 0; slot < VTABLE_SIZE; slot++) {
                for (size_t i = 0; i < NUM_CHILDREN; i++) {
                        if (!(info->children[j].reachable & (1 << slot))) {
                                // Child isn't reachable for this slot.
                                continue;
                        }

                        const auto child_path_id = info->children[i].path_id;
                        info->slot_actions[slot].id_to_child_map[child_path_id] = i;
                }
        }

        return max_offset + log_2(info->mask);
}

Note

We can go even further than this and use the complete 64-bit name hashes, which in general shouldn’t conflict unless the names are the same. This would require determining if there are any identical name hashes between children (only in such cases do we need to allocate bits on the path). However doing this is costly, so the approximation of using only 4-bits in the hash value (corresponding to 16 possible slots) is more useful.

Summary

This section describes the result of implementing all the ideas in the proposal.

callinfo_t

struct callinfo_t {
        uint64_t method_hash;
        void* vtable;
        size_t position;
        typename_t types[8];
};

Interface reference

struct ref_t {
        void* this;
        void* vtable;
};

<return arg> call(ref_t ref, ...<call args>...) {
        typedef <return arg> (*method_type)(hidden callinfo_t*, ...);

        void** fixed_vtable = ref.vtable & ~(127);
        method_type ptr = fixed_vtable[METHOD_HASH % VTABLE_SIZE];

        callinfo_t callinfo;
        callinfo.method_hash = METHOD_HASH;
        callinfo.vtable = ref.vtable;
        return ptr(&callinfo, ref.this, ...<call args>...);
}

typename_t

struct typename_t {
        void* vtable;
};

<return arg> call(typename_t ref, ...<call args>...) {
        typedef <return arg> (*static_method_type)(hidden callinfo_t*, ...);

        void** fixed_vtable = ref.vtable & ~(127);
        static_method_type ptr = fixed_vtable[METHOD_HASH % VTABLE_SIZE];

        callinfo_t callinfo;
        callinfo.method_hash = METHOD_HASH;
        callinfo.vtable = ref.vtable;
        return ptr(&callinfo, ...<call args>...);
}

Root template generator vtable

A root template generator allocates space for one or more vtables, each of which are identical. The root template generator vtable will have an entry for each offset:

const vtable_t ROOT_vtable = { ROOT_fn, ROOT_fn, ROOT_fn, ... };

Root template generator

<return arg> ROOT_fn(hidden callinfo_t* callinfo, ...<call args>...) {
        callinfo->types[0] = VTABLE_int;
        callinfo->position = 0;
        return tailcall TPLGEN_f(callinfo, ...<call args>...);
}

Intermediate template generator

<return arg> TPLGEN_f(hidden callinfo_t* callinfo, ...<call args>...) {
        const auto id = (callinfo->vtable >> callinfo->position) & 0x3;
        callinfo->position += 2;

        switch (id) {
        case 0:
                // Call function 'a' for which 'f' captured a reference.
                return tailcall a(callinfo, ...<call args>...);
        case 1:
                // Pass the types to 'a' template generator.
                callinfo->types[1] = VTABLE_byte;
                return tailcall TPLGEN_a(callinfo, ...<call args>...);
        case 2:
                // Pass the types to 'b' template generator.
                callinfo->types[1] = VTABLE_short;
                return tailcall TPLGEN_b(callinfo, ...<call args>...);
        default:
                unreachable;
        }
}

Receiving template arguments

<return arg> f(hidden const callinfo_t* callinfo, ...<call args>...) {
        typename_t first_arg = callinfo->types[0];
        // etc.
}

Sending template arguments

<return arg> f(hidden const callinfo_t* callinfo, ...<call args>...) {
        // ...
        callinfo_t callinfo_copy = *callinfo;
        callinfo_copy.vtable |= (0x2 << callinfo_copy.position);
        callinfo_copy.position += 2;
        callinfo_copy.types[1] = VTABLE_short;
        a(&callinfo_copy, ...<call args>...);
        // ...
}

Conflict resolution stub

<return arg> conflict_resolution_stub(hidden const callinfo_t* callinfo, ...<call args>...) {
        switch (callinfo->method_hash) {
                case HASH_METHOD_0:
                        return tailcall method0(callinfo, ...<call args>...);
                case HASH_METHOD_1:
                        return tailcall method1(callinfo, ...<call args>...);
                default:
                        unreachable;
        }
}