forked from randomascii/SizeBench
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathwasteful-virtuals.html
225 lines (191 loc) · 12.6 KB
/
wasteful-virtuals.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
<html>
<head>
<title>Wasteful Virtuals</title>
<link rel="stylesheet" type="text/css" href="styles.css?v1">
</head>
<body>
<h1 style="border-bottom: 0;">SizeBench - a binary analysis tool for Windows</h1>
<h2>Wasteful Virtuals</h2>
<p>
One of the heuristic analyses that SizeBench can do is to look for virtual functions that appear to be wasting space.
</p>
<h3>What is a Wasteful Virtual?</h3>
<p>
To understand why some virtual functions can be statically determined to be wasteful, let's start by recalling how virtual functions work.
</p>
<p>
Virtual functions are implemented in all the major compilers using vtables - each type that contains at least one virtual function, or derives
from a type that contains at least one virtual function, will have one or more vtables associated with it, with one vfptr per vtable, but that's
another topic on memory layout for another day. For the purposes of binary size on disk, the important thing is the vtables themselves, not the
vfptrs. A vtable is just an array of function pointers that the compiler constructs. So imagine you have some types like these:
</p>
<pre>
class Base {
public:
void RegularNonVirtual() { }
virtual void VirtualFunctionWithManyOverrides() { }
virtual int VirtualFunctionWithNoOverrides() { return 3; };
};
class Base_Derived1 : public Base {
public:
void VirtualFunctionWithManyOverrides() override { };
virtual void PureVirtualFunctionWithOneOverride() = 0;
virtual void VirtualFunctionWithNoOverrides2() { };
virtual int VirtualFunctionWithNoOverrides(int x) { return x; }; // Note this has an argument, so it's not an override for the one in Base1
};
class Base_Derived1_MoreDerived : public Base_Derived1 {
public:
void VirtualFunctionWithManyOverrides() override final { };
virtual int VirtualFunctionWithNoOverrides() const { return 5; }; // Note this is const, so it's not an override for the one in Base1
virtual int VirtualFunctionWithNoOverrides(float y) { return (int)y; }; // Note this has a float argument so it's not an override for the int version in Base1_Derived1 or the argument-less version in Base1
void PureVirtualFunctionWithOneOverride() override { };
};
class Base_Derived2 : public Base {
public:
void VirtualFunctionWithManyOverrides() override final { };
};
</pre>
<p>
Then you'll end up with vtables that look like these:
</p>
<pre>
Base vtable:
-------------------------------------------
| &Base::VirtualFunctionWithManyOverrides |
| &Base::VirtualFunctionWithNoOverrides |
-------------------------------------------
Base_Derived1 vtable:
-------------------------------------------------------
| &Base::VirtualFunctionWithManyOverrides |
| &Base::VirtualFunctionWithNoOverrides |
| &_purecall | <-- this is PureVirtualFunctionWithOneOverride
| &Base_Derived1::VirtualFunctionWithNoOverrides2 |
| &Base_Derived1::VirtualFunctionWithNoOverrides(int) |
-------------------------------------------------------
Base_Derived1_MoreDerived vtable:
---------------------------------------------------------------------
| &Base_Derived1_MoreDerived::VirtualFunctionWithManyOverrides |
| &Base::VirtualFunctionWithNoOverrides |
| &Base_Derived1_MoreDerived::PureVirtualFunctionWithOneOverride |
| &Base_Derived1::VirtualFunctionWithNoOverrides2 |
| &Base_Derived1::VirtualFunctionWithNoOverrides(int) |
| &Base_Derived1_MoreDerived::VirtualFunctionWithNoOverrides const |
| &Base_Derived1_MoreDerived::VirtualFunctionWithNoOverrides(float) |
---------------------------------------------------------------------
Base_Derived2 vtable:
----------------------------------------------------
| &Base_Derived2::VirtualFunctionWithManyOverrides |
| &Base::VirtualFunctionWithNoOverrides |
----------------------------------------------------
</pre>
<p>
So, a couple of things are interesting here.
</p>
<ol>
<li>
<tt>VirtualFunctionWithNoOverrides</tt> in <tt>Base</tt> is using up a vtable slot in all 4 of these types, but it's
never overridden so this is pointless.
</li>
<li>
<tt>PureVirtualFunctionWithOneOverride</tt> in <tt>Base_Derived1</tt> is using
up a vtable slot in two types even though it has just one implementation. Imagine the class hierarchy were much deeper,
it would use up a vtable slot in all the types in the hierarchy even if there's just one implementation.<br />
</li>
</ol>
<p>
These are the two classes of things SizeBench tries to detect. But why bother? What costs are there, really? There's a few:
</p>
<ol>
<li>
These vtable slots end up on disk as function pointers. They're encoded as Relative Virtual Addresses (RVAs) so they take
up 4 bytes each. In very large class hierarchies, this can add up to a surprising amount.
</li>
<li>
Each vtable slot is a function pointer, and function pointers require an entry in te .reloc Binary Section. Each reloc entry
is exactly 4 bytes, regardless of architecture, as they are also RVAs (Relative Virtual Addresses, offsets into the binary image).
Worse, every reloc entry is parsed and relocated immediately on module load by the loader.
</li>
<li>
Callsites for virtuals almost always have <a href="https://docs.microsoft.com/windows/win32/secbp/control-flow-guard">Control Flow Guard (CFG)</a>
enforced, which generates several additional assembly instructions, bloating code size at each callsite. There are some exceptions,
but generally this is true.
</li>
<li>Virtual calls often can't be devirtualized by the compiler so it can't inline things for optimization purposes.</li>
<li>
Profile Guided Optimization can cause even further code size bloat for virtuals, if they're in hot/warm code, as it will try
to "speculatively devirtualize" - detecting if the vfptr is a well-known vtable entry in the binary to convert the virtual call to
direct, but it does this essentially with "if (well known vfptr) { direct call } else { virtual call }" - so the code size is larger
than a direct call or an indirect call, it's both!
</li>
</ol>
<p>
There may be more costs, but those are the ones I know of at the moment. Thus, reducing virtual usage can benefit binary size, CPU speed,
reference set for disk I/O to not need to load so much code/data.
</p>
<h3>Example in SizeBench UI</h3>
<img class="screenshot" src="Images/WastefulVirtuals_Overview.png" width=800 />
<p>
If you click the "Wasteful Virtuals" button after loading a binary in the tool, you'll get a screenshot similar to the right. This example
happens to be from a binary called OpenConsole.exe in <a href="https://github.com/Microsoft/Terminal">Windows Terminal</a>, from commit
<tt>10222a2b</tt> if you'd like to replicate this exactly.
</p>
<p>
The top row shows a type called <tt>Microsoft::Console::Render::VtEngine</tt> that has some wasteful virtual functions detected on it. Some
of the functions are named <tt>InvalidateSelection</tt>, <tt>StartPaint</tt>, and <tt>UpdateFont</tt> among many others. Each of these functions
wastes 24 bytes (the "Waste Per Slot" column), and collectively across all the derived types this wastes 600 bytes of binary size.
</p>
<p>
In some cases you may get a better return on investment by focusing on things that are high on the "Waste Per Slot" metric as that means fixing just
one virtual function improves the most. In other cases you may want to target an entire type and prefer to sort by "Wasted Size" which is across
the entire type. You can click the column headers to sort either one, depending on which way you want to approach this.
</p>
<br style="clear:right" />
<br/>
<br/>
<img class="screenshot" src="Images/WastefulVirtuals_VtEngine.png" width=800 />
<p>
If you click on a specific Wasteful Virtual (the link in the Name column), you'll get a screen like this.
</p>
<p>
This screenshot shows the <tt>VtEngine</tt> from the first screenshot, it's just an expanded view. Here, each of the wasteful functions
is a link you can use to see that symbol, and eventually this view should also show more about the total waste of the function (vtable slot entries,
reloc entries, and so on) - but for now it just shows the vtable slot waste so it underestimates how much you'll save.
</p>
<br style="clear:right" />
<h3>How To Fix These</h3>
<p>
There are a few ways to go about removing the waste from these.
</p>
<ol>
<li>If the function is virtual, but has no overrides, of course it's as simple as removing the virtual keyword. You may be surprised how often
this is the case - code changes a lot, and something that used to be polymorphic may no longer be, but the keyword lingers with all its costs
noted above.</li>
<li>If the function is pure virtual, with just one override, then perhaps the function can be removed from the base type and the virtual/override
removed from the single implementation. In some cases, I've seen code where nobody actually needs to call it on the base type, so again this is
pretty simple to do.</li>
<li>If the function is pure virtual, with just one override, but folks need to call it on the base class, then there are a few strategies you can use:
<ul>
<li>If your binary already requires RTTI (which has a lot of costs - be careful!), you could remove the pure virtual function, make it non-virtual
on both the base type and the derived type containing the implementation. Then, in the base type, dynamic_cast to the derived type - if
successful, directly dispatch to the derived type. If the dynamic_cast fails, no-op/crash/assert/whatever is appropriate for your codebase.</li>
<li>If your binary does not require RTTI, perhaps you already have a hand-rolled RTTI-esque system (many binaries do). In this case, use your
own "<tt>GetTypeIndex()</tt>" or whatever you may have, to implement the same idea as the dynamic_cast above, but without all the costs of C++
RTTI. This is what the Windows XAML framework does internally, and Chromium has a similar mechanism.</li>
<li>If your binary does not require RTTI and does not have a hand-rolled RTTI-esque system, perhaps with a little analysis the callers that have a
pointer to a base type could be plumbed to have the derived type with the implementation.</li>
<li>If all those fail, maybe this function isn't worth devirtualizing.</li>
</ul>
</li>
</ol>
<h3>Does This Really Add Up?</h3>
<p>
Not for every binary - but for some this can matter a lot. As an example, the XAML framework in Windows extensively uses virtual functions, and by using
this analysis to identify problems and a few techniques to implement solutions, has been able to save >10% of their binary size (over 1MB). XAML may be
pathologically bad at using virtuals, but it's an example of how it can add up in some cases.
</p>
<p>
If you run this analysis on your binary and the "Wasted Size" column in measured in bytes, probably this is a bad use of your time to bother with. But if
it is measured in 10s or hundreds of KBs then that can really matter for certain workloads.
</p>
</body>
</html>