Starling and fast ByteArrays: worth the trouble?

Posted April 12, 2013 by Arnaud Gatouillat & filed under 12=12, Blog.

Hey, here’s another annoying post for non-developers 🙂

Fast memory access has been a demand of AS3 developers since Alchemy was born 5 years ago. Hacks have allowed to access fast memory from AS3 or Haxe for years, Philippe Elsass wrote a recap of these methods in 2010.

Last week the official method for accessing the opcodes from AS3 was quietly announced in a blog post: Making ByteArray faster.

This method needs the ASC2 compiler (included with AIR 3.6/3.7) and is easily usable from FlashDevelop 4.4 (in development). After a default installation select “AIR 3.6” in the project properties.

Like the others hacks, one of the main performance drawbacks is the very slow ApplicationDomain.currentDomain.domainMemory = myFastByteArray call. Don’t bother using this approach if you’ve got lots of different ByteArrays to manipulate.

For 12=12 we’ve crafted custom 3D objects to reduce the number of draw calls (draw calls can be one of the main performance sinkholes). All cubes are actually only one mesh/surface with 216 subcubes in them. Then we transform the surface vertices to change the cubes transform, texture and color. It’s an ideal use case for fast ByteArrays, so the code looks like this :

var scale:Number = visible ? element.scale * .999 : 0;

var pos:int = vertexOffset * sizePerVertex * 4;
while (vj < vm)
{
	sf32(vv[vj++] * scale + x,  pos); // vertex-x
	sf32(vv[vj++] * scale + y,  pos + 4); // vertex-y
	sf32(vv[vj++] * scale + z,  pos + 8); // vertex-z
	sf32(r, pos + 32); // particle/color.r
	sf32(g, pos + 36); // particle/color.g
	sf32(b, pos + 40); // particle/color.b
	sf32(a, pos + 44); // particle/color.a
	vj += 9;
	pos += 48;
}
			
needUploadVertices = true;

The framerate improved about 20% from a simple Vector.<Number> to the fast ByteArray implementations, not bad!

As we also use Starling/Feathers for all the 2D/GUI stuff, I wanted to check if the same approach could improve performance there as well. I've transformed the VertexData class, carefully checking for each method if the domainMemory call was worth using fast opcodes or simple ByteArray methods.

You can download the resulting class here: http://iq12.com/files/fu/starling/fast_ba/VertexData.as

Classes uploading the VertexData.rawData must be also modified. Something like:
mVertexBuffer.uploadFromVector(mVertexData.rawData, 0, numVertices);
becomes:
mVertexBuffer.uploadFromByteArray(mVertexData.rawData, 0, 0, numVertices);

Also, if any code in your project uses domainMemory, you must set VertexData.instanceInDomainMemory = null before starling.nextFrame().

Now what are the results? With the Starling benchmark (from the Starling demo project):
- iPhone3GS: 850 => 920 (+8%)
- HTC Desire (~= NexusOne): 950 => 990 (+4%)
- iPad1: 1380 => 1480 (+7%)
- iPhone4: 1090 => 1260 (+15%)
- iPad3: 2760 => 3290 (+19%)
- Nexus4: 3680 => 3520 (-4%)

Neither bad nor formidable. I would very much welcome a real world Starling benchmark though, ideas anyone?

[update 15/05/13]
Following @makc comment below, I reran my standard ByteArray class and got correct results also:

device			base score	standard ByteArray	fast ByteArray
- iPhone3GS:		850		890 (+4%)		920 (+8%)
- HTC Desire:		950		970 (+2%)		990 (+4%)
- iPad1:		1380		1440 (+4%)		1480 (+7%)
- iPhone4:		1090		1210 (+11%)		1260 (+15%)
- iPad3: 		2760		2820 (+2%)		3290 (+19%)
- Nexus4:		3680		3320 (-10%)		3520 (-4%)

There's probably a pattern, something like old devices have slow GPU upload speed and benefit a lot from uploadFromByteArray(), current devices not so much and are more impacted with the CPU performance hit for going from Vector.<Number> stuff to ByteArray methods.

I also corrected a bug in my code. You can find here:
- with "fast ByteArrays": http://iq12.com/files/fu/starling/fast_ba/VertexData.as
- with standard ByteArray access: http://iq12.com/files/fu/starling/bytearray/VertexData.as

Archives