Hey, here’s another annoying post for non-developers 🙂
Fast memory access has been a demand of AS3 developers since Alchemy was born 5 years ago. Hacks have allowed to access fast memory from AS3 or Haxe for years, Philippe Elsass wrote a recap of these methods in 2010.
Last week the official method for accessing the opcodes from AS3 was quietly announced in a blog post: Making ByteArray faster.
This method needs the ASC2 compiler (included with AIR 3.6/3.7) and is easily usable from FlashDevelop 4.4 (in development). After a default installation select “AIR 3.6” in the project properties.
Like the others hacks, one of the main performance drawbacks is the very slow ApplicationDomain.currentDomain.domainMemory = myFastByteArray
call. Don’t bother using this approach if you’ve got lots of different ByteArrays to manipulate.
For 12=12 we’ve crafted custom 3D objects to reduce the number of draw calls (draw calls can be one of the main performance sinkholes). All cubes are actually only one mesh/surface with 216 subcubes in them. Then we transform the surface vertices to change the cubes transform, texture and color. It’s an ideal use case for fast ByteArrays, so the code looks like this :
var scale:Number = visible ? element.scale * .999 : 0; var pos:int = vertexOffset * sizePerVertex * 4; while (vj < vm) { sf32(vv[vj++] * scale + x, pos); // vertex-x sf32(vv[vj++] * scale + y, pos + 4); // vertex-y sf32(vv[vj++] * scale + z, pos + 8); // vertex-z sf32(r, pos + 32); // particle/color.r sf32(g, pos + 36); // particle/color.g sf32(b, pos + 40); // particle/color.b sf32(a, pos + 44); // particle/color.a vj += 9; pos += 48; } needUploadVertices = true;
The framerate improved about 20% from a simple Vector.<Number> to the fast ByteArray implementations, not bad!
As we also use Starling/Feathers for all the 2D/GUI stuff, I wanted to check if the same approach could improve performance there as well. I've transformed the VertexData class, carefully checking for each method if the domainMemory call was worth using fast opcodes or simple ByteArray methods.
You can download the resulting class here: http://iq12.com/files/fu/starling/fast_ba/VertexData.as
Classes uploading the VertexData.rawData must be also modified. Something like:
mVertexBuffer.uploadFromVector(mVertexData.rawData, 0, numVertices);
becomes:
mVertexBuffer.uploadFromByteArray(mVertexData.rawData, 0, 0, numVertices);
Also, if any code in your project uses domainMemory, you must set VertexData.instanceInDomainMemory = null
before starling.nextFrame()
.
Now what are the results? With the Starling benchmark (from the Starling demo project):
- iPhone3GS: 850 => 920 (+8%)
- HTC Desire (~= NexusOne): 950 => 990 (+4%)
- iPad1: 1380 => 1480 (+7%)
- iPhone4: 1090 => 1260 (+15%)
- iPad3: 2760 => 3290 (+19%)
- Nexus4: 3680 => 3520 (-4%)
Neither bad nor formidable. I would very much welcome a real world Starling benchmark though, ideas anyone?
[update 15/05/13]
Following @makc comment below, I reran my standard ByteArray class and got correct results also:
device base score standard ByteArray fast ByteArray - iPhone3GS: 850 890 (+4%) 920 (+8%) - HTC Desire: 950 970 (+2%) 990 (+4%) - iPad1: 1380 1440 (+4%) 1480 (+7%) - iPhone4: 1090 1210 (+11%) 1260 (+15%) - iPad3: 2760 2820 (+2%) 3290 (+19%) - Nexus4: 3680 3320 (-10%) 3520 (-4%)
There's probably a pattern, something like old devices have slow GPU upload speed and benefit a lot from uploadFromByteArray(), current devices not so much and are more impacted with the CPU performance hit for going from Vector.<Number> stuff to ByteArray methods.
I also corrected a bug in my code. You can find here:
- with "fast ByteArrays": http://iq12.com/files/fu/starling/fast_ba/VertexData.as
- with standard ByteArray access: http://iq12.com/files/fu/starling/bytearray/VertexData.as