Used the code from the Intermediate Lesson 2 and just applied SSE to it in the same fashion as without. Unpacking and packing might be faster, but I thought this was a pretty good attempt. Got a speed up of 1-1.5 ms using SSE. Original code was ~4.85 ms, this version ~3.5 ms.
Code: Select all
void D3DGraphics::SSEDrawSpriteAlpha(const int X, const int Y, const Sprite *sprite)
{
const UINT sseSpriteWidth = sprite->width / 4;
DQWORD *pSprite = (DQWORD *)sprite->surface;
const int xPos = X / 4;
DQWORD zero = SetZero();
for (int y = 0; y < sprite->height; y++)
{
const int rowOffset = y * sseSpriteWidth;
for (int x = 0; x < sseSpriteWidth; x++)
{
// 1-1.5 ms faster using sse for rendering
const int xOffset = x + xPos, yOffset = y + Y;
const DQWORD src = pSprite[x + rowOffset];
const DQWORD dst = GetSSEPixel(xOffset, yOffset);
const DQWORD sa = _mm_srli_epi32(src & alphaMask, 24);
const DQWORD sr = _mm_srli_epi32(src & redMask, 16);
const DQWORD sg = _mm_srli_epi32(src & greenMask, 8);
const DQWORD sb = src & blueMask;
const DQWORD da = _mm_sub_epi32(_mm_set1_epi32(255), sa);
const DQWORD dr = _mm_srli_epi32(dst & redMask, 16);
const DQWORD dg = _mm_srli_epi32(dst & greenMask, 8);
const DQWORD db = dst & blueMask;
const DQWORD rr = _mm_srli_epi32(_mm_add_epi32(_mm_mullo_epi16(sr, sa),_mm_mullo_epi16(dr, da)), 8);
const DQWORD rg = _mm_srli_epi32(_mm_add_epi32(_mm_mullo_epi16(sg, sa), _mm_mullo_epi16(dg, da)), 8);
const DQWORD rb = _mm_srli_epi32(_mm_add_epi32(_mm_mullo_epi16(sb, sa), _mm_mullo_epi16(db, da)), 8);
const DQWORD result =
_mm_or_si128(
_mm_or_si128(
_mm_slli_epi32(rr, 16),
_mm_slli_epi32(rg, 8)),
rb);
PutPixel(xOffset, yOffset, result);
}
}
}
If I can get the walkin dudes' sprites to align correctly, I also figured a good way of handling color key using sse cmpeq to make a mask, andnot with mask draws the sprite surface, and with mask to draw the background, and or the two results to get the correct pixels to draw. The walkin dude sprites are off, but this give me another speed up of 1 ms, so am down to 2.5 ms for the 100 walkin dudes.
Code: Select all
DQWORD keyMask = _mm_cmpeq_epi32(pSprite[index], key);
// if element in keyMask is FFFFFFFF then priColor = bg, else is 0;
DQWORD priColor = _mm_and_si128(bg, keyMask);
// Invert the mask, now if element in inverted keyMask is FFFFFFFF, altColor = pSprite[index]
DQWORD altColor = _mm_andnot_si128(keyMask, pSprite[index]);
// Or priColor and altColor, same as normal bit or, if element 1 is a color then it is used, if not, then the altColor's element 1 is used.
DQWORD color = _mm_or_si128(priColor, altColor);
The lesson was done at 800 x 600, so not really pushing anything.
If you think paging some data from disk into RAM is slow, try paging it into a simian cerebrum over a pair of optical nerves. - gameprogrammingpatterns.com