How do YOU multithread?

Post by **albinopapa** » July 20th, 2015, 7:18 pm

void Graphics::Darken(BYTE alpha)
{
	const int sseWidth = width / 4;
	BYTE bAlpha[]{alpha, alpha, alpha, 255};
	DQWORD mAlpha = _mm_set1_epi32(*((int*)bAlpha));
	mAlpha = _mm_unpacklo_epi8(mAlpha, _mm_setzero_si128());
	

	DQWORD *pPixel = (DQWORD *)sysBuffer;

	for (int y = 0; y < height; y++)
	{
		int rowOffset = y * sseWidth;

		for (int x = 0; x < sseWidth; x++)
		{
			const int index = x + rowOffset;
			// Unpack the bytes to words
			DQWORD A = _mm_unpacklo_epi8(pPixel[index], _mm_setzero_si128());
			DQWORD B = _mm_unpackhi_epi8(pPixel[index], _mm_setzero_si128());

			// Multiply words by unpacked alpha
			DQWORD C = _mm_mullo_epi16(A, mAlpha);
			DQWORD D = _mm_mullo_epi16(B, mAlpha);

			// Divide by 256 by shifting right 8 times
			DQWORD E = _mm_srli_epi16(C, 8);
			DQWORD F = _mm_srli_epi16(D, 8);

			// pack the new values 
			DQWORD G = _mm_packus_epi16(E, F);

			// assign G to pPixel[index]
			pPixel[index] = G;
		}
	}
}

It fades! In debug mode only

Post by **chili** » July 21st, 2015, 2:02 am

You're almost there! The hard part is done; everything inside the loop is fine. You just need to fix the setup of your mAlpha variable. Remember, you're going to multiply this against epi16 values. Use the debugger to pay close attention to what happens to mAlpha when you initialize it and what results you get from multiplying it with the loaded and unpacked pixel channel values.

Post by **chili** » July 21st, 2015, 2:03 am

Also, though you can do it with just the intrinsics you're already using, there is another intrinsic that will help you set up mAlpha more easily. It is just a variation of one you're already using.

Post by **albinopapa** » July 21st, 2015, 5:34 am

Code: Select all

void Graphics::Darken(BYTE alpha)
{
	const int sseWidth = width / 4;
	DQWORD mAlpha = _mm_set1_epi16(alpha);
	DQWORD *pPixel = (DQWORD *)sysBuffer;

	for (int y = 0; y < height; y++)
	{
		int rowOffset = y * sseWidth;

		for (int x = 0; x < sseWidth; x++)
		{
			const int index = x + rowOffset;
			DQWORD A = _mm_unpacklo_epi8(pPixel[index], _mm_setzero_si128());
			DQWORD B = _mm_unpackhi_epi8(pPixel[index], _mm_setzero_si128());
			A = _mm_mullo_epi16(A, mAlpha);
			B = _mm_mullo_epi16(B, mAlpha);
			A = _mm_srli_epi16(A, 8);
			B = _mm_srli_epi16(B, 8);
			pPixel[index] = _mm_packus_epi16(A, B);
		}
	}
}

Arrrgh! Well, not sure why it made a difference, but using _mm_set1_epi16 instead of _mm_set1_epi8 and unpacking seems to have fixed the problem with it not working in Release mode. Really that's the only change I made. In debug mode, the values were correct, but in Release mode, I couldn't check the values, but would get just the base color, no matter what value I passed in to this function.

At first I though it was the fact I was using Direct2D as the "renderer", but then decided to use a DXGI swap chain and still was getting the same result and to think just changing the set function fixed it.

Ok, what's next lol.

Post by **albinopapa** » July 21st, 2015, 5:36 am

I hope this is an entertaining distraction from HUGS. I gather from your previous videos you have been looking for a distraction lol.

Post by **chili** » July 21st, 2015, 9:44 am

Hmm, I just tried your original code in debug, and it does indeed seem to work, while release is broken.

weird.

Post by **chili** » July 21st, 2015, 10:36 am

At first glace I figured your stuff wouldn't quite work, but it seems to be sound. It seems that the optimizer can't pick up on the pointer shenanigans that you pull off here and re-arranges the loading of the xmm2 register (mAlpha) before the array is initialized, and cuts out a bunch of other stuff and just directly loads what it thinks should be the end result of the set1 intrinsic.

Code: Select all

		DQWORD mAlpha = _mm_set1_epi32( *( (int*)bAlpha ) );
000000013FFE2E30  movdqa      xmm2,xmmword ptr [__xmm@000000ff000000ff000000ff000000ff (013FFEBA80h)]  


		DQWORD *pPixel = (DQWORD *)sysBuffer.GetBuffer();
000000013FFE2E38  mov         rax,qword ptr [rcx+30h]  
000000013FFE2E3C  xorps       xmm3,xmm3  
#define DQWORD __m128i		const int sseWidth = SCREENWIDTH / 4;
		__declspec( align( 16 ) ) BYTE bAlpha[]{alpha,alpha,alpha,255};
000000013FFE2E3F  mov         byte ptr [this],dl  
000000013FFE2E43  mov         byte ptr [rsp+9],dl  
000000013FFE2E47  mov         byte ptr [rsp+0Ah],dl  
		mAlpha = _mm_unpacklo_epi8( mAlpha,_mm_setzero_si128() );
000000013FFE2E4B  punpcklbw   xmm2,xmm3

Here, dl holds the alpha value to be copied into the array. I added the alignment specifier because I thought it was just a problem with loading from unaligned memory or something, but that wasn't the issue.

Post by **albinopapa** » July 22nd, 2015, 8:06 am

Used the code from the Intermediate Lesson 2 and just applied SSE to it in the same fashion as without. Unpacking and packing might be faster, but I thought this was a pretty good attempt. Got a speed up of 1-1.5 ms using SSE. Original code was ~4.85 ms, this version ~3.5 ms.

Code: Select all


void D3DGraphics::SSEDrawSpriteAlpha(const int X, const int Y, const Sprite *sprite)
{
	const UINT sseSpriteWidth = sprite->width / 4;
	DQWORD *pSprite = (DQWORD *)sprite->surface;
	const int xPos = X / 4;
	DQWORD zero = SetZero();

	for (int y = 0; y < sprite->height; y++)
	{
		const int rowOffset = y * sseSpriteWidth;
		for (int x = 0; x < sseSpriteWidth; x++)
		{
			// 1-1.5 ms faster using sse for rendering

			const int xOffset = x + xPos, yOffset = y + Y;
			const DQWORD src = pSprite[x + rowOffset];
			const DQWORD dst = GetSSEPixel(xOffset, yOffset);
			
			const DQWORD sa = _mm_srli_epi32(src & alphaMask, 24);
			const DQWORD sr = _mm_srli_epi32(src & redMask, 16);
			const DQWORD sg = _mm_srli_epi32(src & greenMask, 8);
			const DQWORD sb = src & blueMask;

			const DQWORD da = _mm_sub_epi32(_mm_set1_epi32(255), sa);
			const DQWORD dr = _mm_srli_epi32(dst & redMask, 16);
			const DQWORD dg = _mm_srli_epi32(dst & greenMask, 8);
			const DQWORD db = dst & blueMask;

			const DQWORD rr = _mm_srli_epi32(_mm_add_epi32(_mm_mullo_epi16(sr, sa),_mm_mullo_epi16(dr, da)), 8);
			const DQWORD rg = _mm_srli_epi32(_mm_add_epi32(_mm_mullo_epi16(sg, sa), _mm_mullo_epi16(dg, da)), 8);
			const DQWORD rb = _mm_srli_epi32(_mm_add_epi32(_mm_mullo_epi16(sb, sa), _mm_mullo_epi16(db, da)), 8);

			const DQWORD result = 
				_mm_or_si128(
					_mm_or_si128(
						_mm_slli_epi32(rr, 16), 
							_mm_slli_epi32(rg, 8)), 
								rb);
			
			PutPixel(xOffset, yOffset, result);
		}
	}
}

If I can get the walkin dudes' sprites to align correctly, I also figured a good way of handling color key using sse cmpeq to make a mask, andnot with mask draws the sprite surface, and with mask to draw the background, and or the two results to get the correct pixels to draw. The walkin dude sprites are off, but this give me another speed up of 1 ms, so am down to 2.5 ms for the 100 walkin dudes.

Code: Select all

	DQWORD keyMask = _mm_cmpeq_epi32(pSprite[index], key);
        // if element in keyMask is FFFFFFFF then priColor = bg, else is 0;
	DQWORD priColor = _mm_and_si128(bg, keyMask); 
	// Invert the mask, now if element in inverted keyMask is FFFFFFFF, altColor = pSprite[index]
	DQWORD altColor = _mm_andnot_si128(keyMask, pSprite[index]);
	// Or priColor and altColor, same as normal bit or, if element 1 is a color then it is used, if not, then the altColor's element 1 is used.
	DQWORD color = _mm_or_si128(priColor, altColor);

The lesson was done at 800 x 600, so not really pushing anything.

Post by **chili** » July 22nd, 2015, 11:22 am

That's not bad. Kinda what I had in mind for you to tackle next. How did you handle the memory alignment issues? Post your solution so I can take a look at it.

Post by **albinopapa** » July 22nd, 2015, 7:26 pm

I didn't load or store anything explicitly. Everything was either created on the fly or casted to __m128i * /DQWORD *. I will post my project as soon as I get it working again. Forgot to add a new branch like always and started moving sprite to it's own class instead of the Sprite struct and free roaming functions. For the most part though, the alpha blending function I posted, with the exception of the SSEGetPixel, a PutPixel overload for the DQWORD data type and the masks are all I changed.

Even in the PutPixel overload, I cast the pSysBuffer to a DQWORD *.

Planet Chili

How do YOU multithread?

Re: How do YOU multithread?

Re: How do YOU multithread?

Re: How do YOU multithread?

Re: How do YOU multithread?

Re: How do YOU multithread?

Re: How do YOU multithread?

Re: How do YOU multithread?

Re: How do YOU multithread?

Re: How do YOU multithread?

Re: How do YOU multithread?