How do YOU multithread?

Post by **albinopapa** » July 22nd, 2015, 11:50 pm

1920 x 1080 and different alpha image. WalkinDude sprites are now drawing correctly. Three sprite classes to handle alpha, solid and color keyed.

100 WalkinDudes and 30 randomly placed alpha sprites gets me around 14 ms. 1 alpha sprite and ~600 WalkinDudes gets me there also. That's just to render, I know that doesn't include the rest of the game code like updating all 600 positions and curFames. Don't know what the speed would be w/o SSE, I accidentally deleted the non-sse versions of the functions and am suppose to be doing some honey-dos at the moment.

Post by **chili** » July 23rd, 2015, 4:11 am

Hmm, it crashes on the draw function for the walkin dudes for me. If I comment that out, it crashes in release and hangs in debug (I can see the bg and alpha images, but the frame times don't update).

Looks like here you just divided the sprite position by 4, eliminating positions that are not multiples of 4 (16-byte aligned). The same is done for the sprite width. The trick is to get it to work for any position. With clipping

I'd be interested to see how this compares to the alpha blitting code I wrote 3 years ago. In that code I used unpack and shuffle. It takes 75 us to draw a 400x300 image if the image is aligned, otherwise about 95 us, whereas the unoptimized code does it in 1000 us (i.e. 1 ms). So that's about 10 to 13x speedup right there.

By the way, comparing the SSE fade routine (both yours and mine, they are equivalent) with the C routine, you get 700 us vs. 11 ms to fade an entire 1920x1080 buffer on my machine. That's a speedup of almost 16x. Nothing to sneeze at.

Post by **albinopapa** » July 23rd, 2015, 5:09 am

As far as clipping goes, would I draw 4 pixels at a time, then if there are 3 at the end, backup 1 pixel?

I guess that would also be for drawing sprites that have a width that aren't multiple of 4 as well?

Post by **chili** » July 23rd, 2015, 5:23 am

Something like that. Clipping is the easy part, handling the misalignment is the hard part.
You can use unaligned loads to make your life a lot easier, but they carry performance penalties on a lot of architectures. I would suggest writing a very simple blt function (no alpha or clipping or anything fancy) that works at any position using standard loads, and then write another using unaligned loads and compare the difference in performance.

Post by **albinopapa** » July 23rd, 2015, 5:38 am

what do you mean by misalignment? Are you strictly speaking of the 16 byte memory boundaries or 2D arrays/sprites that don't have a width that is a multiple of 4? The reason I ask is because it might be possible that since I haven't explicitly called _aligned_malloc on my pixel buffer, that when I cast to DQWORD *, it's having to use 2+ cycles to reach across boundaries.

Post by **chili** » July 23rd, 2015, 6:24 am

Nah, _aligned_malloc isn't the problem. I mean like when you have a sprite at position x=1,2,3,5,6,7,..., you're going to be storing to misaligned memory. It looks to me like you're dividing the sprite position by 4 (const int xPos = X / 4;), and thus discarding the remainders. You can avoid misalignment that way to be sure, but you're also limiting yourself from using 75% of the possible x coordinates!

Post by **albinopapa** » July 27th, 2015, 7:14 pm

K, my hamster wheel is about to fall off lol.

1) Alignment.
You have memory addresses that are in multiples of 2, 4, 8, 16...16 for SSE. The compiler is apparently using the MOVDQA (the aligned move) when creating new and when I cast. So, if I try to access memory that isn't on boundaries, I get access violations, not sure why it wasn't working on yours though chili, because as you saw I was only retrieving data from memory in multiples of 16 bytes. Anyway, here's what I have in my head.

The WalkinDude sprite has a width of 50 which means that all is fine until you get to element 48. The only way I have thought of to work around this is to let the it load 51 and 52 (wrap around) and overwrite those with background colors. That would be fine up until the last row in which there is no wrap around possible if the total size isn't a multiple of 16.

That then brings me to clipping.

2) Clipping
I'm more interested in screen clipping and preventing wrapping around. I've only one idea how to prevent wrap around on sprites, and that's to allocate memory as if it was a multiple of 16. I decreased the number of pixels of the WalkinDude sprite to 48 when I allocated memory for them. This, at least for me, drew the sprites correctly.

I've spent most my time thinking of a "quick" way of shifting elements around to get the sprite to "float" across the 16 byte register. The best I can come up with is an if else if block for the 4 positions of the sprite.

----------------------
Dammit! Unpackhi seems to pick the elements to the right while unpack low picks elements to the left. Are SSE registers different than memory storage? I figured HI meant most significant so should be on left...as in _mm_unpackhi_epi32 should get the left 2 elements, but it isn't. Although it is even more confusing when you create a __m128i data type for example using _mm_set_epi32(1,2,3,4) and it shows up in the debugger as (4,3,2,1) and then unpackhi you get the 2 and 1. So yeah, I'm guessing SSE registers are big endian?

Post by **albinopapa** » July 27th, 2015, 9:15 pm

I see I can use shuffle_epi32 to rearrange to get the order I want, but still can't see a way around the if/else if/else blocks to get the shuffle order correct as well as an if to determine if I need to handle over lap with the background or the previous portion of the sprite.

Shuffling is weird.

Code: Select all

struct PackedData
{
	int a, b, g, r;
};
// Binary mask  |   Hex
// 00 00 00 00   |   0x0
// result: a, a, a, a

// 00 00 00 01   |   0x01
// result: a, a, a, b

// 00 00 00 10   |   0x02
// result a, a, a, g

// 00 00 00 11   |   0x03
// result a, a, a, r

// 11 00 00 00   |   0xC0
// result r, a, a, a

// 11 11 10 01   |   0xF9
// result r, r, g, b

// 11 11 11 10   |   0xFE
// result r, r, r, g

I use the windows calculator to convert from binary to hex, then put in the hex values.

I would have to do the same for both left and right sides and then mask the pixels together (and/andnot). That's my idea for getting the sprite to move across the regsiters and avoid unaligned moves. This still doesn't help with data widths not being a multiple of 16 bytes though. So that part I'm lost.

Post by **chili** » July 28th, 2015, 3:46 am

I don't think memory alignment is the cause of the crash/hang on my computer. Hard to tell though, because it was behaving differently in debug and release so it's hard to diagnose.

Having the pixels 'float' across the xmm register is basically similar to what I did (if I'm understanding you). You can use shuffle for this purpose because the pixels are 32-bits. I think I used shift si128; I'm not sure if there is any speed difference between the two.

My suggestion is: forget about clipping or any other corner cases/complications. Just focus on getting a sprite of aligned width rendered to the buffer (also of aligned width) without clipping, but on any arbitrary x-coordinate to start (again, as long as it doesn't cause the sprite to be rendered outside of the buffer).

I can show you my code if you're still stuck. Also, use the _MM_SUFFLE(a,b,c,d) macro to generate the shuffle mask.

P.S. You can get around the branching if with the pshufb instruction I think.

Also, you might want to look into XOP instructions. They are only on AMD processors, and I think they're being discontinued, but they look pretty sexy anyways.

Post by **chili** » July 28th, 2015, 3:53 am

albinopapa wrote:K, my hamster wheel is about to fall off lol.

Dammit! Unpackhi seems to pick the elements to the right while unpack low picks elements to the left. Are SSE registers different than memory storage? I figured HI meant most significant so should be on left...as in _mm_unpackhi_epi32 should get the left 2 elements, but it isn't. Although it is even more confusing when you create a __m128i data type for example using _mm_set_epi32(1,2,3,4) and it shows up in the debugger as (4,3,2,1) and then unpackhi you get the 2 and 1. So yeah, I'm guessing SSE registers are big endian?

The big-endian/little-endian disconnect messed me up quite a bit while writing the bit-reorganizing code for the Conway simulation. Shift left moves cells to the right with respect to the layout on the screen, neighboring cells are on opposite ends of the 128-bit dqwords (when you write them out as numbers) and so on. Numbers should be written left to right if you ask me :/

Planet Chili

How do YOU multithread?

Re: How do YOU multithread?

Re: How do YOU multithread?

Re: How do YOU multithread?

Re: How do YOU multithread?

Re: How do YOU multithread?

Re: How do YOU multithread?

Re: How do YOU multithread?

Re: How do YOU multithread?

Re: How do YOU multithread?

Re: How do YOU multithread?