How do YOU multithread?

The Partridge Family were neither partridges nor a family. Discuss.
MrGodin
Posts: 721
Joined: November 30th, 2013, 7:40 pm
Location: Merville, British Columbia Canada

Re: How do YOU multithread?

Post by MrGodin » July 29th, 2015, 12:41 am

Derp, what is all this calamity about ?, looks interesting. I too had it crash on drawing ... something amiss here ? ...
Curiosity killed the cat, satisfaction brought him back

albinopapa
Posts: 4373
Joined: February 28th, 2013, 3:23 am
Location: Oklahoma, United States

Re: How do YOU multithread?

Post by albinopapa » July 30th, 2015, 4:00 am

chili wrote:
...

P.S. You can get around the branching if with the pshufb instruction I think. :lol:
Also, you might want to look into XOP instructions. They are only on AMD processors, and I think they're being discontinued, but they look pretty sexy anyways.

Phenom II processors don't have SSSE3, nor do they have XOP apparently. Both weren't implemented until the FX chips with the Bulldozer architecture. So I'm stuck with SSE/SSE2 integer functions until I upgrade sometime around Christmas.

Have been having troubles concentrating past few days, still think I drank too much over last weekend and haven't recovered fully. I started using the shuffle method only because I wasn't aware/forgot that there was a shift method that could shift all 128 bits as a single element and not as smaller data types. I will still have to use if blocks since they have to be immediate constants and not an array of constants, which would have been super cool.

Thanks for the offer about your code, but I'm bound and determined to work on it a bit longer. I appreciate the mini lessons here.
If you think paging some data from disk into RAM is slow, try paging it into a simian cerebrum over a pair of optical nerves. - gameprogrammingpatterns.com

User avatar
chili
Site Admin
Posts: 3948
Joined: December 31st, 2011, 4:53 pm
Location: Japan
Contact:

Re: How do YOU multithread?

Post by chili » July 30th, 2015, 12:44 pm

MrGodin wrote:Derp, what is all this calamity about ?, looks interesting.
SSE stuff. Fun stuff. :)
albinopapa wrote:Thanks for the offer about your code, but I'm bound and determined to work on it a bit longer. I appreciate the mini lessons here.
Awesome, that's what I like to hear, but don't knock yourself out; sometimes you need to take a break for a few days/weeks/months. It'll always be there when the interest rekindles.

By the way, could you test this out for me on your machine? Press the space bar to activate unaligned mode. I need data for aligned and unaligned mode for the cases of POS=0 and POS=1 (use the arrow keys to change position).
Attachments
yes.zip
(174.33 KiB) Downloaded 107 times
Chili

albinopapa
Posts: 4373
Joined: February 28th, 2013, 3:23 am
Location: Oklahoma, United States

Re: How do YOU multithread?

Post by albinopapa » July 30th, 2015, 6:32 pm

Dead on Run. Debug shows error at movntdqa.

Code: Select all

00171C16  add         eax,dword ptr [ebx]  
00171C18  ret         0F8C1h  
00171C1B  add         al,byte ptr [ebx]  
00171C1D  enter       0E1C1h,4  
00171C21  mov         dword ptr [ebp+8],ecx  
00171C24  mov         edx,dword ptr [edi]  
00171C26  mov         eax,dword ptr [ebp-0Ch]  
00171C29  mov         ecx,dword ptr [edi+0Ch]  
00171C2C  imul        edx,esi  
00171C2F  add         eax,edx  
00171C31  lea         edi,[ecx+eax*4]  
00171C34  mov         eax,dword ptr [ebp-10h]  
00171C37  add         eax,edx  
00171C39  lea         ecx,[ecx+eax*4]  
00171C3C  mov         eax,dword ptr [ebp-14h]  
00171C3F  mov         eax,dword ptr [eax+14h]  
00171C42  add         eax,dword ptr [ebp+8]  
00171C45  cmp         dword ptr [ebp-0Ch],0  
00171C49  je          00171C58  
00171C4B  movntdqa    xmm4,xmmword ptr [edi-10h]  
00171C51  psrldq      xmm4,8  
00171C56  jmp         00171C5C  
00171C58  movdqa      xmm4,xmm7  
00171C5C  cmp         edi,ecx  
00171C5E  jae         00171D08  
00171C64  movntdqa    xmm6,xmmword ptr [edi]  // arrow is here
00171C69  movdqa      xmm5,xmm6  
00171C6D  add         edi,10h  
00171C70  pslldq      xmm5,8  
00171C75  por         xmm5,xmm4  
00171C79  movntdqa    xmm4,xmmword ptr [eax]  
00171C7E  movdqa      xmm2,xmm5  
00171C82  punpcklbw   xmm5,xmm7  
If you think paging some data from disk into RAM is slow, try paging it into a simian cerebrum over a pair of optical nerves. - gameprogrammingpatterns.com

User avatar
chili
Site Admin
Posts: 3948
Joined: December 31st, 2011, 4:53 pm
Location: Japan
Contact:

Re: How do YOU multithread?

Post by chili » July 31st, 2015, 3:43 am

It also crashes on my Core 2 Duo-Vista-32, but runs on my i7-Win7-64 and my i5-Win7-32 at work. Maybe could be that the sprite is not aligned_malloc, but then it should fail on the first mov, not the second. So maybe it's running past the end of the memory block, and some OS are fine with that, and some get pissed... Gonna have to look into this.
Chili

albinopapa
Posts: 4373
Joined: February 28th, 2013, 3:23 am
Location: Oklahoma, United States

Re: How do YOU multithread?

Post by albinopapa » July 31st, 2015, 4:07 am

Just wondering,

Code: Select all

je
jae
isn't that where you have if == and if >=, perhaps the first one is just being skipped. Also, I don't think movntdqa uses _mm_stream_load according to msdn docs, and it's SSE4.1 which Phenom doesn't support, but your Core 2 should. Just tried on Core 2 duo Wind 10 64 bit, crashes upon open and won't let me go into debug.
If you think paging some data from disk into RAM is slow, try paging it into a simian cerebrum over a pair of optical nerves. - gameprogrammingpatterns.com

User avatar
chili
Site Admin
Posts: 3948
Joined: December 31st, 2011, 4:53 pm
Location: Japan
Contact:

Re: How do YOU multithread?

Post by chili » July 31st, 2015, 12:17 pm

Alright, replaced new with aligned malloc for the sysBuffer and got rid of the nt moves. It works on my laptop now, so hopefully it'll work on your machine too.

Interesting results for me: for my Haswell I get no speed difference on pos0 (aligned case), whereas there is about a 17% performance hit in the aligned routine for pos1. Unaligned access strictly better.

On the Core2, there is a 33% slowdown for unaligned on pos0, and the difference grows to a 2x slowdown on pos1. Aligned access strictly better for performance (unaligned still wins for ease/elegance of code of course).

So what instructions/routine you should choose really does depend on the processor. I'm interested in seeing how your phenom II (and FX if you can access one) compares.
Attachments
yes.zip
(173.87 KiB) Downloaded 106 times
Chili

albinopapa
Posts: 4373
Joined: February 28th, 2013, 3:23 am
Location: Oklahoma, United States

Re: How do YOU multithread?

Post by albinopapa » July 31st, 2015, 7:06 pm

Not sure what the pos thing is about, but Phenom II 955 3.2GHz quad core:

Code: Select all

aligned pos 0: 0.184 - 0.187
unaligned pos 0: 0.185 - 0.190

// frustrating to get to pos 1
aligned pos 1: 0.210 - 0.214
unaligned pos 1: 0.198 - 0.201

If you think paging some data from disk into RAM is slow, try paging it into a simian cerebrum over a pair of optical nerves. - gameprogrammingpatterns.com

albinopapa
Posts: 4373
Joined: February 28th, 2013, 3:23 am
Location: Oklahoma, United States

Re: How do YOU multithread?

Post by albinopapa » August 2nd, 2015, 10:18 pm

I might have a solution to my original code acting funny on your machines. Check D3DGraphics::EndFrame. I don't account for any padding, I only assume it will be a multiple of 16. May not be the case considering the 1920x1080 resolution, but on my Core2 duo on board graphics, at 800x600, the pitch was 4096 instead of 3200. The computer I code on has a GTX 560 and it's always been 4 * screenwidth. It's possible that it's causing a problem with the pitch.

Anyway, I know we've moved on, but just thought of it and figured I'd at least put it out there.
If you think paging some data from disk into RAM is slow, try paging it into a simian cerebrum over a pair of optical nerves. - gameprogrammingpatterns.com

albinopapa
Posts: 4373
Joined: February 28th, 2013, 3:23 am
Location: Oklahoma, United States

Re: How do YOU multithread?

Post by albinopapa » August 3rd, 2015, 4:40 am

AMD FX 8350
Aligned POS 0: 0.104
Unaligned POS 0: 0.096
Aligned POS 1: 0.110
Unaligned POS 1: 0.100
If you think paging some data from disk into RAM is slow, try paging it into a simian cerebrum over a pair of optical nerves. - gameprogrammingpatterns.com

Post Reply