Looks like I’m up

January 15, 2011, 7:01 pm

So, I couldn’t think of anything cool and advanced (that isn’t covered by my NDA) to talk about on such short notice, so I figured I’d start with something easy. My apologies to all the veterans on the list since its basic stuff you already know.

We all know that when it comes to programming, there aren’t many language types as fun and exciting as assembly. Unfortunately, in this crazy world of power lunches and tight deadlines, we don’t really get as many chances to write in assembly as we’d like to. However, being able to read and understand that alien language in your debugger’s disassembly tab is something that every programmer needs to be able to do. It’s essential for debugging crashes that occur in release builds only, diagnosing optimization-related compiler bugs, and better understanding what the compiler is thinking so you can make more informed optimization decisions.

Since all non-handheld gaming platforms are based around PowerPC, I’ll be focusing on that. Maybe I’ll update this to include ARM/NEON or MIPS someday. VFPU would be awesome but I’m not sure if that’s supposed to be secret (can anyone verify?)

Basic Calling Convention

The first thing you can do is familiarize yourself with the PowerPC calling convention and ABI. If you know the calling convention and some basic instructions, you can extract almost any information you need.

Lets start with something simple that comes up often. You step into a function and want to see the arguments that are passed in to the function. Mousing over the variable either gives you something like 0xFFFFFFFF or no value at all. What can you do? Well, lets see what the generated code looks like for an update function:

void Doooooosh( Bag* bag, DooooooshLevel level)

{

mflr r12

bl __savegprlr + 0034h (82eea8c4h)

stfd fr31,-38h(r1)

stwu r1,-90h(r1)

mr r31,r3

mr r30,r4

There are a few things you need to know about the PowerPC calling convention.

1) for non-member functions or static member functions, small non-float arguments ( int, bool, pointers, etc… ) are passed in as r3 through r10

2) for C++ member functions, r3 is always the this pointer, and function arguments are passed in as r4 through r10

3) more often than not, float arguments are passed in using the floating point registers ( fr1, for example )

So, knowing this is a C style standalone function, all you have to do is set a breakpoint early in the function and look at r3 and r4. Later on you’ll see why it has to be early. To get the real values, all we have to do is open up a watch and cast each register to its expected type:

( Bag * )r3

( DooooooshLevel )r4

When working with a C++ member function we’d use r4 and r5 instead, and we could also get the this pointer using:

( SomeClassName * )r3

As a side note, if you’re wondering how the proper values end up in the right registers to make a function call, its set up like this:

Doooooosh( bag, level );

mr r4,r30

mr r3,r31

bl Doooooosh (8293d0e8h)

mr is the “move register” instruction. In the above example, mr r4, r30 will take the contents of r30 and copy it to r4. We must assume that r31 and r30 contain the bag pointer and level respectively. Since all C function calls expect their arguments to be in r3 and up, we have to copy all our arguments to those registers. bl stands for “branch link” and is how we usually call non-leaf functions.

Now there is a catch. Remember when I said we have to look at these registers early in the function? At the very beginning of Doooooosh( ) we can assume that the bag pointer and level will be in r3 and r4 respectively. Thats just how function arguments are passed in. But what if Doooooosh( ) calls another function? Wont that called function also need its argument in r3? The point is that just because your function arguments are originally in r3 and r4 doesn’t mean you can expect them to stay there for long. Taking a look back at the original example, you’ll see

mr r31,r3

mr r30,r4

Basically, this is the code saying “I understand that r3 is probably going to get overwritten very soon so I’m going to back up its value in r31″. Any time after these two register moves are executed, we can now get the function arguments ( more safely ) like this:

( Bag * )r31

( DooooooshLevel )r30

Remember, on the PS3 and Xenon, r3 through r10 are considered volatile and r14 through r31 are considered general use non-volatile. Non volatile means that if you stick a value in r30 and then make a function call, when that function call returns r30 will be just as you left it. That is why at the beginning of Doooooosh( ) we save all the argument registers ( r3 and r4 ) into safe non-volatile registers ( r31 and r30 )

Some More Debugging Tips

Don’t be afraid to go back in the call stack if the info you need can’t be found by the above method. For example, I wanted to examine a string that only existed in a function earlier up in the call stack. The solution was to go up one call in the call stack and look for the bl function call. A few lines above that, we were copying the function argument from r30 to r4 ( like we always do for function arguments ). I moused over r30, casted it to a char *, and that gave me the string. Remember that this usually only works for non-volatile registers r14 to r31 ( this is because the registers are “spilled” or copied into the stack frame. Visual Studio and SN debugger are usually able to look in the stack frame to retrieve the saved register values. )

Getting local variables stored in registers can be a little tricky. While I don’t think there is any one way that works 100% of the time, there are a couple of tricks you can use that may help you through. I’m sure with a little imagination, you’ll figure it out

1) If the local variable is passed as the first argument of the function, look for it in the corresponding register right before the function is called ( r3 for a C function or r4 for a C++ member function ) before a function call ( bl ). If you need to catch it a little earlier, start at the function call and work backwards. If you know that the value in r31 is moved into r3 right before the function call, then work your way up the code and see where r31 is being set. The lesson is don’t be afraid to work backwards.

2) look for landmarks. Often, the generated assembly wont match the code very well. Sometimes in mixed view, you’ll have what looks like 10 lines of perfectly good C++ code that seem to have no assembly code generated. Thats when landmarks come in handy. If you have something like this

float x;

x = sqrt( y );

manually scan through the function and look for some assembly opcode that looks like it could correspond to a floating point square root. From there, you can see what the code does with the result and better trace through the assembly. Some other good landmarks include incrementing, trig functions, floating point multiplies, loop conditions, NULL checks, and any other function that would have some stand out opcodes.

3) look for constant initializers. If you have something like this

int x = 123;

and you see some assembly in the function that looks like

li r30, 123

You may have found a hint that r30 corresponds to x at this point in time. By the way, in case you didn’t already know, li stands for “load immediate” and it loads an immediate value into a register. Note that you can only load 16 bit constants in this way. 32 bit constants are done in two instructions by loading the lower 16, then loading the upper 16 and shifting left.

4) if the local variable is used in a conditional, see what is being compared. Compares look something like this

if( player_controller < 8 )

cmpwi cr6,r3,8

bge cr6, CPlusPlusSucks::AndSoDoesThisFunc + 0064h (8283457ch)

most compare instructions begin with cmp. Here you are comparing r3 with 8 and setting some result flags in cr6. bge means branch greater than or equal. It checks the cr6 result flags that were set by the compare and then branches if appropriate. The point is that we know for sure that at this point r3 is player_controller. If needed we can work our way backwards and look for useful information.

Stack Frame: When When All Else Fails

The above diagram is what the stack frame could look like on Xenon. If there is some weird bug you have to track down and all else fails, including good old fashioned logical thinking about the problem at a higher level, you can draw out one of these stack diagrams and extract more information than you could get using some of the previous techniques.

PPC updates its stack all at once at the beginning of the function, unlike LoseTel which seems to do it as you pop and push. The code will look something like this

stwu r1, -96(r1)

Obviously r1 is the stack pointer, and stwu is a clever way of telling people to shut up. Errr… I mean its “store word and update.” It atomically stores r1 at the address and then updates r1 with the new address. The update direction is negative because the stack grows towards low memory. Since the caller’s SP is saved exactly at the top of the new stack frame, this is exactly what we want.

This can get you a few things. First, it enables you to get a call stack in some cases where the debugger goes nuts. It allows you to get the value of params that are too big or too numerous to pass in registers. It also leads to your religious coworkers calling you a witch and trying to burn you for your black magic.

Here is a quick way to decipher instructions you may not know:

if it starts with “L”, it’s probably a load

if it starts with “S”, its probably a store ( instructions starting with “sl” and “sr” are bitshifting operations )

if it starts with “F”, it’s probably a floating point math instruction

if it starts with a “B” it’s a branch.

if it has an “i” at the end of it, it’s probably taking input from an immediate rather than a register.

Thats the very basics. Hopefully that should be enough to get you started reading and understanding your code’s disassembly. Real understanding only comes with practice, so when you have free time (during rebuilds?), look at random bits of code in optimized and unoptimized builds and see how they differ. Don’t just look at the code and see a bunch of instructions, one of which may or may not be a bl with a function name. Instead, try to really understand every instruction and what the code is doing. Its not easy, but someday you’ll be a hero to your unenlightened coworkers who truly believe that optimized builds can not be debugged by humans.

Love,

Jaymin Kessler

↧

Vectiquette

January 31, 2011, 3:21 am

≫ Next: On Demos and Programming Tests : Rant From A Q-Games Test Reviewer

≪ Previous: Looks like I’m up

vect·i·quette

[vect-i-kit, -ket]
–noun
1. the code of ethical behavior regarding professional practice or action among programmers in their dealings with vector hardware: vector etiquette.
2. a prescribed or accepted code of usage in matters of SIMD programming, or a set of formal rules observed by programmers that actually care about performance.
3. How not to be a total douche and disrespect your friend the vector processor, who wants so badly to make your game faster.

A few months back I was talking to a friend who was doing his B.S. in Computer Science at a respectable school. The conversation happened to drift towards SIMD. My jaw hit the floor when he told me he had no idea what that was. I gave him a basic rundown and you could see the excitement in his eyes. I mean how could you not get excited? Even explained in the most basic oversimplified terms, the concept of doing 2, 4, 8, or 16 things at once instead of doing just one is universally appealing. He went off all full of excitement and hope, ready to do some vector programming. He came back one week later, beaten, dejected, and confused. His vector code was orders of magnitude slower than his scalar code.

The problem isn’t just with college students. We often get programming tests from professionals that are quite good in many ways, but that show a complete lack of understanding wrt good vector programming practices. Mistakes tend to fall into two categories: lack of general vector knowledge and assuming that what works on one CPU is also best practice on a different CPU.

While there are certain good guidelines to follow, be aware that different things carry different penalties on various CPUs, and the only way to write correct code is to know the details of the hardware you are targeting, and the compiler you are using. I know they probably told you in school not to code for quirks in a specific compiler, but by not doing so you miss out on tremendous opportunities (see techniques for splitting basic blocks in gcc, and rearranging conditionals to take advantage of gcc’s forward and backwards branch prediction assumptions as simple examples)

OK, let’s get started on the journey to efficiency! One of the biggest offenders in slow vector code is moving data in and out of vectors too much. Often people calculate something into a bunch of scalar floats, move them into a vector, do a vector add, and then extract back to scalars. What these people don’t realize is that moving in and out of vectors is rarely free, and is often one of the most expensive things you can do. The main problem lies in the fact that on most systems float registers and vector registers are two completely separate register sets. More specifically, the problem is that to get from float registers to vector registers, you often first have to store four floats out to memory, and then read them back in to a vector register. After your vector operation, you have to reverse the process to get the values back into scalars. You basically took what could have been 4 consecutively issued adds (assuming your CPU/FPU has pipelined float operations, or a non-IEEE-compatible mode) and turned it into 4 scalar stores, 1 vector load, 1 vector add, one vector store, 4 scalar loads, and who knows how many stalls/hazards! As Steven Tovey rightfully pointed out, if the alignment of the vector is bad, the number of vector loads could be 2, and a bunch of permutes and permute mask gen instructions. Awesome! As a general rule, you don’t want to mix scalar and vector calculations, and if you do, make damn sure that you aren’t just doing one or two vector operations. You have to do enough in vectorland to justify the cost of getting in and out of the registers.

Even if you are on a platform like NEON where the vector registers and float registers alias each other, you still have to be careful. On NEON, switching between scalar and vector mode requires a pipeline flush on newer Cortexes, and that can be semi-costly. The problem here is almost opposite of what we described before because instead of moving things in and out of vector registers and calling only vector instructions, you are keeping things in the same registers but mixing scalar and vector instructions. If you are going from general purpose registers to NEON, its just as bad. While unlike the PS3’s PPU which needs to go through memory, ARM<–>NEON actually has forwarding paths between register sets, but there is still an asymmetrical cost associated with the transfer. Its just something to think about when you think you have a free pass to mix scalar and vector code.

Building whole vectors isn’t the only way to screw yourself. Unfortunately, one of the most common things people do with vectors often results in horrific performance! Take a look at this

// this makes baby altivec cry
some_vec.w = some_float;

See what I did there? We are inserting a non-literal stored in a float register into a vector. I don’t mean to sound idealistic but if you are wrapping built in vector types in structs, I think its best not to define functions for inserting/extracting scalars (depending on the CPU). If they are there, people will use them. The least you could do is name them something horrific like

inline void by_using_this_insert_x_function_I_the_undersigned_state_that_i_know_and_understand_the_costs_associated_with_said_action_and_take_full_responsibility_for_the_crappy_code_that_will_undoubtedly_result_from_my_selfishness_and_reckless_disregard_for_good_code( float x );

There, that ought to teach em a lesson!

There is a clever way to get around some of the above hassles, and its lovingly referred to as “float in vector.” The concept is simple enough. Instead of using floats all over the place, you make a struct that acts like a float, but internally is a vector. This lets you write code that looks like its a mix of vector and scalar, but it actually lives entirely in vector registers. While some_vec * some_float could be disastrous in some cases, if some_float is secretly a vector, this will compile to a single vector multiply. Hot tip: duplicate your scalar to all lanes of the float in vec’s internal vector, because it allows code like the previous example to work unaltered.

One last thing I want to quickly mention before moving on to code writing tricks. Aside from the PS2 VUs, most vector units don’t have cross vector math operations (very useful for dot products). Therefore while code like vec.x * vec.x + vec.y * vec.y + vec.z * vec.z can technically be done completely in vector registers, it takes a lot more work to move stuff around. For a way around this, see point 7 below.

Giving GCC What It Wants

Another important point is to understand the personality of the compiler you are using. Don’t take the attitude that the compiler should do something for you. As a programmer, it is your job to help out the compiler as much as possible (best case) and not make the compiler’s job harder (worst case). So, what does good vector code look like on GCC? The list below is in now way exhaustive, but it contains a few semi useful tips that can make a big difference. I’ll try reeeeeally hard to keep each item brief as to serve as a good introduction, but if you want more details feel free to ask me (or google).

1) If possible, use lots of const temporaries. Storing the results of vector operations in lots of const temporaries helps GCC track the lifetime of things in more complex code, and therefore help the compiler keep stuff in registers.

2) If a type fits in a register, pass it by value. DO NOT PASS VECTOR TYPES BY REFERENCE, ESPECIALLY CONST REFERENCE. If the function ends up getting inlined, GCC occasionally will go to memory when it hits the reference. I’ll say it again: If the type you are using fits in registers (float, int, or vector) do not pass it to the function by anything but value. In the case of non-sane compilers like Visual Studio for x86, it can’t maintain the alignment of objects on the stack, and therefore objects that have align directives must be passed to functions by reference. This may be fixed or the Xbox 360. If you are multiplatform, the best thing you can do is make a parameter passing typedef to avoid having to cater to the lowest common denominator.

3) In a related note, always prefer returning a value to returning something by reference. For example

// bad
void Add(Vector4 a, Vector4 b, Vector4& result);

//good-er
Vector4 Add(Vector4 a, Vector4 b);

The above code is standalone (non-member) functions but this applies to member functions as well. Remember that this is a very C/C++ thing. If you are writing in a nutso language like C#, it can be over 40x faster to return by reference because of the compiler’s inability to optimize simple struct constructors and copies.

4) When wrapping vector stuff in a struct, make as many member functions const as possible. Avoid modifying this as much as you can. For example

// bad, it sets a member in this
void X(FloatInVec scalar);

// good, it creates a temporary vec and returns it in registers
Vector4 X(FloatInVec scalar) const;

Not only does this help out the compiler, but it also allows you to chain stuff in longer expressions. For example, some_vec.w(some_val).normalize().your_mom();

5) For math operations on built-in vector types, using intrinsics is not always the same as using operators. Lets say you have two vectors. There are two ways to add them

vec_float4 a;
vec_float4 b;
vec_float4 c = a + b;
vec_float4 d = spu_add(a, b); // I like si intrinsics better but…

Which is better greatly depends on the compiler you are using and the version. For example in older versions of GCC, using functions instead of operators meant that the compiler wasn’t able to do mathematical expression simplification. It had semantic information about the operators that it didn’t have for the intrinsics. However I have heard from a few compiler guys that I should avoid using operators because most of the optimization work has gone into intrinsics, since that is the most used path. Not sure if this is still true but its definitely worth knowing the two paths aren’t necessarily equal and you should look out for what your compiler does in different situations.

6) When not writing directly in assembly, separate loads from calculations. Its often a good idea to load all the data you need into vector registers before using the data in actual calculations. You may even want to include a basic block splitter between the loads and calculations. This can help scheduling in a few ways.

7) Depending on what you plan to do with your data, consider using SoA (structure of arrays) instead of AoS (array of structures). I wont go too far into the details of SoA but it basically boils down to having 4 vectors containing {x0, x1, x2, x3}, {y0, y1, y2, y3}, {z0, z1, z2, z3}, {w0, w1, w2, w3} instead of the more “traditional” {x, y, z, w}. There are a few reasons for using this. First of all, if the code you are writing looks and feels something like this

FloatInVec dist = vec.x * vec.x + vec.y * vec.y + vec.z * vec.z

it can be a bit of a pain to do when your vectors are in {x, y, z, w} form. There is a lot of painful shifting and moving things around, and a lot of stalls because you can’t add the x, y, and z products until you line them up. Now lets look at this as SoA

Vector4 x_vals, y_vals, z_vals;
Vector4 distances = x_vals * x_vals + y_vals * y_vals …

image from slide 49 of Steven Tovey’s excellent presentation
http://www.facebook.com/l/06abb0l6991DuwnSitEceRCAhbw;www.spuify.co.uk/?p=323

Now, you can freely write code that looks kinda scalar, but you don’t have to extract and move around the x, y, and z values. Everything is already where it needs to be. Also, unlike the first example, you get four for free! If you are doing cross vector operations or using x, y, and z independently in calculations, and if you have many to do at once, it might be a good idea to use SoA. Depending on the latency of the instructions involved, you might even want to consider unrolling to fill in any gaps caused by any stalls. Speaking of which…

8) Depending on how many registers you have, consider unrolling. Don’t just randomly do it, but first look at our code in a pipeline analyzer to see if it would even help, and to check register usage. If there aren’t that many gaps or you are already using most of the available registers, unrolling probably wont help and may even end up making your code slower due to spilling or increased pressure on the i-cache.

9) On the SPUs (or any other hardware that has no scalar support), be very wary of consecutive writes to scalar pointers. There is no way to know at compile time if consecutive writes of values already in registers will be to addresses in the same 16 byte vector, so the compiler must be very conservative. In this case, restrict won’t help.

10) Know your alignment requirements. what constitutes an unaligned load, and what the penalties are for different alignments.

Tacked On Advanced Topic: Odd and Even Pipelines

Jonathan Adamczewski rightfully pointed out that this section felt a little out of place, bolted on, and not as flushed out as some of the sections above. Also, it made my blog post a little too long, so I cut it. But don’t worry, those of you who were just dying to hear me drone on and on about the art of balancing odd and even pipelines will get the chance in my next post. It works out well for me because I was almost completely out of ideas as to what to write next. So please wait for it.

Conclusion

Here is the lesson. I don’t care how smart you think you are, use your perf tools and look at the disassembly. Its never enough to look at your source code and say it looks faster. Its a bad idea to time your stuff, notice that your new optimized version is slower, and then not try to find out why. Also its a terrible idea to take things I say in this as absolute fact (or even remotely correct) without verifying for yourself

beware of simple tests. Optimizers are complicated beasts and its hard to say that just because X is inlined in your test or Y is scheduled nicely doesnt mean it will be so in the real world. Whenever possible, test your code where it will be used.

Shoutouts to

@nonchaotic the low level ninja, editing buddy, and the guy who hopefully stops me from saying anything too horribly stupid.
@Matt_D_ for reminding me why I could never be an english major (or speaker)
@twoscomplement for verifying what I already knew to be true: that I can drone on and on and forget what my own point was
@CarterBen for reminding me that while he may know what I mean when I use vague language, others may totally misinterpret my words
@DylanCuthbert for helping me suck at life less

Remember http://6cycles.maisonikkoku.com for all your stupid SPU tricks, especially if you are doing 2D stuff

↧

On Demos and Programming Tests : Rant From A Q-Games Test Reviewer

February 15, 2011, 6:15 am

≫ Next: Put This In Your Pipe And Execute It

≪ Previous: Vectiquette

by Jaymin Kessler (of Q-Games)

http://6cycles.maisonikkoku.com

@okonomiyonda

This one is for the kids, and I mean that in the least pedo way possible.

I like hanging out in the GDC career pavilion, and not just because it’s filled with companies giving away free shirts ( *ahem* like those awesome Insomniac shirts I would looooooove to get my hands on ). The career pavilion just has this energy to it that comes from being filled with soon-to-be-grads doing everything in their power to achieve their dream and break into a notoriously difficult to enter industry. Its human drama at its finest, and its all there: the exhilaration of getting your first industry job, the heartbreak of being rejected, the uncertainty, the camaraderie… Its the hopes and dreams of the next generation of game makers, and its all packed in to one room at the Moscone center.

But are these future gamemakers /really/ doing all they can to get hired? I dare say no. It all starts with one simple question: “so, got any demos?”

We get a lot of applications from new to the industry applicants here at Q, but we very rarely see any demos. If we do see demos, often they are the kind of demos that hurt our eyes and make us wish we hadn’t seen them. So whats going on? Personally, I don’t think everyone fully understands 1) what a demo is and isnt, and 2) why they are so damn important, especially for new grads with no professional experience.

Lets get this out of the way. A demo is _not_ proof that you meet some minimum eligibility requirement. Yeah, I am talking to you, guy who submitted the rotating cube. Good lookin’ out, B. Maybe you thought you were allaying some deep rooted fear we have that you might not know how to use GL immediate mode to draw a quad, but really all you did was make yourself look bad.

A demo doesn’t have to be a full game. In other words, please don’t feel the need to go out of your way to prove that you can do a wide range of things in a mediocre way. Sometimes its better to focus your attention on one or two things and put a lot of love into them. For example, you could write a really interesting multicore job manager and memory allocation system. There is no rule that says that all demos have to be graphics demos. Because not many college students are thinking about memory allocation or Data Oriented Design, doing stuff like that instead of doing the same bump mapping demo all your classmates are doing can be a great way to stand out! All my pre-industry demos were various non-graphics stuff written in VU assembly on PS2 Linux and I think it ended up helping me. Also keep in mind that code quality matters. Be meticulous. Definitely more meticulous than I am with my grammar and spelling. Even if your demo looks gorgeous, things like memory leaks all over the place, OOP, random unnecessary virtuals, and other horrific abominations might make the guy reviewing your demo think twice. It might mean the difference between hiring you or the other guy in some cases.

Takeaway: for me at least, the point of a demo is to make you stand out from other candidates (in a good way). We can’t hire everyone, so make us think you’re the one we’ve been waiting for. Show us something cool that other people cant do, wouldn’t normally think about, or don’t have the time to do.

That brings me to the subject of timing. An hour after the company you are applying at asks you for a demo is not the right time to start making one. I’d even go as far as to say a month before you start applying at places still isn’t good enough. Its hard to make a technically interesting demo and polish it to a high degree in a short amount of time. Really, you should have “labor of love” home projects that you work on, obsess over, and refine over long periods of time, and update/rewrite as you become a better programmer and learn new stuff. To put it more bluntly, I don’t care if you went to MIT, Caltech, Bergen Community College of Paramus NJ, or some crappy game specialty “school”. Reading books and articles, and trying stuff at home (and failing) is how you improve. If you’re about to graduate and you haven’t been doing this, there is a higher probability you are screwed. Of course if you are reading #AltDevBlog, you already are the kind of person who reads and learns on their own. Good for you.

A lot of people consider their school team projects to be demos. There is nothing wrong with that, per se, but it always raises doubt in my mind as to how much of the work was really done by the candidate. You can throw around impressive sounding phrases like “oh I was the gameplay lead on that project,” but what does that really mean? Its hard to look at a piece of code that was touched by 5 programmers and infer some idea of a candidate’s skill from it. Also, having school projects only and no personal projects is another big red flag for me.

EDIT: One last annoyance I forgot to mention in my original post. Of the demos that we do receive, a large quantity of them don’t actually work. We just can’t run them. Either they crash on boot for some unknown reason, some lib is missing that keeps it from starting, or the candidate assumes everyone must be running super-uber-premium-happy-lucky-family-Windows-2032-better-than-you-edition. Here at Q-Games, like many people in the world, we use Windows XP 32 bit and only use OpenGL. So please please please test your stuff not only on your own box, but also on friends computers running different OSs. Or as an alternative, include HD screenshots and videos of your demo. Since you will never guarantee it works for everyone, at least we can have an idea of how cool your demo would have looked if we were actually able to run it!

Some Quick Notes On Programming Tests

Cool, so you have an ill demo, but you still have to take the dreaded programming test. These range from brainteasers (which I think are an absolutely atrocious idea), to 30 minute quick tests, to 1 week tests, to non technical general knowledge tests. I’ll focus mainly on longer tests, although some of this applies to shorter tests as well to some degree.

When it comes to programming tests, there is no need to freak. Either you know the stuff or you don’t. Going back to home projects, the more the do at home, the more you increase the likelihood of having solved or at least seen one of the problems on the test. Maybe you can apply something cool you discovered in some cool new way for the test.

For me, the golden rule is “there are no simple questions.” When I say that, I don’t necessarily mean that my skill level is so low that even simple stuff is difficult for me, although there may be some truth to that. What I really mean is that an easy question is actually a golden opportunity to show off and do something different. Just because you /can/ answer it in a few minutes doesn’t mean you /should/.

I’ll say it again. A simple question is never a simple question. One of the things I love the most about the Q-Games programming test is that many of the questions are simple enough so that anyone can answer, but open enough to allow people to really go to town and do some interesting stuff. Many of the questions are of the type “lets say you want to do X. Think up an algorithm that does it but is within these constraints” or “come up with a data structure that can represent Y”.

One of my main reasons for writing this rant is that I got a little tired of people just not putting in the effort. Lets say the test asked you to write a doubly linked list. Its a very basic concept and we all saw them in college (except maybe for the game specialty school students). You could probably answer in under 10 minutes with the “standard” implementation. The question seems simple on the surface, but like I said before its really just an excuse to go crazy and consider lots of possibilities

Do you use sentinel nodes or not? Is the list ordered or not? When you add a node, do you add it to the front or back? Why? What does your node look like? Does it contain all the data, a pointer to the data, or some search key plus a pointer to the data? Which is more cache friendly and in which cases? Do you use pointers or indices into a node array? Does it even matter? What can you assume about how it will be used? If you don’t use intrinsically linked lists, will having to write a copy constructor or operator= be an issue? Do you preload the next node when searching? If you do preload, how do you deal with the possibility of the next pointer being null? How do you allocate nodes? Malloc? Fixed block allocator? All at once or one at a time as needed? Does it have to be safe to add from multiple threads? Does it have to be safe to remove from multiple threads? Should you throw mutex locks or lwmutexes around every function just to be safe, or can you come up with something more parallel? How about stating the problems with linked lists and coming up with a better solution that meets the problem constraints? What are the problem constraints?

See what I mean? Those are just a few of the things you can think about when answering. You can be the candidate who quickly regurgitates the standard implementation learned in school in a few minutes, or you can be that candidate that takes the extra time to really do something special and interesting. I highly recommend you go for the latter.

I look forward to reviewing your test :)

note:

the opinions expressed in this rant are entirely my own, and may not reflect the other test-reviewing tech team members who are undoubtedly shaking their heads in disbelief as they read this

I promise I will try to do a proper technical post next time. I just didn’t have the time to finish the post I wanted to do

↧

Put This In Your Pipe And Execute It

March 2, 2011, 10:00 am

≫ Next: Software Pipelining (failed video experiment)

≪ Previous: On Demos and Programming Tests : Rant From A Q-Games Test Reviewer

Jaymin Kessler (Q-Games)

http://6cycles.maisonikkoku.com

@okonomiyonda

When we last left off two articles ago ( vectiquette, or alternatively “What GCC Wants” ) I was just about to get around to talking about dual issue pipelines, and things you can do to up that all important indicator of your worth as a human: the dual-issue rate. Since then I have been thinking a bit about some of the other little things you can do that help out. I figured it would be fun to look at some actual examples and see the effects in the assembly.

This is by no means an exhaustive list. Quite the opposite, this is going to have to be relatively short as I am off to GDC soon. As a result some of it may be disjointed, out of place, spelled wrong, weird grammer, jumping all over the place, nonsensical, or just weird. Sorry.

(note to readers: I will try to point out things that can be different among various CPUs where applicable, but I will mainly be talking about one CPU in particular.)

Now Here’s a Funky Introduction

As you can probably guess, dual-issue is all about issuing two instructions in the same cycle. It is also referred to as superscalar issue, where scalar comes from the latin word meaning “to suck” and super probably means “to do something less”? Types of dual-issue CPUs have been around since the Crays of the 1960′s, and its fairly common in most modern CPUs these days.

(very very very simplified SPU diagram that doesn’t show pipeline stage details (borrowed from IBM))

On the SPUs, 1-word instructions are fetched in groups of 32 at a time (128 bytes), eventually making their way to the instruction line buffer *. From the instruction line buffer, the instruction issue unit fetches and decodes a doubleword aligned pair of instructions. This introduces some limitations. You can’t just dual-issue any two consecutive instructions starting at an arbitrary address. Because of the way the instructions are fetched in pairs two at a time from a doubleword boundary, you can only dual issue if the first instruction in the pair is an even pipe instruction and the second is odd (note: loads and stores have to wait for free LS cycles).

Don’t freak if you’ve never heard the terms odd and even instructions. On the SPUs, all instructions have a type, and each type usually has to be routed to one of seven execution units via a specific pipeline, although I think some chips are looser with this restriction. These pipelines are usually called something like Odd and Even, A and B, 1 and 2, Fry and Laurie, or something like that. In the diagram above you see that permutations, loads/stores, channel instructions, and branches are odd pipeline instructions, while everything else is even.

Finally, the instructions can’t have dependencies on each other’s results. This makes sense because one instruction would have to write its result out to register before the other instruction fetches its input operands from registers (or some fancy result forwarding thingy) if you were to dual-issue and not get garbage.

Note that dual-issue doesn’t guarantee simultaneously issued instructions will finish at the same time.

* cool but rarely useful: 3.5 fetched lines are stored in the instruction line buffer. One line is for the software managed branch target buffer, two lines are used for inline prefetching, and that 0.5 line holds instructions while they are sequenced into the issue logic. I didn’t actually know that before. I accidentally found it while looking for articles to verify that what I thought I knew about instruction issue wasn’t total lies. Pretty cool, right?

Dopplegangers

Let’s get started by looking at an SPU relocatable in a static pipeline analyzer

The left side (red) is the even pipeline and the right side (blue) is odd. Instructions that occur on the same line are issued the same cycle, and cycle numbers are estimates that don’t always reflect the actual time your program will take (blah blah blah branch mispredicts, consecutive loads and stores from the same address, etc…). The first thing you should take note of is that giant glaring hole in the even pipeline from cycles 15 to 28. Those are cycles where the odd pipeline is working its butt off while the even pipeline lazily loafs about like some new-age hippie freeloader. Thats what we want to prevent.

Let’s say we wanted to do something that consisted of 4 odd instructions. Since the odd pipeline is pretty much full, adding a bunch more odd instructions would definitely increase the length of our loop. However, if we could find a way to do what we want in even instructions, we might be able to schedule them into that big gap, essentially getting them for free. Even if the even pipeline instruction version consisted of 12 instructions, it may still be a win. Mind you all of this is a gross oversimplification. First of all, I mention instruction count but I don’t say if the instructions have dependencies on previous instructions (either in its own pipeline or in the other) or how many cycles each instruction is. Also pure odd <–> even swaps are rare. Usually you make a version that reduces, not eliminates, certain pipeline instructions at the cost of adding a bunch of instructions from the other pipe.

So let’s see a concrete example completely unrelated to the odd pipeline bound example above. This little nugget of wisdom comes from Mark Cerny. The straightforward way to calculate the absolute value of a floating-point register is to AND the register with 0x7FFFFFFF (1 even instruction, 2 cycles). If your code is even bound, it is possible to calculate fabs() on a pair of floating-point registers (8 floats total) using 6 odd instructions:

// Shuffle most significant bytes into a single qword, separated by zero bytes
 
shufb temp, input1, input2, s_0A0E0I0M0a0e0i0m
 
// Left-shift by one bit.  The sign bits are now in their own bytes.
 
shlqbii temp, temp, 1
 
// Shuffle in-place to mask off the sign bits.
 
shufb temp, temp, temp, s_0B0D0F0H0J0L0N0P
 
// Shuffle the MSBs back into their original places
 
shufb output1, input1, temp, s_bBCDdFGHfJKLhNOP
 
shufb output2, input2, temp, s_jBCDlFGHnJKLpNOP

If you have *four* registers worth of floats, you can obviously do 4 even or 12 odd or 2 even / 6 odd. Another option is 1 even / 7 odd: 3 shufbs to get the sign bytes into one register, a single xorbi 0×80, and then four shuffles to get the sign bytes back where they belong.

So that was an example of how to take something that was normally 1 even instruction, and rewrite it using 6 odd instructions. This next example comes from the Naughty Dog ICE team and it was passed along to me by Cort Stratton. All these examples take some three element vectors and transpose from mixed to channel format. Some use entirely odd instructions, some use equal amounts of even and odd, and some are even pipe heavy. The original document contains a huge number of variations and I am pasting three random ones here.

(0 even/7 odd) Latency: 11 cycles

—————

shufb temp1, in1, in2, s_AaBb // temp1 = x1 x2 y1 y2
 
shufb temp2, in3, in4, s_BbAa // temp2 = y3 y4 x3 x4
 
shufb temp3, in1, in2, s_CcCc // temp3 = z1 z2 z1 z2
 
shufb temp4, in3, in4, s_CcCc // temp4 = z3 z4 z3 z4
 
shufb out1, temp1, temp2, s_ABcd // out1 = x1 x2 x3 x4
 
shufb out2, temp2, temp1, s_cdAB // out2 = y1 y2 y3 y4
 
shufb out3, temp3, temp4, s_ABcd // out3 = z1 z2 z3 z4

(4 even/4 odd) Latency: 10 cycles

—————

selb temp1, in1, in3, m_0FF0 // temp1 = x1 y3 z3 ??
 
selb temp2, in2, in4, m_0FF0 // temp2 = x2 y4 z4 ??
 
shufb temp3, in3, in4, s_CcAa // temp3 = z3 z4 x3 x4
 
shufb temp4, in1, in2, s_BbCc // temp2 = y1 y2 z1 z2
 
shufb temp1, temp1, temp2, s_AaBb // temp1 = x1 x2 y3 y4
 
selb out1, temp1, temp3, m_00FF // out1 = x1 x2 x3 x4
 
selb out2, temp4, temp1, m_00FF // out2 = y1 y2 y3 y4
 
shufb out3, temp4, temp3, s_CDab // out3 = z1 z2 z3 z4

(8 even/3 odd) Latency: 14 cycles

—————-

shufb temp1, in3, in4, s_AcBa // temp1 = x3 z4 y3 x4
 
selb temp2, in1, in2, m_0FF0 // temp2 = x1 y2 z2 ??
 
selb temp3, in1, in2, m_FF00 // temp3 = x2 y2 z1 ??
 
selb temp6, temp1, temp2, m_FF00 // temp6 = x1 y2 y3 x4
 
shufb temp7, temp1, temp3, s_caAB // temp7 = z1 x2 x3 z4
 
selb temp4, in2, in1, m_FF00 // temp4 = x1 y1 z2 ??
 
selb temp5, in3, in4, m_FF00 // temp5 = x4 y4 z3 ??
 
shufb temp8, temp4, temp5, s_BCcb // temp8 = y1 z2 z3 y4
 
selb out1, temp6, temp7, m_0FF0 // out1 = x1 x2 x3 x4
 
selb out2, temp8, temp6, m_0FF0 // out2 = y1 y2 y3 y4
 
selb out3, temp7, temp8, m_0FF0 // out3 = z1 z2 z3 z4

Which version would you want to use? Hopefully by now you wont automatically respond with “the one with the fewest instructions” or “the one that takes the least cycles.” The correct answer is: it depends entirely on where you want to schedule it in! So do the work, and check your pipeline analyzer.

Unrolling

Since everyone already knows what unrolling is, I thought it would be cool to see its slot-filling effect in action. Below is a function that loads two float vectors, adds them, and stores out the result.

The add is done on cycle 10 and takes 6 cycles to complete, so we can’t do the write until cycle 16. Since there is no other work we can do in-between, we waste a ton of time. However, maybe we can do two vector adds at once. The vector’s read can go between the first vector’s read and the first vector’s add. The second vector’s add can go between the first vector’s add and the first vector’s store. Lets see if it works!

see what I did there? The loop is still exactly 17 cycles but now we are processing two vectors instead of one. Should we press our luck and try 4x unrolling? Big speedup, big speedup, no whammies!

That one wasn’t free. Our loop is now 21 cycles but we’re processing 4 vectors at a time. Ok, one last experiment. 8X UNROLL, GO!!1!

30 cycles for 8 vectors. Thats less than 2x the cost of the original but it processes 8x as much data. Is it worth it? Maybe, thats for you to decide. Mind you I am trying to limit myself to unrolling only. There is a number of wasteful things the compiler is doing in the above example that if fixed, could bring the cost way down closer to the level of the original unrolled example.

Obviously there are some things to beware of: First of all, unrolling increases code size and therefore can be hell on your I-cache on certain platforms. Also, another SPU specific problem is that unrolling can place too much pressure on the local store. There is already so little memory to share between code, data, and job kernels, that tripling your memory usage just really isn’t feasible sometimes. You have to be weary of your register file size. Do you have 16, 32, 64, or 128 registers? Any performance benefit you get from unrolling is pretty much right out the door as soon as you start spilling to stack. Finally, be aware that diminishing returns be in full effect. Unroll too much and not only will things stop getting faster, but they will really start getting much slower.

The SoA<–>AoS Example I Should Have Included Last Time

I should have included this in the vectiquette article. But I didn’t. So I will. Now.

A great semi problem-independent example of the benefits of SoA is the dot product. For those who may not yet have discovered Khan Academy as a great way of sucking less at life, the dot product of two vectors returns a scalar and works like this:

V1 = {X, Y, Z, W}

V2 = {A, B, C, D}

R = V1 dot V2 = (X * A) + (Y * B) + (Z * C) + (W * D)

The first part is trivial. All we have to do is multiply V1 and V2 and store the result in some register. What happens next is the truly horrific part: we have to access each component of the result and add them together. Think back to vectiquette and you’ll remember that individual component access is one of the most horribly mean and cruel things you can do to your poor poor vectors. Let’s see what this single dot product looks like on the SPU

The fm (cycle 7) does the floating point multiply, and that is followed by three rotates (cycle 13). We rotate by 1, 2, and 3 words respectively so the the first element of each vector contains something we want to add. Lets say you have const vec_float4 jenny = {8.6, 7.5, 3.0, 9}. The whole vector rotations would look like this

{8.6, 7.5, 3.0, 9} // rotated 0 words

{7.5, 3.0, 9, 8.6} // rotated 1 word (4 bytes)

{3.0, 9, 8.6, 7.5} // rotated 2 words (8 bytes)

{9, 8.6, 7.5, 3.0} // rotated 3 words (12 bytes)

Thats how we get the separate vector components to line up so we can add them. We then do three adds (cycles 17, 23, 29) to add the results into the X component of the final result and then shuffle to splats the value across the vector. Note that the shuffle is completely unneeded since the correct value is already in each vector element. Mind you this is also wasteful because you are using a whole vector to store a single floating point dot product. If we were to rewrite this in SoA format, instead of having two vectors each containing X, Y, Z, W values, we would have 8 vectors where each contains only X values, only Y values, only Z values, or only W values. It would look something like this

There are a couple cool things here. First of all, since we didn’t have to rotate the results of the multiply, we were able to use fma, or floating point multiply and add. Second, notice that despite having to load more vectors, this version is fewer cycles than the previous version. That information is about to become all the more impressive when I tell you that we are now simultaneously doing four dot products. Let that sink in. Not only is the code itself faster to execute, but it does four dot products at once instead of one. FTW! This also means the output now contains 4 dot products instead of one dot product.

Now some of you will look at this and think… “but I have to do four dot products, I can’t do just one.” To this I have two responses. First of all, if your data is in SoA format, you better damn well have more than one thing you are processing at once. Second, unless you’re Doing It Wrong, if you have one thing to process than you have many. Allow me to clarify. If your game loop is set up right and you are properly processing things in batches, this won’t be a problem.

One last thing. The code above kinda bothered me in that it seems to be wasting a lot of time. As a fun exercise I tried unrolling 2x and then using software pipelining to to eliminate dependencies on loads. The results below show the 2x unrolled version (processes 8 dot products at a time)

and here is the pseudo software pipelining version. 17 through 34 represents the loop prologue (initial loads) and 0 through 28 is the actual dot product loop. The compiler is doing some inefficient stuff that could be remedied by going to asm, but you see my point. Maybe even without going to asm, using the X form of loads could help the compiler get rid of all those add immediates. Either way, the result is way faster than the original and processes 8x the data.

Coming Soon: _Real_ Software Pipelining

Originally I had planned do so a section on Software Pipelining, a subject very near and dear to my heart. However, quickly including it here just didn’t do justice to such an awesome technique, so I decided to save it for my next post. I’ll cover basic ideas, ways of doing it, and maybe even talk a little about pipelining assemblers like SPA and SPASM. Please look forward to it!

“Take these words home and think ‘em through, or the next post I write might be about you.”

Mobb Deep (paraphrased)

Jaymin Kessler

↧

Software Pipelining (failed video experiment)

March 16, 2011, 4:48 pm

≫ Next: The Radical Optimizationist’s War on Abstraction and Patterns

≪ Previous: Put This In Your Pipe And Execute It

Jaymin Kessler (Q-Games)

http://6cycles.maisonikkoku.com
@okonomiyonda

Before watching these, you might want to go back and review my pipeline post. Or not. It’s up to you.

So, I consider this experiment failed but I’d definitely be willing to try again with a shorter topic. Due to a bug in Keynote, I had to record all 24 minutes of the video at once instead of slide by slide. As a result, I forgot to mention things, said some things that are slightly wrong, and may have screwed up in other ways, but I couldn’t go back and fix it. Oh well, I hope you get some useful info out of this anyway. Oh and I had to break it up into three parts since youtube rejected all my previous attempts at uploading because the video was too “long.” And the audio is messed up. I’ll do better next time.

One last point. I recorded these in glorious 1080p, so if you can’t read the text… crank up the res!

Intro to software pipelining: what it is, why it works, and why you might want to try it

Some algorithms and terms/concepts used in /real/ software pipelining.

I actually want to apologize for this one. I ran out of time and couldn’t make any cool animations so its very very wordy.

An easy practical trick you can try at home. Seriously. Try it!

↧

The Radical Optimizationist’s War on Abstraction and Patterns

April 1, 2011, 8:03 am

≫ Next: Rate Me, My Friend

≪ Previous: Software Pipelining (failed video experiment)

Jaymin Kessler

@okonomiyonda

2011 April 1st

In this rant, I would like to address certain misconceptions regarding program performance on modern hardware. Many programmers, not just amateur but also professional, do not properly understand the true responsibilities of the console game programmer. This is mainly the fault of the so-called twitter gurus such as Mike “Acton like a damn fool” Acton, Tony “cow blood” Albrecht, and Steven “nonconversant” Tovey. In this article, I would like to provide a counterbalance to their rhetoric, and explain why no programmer should ever be thinking about low level hardware details

Don Knuth, circa 1987

I want to begin with a quote from Donald Knuth, the father of theoretical computer science, that I believe we should all be familiar with. Please open your book of Hoare to page 671, chapter 5, verse 11. Was it not Donald Knuth, father of the C programming language, that said “optimization is evil, and therefore you shouldn’t do it”? Even more true today than it was back then.

Who are we to argue with Knuth, the man who discovered boolean logic? Optimizer is one of those jobs on the way to extinction, like gas street lamp lighter, town crier, and Haskell programmer. And when Donald Knuth, creator of the electrical force described in Ohm’s Law, tells you something is useless, you better believe it is!

Assembly, or as I like to call it, the dumbest most pointless language evar

“Peer review!” – Christer Ericson

The radical optimizationist will tell you that a major tool for optimizing any piece of code is to rewrite it in assembly. If anyone tells you that they got some speedup from rewriting code in assembly, they are telling you a story from 60 years ago, or they are genuinely stupid. The very idea of believing that you can make something run faster by writing it in assembly is rooted in ignorance. People who make such audacious statements probably aren’t aware that since the mid 1940’s, compilers have been outputting assembly from your C++ code. Thats right. Converting your C++ to assembly is exactly like replacing your integer divisions with shifts: the compiler is already doing it for you and it will result in nothing more than you wasting your precious time.

Also, in this day and age of multicore CPUs and giant PS3 clusters, you really don’t get much of a speedup from micro-optimizations anymore. The real speedup comes from using parallel resources efficiently. I would argue that assembly is a woefully inadequate language for an increasingly parallel world, and here is why. In order to do any kind of parallel programming, you need library support. This is a fact. Not only does assembly not have any mutex and semaphore support in its standard library, but it also doesn’t even have any job manager API. I’m sorry, but that right there kills any illusion of usefulness the language may have once had.

Furthermore, instruction scheduling is a hard problem. In fact, finding an optimal schedule isn’t just hard, its NP Complete. If you have never heard that term before, NP Complete is mathematician shorthand for No Person but a Complete idiot would try to solve this. Think about it. Grab a pen and paper and start writing some code. How long would it take you to write out a whole program? A minute? 5 minutes? If there is one thing computers are good at (aside from responding to commands like ZOOM IN and ENHANCE RESOLUTION) its trying lots of things really fast. A compiler can loop through every program in existence until it finds one that matches your C++. You can’t compete with that, so why not just let the compiler do its job?

So what about the Fox Mulders of the computer world, with their paranoid conspiracy theories about compilers being untrustable? Maybe when they were young they had a fortran compiler they were close with inappropriately touch their “special” file, and they now are traumatized. Continuing with the X-Files analogy I’d say that Acton, Albrecht, and Tovey are more like the Lone Gunmen, where Acton is the guy with the blonde hair and glasses that everyone liked, Tovey is the guy with the short hair and beard, and I can only assume Albrecht looks like that third guy because no one has ever seen a photo of him.

But my point is this. Compilers are expert systems, and as such always do the right thing. In fact, compilers are so good at what they do that the term “compiling” was named after them!! But really, I wouldn’t expect you to believe any of this without proof. Only the radical optimizationist forms half-assed opinions without first verifying them to be true. Take for example the following code:

int RealWorldFuncExample(void)
 
{
 
    return 5;
 
}
 
int main(void)
 
{
 
    int complex_deep_nested_expression =
        RealWorldFuncExample();
}

Lets take a look at what the compiler does with this by LOOKING AT THE ASSEMBLY, something a programmer should never ever have to do!

il r4, 5

Admittedly, I have no idea what that means, but its ONE SINGLE LINE OF CODE! This sentiment that compilers can’t be trusted is obviously a carryover from the dark ages when compilers didn’t optimize things for you. Provably the compiler always Does The Right Thing. Actually, speaking of inlining, this brings up another good point I wanted to mention. Did you ever notice how C# has no inline keyword? That is because the incredibly forward thinking engineers at Microsoft (a company headed straight to the top if there ever was one) realized that compilers are way better at deciding what should be inlined than a programmer. Come on, look what happened in C! They let the “C-partiers” decide what to inline, and being of questionable intelligence, everyone ended up inlining everything. Do you really want the power to decide what gets inlined in the hands of idiot programmers like yourself? No, only Donald X. Knuth, shaolin grandmaster and Mortal Kombat thunder god, should have that power.

Move Forward, Not Back

As a society, we tend not to invent things that are worse than things we already have. Think about it. We invented artificial flavors because they are better than natural flavors. If natural flavors were better, than we would never need to invent new chemicals. Its the same thing with languages. We invented C++ because it was superior to assembly. If assembly was better, by definition C++ wouldn’t exist, and C++ most certainly exists. We have all kinds of new algorithms and fast computers with lots of memory that make C++ practical now, and there really is no reason to go back to the dark days of register allocation (unless you mean allocating callbacks to register with your awesome new callback design pattern!)

Object Oriented Programming, or OOP, represents a new paradigm that revolutionized computer programming to the extreme. Its main benefit was the ability to allow programmers to think about their program using analogies that made sense to humans, instead of thinking about boring hardware details which always just change all the time anyway. The same strategy of forcing the people way of doing things on machines was deployed by computer vision researchers with phenomenal success. It turns out that taking something non-human like a computer and trying to force it to see in the same way humans see was the best possible way for computers to make sense of the world around them.

Now some people will whine and moan about the performance effects of OOP, but I don’t see it that way at all. When you change the way you write code for some specific piece of hardware, all you are really doing is rewarding the incompetence of hardware designers. Do you really want some ivy league fat cat hardware designer in his cushy tokyo skyscraper corner office dictating to you what makes your code good or bad? Hells no, you don’t! Its not our job to tailor our code to their hardware, but rather it is they who must tailor their hardware to our code! We didn’t land on the PPU, the PPU landed on us!

Its time we as programmers stood up for the abstractions that set us free and make our code so wonderfully unreadable. If we wanted to think about machine architecture, we would have become electrical engineers. No, we must not let the IBM nanny state tell us what we should and shouldn’t do. In the end, there must only be freeeeeeeedom!!1!

In Conclusion

If you should ever find yourself thinking you want to be a low level programmer, try this. The next time your mom asks what it is you do for a living, try telling her “I’m just a low level programmer.” I’m willing to bet she responds with something like “Oh no, hunny, thats not true. I’m sure lots of people think your work is important.” See, even your mom knows low level is a bad thing!

Finally, I wanted to provide an example of just the kind of idiotic trickery that no programmer should be engaging in. The best example I could find was this pointless garbage

↧

Rate Me, My Friend

April 16, 2011, 1:14 am

≫ Next: simple SPU ray length counting trick that everyone probably already knew

≪ Previous: The Radical Optimizationist’s War on Abstraction and Patterns

Jaymin Kessler

Q-Games

http://6cycles.maisonikkoku.com

It’s officially review season! That wonderful time of year when managers and bosses have programmers and artists jumping through hoops to prove they are worth that $20,000 bonus in the best case, and not worthy of firing in the worst case. Some reviews and fun, some tortuous, some boring, but everyone seems to have a different way of doing it.

I used to live in Florida and work for a company specializing in… well let’s just say particular arts that involve electronics. I won’t mention the specific company for fear of revealing any secret info, but I can tell you that the games I worked on included Maddening and Lion Forest GPA 2009.

The PACE review process bordered on absurd. Its been three years since I worked there, and I may be a little fuzzy on the details so hopefully Szymon Swistun, Ben Carter, or Jon Sterman can jump in and correct me if I screw up. There were three (that I remember) components to the review process.

First of all, you needed to find coworkers to fill out evaluation forms on your behalf. This part of the process was a bit of a joke because people always chose their friends who would of course give them a glowing review. If you didn’t have any friends or none of your coworkers liked you, you were officially screwed. I understand there is some value in being notorious among your coworkers, but when 400 people each collect 3 – 5 peer reviews each, and every single one starts with the sentence “X is irreplaceable and an asset to the company, capable of such amazing feats of engineering”, then what’s the point? Luckily for me, I had a crew (Mike, Jason, Ser-geon, and Arturo) and we helped each other out, but others weren’t so lucky.

Then there was the job matrix. This was the matrix that made The Matrix 2 and 3 look sane in comparison. This was a chart where columns were the specific job and rows were things you should be doing. For example, to be an SSE1, you have to create some kind of tool, and then force an entire game team or studio to use it. Of course this meant that every aspiring SSE1 out there was writing tools and packages that do things existing tools and packages already did, and then tried to force entire teams to use it. Some of the job criteria were even more ridiculous than that. Pretty early on, I realized that stuff in the job matrix was complete and utter crap, and that I didn’t have any interesting in doing the stuff that an SSE1 does. Don’t get me wrong, I wanted to become an SSE1, but the stupid crap in the job matrix actually demotivated me from trying for real.

Finally, the ACTION values. You can tell from the fact that ACTION is an acronym that it is some new-age hippy manager crap.

The A.C.T.I.O.N. Values are:

ACHIEVEMENT

. Meritocracy

. Individual Accountability

. Reward for Success

CUSTOMER SATISFACTION

Everyone has a “customer,” whether it’s the customer who buys the games, a supplier, or a co-worker in another EA department. We find out what our customers want and use their input to measure our performance.

. Identify Key Customers

. Build Relationships

. Get Feedback

TEAMWORK

. Play Our Positions

. Execute Our Key Assignments

. Communicate

. Offer Enthusiasm and Support

. Think EA World — See the BIG picture

INTEGRITY

. Openness and Honesty

. Keeping Commitments

. Equality: We are a one class society

OWNERSHIP

. Responsibility

. Innovative/Work Smart

. Manage Our Own and Other People’s Time Effectively

. Self-Expression/Good Citizenship

. Express Our Views

NOW

. Urgency — Do It Now!

. Priorities — “Live with the Hot Ones!”

. Be the Values, Make the Culture Real

Sorry, I just need a minute to stop laughing. “Live with the hot ones” always makes me lose it. Deep breath… and… Anyway, they provided this list of ACTION values, along with a thick booklet containing examples of what “proper” examples of each value look like. So we would have to look at the ACTION values list, look up what good examples of each value look like in the booklet, and then copy/modify them to contain our names. I like to tell myself that Insomniac has something like the ACTON values, where ACTON stands for

Asynchronous

Code is less important than data

Three big lies

OOP solves nothing

Nothing taken for granted

but I don’t think they would do anything that stupid. Be advised I have similar acronyms for other programmers I talk with on Twitter.

I also seem to remember there being surveys to fill out, but maybe they were part of some career direction initiative. The first question was always “what if your goal for this year” to which I replied “become an SSE1”. Next question was “how will you know you achieved this goal” to which I replied “I’ll be an SSE1”. I never really took them seriously but then again neither did a lot of people I worked with.

I’m not saying that I know of a better process. I realize EA is a huge company where not everyone knows everyone else well, but still. To further complicate matters, as soon as the review process was finished and the few available promotions were given out, emails were sent out to the whole company. Everyone knew who got promoted and who didn’t, which instantaneously unleashed a wave of angry programmers upon the offices of their managers demanding to know why the incompetent Mr. X got a promotion while they did not.

Currently I am working at Q-Games which has a far more sane process. This year we had to fill out a short 10 question survey with questions like “what is the coolest thing you did all year” and “what are you doing to improve and get better as a game maker”. Of course Q is a smaller company where it’s easier to know how your coworkers are doing, and we have a flatter structure (no SE0, SE1, SE2, SE3, SSE1, SSE2, SSE3, TD1, TD2, TD3, etc) so fretting over promotions and the jealousy that goes with it is pretty much nonexistent.

So, that brings me to the point of my post. My parents are visiting me in Kyoto and I didn’t have time to write another post thats up to the <sarcasm> high standards of technical excellence </sarcasm> of my previous posts. Therefore, I figured it would be fun to allow readers to post their evaluation processes, the good, the bad, and the f–king insane, for the rest of us to enjoy. So, go for it. Reviewed people of the world, you are among friends. Managers GTFO.

↧

simple SPU ray length counting trick that everyone probably already knew

April 30, 2011, 8:11 am

≫ Next: Initial D(ebugger) 3rd Stage

≪ Previous: Rate Me, My Friend

This is a very early variation of part of the PixelJunk Shooter 2 lighting system. It can process up to 16 pixels at a time, and you can easily unroll to do many rays at once. The version shown does 4 pixels at a time for simplicity

Consider this a crappy port of an old blog entry to video. It almost feels like a badly narrated audiobook read by a voice actor that only kinda knows what he is saying. Nothing new or interesting, and with no useful explanatory animations to simplify things. Enjoy ;)

↧

Initial D(ebugger) 3rd Stage

May 16, 2011, 7:50 am

≫ Next: Seven Year Review

≪ Previous: simple SPU ray length counting trick that everyone probably already knew

Humans are natural pattern matching machines. Evolution of this ability enabled us to better survive on the plains of africa, and lack of it is what makes non-primates suck so hard at sudoku. Of course our drive to look for patterns even where there are none also has its dark side. It enables the deluded beliefs of the religious, got me addicted to Rubicon before the show was cancelled, adds fuel to the fire of conspiracy theories, and has caused me to see patterns in the way people use debuggers that may or may not actually be there.

“Thats a nice score for a girl”

debugger usage: stage 1

Almost without exception, all the beginner programmers I’ve known universally have no interest in debuggers. After talking to many of them, it turns out they all had a few complaints in common. Because UNIX is what is taught in college, people’s first introduction to the debugger tends to be GDB which can be a little overwhelming for people not used to it. Combine this with the fact that the programs people tend to make towards the beginning are easily debugged by sprinkling printfs around, there isn’t much motivation to bother learning how to use a debugger. Finally, and this is particularly shocking, they were never taught about debuggers in school and just didn’t understand what they do or why they are useful.

Why it’s dangerous: The primary danger is that while you can debug very simple programs using printf, you’re being incredibly shortsighted and not learning a skill that will become absolutely essential once you start working on larger games and more complicated problems. Also, adding printfs as you discover problems forces you to recompile, makes your program run slow, and just plain doesn’t work for many problems. Finally, printf is buggy or unavailable on some systems, and its a really annoying way to debug multithreaded programs.

Advice: do yourself a favor and learn to use a debugger. Any one will do, you just have to learn the basic concepts and how to use it to track down various types of issues. As an additional bonus, since many schools don’t teach debugger usage, knowing how to use one is a huge plus when applying for a job.

debugger usage: stage 2

So the student finally gets around to figuring debuggers out and they get a taste of the power that comes with them. Sure your program /looked/ like it was functioning correctly before but now you can know for sure! You can set breakpoints, step over every line, into every function, and check the value of each result. You can set hardware breakpoints to find memory stomps, view the data in regions of memory, check asm and registers to follow optimized code, and all kinds of other great things. With all that information available, the stage 2 debugger user never wants to run unattached ever again. It gives them a very nice sense of security that eventually becomes a crutch.

Why it’s dangerous: With some people, this debugger dependence can cause them to become lazy with their logic, and a bit careless. While before, you had to really think about what you were doing because you couldn’t check every result, now you can write any old code that kinda looks correct-ish, single step until something goes wrong, edit, recompile, and debug again. Putting aside the obvious danger of believing that haphazard logic is correct just because it worked that one time you stepped into your function, this is a huge time drain. It takes a tremendous amount of time to copy your elf to the target, start up the debugger, reset the target, load back into game, get to the point where you hit your breakpoint, and check all the results. It’s even slower if you don’t know how to debug optimized builds and therefore are forced to always run in debug. Some problems either can’t be debugged with a debugger, only occur when you run unattached, or maybe there isn’t even a debugger yet for that new platform you’re working on. When faced with issues like these, many stage 2 debugger users tend to throw up his hands and give up.

Advice: Some problems require debuggers, but if you are firing up the debugger to find some problem in gameplay logic that you could have easily figured out by eyeballing the code, you’re Doing It Wrong. Also, don’t depend on debuggers too heavily because there will come a day where you won’t be able to use one to get you out of that mess you’re in.

debugger usage: stage 3 (logic renaissance)

Confession: I have spent most of my life as stage 2 user, and debugger addict. I didn’t use it as a substitute for carefully writing code, but I pretty much never ran unattached. I could debug anything, even final builds, so there was no reason for me not to run unattached. I didn’t even know there were people that ran unattached until I came to Q-Games. I noticed my boss never ran in the debugger so one day I asked him about it.

me: “why are you not running attached”

boss: “because it’s slow”

me: “but how can you be sure it works?!”

boss: “I look at the screen and if it looks right, it works”

me: *blinks Zorak-style*

Now mind you my boss is a graphics guy so he really can tell if something is working by looking at the screen, but that didn’t stop me from calling him insane. It took me a year or so but eventually I figured out he was right. A good majority of problems can be found (or avoided in the first place) just by using your brain instead of a debugger. Stage 3 debugging is really all about knowing when to reason something out and when to turn to the debugger. Sometimes, even problems the debugger was designed to find (like DMA errors) are more easily and quickly found just by taking what you know about the problem and working backwards.

Why it’s dangerous: there is nothing particularly dangerous about relying on your own powers of reason, or only using a tool when its needed, but don’t let it swing too far in the opposite direction. I know a few people who refuse to fire up the debugger because they think they should be able to figure out any problem without it. Ego is the main danger here.

Advice: there are a lot of stage 1 and 2 n00bs out there that could really benefit from your experience. Take the time to show them a few tricks and help the generation that grew up without the Spectrum, Amiga, Micro, and C64 suck just a little bit less.

↧

Seven Year Review

June 14, 2011, 6:36 pm

≫ Next: PixelJunk Shooter 2 lighting : My one (so far) regret

≪ Previous: Initial D(ebugger) 3rd Stage

Jaymin Kessler

Q-Games

if( strcmp(your_name, “Jaymin Kessler”) )
 
{
 
    exit(0);
 
}

Seriously, I’m writing this one for me and if you read it, I won’t be responsible for the 10 minutes of your life that you can never get back. You know that thing they do on some gaming news sites where instead of posting an interesting game-related article, some guy writes some other guy a letter about how he had to go out and buy ランドセル that day because the school year was starting? Yeah, well its kinda like that but the guy writing the letter is me and the intended recipient is me 7 years from now. Again, last warning, unless you want to read a summary of the life of some guy you never met and really don’t care about, click on over to Wolfgang Engel’s excellent article (http://altdevblogaday.com/2011/06/13/screen-space-rules-for-designing-graphics-sub-systems-part-i/)

Side note: I am also writing this because Mike Acton challenged the AltDegBlogADay authors to try writing about what we suck at. Sometimes I suck at life. Not sure if this is what he meant…

Background, or more specifically, ZZZZZZZZZZZZZZZZ

kinda looked like this, but different

I started programming in Kindergarten or 1st grade when my mom bought me a Commodore 64 and made me sit with her and type in programs out of the manual. My first program ever was the hot air balloon sprite animation, and I remember it vividly for two reasons: First because it was the exact moment that I completely and totally fell in love with programming, and second because its the last time I ever wrote any graphics code until 2009. Shortly after my mom became quite aware that she had probably done a bad thing when I stopped hanging out outside with friends (or having friends), and so she sent me to Camp Ramaquois for nature re-education. Luckily for me, Ramaquois had a computer lab in which I spent precisely 100% of my time teaching the other kids C64 BASIC. (side note, my mom _wished_ I was sitting back at home programming in the basement once I got into phone phreaking)

In 5th grade a classmate called Monsoor Zaidi and his cousin Ursil Kidwai mentioned to me that there was a language called C that magically allowed you to just make up your own commands and the computer would run them. I’m still to this day not 100% sure what they were talking about but right around then I became a C programmer. Assembly, PASCAL, and Lisp followed.

College, or more specifically the first and last time in my life I thought I was talented

uhhh yeah, I was /that/ guy in college

True story: I actually went to film school for 2 years because I wasn’t aware that programming could be a job that people did professionally. Undergrad was a weird time for me. Many of the people I went to school with just started programming when they got to school, which I found quite unbelievable. Of course working with people so new to programming, I started to wonder if maybe I was pretty good at this programming stuff. I had been doing assembly for years and kids in my class were really struggling with it, so maybe that was somehow indicative of natural talent (spoiler alert: no, no it wasn’t!)

I did my masters degree in AI because I was so cyberpunk and thats what people who read William Gibson books and want to live in computers do. You know who else goes to school for AI? People who design optical systems for recognizing flaws in chicken eggs. In other words, it was fun but not exactly what I imagined. Grad school is a world where people spend very little time writing code and a lot of time writing reports and papers. F-that.

Skip here for crap about working in the industry

both a screenshot from the first game I worked on, and an approximation of what I used to look like

Jon Sterman, an old friend from 2600 was working at Hypnotix when I was graduating, and got me hired by recommending me. I had some PS2 Linux VU1 assembly demos but nothing too too insane. I remember 2 things about my first day of work (2004 June 1st @ 10:14AM). First of all, Andrew Grabish (programmer and dress wearer) picked up on my excitement and enthusiasm and told me that in 7 years I wouldn’t be so damn excited and enthusiastic. Those of you who follow me on twitter may have noticed my tweet from 2011 June 1st @ 10:14AM when I finally after all these years got my revenge and got to tell Andrew how I am even more enthusiastic and passionate than ever, and that he was wrong.

The other thing I remember about my first day is that its the first time in my life that I realized just how much I sucked. Exchange with lead programmer is as follows:

Josh: do you… enjoy… working in the game industry?

me: oh yes, of course I do!

Josh: do you intend to… oh, I don’t know… /stay/ in the game industry?

To this day, I still get really excited on the 1st and depressed around E3, since I sat on my couch watching the coverage thinking how jealous I’ll be of all those game makers when I am forcibly ejected from the industry. So I had a choice: get better or continue sucking. I thought I chose the former but I’m sure it looked like I went with the latter. Maybe I sucked too much to know how to suck less? It sure didn’t help that I often went into panic loops where I would try something that didn’t work, start to panic, as a result try the same thing again which didn’t work again, and that made me panic even more.

In 2005 EA Tiburon bought Hypnotix and we were all off to the magical world of Orlando. Tom Kirchner had me convinced that everyone at EA, even their herbs, were powerful wizards capable of amazing feats of engineering that normal mortals could never comprehend. It was right about then that I decided it was time to work my ass off, and really step up my game. I probably haven’t rested since. Never underestimate the power of someone with minimal talent, an inferiority complex with something to prove, and an insane obsession with programming and learning new stuff.

you never really fully get over your cyberpunk phase

In 2008 I applied at Q-Games. I was a fan of Monsters, but that had nothing to do with it. I learned at EA that I could have fun working on stuff that I didn’t play if I was learning new things. My primary reasons for wanting to go to Q were

Wanted to work with the guy who did the PS2 duck in a bath demo
Wanted to work on PS3 OS projects like the XMB and gaia earth visualizer
Wanted to live in Japan
Wanted to avoid writing cross platform code like the plague that it is

I remember thinking how fun the programming test was, and taking days off from work to work on it. I really worked my ass off, trying to find different approaches to the problems. In the end, they were nice enough to hire me, but 3 years later I checked my programming test again and there is no way I would have hired me. I hope that means I suck a little less than I did 3 years ago. Q gives me a lot of interesting problems to work on and affords me a lot of time to do free research, so for someone like me whose only goal in the world is to learn new stuff and improve, its an ideal place for me.

Factsheet for later comparison, aka time capsule of embarrassment

To future me, I just want to talk about where I am currently at, because its probably amusing how much my perspective will change in 7 years. Also, I am really looking forward to looking back at how ridiculously little I knew back in 2011

making games is nice and all, but I am in it for the interesting hardware and the chance to work on really cool problems. Luckily I can do that in such a way as to maintain the illusion of usefulness to a game company :) If I ever get kicked out of the industry, I am becoming an embedded programmer
Because for me, most of the fun of programming is finding interesting uses for weird instructions, I tend to focus a little too much on the low level. I’m starting to get better about that and improve my lateral thinking, but I still tend to immediately think low level
I used to suck at math, but I have now made my way through the khan academy series on Calculus and Differential Equations. I still know nothing about Linear Algebra (you know, the math thats actually useful at work) but I just can’t get into it
I am physically unable to do stuff in a non-optimal way if I can think of a better way, and when I start obsessing over a problem I forget to eat, can’t sleep, and can’t think of anything else. Its useful but it can also be a little destructive.
three things that would mean the end of my life: having kids, becoming a manager-type, not being able to code anymore
I am still strongly atheist, and this is unlikely to change because I am not an idiot.
I’m not saying that good data design and optimal code means having to give up readability or ease of understanding, but if I had to choose then nothing is more important than performance. Thats one of the reasons I’m not a C# fan. You sacrifice too much performance for a little syntactic sugar.
the main purpose of programming is not really to actually make stuff, but rather to improve as a programmer, learn new things, and challenge myself
I spoke at GDC (which I didn’t deserve), and I’ll be speaking at SIGGRAPH (which I really don’t deserve) and through that I have learned that I really suck at, and am really really terrified of public speaking. Although anything that helps me get 永住権 is good, right?
Embarrassment provides a strong motivation to improve, and so here I am saying embarrassing stuff about lack of talent
Finally, related to point 8, the most important thing in life is being keenly aware of your weaknesses and the things you suck at, and being able to take concrete steps to obliterate them. There is no shame in not understanding something or not being good at it, as long as you have a plan to remedy the situation

So thats where I am at in 2011 with a mere 7 years experience. Seven years *sounds* like a long time to be doing something, but compared to some of the people I follow on twitter 7 years puts me squarely in n00b territory. Honestly, I should probably be better than I am by now, but I’m sure I’d say the exact same thing no matter what my current skill level. Anyway, I promise I won’t do another one of these until 2018, possibly by which I will have learned to write coherently, but probably not.

↧

PixelJunk Shooter 2 lighting : My one (so far) regret

June 30, 2011, 6:29 am

≫ Next: When you can’t SPASM, TREMBLE

≪ Previous: Seven Year Review

Disclaimer: I am not on the PixelJunk team, so if you liked the game I probably had zero to do with it! Conversely, if there were parts you disliked, I’m probably not the person to talk to :)

By now, everyone on the planet should have played PixelJunk Shooter 2, but in case the PSN outage stopped you from downloading it, it looks something like this:

One of the more interesting things in the sequel was the introduction of dark stages, something I think added an interesting new dimension to the game. I won’t do an in-depth description of how the SPU lighting in those stages works, but basically we calculate a real-time dynamic distance field that is also used for fluid particle collisions, and use that to get the length of light rays for occlusion and shadow casting. The lighting itself consists of three stages: get ray length, rasterize out, merge with other lights. The second stage was by far the slowest due to inefficient memory accesses, but I will save my ideas for that for another day. Its the first stage I want to talk about today, but first we need some background.

Distance Field: it slices, it dices, it makes thousands of julienne fries!

Distance fields are one of those things that, to me at least, seem to have endless interesting uses. They are related to the skeleton transform, which I believe is the process in which girls become models. Lets start off with a simple 2D world, in which there are two objects: an oval and a square. You start with a “binary” image (that doesn’t have to be binary at all) where certain pixel values denote objects and others free space. The end result of the distance transform should be another image where each pixel’s value gives the distance to the closest object. As you get closer and closer to an object, the distance gets smaller and smaller, but what happens inside an object? Well, that depends. In an unsigned distance field, pixels that represent an object tend to be all zero, since the distance to the closest object (itself) is zero. In a signed distance transform, the distances become zero at object edges, and then go negative as you move towards the center of an object. Its actually quite useful, for example, when you want to know how far an objects penetrates a wall and figure out how much you need to back up.

There are many methods used to calculate them on CPUs, GPUs, and some mythical fake-sounding possibly theoretical machines like SIMD Hypercube Computers (Parallel Computation of the 3-D Euclidean Distance Transform on the SIMD Hypercube Computer), LARPBS (Fast and Scalable Algorithms for the Euclidean Distance Transform on the LARPBS), and EREW PRAM (Parallel Computation of the Euclidean Distance Transform on a Three-Dimensional Image Array). GPU algorithms tend to be multi-pass and slow to converge, while CPU algorithms tend to be very parallel-unfriendly, and the algorithms that are parallel tend to be for the weird architectures mentioned above.

I’ll now briefly go over the Chamfer signed distance transform. For a handy reference, be sure to check out the excellent “The Dead Reckoning Signed Distance Transform” by George J. Grevera. First (obviously) is initialization. Go through every pixel in your texture and if that pixel is inside an object, give it a value of -1.f, otherwise give it a value of FLT_MAX. Then pass over one more time looking for object edges, and assign them values of zero. The second pass is a forward sweep from left to right and top to bottom. You have a sliding window that looks something like this

√2 1 √2 1 C - - - -

where C is the center of the window (and the pixel whose value we want to fill in). So for each pixel in the surrounding 8 neighbors, we take its distance value and add the corresponding offset in the window (1 for the pixels directly above and to the left, √2 for the pixels to the upper right and upper left, skip pixels marked with a -). Out of those 4 values, find the min and make that the distance for C. You can see we are starting with known distances to objects and then propagating. The second pass is almost identical, except we start at the bottom right corner and go right to left, down to up. This time the window looks something like this

- - - - C 1 √2 1 √2

Thats it. By now you’ll have a lovely approximation of the distance from any pixel in your map to the closest object. If you check out Grevera’s paper you can see the results from experimenting with different window sizes and distance offsets, and read about dead reckoning which is useful for keeping track of real distances.

that one regret

One fine day, Jerome (my boss) sent me a copy of a PDF he thought I’d be interested in. It was called “Rendering Worlds with Two Triangles with raytracing on the GPU in 4096 bytes” by Inigo Quilez (http://www.iquilezles.org/www/material/nvscene2008/rwwtt.pdf). Its the paper that introduced me to raymarching, and kicked off my obsession with procedurally generated on the fly distance fields. The really obvious thing he mentions is that for any point in the distance field, its “guaranteed” that there won’t be any objects closer than the distance field value at that point. So if you’re marching along a ray, you no longer have to check every single pixel for occluders, but rather can just jump ahead by the distance to the closest object. It would have been absolutely perfect for the first pass of the Shooter 2 lighting system… if only I had actually used it! The only drawback is when you have a ray running parallel to a close by wall. Because the closest object is always right next to you, you can’t jump ahead so far.

The approach I took in Shooter 2 was slightly more… um… low level. I decided to calculate light ray length by loading between 4 and 16 pixels at a time into a vec_uchar16, and then in parallel check for the earliest occluder giving me the total ray length (see http://6cycles.maisonikkoku.com/6Cycles/6cycles/Entries/2010/4/17_n00b_tip__Psychic_computing.html). Of course I was too busy unrolling and software pipelining and microoptimizing to care about the insane cost of loading sparse pixels along a ray and packing them into a vector. Actually, thats not entirely accurate. I put a lot of work into coming up with an offline-generated table of ordered indices that would minimize the redundant loads and packing, but the overall cost of the first stage was still dominated by inefficient (some would say unnecessary and avoidable) data transforms. (note: I experimented with ways to get around this like http://6cycles.maisonikkoku.com/6Cycles/6cycles/Entries/2010/11/27_n00b_tip__Combining_writes.html but none ended up shipping with Shooter 2)

testing light occlusion against the oval and box defined by the distance field

So, as a joke I decided to hack together a particle free Shooter 2-like lighting demo on a platform far less powerful than the PS3 and the results were pretty amazing. Not only was I able to get a large number of lights going, but I was also able to add reflection and refraction, something I must admit would have looked insanely sexy with the Shooter fluid system :)

There is no such thing in life as normal

Even if you’re a Johnny Marr fan, you have to admit Morrissey has a point. The geometry for the objects used to define my distance field doesn’t exist, and there are times I want the normals. For example, when doing the reflection and refraction mentioned above. I thought back to basic calculus and remembered how to calculate gradients

http://www.khanacademy.org/video/gradient-1?playlist=Calculus

http://www.khanacademy.org/video/gradient-of-a-scalar-field?playlist=Calculus

Testing my newfound normals, I found something disturbing. When bouncing off the oval, there were certain points when the reflected ray would totally go nuts (see below where bouncing off two very close points gives two different results).

Interesting. I tried rendering some of the normals and suddenly the problem became clear

OK. So the distance field itself is a low resolution noisy approximation of the true distance, and calculating the normals is an approximation from the distance field, so I’d expect it to be crap but we should be able to do better. I researched all kinds of interesting ways of improving the normals, things like edge curve fitting and bilinear filtering, but in the end I was able to get close enough but still maintain acceptable performance by a combination of blurring the distance field values and increasing the distance from the current pixel of the pixels used to get the gradient. Below are some things I tried and the results

averaging the normals

averaging the normals and increasing the gradient distance from 1 to 2

Additional unrelated topic: moving raymarching into 3D

One last thing. Ray marching is an insanely cool technique that has uses in dimensions other than 2. It can also be used to do stuff in 3D! Since I’m not a graphics programmer and I suck at making stuff look good, I won’t waste too too much time talking about the cute little guy I was able to make

It took me about 15 minutes to get that first little demo up and running. I’m still experimenting with procedural on the fly distance fields, and I might post again after a bit more math research. By the way, here is what it looks like when someone who knows what they are doing uses raymarching

↧

When you can’t SPASM, TREMBLE

July 14, 2011, 7:28 am

≫ Next: KHAAAAAAAAAAAAAAAAN (academy)

≪ Previous: PixelJunk Shooter 2 lighting : My one (so far) regret

If you somehow end up enjoying any part of this article, please consider following me on Twitter. I’m currently locked in mortal combat with Q-Games’s Ariel to be the first one to 1000 followers and my pride couldn’t handle her winning

Follow on Twitter

“I find both you and your project worthless” – guy on twitter (paraphrased)

I have always loved acronyms. Ever since writing an elf relocatable instrumenting profiler called Non Intrusive Profiling Platform for Loadable Elfs, I have strived to find “clever” acronyms for all my projects. Even my job title is an acronym! Debugging and Optimizing Unified Cache Hierarchies Engineer, or at least thats what I assume it stands for when my coworkers call me the company douche. Its is in this spirit that I am pleased to present TREMBLE.

Thats me, the company D.O.U.C.H.E

TREMBLE, or Trivially Reordered Evaluation Multicore Bi-pipeline Loop Engine, is a tool for scheduling, optimizing, and software pipelining loops written in an SPU assembly-like language. Actually, it /is/ SPU assembly with some extra directives to give various hints to the scheduler. Its somewhat similar to another tool only available to licensed PS3 developers but is public and developed using 100% publicly available algorithms and information. Once the code gets cleaned up a bit, and becomes a little more feature complete, I’ll throw it up on git hub (whatever that is) and hopefully it will become something like the Jaymin Kessler center for kids who can’t pipeline good :)

Disclaimers

I am not a compiler writer, nor an assembler writer. Sure I did one back in college but nothing that interesting. As a result, I’m sure the Codeplay guys will look at my tool and be horrified at how I missed obvious data structures and how my “optimizations” break everything but the two cases I tested.

Next, let it be known that I am doing this for fun and for my own education. Yes I can optimize my own code, but writing a program to replace myself that works with programs I didn’t write and have never seen seems like an incredibly interesting challenge. As such, I’m sure some of the things I say in this article may be a little off or will betray my status as an optimizing assembler writing n00b. But hey, you never know what interesting problems come up until you try! Anyway, the point is that even though initial results are quite promising, I am only interested in having fun / experimenting / learning and don’t care much about beating tool X or compiler Y.

Finally, I wrote this tool in C#, a language I spend much of my time bashing for its atrocious performance. So why would I do something so out of character? Well, I noticed that I do a lot of my regex processing in PERL already which isn’t exactly a language I’d do a game engine in, so why discriminate against C# (aside form it being a M$ language)? Besides, I don’t care if the codegen is crap. I only care that the generated code’s generated code is fast, if you follow my meaning.

Before getting started, familiarize yourself with some stuff

Because the reading is for nerds, I made some youtube videos. The first is an introduction to what software pipelining is and why we may want to use it, the second is a boring discussion of algorithms that few people but me care about, and the last one is a lively animation that demonstrates the scheduling I am currently doing. They can be found here:

You don’t have to watch them in 1080p, but I suggest at least 720p!

Cheat sheets are a great way to pass tests

I know things about the SPU ISA. That doesn’t mean I want to manually enter all that stuff in. In a very rare display of laziness, I decided to begin with a txt file version of the Insomniac SPU cheat sheet that I had laying around. From there, its trivial to read in the instruction mnemonics, registers, descriptions, pipeline, latencies, and functional units. I did have to make a few changes, though. My program parser assumes that rt means destination register and ra is an input operand. However, instructions like branches and stores use rt despite not actually writing values to that register. So I went through and changed some instructions to use ra instead

1	stqd ra, s14(ra) store quadword (d-form) odd 6

I am also storing off other info about if an instruction is a branch, or a store. The only other modification was the ability to add comments. This allowed me to not only document my changes, but also comment out instructions that aren’t allowed.

Parsing and building program info

I will now try to explain my crapfest of branches and for loops in the simplest way possible. I have a 128 element array (one per register) of the following struct

struct RegOwner
{
    int inst_num;
    int times_used;
};

Every time an instruction writes to a register, it becomes that register’s owner. Every time another instruction uses the result, times_used is incremented.

We start by looping over all instructions, and then each input operand for the current instruction. Going through each non-immediate input operand, we look at the register number and look it up in the RegOwner table. If no one owns the register, either its an input coming into the loop or its written to / updated by a later instruction to be used when we loop around again. If the register does have an owner, we do the following (assuming the nth input operand):

increment reg_owners[reg_num].times_used
set current instruction register n dependency to the reg_owners[reg_num].inst_num
add the current instruction to original_program[reg_owners[reg_num].inst_num].reg_info[0].dependencies

Its actually a bit more involved than that and I am storing more data in the register info struct, but you get the idea. We have some kind of weird messed up dependency graph where each instruction’s output operand has a list of things that are dependents and each instruction’s input operands have exactly one instruction as a dependency. Below is some of the insanely redundant info I am storing off

public class DepPair
{
    public int m_inst_num = -1;
    public int m_oper_num = -1;
}

public class RegisterDependencyNode
{
    public List<m_dependents> = new List();    // if output, things that depend on our result (can be many)
    public DepPair m_dependency;  // if input, things we depend on (can only be one)
    public Program.OperandInfo.OperandType m_operand_type;
    public int m_allocated_reg = -1;
    public int m_immediate_arg = -1;
    public bool m_is_locked = false;
}

public class InstructionNode
{
    public string m_opcode;
    public Program.Pipe m_pipeline;
    public int m_num_cycles;
    public bool m_include_in_schedule = true;
    public List<m_reg_info> = new List();  // all regs
}

Once the input operands are done, we do the output operand (if any). This is as simple as saying the current instruction is the register’s new owner by doing

reg_owners[reg_num].inst_num = current_line;

reg_owners[reg_num].times_used = 0;

Lots of other little things are done in the main loop, but the only one really worth mentioning is tracking labels. While technically I only allow branching at the end of the loop to the start of the loop, and loop counters that are incremented/decremented by immediates, I decided to go ahead anyway and associate labels with lines of code, of course remembering to adjust for comment lines!

The “parser” itself is done with regexes. I’m using it to collect operands, opcodes, and comments, and it turned out to be much easier than writing a real parser!

Scheduling pass

For now, I am doing this:

I schedule the loop branch in the last slot of the odd pipe, and the first instruction in the first slot (first iteration) of whichever pipe it belongs to. Now I loop over the rest of the instructions to schedule. For each instruction, I look at its dependencies, where they are scheduled, and when they finish. The last of those to finish is the first slot I am allowed to schedule the next instruction in. If its not empty, I keep moving down the schedule looking for a free slot, and wrapping around and continuing when I fall off the edge of the schedule (the schedule length is the minimum initiation interval, or max(odd_instruction_count, even_instruction_count)). Really, there isn’t much I can say that is more clear than the animation in the video.

Assigning registers

For obvious reasons, I can’t use all the same registers used in the original program. Anti-dependencies (for example) make it dangerous when instructions start getting rescheduled or wrapping around the schedule. My initial process of assigning registers was so ghetto that its almost embarrassing. It basically “worked” like this. In the original loop, input operand registers that have no dependencies and output operands than have no dependents are untouchable. This is because they are either inputs or outputs to the loop and should be left alone. All other registers are fair game. So we start out with a range of usable registers and remove ones on the no-touch list. Then for each instruction, if the output operand is not in the do-not-touch list, we assign it a register from the pool available to instruction in its iteration. Then for every register that depends on the output operand, we change its register to the one we just assigned. Thats it. No SSA, no fancy anything. It was something quick and dirty to get me up and running. It was also horribly broken.

The concept of no-touch registers still seems semi-valid, but it doesn’t fix the antidependency problem. Because I am doing register allocation after scheduling, I ended up having to try and insert register copies in some cases to make sure the registers read in iteration n+1 are different than the ones written in n. I only have to do this for some cases, though. If a result has no dependents in later iterations, or the dependents are used in instruction slots earlier than where we write the value, then we don’t need to copy (I think…). Imagine this hypothetical situation. We have the following program

a) add r32, r31, r31

b) add r33, r32, 4

c) add r34, r32, 4

which then gets scheduled as:

0) add r33, r32, 4 (instruction b, iteration n + 1) ; uses r32 from previous iteration

1) add r32, r31, r31 (instruction a, iteration n) ; writes a new value in r32

2) add r34, r32, 4 (instruction c, iteration n + 1) ; error: _should_ use r32 from previous iteration

In general, to be safe at the end of iteration n we have to copy the value in r32 to another reg, however notice that its only a problem if we have an instruction that reads r32 AFTER schedule slot 1 (aka instruction c). If c didn’t exist, we wouldn’t need to copy

Contrived example: Lets say you have the following program

.L4:
    shufb   $0,$0,$0,$0
    shufb   $1,$1,$1,$1
    shufb   $2,$2,$2,$2
    shufb   $3,$3,$3,$3
    a       $5,$0,$0
    lqd	    $4,0($4)
    lqd	    $8,0($8)
    a       $7,$5,$8
    brz $4, .L4

The minimum initiation interval is 7, so the second add won’t be scheduled until the second iteration. Actually, it will try to schedule it in the schedule slot where the first add already is, and so it gets placed in the next even slot after the first add. Long story short, we have a second iteration instruction later in the schedule than the first iteration instruction it depends on, so a copy is required to fix the antidependency and stop the register from living too long. Here is the TREMBLE output

scheduling logic:
scheduling shufb in slot 0 iteration 0
scheduling shufb in slot 1 iteration 0
scheduling shufb in slot 2 iteration 0
scheduling shufb in slot 3 iteration 0
scheduling a in slot 4 iteration 0
scheduling lqd in slot 4 iteration 0
scheduling lqd in slot 5 iteration 0
scheduling a in slot 5 iteration 1
register move required, inserting slot 6 pipe EVEN
 
best schedule found: 7 cycles with 2 iterations in flight
 
Prologue:
shufb r0 r0 r0 r0
shufb r1 r1 r1 r1
shufb r2 r2 r2 r2
shufb r3 r3 r3 r3
a r16 r0 r0
lqd r4 0(r4)
lqd r8 0(r8)
ori r18 r16 0
 
Schedule:
 
0) nop (iter 0)
0) shufb r0 r0 r0 r0 (iter 0 )
1) nop (iter 0)
1) shufb r1 r1 r1 r1 (iter 0 )
2) nop (iter 0)
2) shufb r2 r2 r2 r2 (iter 0 )
3) nop (iter 0)
3) shufb r3 r3 r3 r3 (iter 0 )
4) a r16 r0 r0 (iter 0)
4) lqd r4 0(r4) (iter 0 )
5) a r7 r18 r8 (iter 1)
5) lqd r8 0(r8) (iter 0 )
6) ori r18 r16 0 (iter 0) live too long: reg copy
6) brz r4 0 (iter 0 )

Prologue

The prologue is basically stepping through each instruction in the new schedule until you execute the last instruction whose result is needed for the instructions scheduled in the last iteration. Its just that simple.

Removing unused instructions

This is surprisingly easy because I have an array of structs mapping register numbers to the instructions that own them. Lets look at the following example

0: add r32, r31, r31

1: add r33, r32, r32 ; r33 owned by instruction 1, usage count = 0

2: add r33, r100, r101 ; previous r33 owner has a usage count of 0

Look at instruction 2. It wants to write to r33 so it checks the previous owner’s usage count. Instruction 1 wrote to r33 but that value was never used (count = 0). Therefore we can remove instruction 1. Going further, we can look at instruction 1’s input operands and remove them from the list of r32’s dependents. After removing instruction 1, instruction 0 has no dependents anymore and can also be removed.

The reason I am not just removing instructions with zero dependents is because an instruction that seems to have no dependents can also be either a loop output (needed after the loop) or something needed at the beginning of the next loop iteration.

Contrived example:

again:
   a $32, $32, $32
   a $31, $31, $31
   a $34, $31, $32 ; removable because the only instruction depending on r34 is removable
   a $33, $31, $32 ; removable because the only instruction depending on r33 is removable
   a $32, $33, $34 ; unused instruction setting off the removal chain
   a $32, $35, $36
   brz $127, again

TREMBLE output:

unused instruction found on line 4: a r32 r33 r34
unused instruction found on line 3: a r33 r31 r32
unused instruction found on line 2: a r34 r31 r32
unused instruction found on line 1: a r31 r31 r31
unused instruction found on line 0: a r32 r32 r32
 
scheduling logic:
scheduling a in slot 0 iteration 0
 
best schedule found: 1 cycles with 1 iterations in flight
 
Schedule:
 
0) a r32 r35 r36 (iter 0)
0) brz r127 0 (iter 0 )

Limitations and future work

Here is a list of stuff I plan to do / fix/ make suck less

I am really interested in automatically unrolling loops and am experimenting with this pretty heavily at the moment. Supposedly some types of module scheduling can benefit greatly from pre-unrolling loops. However, it seems to be tricky in some cases determining what to unroll and how to unroll with complex logic updating the registers that serve as base addresses for stores. If there is a real “official” way to do it, please don’t tell me. I want to see what I can come up with on my own.
When scheduling instructions, I am currently looking at them in order. Evidence suggests there is some benefit to pre-ordering them by the number of dependents, with registers having many dependents scheduled first.
Register allocation isn’t as efficient as it could be. For example if you write a value to r32 in iteration 1, slot 0 and that value only lives until iteration 1 slot 2, technically we should be able to reuse that register for calculations that don’t carry over to the next iteration. Also, I am assigning registers in a range with no preference for volatile regs and no spilling non-volatile regs. That is straight up broken and needs to be fixed. Speaking of registers…
My fix for registers living too long can be a bit more clever. Maybe if I take registers into account when scheduling…
I would like to implement some other more interesting scheduling algorithms, and I need to back up and unschedule instructions when instruction scheduling fails. Actually, at the moment scheduling never fails which is pretty horrific
I’d like to do pipeline balancing in the case where one pipeline is used much more heavily than the other
Lots of other optimizations I’d like to try
I am generating prologues but not epilogues (yet). Also, my logic to try and collapse epilogues and prologues is a little… incomplete. s/incomplete/nonexistent
The only branches that are allowed are branches back to the beginning of the loop, and loop counters have to be updated by adding/subtracting an immediate for unrolling to work
DMA: I am not even thinking about this right now
Massive redundancy, inefficiency, and redundancy. Clearly my mood and direction changed every day I worked on this. Needs cleaning up…
It might be cool to eventually make some kind of iPad game or educational tool out of this. The game could be things like “beat this schedule” and the educational tool would be more interactive and visually show you how the code was scheduled. Maybe even serve as a reference where you could mouse over instructions to get info on them, or search for instructions in drop down menus when entering in programs interactively

↧

KHAAAAAAAAAAAAAAAAN (academy)

July 29, 2011, 7:59 pm

≫ Next: PixelJunk Shooter 2 SIGGRAPH talk

≪ Previous: When you can’t SPASM, TREMBLE

Its the best website you’ve never heard of.

Originally I had planned to do an update on TREMBLE, my SPU loop pipeliner and optimizer. I added some cool new features like multi-implementation odd/even pipeline balancing macros and a GUI for people to play around with. However, since I am pretty tied up with SIGGRAPH presentation issues at the moment, I am going to take the easy way out and write a short article on something very near and dear to my heart.

We all know some programmer that knows nothing about math. If past experiences are indicators, we know him extremely well because he is us. Most programmers I have known (myself included) just didn’t take a lot of math in school, or took some math class and 3 seconds after the class ended they instantly forgot all the meaningless formulas they memorized. The way its taught in school, math just didn’t seem that interesting / relevant to programming. Unless you’re bob, alice, or some dude talking about global illumination and raymarching fractals, you can actually get quite far in the industry not knowing how to factor polynomials, convert log bases, or do algebraic simplification. However, just because you can get by without it, doesn’t mean you should strive to continue your ignorant ways, or miss out on the benefits of training yourself to think about things in alternate ways.

Enter Khan Academy. I don’t know whether to call it a video site, school, a system, or a way of life. At its core, its one person making thousands and thousands of videos on subjects like biology, chemistry, cosmology, physics, economics, math, and programming. The thing that sets the videos apart from other educational videos is Khan’s focus on intuition and understanding the why more than the how. Anyone can give you a formula to memorize, but Khan gives you the understanding of why the formula is what it is, so that if you ever forget it you can reason out what it should be. This gives Khan academy videos that sense of wonder that only comes from a really cool concept suddenly snapping into place.

Furthermore, there is a really cool practice system in place that ties into the videos. Its a google maps navigable tree of exercises you can make your way down, and a sidebar that suggests exercises for you to try based on videos you have watched. If you have trouble answering a question, it povides a link to the relevant video so you can brush up without breaking your question-answering streak. It even has a little scratchpad area you can scribble notes on off to the side. There is even a coaching section where teachers can add students and students can add coaches. That way teachers can track their students progress and adjust accordingly when in the actual classroom.

And of course most near and dear to our hearts is the metagame. You get points for watching videos and doing exercises, and there is even a trophy system (called “badges,” because trophies is too PS3-like?) where you are given various awards for doing things like watching a certain number of videos, answering a consecutive number of questions correctly, and other fun stuff.

I’ll now close with a list of stuff I love about Khan academy

0) If there is ever something you’ve been curious about or wanted to learn, this is the way to go! Everything is explained in a very easy to understand way, and if there is some prerequisite concept you need to understand the video, you can go watch that video as well

1) Sometimes people lose concentration in class. If you nodded off for a second and missed some important piece of information, its embarrassing to ask the teacher to go back and repeat it. With Khan academy, you can just back up a little and rewatch, or pause while thinking about whats being said to make sure you understand.

2) Khan is human. He makes up examples on the fly, screws things up, makes mistakes, and later on has to make correction videos. Its really an amazing example of someone who is really smart and really good at math, and yet isn’t perfect. Its makes you feel less discouraged when you, yourself, screw up.

3) The metagame is kinda addictive. If you look at the screenshots above, I have badges for doing addition and subtraction exercises. Its not because I was unsure of how to add negative numbers, but rather I was trying to catch up to @kalin_t’s score

4) Even if there were no practice exercises, badges, or anything else, the intuition you get from the videos is priceless. It is really empowering to understand the why instead of just memorizing the how, and that feeling you get when everything snaps into place is just indescribably cool!

So, thats it. Get your ass to www.khanacademy.org and start sucking less (or using your newfound knowledge of cosmology to impress girls at parties!)

PS… don’t judge me for the small number of watched videos listed in my profile. I watch them every day at lunch and I’m not signed in 90% of the time :)

↧

PixelJunk Shooter 2 SIGGRAPH talk

August 26, 2011, 8:42 am

≫ Next: Basic register allocation, for n00bs by n00bs

≪ Previous: KHAAAAAAAAAAAAAAAAN (academy)

Here is a semi-accurate reproduction of my SIGGRAPH 2011 talk on PixelJunk Shooter 2. I worked really hard to make these videos and all I ask in return is that you follow me on Twitter. I need to hit 1000 followers and I need to do it soon… or something very bad will happen. Please show your appreciation for open knowledge sharing and help me out by following me on Twitter. Thats all I ask.

http://twitter.com/okonomiyonda

And if you’re feeling especially helpful, please also get your friends to follow me :) Now, with no further ado, here’s the presentation brought to you without commercial interruption in glorious 1080p!

(edit: at Naty Hoffman’s request, I have added the slides in PDF format for people who may not want to listen to me drone on in a 30 minute video. The temporary home for the slides is here until they are probably moved to their new permanent home on the Q-Games fumufumu website. Also I am adding a link to my website which I make reference to throughout the video here)

Part 1:

http://www.youtube.com/watch?v=7q8s7DMOOD4

Part 2:

http://www.youtube.com/watch?v=HBPQ7GRPTEw

↧

Basic register allocation, for n00bs by n00bs

October 12, 2011, 9:14 pm

≫ Next: I HAVE CURED INSOMNIA (with this video on vector-scalar mixing)

≪ Previous: PixelJunk Shooter 2 SIGGRAPH talk

I won’t bother with an intro since I already know how many ADBAD readers care about register allocation. My only request is that the one person who will end up watching these watches them in HD, otherwise my uploading a 3.2GB file will have been for nothing

video 1 part A

and part B

↧

I HAVE CURED INSOMNIA (with this video on vector-scalar mixing)

December 19, 2011, 5:17 pm

≫ Next: Strattonian Gambit

≪ Previous: Basic register allocation, for n00bs by n00bs

disclaimer: sorry about my voice, but this was recorded during a 24 hour food poisoning marathon and my throat was really raw and scratchy

A medium length video about the pain of mixing scalars and vectors in computation. I’ll show examples on all three CPUs you will encounter as a game developer: PowerPC, SPU, ARM (although technically there are some people still doing PS2 MIPS, VFPU, or maybe even 68000). I’ll explain how particularities of each architecture make it harder (or easier) to do things like horizontal add and vector insert, and point out where the performance problems are. I will also open with one of the illest demos I ever heard on Stretch and Bobbito

↧

Strattonian Gambit

June 14, 2013, 7:00 am

≪ Previous: I HAVE CURED INSOMNIA (with this video on vector-scalar mixing)

After a 1 year hiatus from the site (including a bit of crunch), I thought I’d try to start making videos again. Hopefully I can also finish a bunch of the almost dones and post them as well. I apologize in advance for how hastily I threw this together.

Implementing absolute value using only odd instructions on the SPUs.

To me, the most fun part of programming is finding fun and creative ways to use an ISA. There are few joys greater than exploiting really weird instructions in ways they may or may not have been designed for to save a few cycles. While the solution to this one didn’t really involve anything too crazy, its was still quite a fun puzzle.

This one was inspired by a challenge posed to me by Cort Stratton of Naughty Dog ICE team fame. We were at a Sony dev con and everyone was going out drinking, so of course we decided to break off from the group and talk about programming. There he challenged me to find a way to implement absolute value using only odd instructions. The answer I gave him was, in his own words, a little “floaty” and “not too concrete.” I mumbled some lame excuse about arriving in NJ the day of Superstorm Sandy, spending 3 days jetlagged with no electricity or internet, before catching a flight to cali for the conference. I also promised him that I would send him my (more comprehensible) answer as soon as I got back to Japan.

One and a half years later my answer still isn’t concrete. Its still floaty and inexact. It may not even be correct. But this time it has nice pictures and animations to go along with it.

http://www.youtube.com/watch?v=HKOVDvvXljg&feature=youtu.be

↧