Long answer: I think that the more minds there are working on this (or any other interesting) problem, the more progress will be made. And we shouldn’t let the perfect be the enemy of the good enough.
The field of robotics, and especially soft robotics like this, is so new and uncharted that individuals tinkering in their garage can and have made real advances for the field.
That said, really good control systems for these kinds of actuators are something very much lacking. The human nervous system, and in particular the vestibular system is so absurdly effective, efficient and complex all at the same time that it boggles my mind. Nothing anyone has ever designed has ever come close to that level of functionality. But to make a robot that can do human things, we need that level of functionality. We need to have really good proprioception which is integrated with force and feedback sensing.
To give an example; open a bag of potato chips and pull out a decent handful. Now eat the chips one by one out of your hand without dropping any and without using your other hand. I dare say you can probably do this, and without crushing the chips in your hand. You can feel just how much force to apply, you can feel where to place your fingers so that all the chips are held just so. You can make subtle adjustments after each chip is taken away.
There is no robot hand in the world that can do that simple task. Ask any arm amputee. And to get to that point, a lot of really difficult stuff has to be invented and worked out. I suspect this is a problem which will be solved piecemeal, over time, but much of the fundamental work in things like the mathematics of motion control, material science and compliant mechanisms will come from very well funded labs. If I had to guess, I give a 60% chance a robot can do the chip eating task within a 5 years. And 85% within 10 years. (Especially if the work of my team pays off)
I run http://www.visaok.in/ a job board listing curated visa sponsored tech jobs in over 20 countries around the world. A while back, I painstakingly created Visa Guides for the top 20 most popular countries around the world (most of them are in Europe and Asia). These Visa Guides provide details on the Visa Requirements in various categories of 'Skilled Tech Workers'. All Visa Guides are listed here => http://www.visaok.in/work-permit/blog/
I found that in almost all the countries that have some sort of a 'Startup Visa', there is no requirement for having a degree.
I've listed the Guides for European employers below. It shouldn't take you too long to identify a country of choice and review the short, but detailed Visa Guide to see if you fit in without a college degree. You can then search the main job site after selecting the Visa from the dropdown and entering your skill / job title.
My email is in bio. Feel free to write back for more info or if you have any feedback on the job board.
OK so basic background here: CUDA processing usually looks like some dimensional array of data (1d, 2d, 3d, etc). Then you have a series of "warps" which tesselate their way through your data space processing a chunk of elements at a time. The warps can be organized into larger "blocks" to share data between parts of the warp. Many blocks make up a "grid", which is more or less synonymous with "the processing elements of a kernel". A kernel is a GPU program.
Blocks can't communicate between each other since they may be on different SMX processor engines (SIMD units). Also, kernels can't communicate either according to spec. CUDA doesn't guarantee the order of kernel scheduling - but it is possible via undefined behavior with spinlocks.
Generally speaking - larger problem sizes should be better for you. GPUs suck at small individual tasks, starting and stopping the kernels [from the CPU] is expensive. They are good when they are doing as big a task as possible (asymptotically to a limit). Memory size will limit how big a data set you can work on, which will limit your total speedup. So overall, less memory usage = better speed.
You run lots and lots of threads at any time. GPUs are designed around the idea of massive threading, easily run dozens of threads per actual core. This covers up the massive latency when you need to go off-chip to load from global memory. You might run 10,000 threads in a program, and most of them will be sleeping while waiting for their data to load. When all threads in a warp are in READY state, the warp is scheduled and will execute.
As you note, GPUs don't work well when the threads are doing different stuff. For example, any threads that don't follow an "if" statement will just idle - because all threads in a warp execute in lockstep. They are masked off and their instructions don't affect their registers. If there are N different paths through the code, you will run it N times.
Architecture is critical to understand because this is actually bare-metal programming, like a microcontroller. There are very few niceties here. Memory is not zeroed between runs (actually not even during a soft PC restart). There is no virtual memory segmentation. Illegal accesses may not even throw, or they may trash your OS's viewport, crash the drivers, etc. And if you don't code around the architecture's limitations, your performance will suck balls.
-------
In terms of general advice: a lot of times, scanning your data to pre-process and select "active" areas of the problem is a viable strategy. Streaming data sequentially across a warp is a pretty efficient operation thanks to warp coalescing, you have mega amounts of bandwidth, etc.
Think real heavily about your data layout. Structure of arrays is often really good because it gives you an efficient stride of 1 as much as is possible when reading/writing. That maximizes your efficiency when coalescing warps. If you are having every thread fire off its own request with no coalescing - your IOPS will trash the memory controller's performance.
As an extremely broad stroke, the best general-purpose approach to GPU programming is to convert your task into a sorting or searching task. GPUs are really, really good at sorting, and there's many good algorithms out there, so you don't have to handle the low-level stuff until you get up to a big problem size (i.e. you are maxing out GPU memory). Pay very close attention to the Thrust "histogram.cu" example because it demonstrates these techniques.
So, one good approach is to find your active elements first. You can sort the active elements to the front of the array. Or, you can use something like a prefix scan/sum or a thrust::copy_if to pull out indexes of "active" elements efficiently, and then scatter your operations across the indexes. If your indexes are sequential, then you will get the maximum amount of warp coalescing that is possible. That may not be much if your "active" elements are very sparse and widely distributed, but at least you're trying, and you're ensuring that all your elements are active as much as possible.
Obviously, wherever possible you want to avoid redundant operations, just like on CPUs. Structure your data to avoid redundant sorting, consider whether you want in-place or stable-sorts, etc. But overall sorting is very efficient on GPUs. You avoid thread divergence, you align memory access, etc.
Another approach is "dynamic parallelism". So you scan your data, figure out where "hot spots" are that have a lot of data that needs processing, and you launch more compute resources there (your kernel can launch additional kernel instances where needed). Also, in some situations you may be able to do the above approach of picking out indexes that need processing and doing them all at once - but you do it into registers or shared RAM. That way you are still keeping your cores processing instead of idling, but you avoid the round-trip to global RAM. The downside is you increase pressure on your registers/SRAM, which are very very limited resources.
If a thread can't find an element to process in a particular place - there's actually no problem with having some of your threads continue on to the next area that the warp was going to process. Assuming a random distribution - on average most of your elements will be in approximately the same area, so you still get some coalescing, and there is really no reason to have the rest of the threads halt/diverge and wait for the active elements.
Another cute dynamic parallelism trick - most of the overhead from starting/stopping kernels is the overhead of syncing the device up to the CPU. Put your main loop in a kernel by itself, and have the kernel launch more processing kernels. Overhead gone, now the GPU is running 100% on its own. However - if you really do need to talk to the CPU, then you will have to spinlock and poll, which is undefined behavior. Again, possible but iffy.
I really fucking hate CURAND. It's absolute garbage to use, it eats tons of global memory, it eats tons of SRAM, it is very not good. Instead, I really like Random123. Essentially instead of a "stateful" generator like Mersenne Twister, it's based on encryption algorithms. If you accept the concept that the output of an encryption algorithm is uncorrelated to a changing input, then essentially the encryption key becomes your "seed", and encrypting the value 0 becomes the first output from the RNG, 1 becomes the second, etc.
The advantage of doing this is that you don't waste your precious memory bandwidth and SRAM on CURAND, and instead you get to use CPU cycles. Paradoxically, GPUs have absolutely insane bandwidth, but bandwidth is the second most precious resource. The only thing more important is SRAM, because you get like 100 bytes per core (note: not per thread, per core, for all threads) or something like that, for all your registers, cache, and shared variables CPU cycles are cheaper than dirt. If you can possibly compute something from some data you already have loaded, that will usually be more efficient than loading it from global memory.
Use some property of your data (say, an index, or a uid) as your key value for Random123 and you get essentially infinite RNGs for free. If you need to have different results across different runs (stochastic simulations) then just add the actual seed to the uid-key-value. By storing a single counter (the max counter value any single element has taken) you can maintain the individual states for every single generator in your set. Not only that, but you can seek to arbitrary places in your RNG sequence. Let's say you generate some property of your data randomly. You don't actually need to store that for each element - you can just store the counter value you used to generate that, you have the index of the data element you're working on, just re-generate it in place wherever you need it. It's free money. Wait no, free global memory, which means you can scale your program up, which means it runs faster. So basically free money. Even better, you can force it to be cached in every SRAM bank using the __constant__ keyword.
I have a really idiosyncratic style for CUDA. I typically start with Thrust (basically the C++ STL for CUDA), writing high-level functional operations. Then I figure out where I can squish operations together, which I move into functors (pass them the index of elements they're working on, plus the array head pointers, they do operations on memory). Functors are nice because Thrust will auto-balance the grid for you for good occupancy. You can then start porting stuff into raw __device__ functions, and then finally translate it to a __global__ function that allows warp and grid level collective operations.
Once you've got the high-level stuff done, you need to tune the low-level kernel behavior. As much as possible - avoid global-atomic operations, since they kill your performance (you bypass cache and operate directly on global memory, incurring latency with every call, and CAS updates tend to cause contention/spinning). Pre-process in your shared RAM as much as possible. CUB (Cuda UnBound) provides warp-level and block-level collective operations that are useful - for example, a prefix-sum can give you the output targets for each thread in a warp that has variable amounts of data (0, 1, many) that it needs to output, which replaces a whole bunch of atomic operations. etc.
However, again a caveat: writing these collective operations can often involve "sync points", like thread fences. These warp/block/global sync points are really expensive in terms of processing, since you will have a bunch of cores idling to wait up for the stragglers. In some cases it's again possible to avoid an explicit sync operation by clever exploitation of the CUDA scheduler (as above, with inter-grid communication: it's not really that smart). But this is obviously very much undefined behavior too.
Texture cache can sometimes also be helpful. Basically it lets you align data in multiple dimensions rather than just one - so you can have a 3D kernel reading values, and from the GPU perspective it looks like they're all aligned, even though you're reading chunks that are hugely off in flat memory space. But there's some caveats, IIRC you really need to set it up before you run a kernel (can't do it on the fly), and IIRC it's read-only.
Also, you can cleverly abuse the texture interpolation for free math sometimes. That's typically the best gains you'll get out of texture memory, but it comes at the cost of lots of extra latency.
In newer revisions of CUDA you can transparently page stuff from host memory and it will kinda try to keep the two memory spaces synced up or whatever. This is a really bad idea, you should think real carefully before using that feature (basically never). Your 300 GB/s memory system is suddenly limited to 16 GB/s over PCIe, and memory bandwidth is precious. Explicitly manage your device memory, explicitly say when you want stuff copied and fsync'd, and don't let the autopilot handle it.
-------
As for your specific problem of tree searching: this is really bad for GPUs. As you noticed, naieve tree algorithms are pretty much the worst case, they lead to lots of divergence which GPUs suck at. As much as possible - you want to convert things into internal "while" loops that can keep moving across your dataset if they don't find something in a specific place. Don't recurse, loop. But generally - the structures which work well for CPUs don't necessarily work well for GPUs. Especially if you insist on doing one operation at a time. Searching for one element in a tree sucks. Doing range queries or searching a couple hundred values is going to be a lot better.
I have always been fascinated with the idea of probabilistic data structures and GPUs. Maybe you don't know for sure where an element is stored, but with 2000 cores you can probably find it even if there's a few dozen places it might be. That avoids some of the traditional problems of lock contention/etc on traditional data structures. And when you need to rebalance - GPUs are good at that sort of thing, since it's more or less sorting.
Also, I feel like GPUs could be an interesting model for Erlang. Lots of threads idling with low overhead? That's Erlang. Efficient message passing would be a trick though, and the use-cases would be diametrically opposite. You would have high latency and efficient numerical processing.
I also think I should be able to implement EpiSimdemics with a similar model to this one, but that model isn't open source and Keith Bissett, the guy at Virginia Tech who runs that program, refused to return my calls when I asked for disease model parameters to validate against. Ah, academia.
-------
Ton of words here, and it's been years since I touched any of this stuff (couldn't find a job in my specialty and ended up programming Java - ugh) but you've inspired me to actually finally put the code for my grad thesis on github. It might be a worthwhile example of a real-world problem for you. Be gentle, it's my first time. I haven't touched it in years and there are a few minor things I know I screwed up (noted in the readme.md).
Quick-start docs for the Thrust library, the actual easiest easiest introduction to CUDA that you ever will find, literally 10 lines for a hello-world program: https://thrust.github.io/
Co-author here -- thanks for linking to our paper. The basic theme is that (a) purely functional implementations of things like video codecs can allow finer-granularity parallelism than previously realized [e.g. smaller than the interval between key frames], and (b) lambda can be used to invoke thousands of threads very quickly, running an arbitrary statically linked Linux binary, to do big jobs interactively. (Subsequent work like PyWren, by my old MIT roommate, have now done (b) more generally than us.)
Daniel Horn, Ken Elkabany, Chris Lesniewski, and I used the same idea of finer-granularity parallelism at inconvenient boundaries, and initially some of the same code (a purely functional implementation of the VP8 entropy codec, and a purely functional implementation of the JPEG DC-predicted Huffman coder that can resume encoding from midstream) in our Lepton project at Dropbox to compress 200+ PB of their JPEG files, splitting each transcoded JPEG at an arbitrary byte boundary to match the filesystem blocks: https://www.usenix.org/conference/nsdi17/technical-sessions/...
My students and I also have an upcoming paper at NSDI 2018 that uses the same purely functional codec to do better videoconferencing (compared with Facetime, Hangouts, Skype, and WebRTC with and without VP9-SVC). The idea is that having a purely functional codec lets you explore execution paths without committing to them, so the videoconferencing system can try different quantizers for each frame and look at the resulting compressed sizes (or no quantizer, just skipping the frame even after encoding) until it finds one that matches its estimate of the network's capacity at that moment.
One conclusion is that video codecs (and JPEG codecs...) should support a save/restore interface to give a succinct summary of their internal state that can be thawed out elsewhere to let you resume from that point. We've shown now that if you have that (at least for VP8, but probably also for VP9/H.264/265 as these are very similar) you can do a lot of cool tricks, even trouncing H.264/265-based videoconferencing apps on quality and delay. (Code here: https://github.com/excamera/alfalfa). Mosh (mobile shell) is really about the same thing and is sort of videoconferencing for terminals -- if you have a purely functional ANSI terminal emulator with a save/restore interface that expresses its state succinctly (relative to an arbitrary prior state), you can do the same tricks of syncing the server-to-client state at whatever interval you want and recovering efficiently from dropped updates.
For lambda, it's fun to imagine that every compute-intensive job you might run (video filters or search, machine learning, data visualization, ray tracing) could show the user a button that says, "Do it locally [1 hour]" and a button next to it that says, "Do it in 10,000 cores on lambda, one second each, and you'll pay 25 cents and it will take one second."
Long answer: I think that the more minds there are working on this (or any other interesting) problem, the more progress will be made. And we shouldn’t let the perfect be the enemy of the good enough.
The field of robotics, and especially soft robotics like this, is so new and uncharted that individuals tinkering in their garage can and have made real advances for the field.
That said, really good control systems for these kinds of actuators are something very much lacking. The human nervous system, and in particular the vestibular system is so absurdly effective, efficient and complex all at the same time that it boggles my mind. Nothing anyone has ever designed has ever come close to that level of functionality. But to make a robot that can do human things, we need that level of functionality. We need to have really good proprioception which is integrated with force and feedback sensing.
To give an example; open a bag of potato chips and pull out a decent handful. Now eat the chips one by one out of your hand without dropping any and without using your other hand. I dare say you can probably do this, and without crushing the chips in your hand. You can feel just how much force to apply, you can feel where to place your fingers so that all the chips are held just so. You can make subtle adjustments after each chip is taken away.
There is no robot hand in the world that can do that simple task. Ask any arm amputee. And to get to that point, a lot of really difficult stuff has to be invented and worked out. I suspect this is a problem which will be solved piecemeal, over time, but much of the fundamental work in things like the mathematics of motion control, material science and compliant mechanisms will come from very well funded labs. If I had to guess, I give a 60% chance a robot can do the chip eating task within a 5 years. And 85% within 10 years. (Especially if the work of my team pays off)