Neural Magic interview question

Speeding up an already cuda kernel, proposing some optimizations.