Nikita - a story of obsessed developers

FADE IN: IBM MANCHESTER LAB - DAY
Enterprise-y office setting.
                MANUEL
        Coffee? Tea?
                TOBY
        Sure.
                MANUEL
        By the way, I found this awesome
        problem on HackerRank - it's called
        Nikita and the Game.
                TOBY
        It's... hmm. Interesting. Maybe if
        you start with... No, wait. Hmm...
Some time later:
                JOE
        What are you guys looking at?

And that’s how we got nerd sniped.

There’s the obvious solution, which happens to be the quickest to implement. This is the route I chose, in Python. You simply start on each end of the array segment, and look for the mid-point by moving the position of the smaller sum:

  array: 4 1 1 3 1
l/r pos: ^       ^
l/r sum: 4   >   1

  array: 4 1 1 3 1
l/r pos: ^     ^
l/r sum: 4  =  4

  array: 4 1 1 3 1
l/r pos: ^   ^
l/r sum: 5 > 4

  array: 4 1 1 3 1
l/r pos:   ^ ^
l/r sum:   5=5

=> [4, 1] [1, 3, 1]

The performance of this is satisfactory. In fact, the only issue I hit was recursion depth. While I could bump this up on my machine, on HackerRank this didn’t work. No worries though, just use a list as a stack for array segments and iterate.

Joe however figured out that you can use a prefix sum to express the problem, at which point you have an ordered list which you can search really quickly:

array: _ 4 1 1 3  1
scan:  0 4 5 6 9 10
hi/lo: ^          ^, mid = 5, index(5) = 2
=> [0, 4, 5] [6, 9, 10]

This works because you only need the depth. It’s also extremely efficient because you’re only summing once. Additionally, as soon as you get an array segment with an odd sum, you can stop because you’ll never be able to split it into equal parts in this representation.

Problem solved, right? Wrong. Now we have two solutions (both Python), and one is quicker. It’s officially a competition!

There’s one more (easy) optimisation we can make. Zeros don’t matter, because if you have a zero, it acts like a wildcard for any side of the sum. It’s not completely obvious in the simple approach, but the prefix sum approach obviously doesn’t change by adding zeros. There’s one edge case: if all your values are zeros. Say you have n zeros, then you can split n - 1 times:

0 0 0 0 0
0 0 0 0
0 0 0
0 0
0 <- depth = 4

It’s worth handling this edge case, as one of the test cases has a fairly large n.

Joe and I did some profiling on the Python code. Turns out that the majority of time is spent on int(), parsing the input!

arr = [int(s) for s in text.split(' ') if s != '0']

At this point, Manuel moved to C and Joe moved to Golang. Both of these will execute faster than Python (but take longer to write). Manuel found out the biggest speed-up came from implementing your own atoi, which is super simplistic because the input is very limited. I implemented it like this:

uint64_t _atoi() {
    uint64_t num = 0;
    int b = getchar();
    while (b != ' ' && b != '\n') {
        num = num * 10 + (b - '0');
        b = getchar();
    }
    return num;
}

This does make sense if the problem is I/O bound. I also put this into Joe’s go program. Surprisingly, the go version was quicker than Manuel’s C version! Now that’s interesting.

By using gcc -O3, I managed to shave a few milliseconds off. Then I used clang, which I prefer because the assembly output is much cleaner. The clang -O3 version was also faster than gcc -O3 - llvm is simply better at rearranging the code at a higher level. clang can output the assembly directly. For comparing it with the go version, objdump works on both. ¹

$ clang -O3 -S -masm=intel nikita.c -o nikita.s
# or
$ clang -O3 -Wall -Wextra nikita.c -o nikita
$ objdump -M intel -S nikita > nikita.s

Let’s see what we have:

; Intel-syntax 64-bit assembly
; prologue omitted
  xor ebx, ebx              ; uint64_t num = 0; *
  jmp getchar               ; goto getchar;
  .align  4, 0x90           ; 6 bytes of nop padding for alignment
loop_start:                 ; do {
  lea rcx, [rbx+4*rbx]      ;   tmp = num + num * 4 = num * 5;
  cdqe                      ;   b = (uint64_t)b;
  lea rbx, [rax+2*rcx-48]   ;   num = tmp * 2 + b - '0';
; so the last 3 lines are:   num = num * 10 + (uint64_t)b - '0';
getchar:
  call    _getchar          ;   b = getchar();
; return value is in rax, but since it's an int use eax.
  cmp eax, 10               ;   if (b == '\n')
  je  loop_end              ;     break;
  cmp eax, 32               ; } while (b != ' ')
  jne loop_start
loop_end:
  mov rax, rbx              ; return num;
; epilogue omitted

* for amd64, zeroing the 32-bit register implicitly zeros the top 32-bits. Doing this as a 32-bit op takes one less byte than the corresponding 64-bit op: 48 31 db xor rbx,rbx vs 31 db xor ebx,ebx.

Pretty tight - not much optimisation there, except for getchar. I tried using fread on stdin instead, but that was about the same as getchar. In the go disassembly, I didn’t find either getchar or fread, but I did find mmap! Aha, so that’s how the go version is faster.

Using mmap, the C version was now 3x as fast as the go version. Now we could all get on with our lives. Nikita, it’s been fun!

[1]

There are several other ways to output assembly from a go program:

go tool compile -S [-N] nikita.go > nikita.s - This produces some weird output. -N for unoptimised.
go tool objdump $GOPATH/bin/nikita > nikita.s - The same as compile, but with all the statically linked stuff.

I prefer the objdump -S $GOPATH/bin/nikita route. ↩