FADE IN: IBM MANCHESTER LAB - DAY Enterprise-y office setting. MANUEL Coffee? Tea? TOBY Sure. MANUEL By the way, I found this awesome problem on HackerRank - it's called Nikita and the Game. TOBY It's... hmm. Interesting. Maybe if you start with... No, wait. Hmm... Some time later: JOE What are you guys looking at?

And that’s how we got nerd sniped.

There’s the obvious solution, which happens to be the quickest to implement. This is the route I chose, in Python. You simply start on each end of the array segment, and look for the mid-point by moving the position of the smaller sum:

```
array: 4 1 1 3 1
l/r pos: ^ ^
l/r sum: 4 > 1
array: 4 1 1 3 1
l/r pos: ^ ^
l/r sum: 4 = 4
array: 4 1 1 3 1
l/r pos: ^ ^
l/r sum: 5 > 4
array: 4 1 1 3 1
l/r pos: ^ ^
l/r sum: 5=5
=> [4, 1] [1, 3, 1]
```

The performance of this is satisfactory. In fact, the only issue I hit was recursion depth. While I could bump this up on my machine, on HackerRank this didn’t work. No worries though, just use a list as a stack for array segments and iterate.

Joe however figured out that you can use a prefix sum to express the problem, at which point you have an ordered list which you can search really quickly:

```
array: _ 4 1 1 3 1
scan: 0 4 5 6 9 10
hi/lo: ^ ^, mid = 5, index(5) = 2
=> [0, 4, 5] [6, 9, 10]
```

This works because you only need the depth. It’s also extremely efficient because you’re only summing once. Additionally, as soon as you get an array segment with an odd sum, you can stop because you’ll never be able to split it into equal parts in this representation.

Problem solved, right? Wrong. Now we have two solutions (both Python), and one is quicker. It’s officially a competition!

There’s one more (easy) optimisation we can make. Zeros don’t matter, because if you have a zero, it acts like a wildcard for any side of the sum. It’s not completely obvious in the simple approach, but the prefix sum approach obviously doesn’t change by adding zeros. There’s one edge case: if all your values are zeros. Say you have `n`

zeros, then you can split `n - 1`

times:

```
0 0 0 0 0
0 0 0 0
0 0 0
0 0
0 <- depth = 4
```

It’s worth handling this edge case, as one of the test cases has a fairly large `n`

.

Joe and I did some profiling on the Python code. Turns out that the majority of time is spent on `int()`

, parsing the input!

```
arr = [int(s) for s in text.split(' ') if s != '0']
```

At this point, Manuel moved to C and Joe moved to Golang. Both of these will execute faster than Python (but take longer to write). Manuel found out the biggest speed-up came from implementing your own `atoi`

, which is super simplistic because the input is very limited. I implemented it like this:

```
uint64_t _atoi() {
uint64_t num = 0;
int b = getchar();
while (b != ' ' && b != '\n') {
num = num * 10 + (b - '0');
b = getchar();
}
return num;
}
```

This does make sense if the problem is I/O bound. I also put this into Joe’s go program. Surprisingly, the go version was quicker than Manuel’s C version! Now that’s interesting.

By using `gcc -O3`

, I managed to shave a few milliseconds off. Then I used `clang`

, which I prefer because the assembly output is much cleaner. The `clang -O3`

version was also faster than `gcc -O3`

- llvm is simply better at rearranging the code at a higher level. `clang`

can output the assembly directly. For comparing it with the go version, `objdump`

works on both. ^{1}

```
$ clang -O3 -S -masm=intel nikita.c -o nikita.s
# or
$ clang -O3 -Wall -Wextra nikita.c -o nikita
$ objdump -M intel -S nikita > nikita.s
```

Let’s see what we have:

```
; Intel-syntax 64-bit assembly
; prologue omitted
xor ebx, ebx ; uint64_t num = 0; *
jmp getchar ; goto getchar;
.align 4, 0x90 ; 6 bytes of nop padding for alignment
loop_start: ; do {
lea rcx, [rbx+4*rbx] ; tmp = num + num * 4 = num * 5;
cdqe ; b = (uint64_t)b;
lea rbx, [rax+2*rcx-48] ; num = tmp * 2 + b - '0';
; so the last 3 lines are: num = num * 10 + (uint64_t)b - '0';
getchar:
call _getchar ; b = getchar();
; return value is in rax, but since it's an int use eax.
cmp eax, 10 ; if (b == '\n')
je loop_end ; break;
cmp eax, 32 ; } while (b != ' ')
jne loop_start
loop_end:
mov rax, rbx ; return num;
; epilogue omitted
```

* for amd64, zeroing the 32-bit register implicitly zeros the top 32-bits. Doing this as a 32-bit op takes one less byte than the corresponding 64-bit op: `48 31 db xor rbx,rbx`

vs `31 db xor ebx,ebx`

.

Pretty tight - not much optimisation there, except for `getchar`

. I tried using `fread`

on `stdin`

instead, but that was about the same as `getchar`

. In the go disassembly, I didn’t find either `getchar`

or `fread`

, but I did find `mmap`

! Aha, so that’s how the go version is faster.

Using mmap, the C version was now 3x as fast as the go version. Now we could all get on with our lives. Nikita, it’s been fun!

There are several other ways to output assembly from a go program:

`go tool compile -S [-N] nikita.go > nikita.s`

- This produces some weird output.`-N`

for unoptimised.`go tool objdump $GOPATH/bin/nikita > nikita.s`

- The same as`compile`

, but with all the statically linked stuff.

I prefer the `objdump -S $GOPATH/bin/nikita`

route. ↩