# Google Interview

## Coding Interview 1

[['a1', 'a2', 'a1b'], ['b1', 'b2'], ['c1', 'c2']]

### problem 1

print all combinations, like
```
a1b1c1
a1b1c2
a1b2c1
a1b2c2
a2b1c1
a2b1c2
a2b2c1
a2b2c2
a1bb1c1
a1bb1c2
a1bb2c1
a1bb2c2
```
### problem 2

given a string 'a2b2c1', is it one of combinations from above.

solved by dynamic programming.


## Coding Interview 2

### Problem 1
given a tree (not a binary one)

```
         o
        /|\
       / | \
      o  o  o
     / \   / \
    o  o  o   o
    |     |
    o     o
```
Find nodes with similar structure. Like all life nodes are similar.

My solution was to come up with a signature for each node, like
sig node = <num child>,(<sig child 1>),(<sig child 2>),(<sig child 3>) ...;
then create a map <sig> -> set of similar nodes.

if we know that nodes are divers, then we could actually create a hash.

### Problem 2
Represent a number in (-2) base, like  a*(-2)^4+b*(-2)^3+c*(-2)^2+d*(-2)^1+e*(-2)^0

e.g.    1 -2 4 -8 16 -32

 0  = 0000
 1  = 0001
-1  = 0011
 2  = 0110
-2  = 0010
 3  = 0111
-3  = 1111
....


## System Design 1

800 Billions query log entries, 8 machines 
top million query by frequency

unique queries? take samples to estimate the unique query, shrinks by 1 factor of ten

distribute queries evenly by machine

go over the logs in each machine
and create 10B key value pairs (assume 10x reduction)

For all machines = 4T
average query length = 50B

Data per machine = 500GB of data  (This produced from 5T)
    (use hash map, split data in chunks ~1T or 0.5T
      Produce about 50GB for each chunks, do this 5-10 time for all data)

reading from disk is going to be 20h at 200MB/s -> but we have 6 disks -> 3.5h to read everything

Another way to do this is to sort each chunk (by query), merge those. Produce all 500GB of data. 

idea: merge the data between machines in pairs (1.5h to send to wire)

Somehow sort 500GB data on each machine and produce 1 M query per machine.
(50Meg Data)
About 10x50GB chunks. Create another hash count->set of queries.)

Do chunks have overlapping queries? yes they wiil


Send all data to one machine (400Meg Data)
Apply some sort of heap sort and merge presorted arrays.


Correctness of data

how big are the machines?
32 cores of 3 GHz
120 G of RAM
6x2T each (spinning disk)
10Gbps network connection

you can store 5T in each machine

how to do the key value pair?
have an hash map

Run Time 

2M querie 100MB data per machine, (1GB data transfered for all) May be another 1GB for the second phase.

~4-5h to process this data.