# Google Interview ## Coding Interview 1 [['a1', 'a2', 'a1b'], ['b1', 'b2'], ['c1', 'c2']] ### problem 1 print all combinations, like ``` a1b1c1 a1b1c2 a1b2c1 a1b2c2 a2b1c1 a2b1c2 a2b2c1 a2b2c2 a1bb1c1 a1bb1c2 a1bb2c1 a1bb2c2 ``` ### problem 2 given a string 'a2b2c1', is it one of combinations from above. solved by dynamic programming. ## Coding Interview 2 ### Problem 1 given a tree (not a binary one) ``` o /|\ / | \ o o o / \ / \ o o o o | | o o ``` Find nodes with similar structure. Like all life nodes are similar. My solution was to come up with a signature for each node, like sig node = ,(),(),() ...; then create a map -> set of similar nodes. if we know that nodes are divers, then we could actually create a hash. ### Problem 2 Represent a number in (-2) base, like a*(-2)^4+b*(-2)^3+c*(-2)^2+d*(-2)^1+e*(-2)^0 e.g. 1 -2 4 -8 16 -32 0 = 0000 1 = 0001 -1 = 0011 2 = 0110 -2 = 0010 3 = 0111 -3 = 1111 .... ## System Design 1 800 Billions query log entries, 8 machines top million query by frequency unique queries? take samples to estimate the unique query, shrinks by 1 factor of ten distribute queries evenly by machine go over the logs in each machine and create 10B key value pairs (assume 10x reduction) For all machines = 4T average query length = 50B Data per machine = 500GB of data (This produced from 5T) (use hash map, split data in chunks ~1T or 0.5T Produce about 50GB for each chunks, do this 5-10 time for all data) reading from disk is going to be 20h at 200MB/s -> but we have 6 disks -> 3.5h to read everything Another way to do this is to sort each chunk (by query), merge those. Produce all 500GB of data. idea: merge the data between machines in pairs (1.5h to send to wire) Somehow sort 500GB data on each machine and produce 1 M query per machine. (50Meg Data) About 10x50GB chunks. Create another hash count->set of queries.) Do chunks have overlapping queries? yes they wiil Send all data to one machine (400Meg Data) Apply some sort of heap sort and merge presorted arrays. Correctness of data how big are the machines? 32 cores of 3 GHz 120 G of RAM 6x2T each (spinning disk) 10Gbps network connection you can store 5T in each machine how to do the key value pair? have an hash map Run Time 2M querie 100MB data per machine, (1GB data transfered for all) May be another 1GB for the second phase. ~4-5h to process this data.