45分鐘google電話面試實錄

benniaozhuiri 2013-01-29

展開全文

這是我找工作以來的第一個正式的電話面試。本來是想拿小公司練練手，哪知道老板突然有推薦，機會難得，不能放棄，只好先上了。投了簡歷之后，就有 recruiter和我取得聯(lián)系。在一番email往來之后，確定了今天下午的面試。整個interview有45分鐘，問了兩個算法題，一個設計題。難度不是很大，但是感覺一般，發(fā)揮不算太好。

首先和面試官一番寒暄。稍微介紹了下research熱熱身，然后開始正式的面試。第一個題目是個字符串匹配。給定n個長度相等的字符串，輸出最長公共字串。注：串數(shù)目為n，可假定所有串等長。

example: abcabb, xyabcd, dcabcc

output: abc

a. first attempt, suffix tree. 這是我的第一反應，因為Knuth曾經猜測對兩個字符串求最大共同連續(xù)字串是不能在線性時間內完成的，但是結果用后綴樹可以在線性時間檢測。所以我提出以下方案：get suffix tree of 1 and 2 merge the tree to find the longest shared length

I found that this did not work. because the longest shared string of 1 and 2 was not necessary the longest of all the strings. I changed my idea immediately.

注：我后來仔細的思考了一下，后綴樹仍然是可行的。 Just set up a big suffix tree for all the strings. For each nonleaf node, count the subtree nodes id, i.e, the indexes of all the strings that have showed up in the subtree. If the indexes make up the set {1...n}, then that inner node has the label shared by all the strings. Then we just need to find the guy with larges label length. Thus the total time would be n * nx = xn^2

b. 我提議k-gram再用inverted indexes.可惜面試官沒有聽說過這個算法 :< 只能用例子來解釋k-gram.

idea of k-gram: set up all 2-grams, 3-grams, ..., x-grams of the n input strings. Then there are nx^2 *-grams in total. For each gram, check if a k-gram contains the whole 1:n, if so, that certain gram is shared by all the strings. Then finding the longest length gram and we will get the solution.

REALIZATION: set up a hashtable for all the grams, then a new tuple (gramwords, stringNo, idx), will be mapped to a bucket. There are totally nx^2 map images. For each gram, we can use a length-n bit vector to record the appearance of different strings. 注意，如果大部分gram所含的相交字符串集數(shù)目不超過n，用list更好。

面試官的評論: the time may not be exactly nx^2 because we need to compute the hash value and it takes l time for a l-gram. Hence the final time may still be nx^3 which match his brutal force solution. (敵人大大的狡猾?。?/p>

MY ANSWER: the hash values are not computed separately. We will use Karp-Rabin algorithm to compute a signature of the grams. 他不知道Karp-Rabin algorithm :< 估計他不知道我是啥意思。我同時補充到，我的算法具有可擴展性。任何一個新的string都可以加入原數(shù)據(jù)結構，不需要全部重做。他也就哼哼阿阿的算是了解了我的算法。然后他提出了一個簡單的算法，說期望我先上那個算法。。。

總之，我覺得這題答的并不好。我應該第一時間回答最簡單的暴力解法，也就是接下來他希望我回答的。因為如果不用后綴樹，后綴數(shù)組，kgram這些高級的數(shù)據(jù)結構是很難實現(xiàn)優(yōu)化的?？墒敲嬖嚬賹τ谧址?配似乎不是很熟悉。估計我答出暴力算法他就會滿足。雖然我很早就讀了很多面經，確定了先易后難的策略，可惜還是一激動就拋之腦后了。

注：一位朋友觀察到不用建立所有string的kgram，只要第一個字符串的kgram就可以。

c. His algorithm. use brutal force. take s[i:j] and do n-1 string matching. The time will be x^2 * nx = nx^3.

第二題是設計類問題。我基本沒有什么感覺。也不知道對錯，面試官也沒有評價，反正使勁瞎猜也沒猜出個啥來。結果中間面試官還被趕出了他的會議室，斷線了兩分鐘。。。

問題，gmail在載入的過程中非常的緩慢，怎么辦？（看來和我一樣用5歲舊電腦的人不少：）

a. get all machines and do comparison experiments to see if it is caused by the hardware.

INTERVIEWER: it is not hardware problem. nor network. He proposed Javascript

b. give the user choice to choose cleaned version, for example, disable the javascript, or two steps showing up: first incoming messages then others. Rewrite java script and so on.

The thing is I had no idea about how javascript works :< I don't know it will cause page loading slow.其實我曾經被script折磨過的。有很長一段時間上西西河打開某些網頁時經?；ㄙM很多時間，然后跳出對話框說是google toolbar的某個script有問題，不過我當時沒有深究，結果。。。sigh！

這是個problem solving問題。明顯應該先硬件后軟件，我怎么答出硬件之后忘記了問是否軟件的問題，失策失策，答的最差的一題。

算法問題

Background:

有一個錯誤的log如下，注意時間。具體問題就是，我可以隨時要求你給出一下統(tǒng)計數(shù)字，比如上一分鐘，上一小時得到的錯誤數(shù)。

Example:

404_errors_last_minute: 10

404_errors_last_hour: 100

404_errors_total: 1234

00:01: Event

00:05: Event

00:06: get_total -> 2

01:00: get_last_minute -> 2

01:02: get_last_minute -> 1

01:06: get_last_minute -> 0

有如下函數(shù)給出

time_t current_time();

void event()；

希望能得到如下的統(tǒng)計數(shù)字。填寫相應的函數(shù)。

// Get statistics.

int get_last_minute()；

int get_last_hour()；

int get_total()；

解答: 使用隊列來分別保存一分鐘，一小時的信息。當新的error entry來到的時候，在分鐘隊列中刪除超時的那些（對于小時隊列做同樣的刪除）。思路應該是對的可惜在編寫程序是沒有一次得到最優(yōu)的結果。首先刪除 outdated數(shù)據(jù)應該循環(huán)刪除直至不能，不能只刪一次。不過我自己發(fā)現(xiàn)了這個bug改正過來了。此外刪除應該check queue front元素，我寫的代碼居然check了back，汗！面試官當時沒有發(fā)現(xiàn)。不過如果他回頭看存下來的代碼一定會找到這個，杯具阿。

擴展問題。如果數(shù)據(jù)項是在太多，億萬記，將全部時間數(shù)據(jù)保存要耗完內存，怎么半？于是我提議將數(shù)據(jù)按照0.001秒分成1000個區(qū)間，對于時刻s的 query，我們只需要check從s往前數(shù)的第1000個區(qū)間中的具體數(shù)值。這樣內存就只需要1／1000，當然我們也增加了disk讀寫的時間。不過面試官沒有詳細問，我提出區(qū)間劃分他就掠過了，也沒有太多評價。估計是要超時了。我前兩題沒答太好，弄了太多時間。

總而言之，發(fā)揮一般，暴露了不少問題。字符串匹配算比較熟悉的，可惜面試官又不熟悉。個人感覺google的算法題目還是比其他公司難一些。