Saturday, December 6, 2014

PI设计

Scalable Web Architecture and Distributed Systems

http://www.aosabook.org/en/distsys.html#fig.distsys.20
http://katemats.com/distributed-systems-basics-handling-failure-fault-tolerance-and-monitoring/

Big Table
http://en.wikipedia.org/wiki/BigTable

Consistent Hashing
http://n00tc0d3r.blogspot.com/2013/09/big-data-consistent-hashing.html

Distributed Database
http://en.wikipedia.org/wiki/Distributed_database

总结
http://chuansongme.com/n/290160

FB Design Question
http://blog.csdn.net/sigh1988/article/details/9790337

FB store photo
https://www.facebook.com/note.php?note_id=76191543919

Newsfeed
http://www.quora.com/What-are-best-practices-for-building-something-like-a-News-Feed

Reddit ranking
http://amix.dk/blog/post/19588

Visibility position
https://about.pinterest.com/en/careers/software-engineer-visibility_27138

Making Pinterest
http://engineering.pinterest.com/

Google: Google File System, MapReduce, BigTable
Facebook: Cassandra
Amazon: Dynamo

领英准备

social graph engine
把面试网站看一遍
做leetcode题
看并且总结领英设计

https://engineering.linkedin.com/recommender-systems/browsemap-collaborative-filtering-linkedin
Browsemap Collaborative filtering
Browsemap platform: offline / online system
offline uses hadoop for batch computation and then loads into distributed k-v store for queries
online query goes to k-v store
activity data loads to HDFS
"people also viewed" features for people / jobs / companies
May also support A/B testing; user experience / presentation is the most important

LinkedIn Search:
Shard, index(term-posting list), early termination, query rewriting
spell check - ngrams, edit distance, metaphone, concurrence
query tagging
vertical intent
query expansion - relative
Training with A/B tests, model selection -> regression decision tree

Profile frontend architecture
https://engineering.linkedin.com/profile/engineering-new-linkedin-profile
JSON mapper
We created classes called Mappers in our web app that combine, adapt, or modify data from our backend services into JSON
can call in batch or individually. can also hit mapper endpoints directly via ajax when dynamically
Fizzy-UI aggregator
Serverside: apache traffic server redirecting web responses
Clientside: Define the frame with html base page with segmentations; fire parallel request to endpoints specified; inject returned data and its own markup into the base page; flush it to browser; use Fizzy client side code to render page progressively (only render viewable part firstly above fold)
dust.js
edit form: re-render and replace!

Unifying the LinkedIn Search Experience
https://engineering.linkedin.com/search/unifying-linkedin-search-experience
Query Auto-complete and Content Type Suggestion
Most frequent queries + likelihood of a successful search (because long heavy tail and generic of most frequent queries)
Classify query into right vertical (job/people/...)
Unified Search Result Page
eliminate the drop-down to select content type
Intent prediction: order by probability distribution, limit the number verticals that the query is sent to thus eliminating unnecessary scaling
use data on pre-unified search behavior to build intent models
Page Optimization
Ranking the results by relevance and secondary results. Can be feature optimal ranking

Personalized Navigation
https://engineering.linkedin.com/mobile/linkedin-mobile-introducing-personalized-navigation
View/action events->daily Kafka->ETL->HDFS->weekly hadoop workflow->vordemont k-v storage->member id based lookup->frontend
For each branch(category)/leaf(specific) count 1)number of days of visit in the past week 2) number of weeks of visit
when counting, use scaling factor: continuous visit across weeks are higher than spiky visits in one week
start from the history of all members, then keep personalized thresholds and update those passing thresholds

Decoupling translation from source code
https://engineering.linkedin.com/language-packs/decoupling-translation-source-code
property file that maps from english string to translations
i18n to retrieve data from property file
old way: bundled property to project(war), need to redeploy for every change
new way: using separate jar, language package and server code are managed and deployed saparately
Backwards Compatibility: to handle the situation where language pack system shows up earlier than the code

Databus
https://engineering.linkedin.com/data-replication/open-sourcing-databus-linkedins-low-latency-change-data-capture-system
source database->in memory log stores (events) (for fast moving clients) -> snapshot stores (for slow moving clients or data copy)

Publishing Platform
https://engineering.linkedin.com/publishing-platform/maximizing-our-publishing-platform-reach-network-distribution
Feed-Mixer supports the most important network distribution channel for member published articles - the LinkedIn Feed on Mobile and Desktop. These feeds help LinkedIn members keep up with their connections’ activities by consuming updates on the LinkedIn.com homepage feed or on LinkedIn mobile applications. When an author publishes an article on LinkedIn, a “member published” activity is distributed to all of the author's 1st degree connections and followers. Viral activity on that update - comments, likes, and reshares - gets the poster additional distribution beyond their 1st degree network, and gets them distribution to additional people who can follow them. For many of our authors, over time, the followership base that they build on LinkedIn will dwarf their 1st degree network.

Creating this publishing notification is very straightforward. On the server side, first we define a new notification type for the publishing event. The creation of these notifications are done asynchronously after a member publishes a post, allowing us to implement custom business logic on how widely and when to distribute it. Publishing service invokes notification system’s rest.li API to send out the notifications. On the client side, we created templates for both desktop and mobile clients to render the new notification type.

content filter for spams and low quality content

How to find connection degree easily?

Graph concepts(edge?), algorithms

RESTful API?

Galene

https://engineering.linkedin.com/search/did-you-mean-galene

rovide deeply personalized search results based on each member’s identity and relationships

Lucene

The search index has two primary components:

· The inverted index – a mapping from search terms to the list of entities that contain them; and

· The forward index – a mapping from entities to metadata about them.

Federator(rewrite, restruct, add metadata)->broker(for each vertical, ad metadata, send to multiple searchers[shards], merge & rank results from searchers)->

A rewriter is made up of multiple rewriter modules each of which performs a specific function. Specific functions can be synonym expansion, spelling correction, or graph proximity personalization.

Indexing on Hadoop:

We first run map reduce jobs with relevance algorithms embedded that enrich the raw data – resulting in the derived data. Some examples of relevance algorithms that may be applied here are spell correction, standardization of concepts (for example, unifying “software engineer” and “computer programmer”), and graph analysis.

Live Updates:

In Galene, live updates are performed at the granularity of single fields. We have built a new kind of index segment – the term partitioned segment. The inverted index and forward index of each entity may be split up across these segments. The same posting list can be present in multiple segments and a traversal of a single posting list becomes the traversal of a disjunction of the posting lists in each of the segments. For this to work properly, the entities in each segment have to be ordered in the same manner - given that we order entities by static rank in all segments, we satisfy the ordering constraint. The forward index becomes the union of the forward indices in each of the segments.

In Galene, we maintain three such segments:

· The base index – this is the one built offline on Hadoop. This is rebuilt periodically (say every week). Once built, it is never modified, only discarded after the next base index is built.

· The live update buffer – which is maintained in memory. All live updates are applied to this segment. This segment is designed to accept incremental updates and augment itself to retain the entities in the correct static rank order.

· The snapshot index – given that the live update buffer is only in memory, we periodically (every few hours) flush it to the snapshot index on disk to make it persistent. If the snapshot index already exists, a new one is built that combines the contents of the previous snapshot index and the live update buffer. After each flush, the live update buffer is reset.

Use indexers to generate and ship snapshot and searchers(read only) to parse queries and return results

LinkedIn Connected: Engineering Pre-Meeting Intelligence

https://engineering.linkedin.com/mobile/linkedin-connected-engineering-pre-meeting-intelligence

we delved into the details of how we achieved decoupling of the generation and ranking of relationship opportunities, scheduling based on the member's calendar and delivery of timely notifications by Opportunist through asynchronous Kafka messaging.

Mobile A/B Testing

https://engineering.linkedin.com/mobile/mobile-ab-testing-linkedin-how-members-shape-our-apps

Rendering based on view types on the client side allows us to dynamically modify client UI for any subset of the clients with the entire control on the server. The XLNT framework allows us to perform targeted experiments that can also leverage contextual information provided by the client.

LinkedIn University Pages

https://engineering.linkedin.com/university/building-linkedin-university-pages

should first talk about our work standardizing LinkedIn’s school data.

Using set cover algorithm to optimize query latency for a large scale distributed graph

https://engineering.linkedin.com/real-time-distributed-graph/using-set-cover-algorithm-optimize-query-latency-large-scale-distributed

NCS, the caching layer, calculates and stores a member's second-degree set. With this cache, graph distance queries originating from a member can be converted to set intersections, avoiding further remote calls. For example, if we have member X's second degree calculated, to decide whether member Y is three degree apart from member X, we can simply fetch Y's connections and intersect X's second degree cache with Y's first degree connections.

We decided to apply a greedy set cover algorithm to address this query optimization problem. Greedy set cover algorithms are used to ﬁnd the smallest subset that covers the maximum number of uncovered points in a large set. In this case, we would apply the algorithm to the set of partitions that stored a member's first-degree connections. Partitions stored on each GraphDB node would be the elements in a family of sets. We wanted to find the smallest number of elements from this set family that covered the input set.

Read the rest of the paper for implementation & optimization details

Tuesday, November 4, 2014

经验教训

把自己的项目复习明白！

Linkedin

merging linkedlist
paint house with lowest cost
shorten URL service

Monday, November 3, 2014

PI面筋

PI面筋
What is the purpose of a database index? What is the tradeoff?

Take n-length array with n-length words in it to find longest word that can chain to a 3 letter word (in the array) by taking out any one letter and becoming an anagram for a shorter word in the array.

Enumerate all NxN grids that can be composed of valid English words.
Architect a chess program.
An amusing question about finding "happy" numbers.

write a function that returns the minimum edit distance between two strings and analyze its complexity.

How many characters in the front of a string would need to be added in order to make it a palindrome.

1. Discussion on past projects. One algorithm question related to a data structures.
2. Object oriented code design question
3. Algorithm questions on arrays and tress.
4. System design question.

explain hash tables (insertion/deletion/search time, how rehashing affects these times

Split a sentence into words (even though Pinterest is written in Python, they don't want you to just use Python's split function, but use a lower level language like C to take less memory)

countandsay
duplicate removal algorithm
k mostly frequently used words in a string
first missing number
implement a hashmap and all its methods

read through string and output number of times each digit occurs, reverse words in sentence, etc.

sentence partitioning
reading matrix diagonally

how would you desing pinterest’s architecture
given an array of words find what is and how long is the length of the longest common substring between two words in the array (prefix tree)
given an arrya of numbers see if you can get to index with 0 in it from an index by jumping through the array using the values in the array. So if you have [1,2,1,0,3] you can get to 0, from 0 by jumping 0, you can get to 0 from 3, by jumping 3 index down to 2 and then jumping 2 index up to 0..

Write a job scheduler. Input : list of jobs with dependencies. Output: list of jobs to be executed according to dependencies.
Count calendar days. Input: list of ranges [start,end], a range represent days i.e. [1,4] means days from 1 to day 4. Output: Total days included in the list of ranges. Answer was 1)Sort list, 2)Merge overlapping ranges and 3) Count/Sum the days of the resulting non-overlapping ranges.

http://www.mitbbs.com/article_t/JobHunting/32570751.html
如果要给每个Pin加上一个price tag，怎么去evaluate这是否work？
（1） A/B testing -> 可以有好几种，讨论优劣性
（2） metrics to monitor -> click rate, impression, return user ratio, etc

上门：
1. 假设Pinterest的更新系统只能显示3条更新，怎么设计？更新可以是：用户评论、
加新的pin，repin等等，一共可能有一千多种。讨论各种方法的优劣性
回答：a ranking problem...

2. 给如下的数据格式：
<start_time, end_time, value>
比如有一组数据：
1, 3, 100
2, 4, 200
5, 6, 300
。。。
这些数据时间点可能有重合。在时间段2~3之间，value的和是100+200 = 300. 找出这
组数据中最高的value和。
回答1：用一个数组，每个cell代表一个timestamp，然后扫一遍数据，increment相应
的cell。-》面试官：时间连续怎么办？有没有更好的办法。
答案：把数据变成：<type, time, value>；然后按照时间排序。如果是start_time,就
+value，不然就-value：
int sum = 0;
int max = 0;

// sort by time

while(have more lines) {
    if(type is start) sum += value;
    else sum -= value;
    if(sum > max) max = sum;
}
return max;

3. 设计一个数据结构支持O(1)的insert， remove， find random（老题）

4. java arraylist里如果满了，怎么办？为什么？
答： make a new copy, size double. 原因是：double size的时候需要拷贝原来的n
个数据，当当前这个长度为2*n的arraylist再满的时候，至少还需要插入n个数据，这
样平均每个数据的cost是在O(1)级别的

5. 怎么做weighted random sampling？（老题）

6. 有产生5的随机数，怎么生成7的？（老题）

7. 怎么去找log中的异常？outlier detection

Tuesday, October 28, 2014

Can not instantiate an array with generic, like
E[] array = new E[2];
Has to be like this:
Object[] array = new Object[2];