social graph engine
把面试网站看一遍
做leetcode题
看并且总结领英设计
https://engineering.linkedin.com/recommender-systems/browsemap-collaborative-filtering-linkedin
Browsemap Collaborative filtering
Browsemap platform: offline / online system
offline uses hadoop for batch computation and then loads into distributed k-v store for queries
online query goes to k-v store
activity data loads to HDFS
"people also viewed" features for people / jobs / companies
May also support A/B testing; user experience / presentation is the most important
LinkedIn Search:
Shard, index(term-posting list), early termination, query rewriting
spell check - ngrams, edit distance, metaphone, concurrence
query tagging
vertical intent
query expansion - relative
Training with A/B tests, model selection -> regression decision tree
Profile frontend architecture
https://engineering.linkedin.com/profile/engineering-new-linkedin-profile
JSON mapper
We created classes called Mappers in our web app that combine, adapt, or modify data from our backend services into JSON
can call in batch or individually. can also hit mapper endpoints directly via ajax when dynamically
Fizzy-UI aggregator
Serverside: apache traffic server redirecting web responses
Clientside: Define the frame with html base page with segmentations; fire parallel request to endpoints specified; inject returned data and its own markup into the base page; flush it to browser; use Fizzy client side code to render page progressively (only render viewable part firstly above fold)
dust.js
edit form: re-render and replace!
Unifying the LinkedIn Search Experience
https://engineering.linkedin.com/search/unifying-linkedin-search-experience
Query Auto-complete and Content Type Suggestion
Most frequent queries + likelihood of a successful search (because long heavy tail and generic of most frequent queries)
Classify query into right vertical (job/people/...)
Unified Search Result Page
eliminate the drop-down to select content type
Intent prediction: order by probability distribution, limit the number verticals that the query is sent to thus eliminating unnecessary scaling
use data on pre-unified search behavior to build intent models
Page Optimization
Ranking the results by relevance and secondary results. Can be feature optimal ranking
Personalized Navigation
https://engineering.linkedin.com/mobile/linkedin-mobile-introducing-personalized-navigation
View/action events->daily Kafka->ETL->HDFS->weekly hadoop workflow->vordemont k-v storage->member id based lookup->frontend
For each branch(category)/leaf(specific) count 1)number of days of visit in the past week 2) number of weeks of visit
when counting, use scaling factor: continuous visit across weeks are higher than spiky visits in one week
start from the history of all members, then keep personalized thresholds and update those passing thresholds
Decoupling translation from source code
https://engineering.linkedin.com/language-packs/decoupling-translation-source-code
property file that maps from english string to translations
i18n to retrieve data from property file
old way: bundled property to project(war), need to redeploy for every change
new way: using separate jar, language package and server code are managed and deployed saparately
Backwards Compatibility: to handle the situation where language pack system shows up earlier than the code
Databus
https://engineering.linkedin.com/data-replication/open-sourcing-databus-linkedins-low-latency-change-data-capture-system
source database->in memory log stores (events) (for fast moving clients) -> snapshot stores (for slow moving clients or data copy)
Publishing Platform
https://engineering.linkedin.com/publishing-platform/maximizing-our-publishing-platform-reach-network-distribution
Feed-Mixer supports the most important network distribution channel for member published articles - the LinkedIn Feed on Mobile and Desktop. These feeds help LinkedIn members keep up with their connections’ activities by consuming updates on the LinkedIn.com homepage feed or on LinkedIn mobile applications. When an author publishes an article on LinkedIn, a “member published” activity is distributed to all of the author's 1st degree connections and followers. Viral activity on that update - comments, likes, and reshares - gets the poster additional distribution beyond their 1st degree network, and gets them distribution to additional people who can follow them. For many of our authors, over time, the followership base that they build on LinkedIn will dwarf their 1st degree network.
Creating this publishing notification is very straightforward. On the server side, first we define a new notification type for the publishing event. The creation of these notifications are done asynchronously after a member publishes a post, allowing us to implement custom business logic on how widely and when to distribute it. Publishing service invokes notification system’s rest.li API to send out the notifications. On the client side, we created templates for both desktop and mobile clients to render the new notification type.
content filter for spams and low quality content
How to find connection
degree easily?
Graph concepts(edge?),
algorithms
RESTful API?
Galene
https://engineering.linkedin.com/search/did-you-mean-galene
rovide deeply personalized search results based
on each member’s identity and relationships
Lucene
The search index has two primary components:
· The inverted index – a mapping from search
terms to the list of entities that contain them; and
· The forward index – a mapping from
entities to metadata about them.
Federator(rewrite, restruct, add metadata)->broker(for
each vertical, ad metadata, send to multiple searchers[shards], merge &
rank results from searchers)->
A rewriter is made up of multiple rewriter
modules each of which performs a specific function. Specific functions can be
synonym expansion, spelling correction, or graph proximity personalization.
Indexing on Hadoop:
We first run map reduce jobs with relevance
algorithms embedded that enrich the raw data – resulting in the derived data. Some
examples of relevance algorithms that may be applied here are spell correction,
standardization of concepts (for example, unifying “software engineer” and
“computer programmer”), and graph analysis.
Live Updates:
In Galene, live updates are performed at the granularity of single
fields. We have built a new kind of index segment – the term
partitioned segment. The inverted index and forward index of each
entity may be split up across these segments. The same posting list can
be present in multiple segments and a traversal of a single posting list
becomes the traversal of a disjunction of the posting lists in each of the
segments. For this to work properly, the entities in each segment have to
be ordered in the same manner - given that we order entities by static rank in
all segments, we satisfy the ordering constraint. The forward index
becomes the union of the forward indices in each of the segments.
In Galene, we maintain three such segments:
· The base index – this is the one built
offline on Hadoop. This is rebuilt periodically (say every week).
Once built, it is never modified, only discarded after the next base index is
built.
· The live update buffer – which is
maintained in memory. All live updates are applied to this segment.
This segment is designed to accept incremental updates and augment itself to
retain the entities in the correct static rank order.
· The snapshot index – given that the live
update buffer is only in memory, we periodically (every few hours) flush it to
the snapshot index on disk to make it persistent. If the snapshot index
already exists, a new one is built that combines the contents of the previous
snapshot index and the live update buffer. After each flush, the live
update buffer is reset.
Use indexers to generate
and ship snapshot and searchers(read only) to parse queries and return results
LinkedIn Connected: Engineering Pre-Meeting Intelligence
we delved into the details of how we achieved
decoupling of the generation and ranking of relationship opportunities,
scheduling based on the member's calendar and delivery of timely notifications
by Opportunist through asynchronous Kafka messaging.
Mobile A/B Testing
Rendering based on view types on the client
side allows us to dynamically modify client UI for any subset of the clients
with the entire control on the server. The XLNT framework allows us to perform
targeted experiments that can also leverage contextual information provided by
the client.
LinkedIn University Pages
should first talk about our work standardizing
LinkedIn’s school data.
Using set cover algorithm to optimize query latency for a large
scale distributed graph
NCS, the caching layer, calculates and stores a
member's second-degree set. With this cache, graph distance queries originating
from a member can be converted to set intersections, avoiding further remote
calls. For example, if we have member X's second degree calculated, to decide
whether member Y is three degree apart from member X, we can simply fetch Y's
connections and intersect X's second degree cache with Y's first degree
connections.
We decided to apply a greedy set cover algorithm to address this query
optimization problem. Greedy set cover algorithms are used to find the smallest
subset that covers the maximum number of uncovered points in a large set. In
this case, we would apply the algorithm to the set of partitions that stored a
member's first-degree connections. Partitions stored on each GraphDB node would
be the elements in a family of sets. We wanted to find the smallest number of
elements from this set family that covered the input set.
Read the rest of the paper for implementation &
optimization details