You are on page 1of 9

Networking Social Media LinkedIn

Ahmad Nazri Bin Ibrahim s818472


Alaa Mohammad s818342
Lenorita Ramas Ngelambai Anak Leban s822648
Nurul Asyikin Binti Abd Rashid s818395

Awang Had Salleh Graduate School of Arts & Sciences, Universiti Utara Malaysia (UUM)

Networking Social Media LinkedIn 1


Abstract LinkedIn is among the largest social networking
sites in the world Business and employment-oriented social
networking service that operates via websites. The
company that built distributed system over core data sets
and request processing requirements. LinkedIn have a
large scale of data infrastructure projects. Every day it can
accommodate us with the increasing of data scales.
Backbone distributed system of LinkedIn is the architecture
design on it. The discussion will include: (1) Voldemort: a
scalable and fault tolerant key-value store; (2) Databus: a
framework for delivering database changes to downstream
applications; (3) Espresso: a distributed data store that
supports flexible schemas and secondary indexing; (4)
Kafka: a scalable and efficient messaging system for
collecting various user activity events and log data

KeywordsLinkedIn; infrastructure; voldemort;


databus; helix; bootstrap; kafka; Espresso; hdfs

I. INTRODUCTION
LinkedIn's founders are Reid Hoffman, Allen Blue, Figure 1: LinkedIn product service
Konstantin Guericke, Eric Ly and Jean-Luc Vaillant. It
officially launched on May 5, 2003. Its one of the II. DISTRIBUTED SYSTEM FEATURES USED BY
worlds largest professional networks on the Internet. LINKEDIN
More than 467 million members around the world coved LinkedIn logically divides into three fairly traditional
almost 200 countries and territories. Standing as a highly tiers of services: a data tier that maintains persistent state
structured set of data and provided data scientists and including: All user data, a service tier that implements an
researchers the ability to conduct applied research that API layer (showing on Figure 2).Capturing the logical
and your network data driven products including search, operations of the website, and a display Tier responsible
social graph, and machine learning systems. for translating APIs into user interfaces Including web
LinkedIn make a priority of the privacy and security pages and other user-facing applications Such as
of its membership at the forefront in all their research. sharing widgets on external sites or mobile Application
LinkedIn provided service by fourth different types of Content. These tiers physically run on separate hardware
clients: and Communicate via RPC.
The service tier and display tier are both largely
Business plus- Where membership can find and
stateless all data and state comes only from transient
contact the right people promote and grow business
caches and underlying data systems. As a result the
potentials. request capacity of these stateless services can be
Job seeker- stand out and get in touch with the increased by randomly balancing the load over an
hiring managers see how you compare to other increased hardware pool. Likewise failures are handled
applicants trivially by removing the failed machines from the live
pool. The key is that since there is no state in these tiers,
Sales navigator fined leads and account in your right all machines are interchangeable
target market.
This is a common strategy for simplifying the design
Recruit lite find great candidates faster of a large website: state is pushed down into a small
number of general data systems which allows the harder
LinkedIn are focusing on two product of the distributed problem of scaling state fully systems to be solved in as
strategy. few places as possible. The details of these systems
differ, but they share some common characteristics. Data
volume generally requires partitioning data over multiple
machines, and requests must be intelligently routed to the
machine where the data resides.
Availability requirements require replicating data onto
multiple machines to allow fail-over without data loss.
Replicated state requires reasoning about the consistency
of data across these machines. Expansion requires
redistributing data over new machines without downtime
or interruption. LinkedIns core data systems are (a) live
storage, (b) stream systems, (c) search, (d) social graph,
(e) recommendations, and (f) batch computing. A high
level architecture of those systems is shown in Figure 2.
We will give a brief overview of each of these areas, and
provide details on a few selected systems.

A. VOLDEMORT
Project Voldemort is a highly available, low-latency
distributed data store. It was initially developed at 2008
in order to provide key-value based storage for read write
data products like Who viewed my profile

Figure 2 : A very high-level overview of LinkedIns


Architecture,
Figure focusing on the core for
2: Accessibility datausers
systems

III. LINKEDIN DISTRIBUTED DESIGN AND


INFRASTRUCTRE
In LinkedIn they use many components to achieve
smooth and active production data that can perform
without delay or complications

Figure 3: Project Voldemort

A Voldemort cluster can contain multiple nodes, each


with a unique id. Stores correspond to database tables.
Each store maps to a single cluster, with the store
partitioned over all nodes in the cluster. Every store has
its set of configurations, including - replication factor
(N), required number of nodes which should participate
in read (R) and writes (W) and finally a schema (key and
value serialization formats)
B. DATABUS
At LinkedIn, we have built Databus, a system for
change data capture (CDC), which is being used to
enable complex online and near-line computations under
strict latency bounds. It provides a common pipeline for
transporting CDC events from LinkedIn primary
databases to various applications. Among these
applications are the Social Graph Index, the People
Search Index, Read Replicas, and near-line processors
like the Company Name and Position Standardization.
Databus contains adapters written for Oracle and
MySQL, two of the primary database technologies at use
at LinkedIn, and a subscription API that allows
applications to subscribe to changes from one or more
data sources. It is extremely easy to extend it to add
support for other kinds of data sources Figure 4: Databus architecture details

Relay: As mentioned above, the first task of the relay is


to capture changes from the source database. At
LinkedIn, we employ two capture approaches, triggers
or consuming from the database replication log. Once
captured, the changes are serialized to a data source
independent binary format. We chose Avro because it is
an open format with multiple language bindings. Further,
unlike other formats like Protocol Buffers, Avro allows
serialization in the relay without generation of source-
schema specific code. The serialized events are stored in
a circular in-memory buffer that is used to serve events
to the Databus clients. We typically run multiple shared-
nothing relays that are either connected directly to the
database, or to other relays to provide replicated
Figure 4: Data Bus at LinkedIn availability of the change stream. The relay with the in-
memory circular buffer provides:
The Databus pipeline consists of three main Default serving path with very low latency (<1 ms)
components: the Relay, the Bootstrap Server, and the Efficient buffering of tens of GB of data with
Databus Client Library. The Relay captures changes in hundreds of millions of Databus events
the source database, serializes them to a common binary Index structures to efficiently serve to Databus
format and buffers those. Each change is represented by clients events from a given sequence number S
a Databus CDC event which contains a sequence Server-side filtering for support of multiple
number in the commit order of the source database, partitioning schemes to the clients
metadata, and payload with the serialized change. Support of hundreds of consumers per relay with no
Relays serve Databus events to both clients and the additional impact on the source database
Bootstrap servers. Bootstrap servers provide clients with
arbitrary long look-back queries in Databus event stream
and isolate the source database from having to handle C. BOOTSRAP SERVER
these queries main components as in Figure 4 in details: Bootstrap server: The main task of the bootstrap
server is to listen to the stream of Databus events and
provide long-term storage for them. It serves all requests
from clients that cannot be processed by the relay
because of its limited memory. Thus, the bootstrap
server isolates the primary database from having to serve
those clients.
There are two types of queries supported:
Consolidated delta since sequence number (say T)
Consistent snapshot at sequence number (say U)

Figure 5: Bootstrap Server Design

Related Systems: The related systems can be classified


in three categories: generic pub/sub systems, database
replication systems, and other CDC systems. Generic
pub/sub systems like JMS implementations, Kafka and
Hedwig generally provide message-level or, at best,
topic-level semantics.
D. HELIX GENERAL CLUSTER MANAGER
Helix is a generic cluster management framework
used for automatic management of partitioned, Figure 6: Helix General Cluster Manager
replicated and distributed resources hosted on a cluster
of nodes.
E. KAFKA
Helix provides the following features: Kafka is distributed in nature; a Kafka cluster
1. Automatic assignment of resource/partition to typically consists of multiple brokers.
nodes
2. Node failure detection and recovery
3. Dynamic addition of Resources
4. Dynamic addition of nodes to the cluster
5. Pluggable distributed state machine to manage
the state of a resource via state transitions
6. Automatic load balancing and throttling of
transitions
Figure 8: Espresso system component

New storage nodes can be added as the data size or the


request rate approaches the capacity limit of a cluster. A
warm standby is located in a geographically remote
Figure 7: Kafka Architecture location to assume responsibility in the event of a
disaster.
To balance load, a topic is divided into multiple
partitions and each broker stores one or more of those
Espresso is a distributed document oriented database:
topics and parts.
It is timeline consistent, provides rich operations on
documents and supports seamless integration with
F. ESPRESSO near line and online environments.
ESPRESSO is a distributed, timeline consistent, Espresso uses master-slave architecture.
scalable, document store that supports local secondary Apache Helix, uses Zookeeper as the coordination
indexing and local transactions. ESPRESSO relies on service to store cluster metadata.
Databus for internal replication and therefore provides a Espresso has been developed in Java and is in
Change Data Capture pipeline to downstream production since June 2012.
consumers. The development of Espresso was motivated
by our desire to migrate LinkedIns online serving
infrastructure from monolithic, commercial, RDBMS IV. HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
systems running on high cost specialized hardware to Data in a Hadoop cluster is broken down into smaller
elastic clusters of commodity servers running free pieces (called blocks) and distributed throughout the
software; and to improve agility by enabling rapid cluster. In this way, the map and reduce functions can be
development by simplifying the programming model, executed on smaller subsets of your larger data sets, and
separating scalability, routing, caching from business this provides the scalability that is needed for big data
processing.
Each component in Espresso is fault tolerant.
Storage node Failure: Helix calculates a new set of
master partitions from existing slaves.
Failure Detection: Helix Zookeeper heartbeat for
hard failure Monitor performance metrics reported
by the router and the storage nodes
Data bus is also fault tolerant

Figure 9: Hadoop architecture


the cluster is the set of host machines (nodes). Nodes VI. FUTURE WORKS
may be partitioned in racks. This is the hardware part Inspired by Facebook's "social graph", LinkedIn also
of the infrastructure. want to try create an "economic graph" within a decade.
the YARN Infrastructure (Yet Another Resource The goal is to create a comprehensive digital map of the
Negotiator) is the framework responsible for world economy and the connections within it. The
providing the computational resources (e.g., CPUs, economic graph was to be built on the company's current
memory, etc.) needed for application executions. platform with data nodes including companies, jobs,
skills, volunteer opportunities, educational institutions,
Two important elements are: and content. They have been hoping to include all the job
the Resource Manager (one per cluster) is the master. listings in the world, all the skills required to get those
It knows where the slaves are located (Rack jobs, all the professionals who could fill them, and all the
Awareness) and how many resources they have. It companies (nonprofit and for-profit) at which they work.
runs several services, the most important is the The ultimate goal is to make the world economy and job
Resource Scheduler which decides how to assign the market more efficient through increased transparency. In
resources. June 2014, the company announced its "Galene" search
architecture to give users access to the economic graph's
data with more thorough filtering of data, via user
searches like Engineers with Hadoop experience in
Brazil.
LinkedIn has used economic graph data to research
several topics on the job market, including popular
Figure 10: Resource Manager destination cities of recent college graduates, areas with
high concentrations of technology skills, and common
The Node Manager (many per cluster) is the slave of career transitions. LinkedIn provided the City of New
the infrastructure. When it starts, it announces York with data from economic graph showing "in-
himself to the Resource Manager. Periodically, it demand" tech skills for the city's "Tech Talent Pipeline"
sends a heartbeat to the Resource Manager. Each project
Node Manager offers some resources to the cluster.
Its resource capacity is the amount of memory and VII. REFERENCES
the number of vcores. At run time, the Resource
Scheduler will decide how to use this capacity: a [1] Salem, Saeed, Shadi Banitaan, Ibrahim Aljarah,
Container is a fraction of the NM capacity and it is James E. Brewer, and Rami Alroobi. "Discovering
used by the client for running a program. Communities in Social Networks Using Topology
and Attributes." In Machine Learning and
V. COMPARISON BETWEEN LINKEDIN, FACEBOOK Applications and Workshops (ICMLA), 2011 10th
AND TWITTER
International Conference on, vol. 1, pp. 40-43.
IEEE, 2011.
The following comparison show the different between
[2] Zhenghe Wang, Wei Dai, Feng Wang, Hui Deng,
the social media network and LinkedIn service. Shoulin Wei, Xiaoli Zhang, Bo Liang, "Kafka and
Feature LinkedIn Twitter Facebook Its Using in High-throughput and Reliable Message
In depth service Yes No Yes Distribution", Intelligent Networks and Intelligent
Systems (ICINIS) 2015 8th International Conference
Amount of content Not Limited Not on, pp. 117-120, 2015.
limited limited
[3] Jun Rao and Sam Shah LinkedIn Confidential
Better user Yes No Yes 2013 All Rights Reserved
experience [4] Shvachko K, Kuang H, Radia S, Chansler R. The
Interact with others Yes Yes Yes hadoop distributed file system. In2010 IEEE 26th
Add feeds to others Yes No No symposium on mass storage systems and
apps technologies (MSST) 2010 May 3 (pp. 1-10). IEEE.
[5] Shadi A. Noghabi , Sriram Subramanian , Priyesh
Narayanan , Sivabalan Narayanan , Gopalakrishna
Holla , Mammad Zadeh , Tianwei Li , Indranil
Gupta , Roy H. Campbell, Ambry: LinkedIn's
Scalable Geo-Distributed Object Store, Proceedings
of the 2016 International Conference on
Management of Data, June 26-July 01, 2016, San
Francisco, California, USA
[6] http://hadoop.apache.org/.
[7] Finkle, Jim and Jennifer Saba (June 6,
2012). "LinkedIn suffers data breach". Reuters.
Retrieved June 7, 2012.

You might also like