Please enable JavaScript.
Coggle requires JavaScript to display documents.
Distributed System ( shared-nothing systems ), Cloud Computing and…
Distributed System
( shared-nothing systems )
something about shared-nothing systems
bunch of machines connected by a network
each machine has its own memory and disk
cannot access other machines's memory or disk
dominant approach for building internet service
cheap because not require special hardware
can make use of commoditized cloud computing services
achieve high reliablity
Cloud Computing and Supercomputing
how to build large-scale computing systems
HPC ( high performance computing )
supercomputer with thousands of CPUs
computationally intensive computing task
weather forecasting
molecular dynamics
cloud computing
multi-tenant datacenters
commondity computers connected with IP network ( often Internet )
elastic/on-demand resource allocation
metered biling
traditional enterprise datacenters
handing faults
supercomputer
like a single-node computer > a distributed system
typically checkpoins the state of its computation
one note faults
after the faulty node is repaired, restart from the last checkpoint
simply stop the entire cluster workload
systems for implementing internet services
in this book, we will focus on these systems
difference from supercomputer
internet-related applications is
online
need
to be able serve users with low latency at any time
not acceptable to making the service unavailable
the difference of node in supercomputer and cloud services
supercomputer built from specializer hardware
node quite reliable
nodes communicate through shared memory
remote direct memory access (RDMA)
nodes in cloud service built from commodity machines
equivalent performance at lower cost
have higher failure rates
topology in large datacenter networks and supercomputer
large datacenter
topologies based on IP and thernet
arranged in clostopologies
provide high bisection bandwidth
bisection bandwidth: bandwidth available between the two partitions
supercomputer
specialized network topologies
( multi-dimensional meshes and toruses )
better performance for HPC workload with known communication patterns
the broken rates of big system
bigger system -> more likely its components is broken
broken things get fixed but new things break
resonable to assume that something is always broken
in large systemwhen broken, if simply giving up
time of recovering from faults > doing useful work
if the system can tolerate failed nodes,
then keep working
very useful for operations and maintenance
example: if request to virtual machine A failed, just kill it then request to the new virtual machine B
communication between distributed systems and supercomputers
distributes system
keep data geographically close to user to reduce access latency
communication over internet
slow and unreliable compared to local network
supercomputers
all of node are close together
make distributed systems works
accept the posibility of partial failure
build fault-tolerance mechanism
build a
reliable system
from
unreliable components
Distributed System
Chapter 8: The trouble with distributed system
What's challenges about distributed we are up against?
Chapter 9: Consistency and consensus
How to build system that do their job, in spite of everything going wrong
Faults and Partial Failures
Single computer ( individual compuer )
hardware working correctly
always procedures the same result
hardware problem
example
memory corruption
loose connector
consequence
total system failure
kernel panic
blue screen of death
failure to start up
internal fault occurs
refer crash completely
returning a wrong result
not recommend
difficult and confusing to deal with
Several computer, connected by a network
( Distributed System )
remarkably wide range of things can go wrong
pdu failures
switch failures
accidental power cycle of whole rack
whole-DC backbone failures
whole-DC power failures
hypoglycemic driver smashing
some parts broken
some parts working fine
partial failure
nondeterministic