Please enable JavaScript.
Coggle requires JavaScript to display documents.
COMP26120 Algorithms and Data Structures - Coggle Diagram
COMP26120 Algorithms and Data Structures
Introducing Algorithms
What is an algorithm?
Computational Problem vs Algorithm
computational problem - a desired relationship between input(s) and output(s)
algorithm - a well defined set of steps tranforming input(s) into output(s) solving a computational problem
Algorithm Properties
Does it solve the computational problem?
Speed
Memory usage
Performance scale with size of inputs?
Secure
Good memory locality
Best/worst/average cases?
Deterministic
Does it always terminate?
How good is the output?
Correctness
Does it scale with computational resources?
What is a data structure?
Abstract data type and Data Structure
abstract data type - defines the behaviour of a set of possible operations on data of a certain type
data structure - way to store and organise data in order to facilitate access and modifications
Data Structure Properties
Implements a given abstract data type
How do operations perform in terms of time and memory and in relation to size and layout
Is it deterministic
What are the trade-offs
Correctness
Describing algorithms and data structures
Pseudocode
careful english, abstract, no syntax errors
We tend to present algorithms using pseudocode because it allows us to abtract from the details of a particular programming language and be more concise and general.
Abstract machine
sequential execution, constant operations, infinte memory
Real code
fixed syntax, low level details, handling memory, compiler/interpreter, virtual machine, OS, memory, architecture
Introducing Algorithmic Complexity
Asymptotic Performance
We focus on the infinte set of large n ignoring small values of n. As the problem size gets vast, the properties affected are:
running time
memory/storage requirements
bandwidth/power requirements / logic gates
Asymptotic notation = big-O
Search problems
For an unsorted sequence, we must use linear serch with worst case O(n) and average case n/2
For a sorted sequence, we can use binary search. At each iteration/interaction, the number of positions is cut in half, so we have worst case O(logn).
Analysis of Algorithms
We usually use a generic uniprocessor RAM: memory equally expensive to access, instructions executed sequentially, reasonable inctructions take unit time, constant word size
Input size
Time and space complexity is generally a function of the input size
How we characterize input size depends on:
sorting: number of input items
multiplication: total number of bits
graph algorithms: number of nodes and edges
Running time
Number of primitve steps that are executed
Most statements require the same amount of time
Analysis
Worst case
provides an upper bound for running time, is a guarantee
Average case
provides the expected running time
useful, but depends on what you define average eg. random, equally likely inputs or real life inputs
Upper and Lower Bounds
Upper Bound Notation
Insertion sort's runtime is O(n^2)
In general, a function f(n) is O(g(n)) if there exists positive constants c and n0 such that f(n) <= c.g(n) for all n >= n0
Lower Bound Notation
Insertion sort's runtime is Ω(n)
In general, a function f() is Ω(g(n)) if there exists positive constants c and n0 such that 0 <= c.g(n) <= f(n) for all n >=n0.
Asymptotic Tight Bound
A function f(n) is ϴ(g(n)) if there exists positive constants c1, c2 and n0 such that 0 <= c1.g(n) <= f(n) <= c2.g(n) for all n >= n0
Other Asymptotic Notation
A function f(n) is o(g(n)) if for any positive constants c there exists a constant n0 > 0 such that 0 <= f(n) < c.g(n) for all n >= n0
A function f(n) is w(g(n)) if for any positive constants c there exists a constant n0 > 0 such that 0 <= c.g(n) <= f(n) for all n >= n0
Asymptotic Comparison
f(n) = O(g(n)) is like a <= b
f(n) = Ω(g(n)) is like a => b
f(n) = ϴ(g(n)) is like a = b
f(n) = o(g(n)) is like a < b
f(n) = w(g(n)) is like a > b
Abuse of notation f(n) = O(g(n)) indicates that f(n) ε O(g(n))
Divide and Conquer
Divide and Conquer
Divide the problem into some prolems that are smaller instances o the same problem D(n)
Conquer the subproblems by solving them recursively.(aT(n/b))
Combine the solutions to the subproblems into the solution for the originl problem C(n)
T(n) = ϴ(1) if n <= c, T(n) = D(n) + aT(n/b) + C(n) otherwise
Recurrences
The expression T(n) is recurrence: an equation or inequality that descrives a function in terms of its value on smaller functions
You can solve recurrences using the:
substitution method
iteration method
master method
Proof by Induction
Review
Show that the property P is true for n>= k
base case
tep case or inductive step
Supppose that
P(k) is true for a fixd constant k
P(n) implies P(n+1) for all n >= k
Then P(n) is true for all n >= k
The Substitution Method for Solving Recurrences
Guess the form of the answer, then use induction to find the constants and show that the solution works.
The Substitution Method
The Substitution Method
Induction requires us to show that the solution remains valid for the limit conditions
Base case: shows that the inequality holds for some n sufficiently small. It holds with respect to the asymptotic notation: T(n) <= anlog n for N >= n0. Hint: extend the binary conditions to make the inductive hypotheses count for small values of n.
Inductive hypothesis: Assume that T(n) <= cnlogn for c > 0 holds for all positive m < n (for the specfic example)
Induction: Inequality holds for n
Making a good guess
Takes experience/creativity
Prove loose upper and lower bounds to zone in
Changing variables
A little algebraic manipulations can make an unknown recurrence similar to something you have seen before
Complexity of Recursive Programs
Math and the Pertubation Method
The Pertubation Method
Step 1: Equate the two expressions for S(n + 1) by taking its first and last terms
Step 2: Find Sn on both sides of the equation
Step 3: Isolate Sn to find the formula for the sum
(Math notes omitted)
Approximation by Integrals
When a summation has the form (the sum of f(k)), where f(k) is a monotonically increasing function, we can approxiamte it by integrals.
When f(k) is monotnically decreasing function, we can use a similar method to provide the bounds.
The Iteration Method
Solving Recurrences
Iteration method
expand the recurrence
work some algebra to express as a summation
evaluate the summation
For recurrences in the form T(n) = aT(n/b) + cn T(n) =
ϴ(n) if a < b
ϴ(nlogbn) if a = b
ϴ(n^logba) if a > b
The master method
The Master Method
Given a divide and conquer algorithm:
An algorithm that divides the problem of size n into subproblems, each of size n/b
Let the cost of each stage + combined solved subproblems be described by the function f(n) = D(n) + C(n)
The master method provides a "cookbook" method for solving recurrences of the form T(n) = aT(n/b) + f(n) then T(n) =
ϴ(n^logb(a)) if f(n) = O(n^logb(a - ε))
ϴ(n^logb(a)logn) if f(n) = ϴ(n^logb(a))
ϴ(f(n)) if f(n) = Ω(n^logb(a + ε)) AND af(n/b) <= cf(n) for large n
ε > 0, c < 1
a >= 1, b > 1 and f are asymptotically positive
Cookbook Method
Step 1: Identify a, b and f(n)
Step 2: Determine n^(logb(a))
Step 3: Compare f(n) to n^(logb(a))
Step 4: According to the case, apply the corresponding rule
Amortized Analysis
General Strategy of amortisation and dynamic tables
Amortised analysis
We avergae the time required to perform a sequence of data structure operations over all the operations performed.
Amortised analysis differres from the average case analysis in that probability is not involved. It guarantees the average performance of each operation in the worst case.
Types of Amortised Analysis
Aggregaate method - obtain the average cost per operation
Accounting method - determine the amortised cost of each operation
Potential method - determine theamortised cost of each operation and may overcharge operators early on to compensate for undercharges later.
Hash table (Large Universe)
The universe contains many elements
The number of keys to be inserted is much less than the universe
It makes no sense to have a position for each element of the universe.
How large should a hash table be?
Small as possible, but large enough to not overflow
If we do not know the size in advance we have dynamic tables
Whenever the table overflows we make a new larger table and re-allocate each value.
Typical implementations are ArrayList in Java and std::vector in C++
Worst-case analysis
With n + 1 values into a ahsh table size n, all but one take a constant time, but one would take O(n) time.
We can take the average of this and finding that pushing elements onto te dynamic array takes ϴ(n)/n = ϴ(1)
The aggregate method
Aggregate Analysis
For all n, a sequence of n operations takes worst-case time T(n) in total. In the worst case, the avaerge cost, or amortised cost, per operation is thereofore T(n)/n.
This amortised cost applies to each operation. The accounting method and the potential method may assign different amortised costs to different types of operations.
The running time, of a stack operation is O(1), we did not use probabilistic reasoning.
The worst-case bound of O(n) on a sequence of n operations.
Dividing this total cost by n yielded the average cost per operations or the amortised cost.
Summary
In an amortised analysis, we average the time required to perform a sequence of data-structure operations over all the operations performed
Hash Tables
Data Structures vs Data Types
Abstract Data Types
For a given data type defines possible values, operations and behaviour
Data Structure
A way to store and organise data to facilitatae access and modifications. They implement ADTs,
Dynamic Arrays vs Linked Lists
Memory footprint
Dynamic arrays have O(n) wasted space
Linked Lists have space overhead for pointer
Access
They both have O(n) access
Sets and Dictionaries
Sets are unordered collections
Dictionaries are a generalisation of arrays to non-int indicies
Priority Queues and Disjoint Sets
Priority Queue is a queue that is ordered by priority
Disjoint sets are a collection of sets that can be merged
Hash Maps
Look-up table
A value can be put in a lookup table at a chosen point, but nothing can be done if the table position chosen is not part of the table.
The value of a hash map position can be amde to fit into the table using hashing and modding of the given number, but this can result in collisions.
Collisions can be kept in a list at table entry (separate chaining)
Hash Functions
Hash functions map the universe to a set of integers
Uniformity in compression
N could be prime, maing it less likely for patterns to occur
N could be a power of 2, but more collisions can occur due to indexing on LSBs
We can use XOR with shifted key to mix in higher level bits
Hash codes are used to change a sequence of integers into one integer
We must avoid symmetry with this, so we use polynomially based hash codes
These create space for each value in the sequence
Hash Map Collisions
Open addressing
One thing per bucket, with schemes for an occupied bucket
Linear probing - creates clusters
Quaratic probing - creates secondary clusters as it fills in the quadratically available space
Double hashing - uses two hash functions when the first causes a collision, avoifding clustering
Complexity and Rehashing
Load Factor
The load factor of a hash map of capacity m containing n objects is n/m
It tells us how well utilised our hash map is
greater than 1 : a bucket has more than one entry
0.5: malf full
->0: a lot of space is wasted
Complexity
Worst case, collisions mechanisms degenerate into linear seach
Average case makes assumptions:
Space is O(n), insert, find and remove are O(1)
All of their worst cases are O(m)
Average Case Complexity
Assumptions
hash function is uniform
load factor less than 1
operations constant time
Separate chaining and open addressing average case simplifies to O(1)
Rehashing
Create a new table that is a fixed percantage larger, reinsert all existing elements and deleting the old table
This happens when the load factor reaches between 0.5 and 0.75 for open addressing, but for separate chaining it can be pushed higher
Cost of rehahsing is amortised across all insertions
Search Trees
What are trees?
Binary tree - nodes have at most 2 children
General tree - At least one node has more than 2 children
Proper binary tree - Nodes have either 0 or 2 children
Complete binary tree - All levels are full, except the last where all nodes are to the left
Perfect binary tree - tree is proper and all external nodes have the same depth
Representation of trees
Array representations
The left child for each node i is at 2i + 1, while the right is at 2i + 2
This can go wrong with binary trees with lots of levels that are like linear lists as we have to leave lots of blank spaces
Linked representation
Each node is made up of four parts - element no, parent pointer, left and right children
For a general tree, we use a three part node with a children pointer to a container of pointers to the children
This reduces wasted space
Searching and Traversing trees
Pre-Order Traversal - current, left and right
In-Order Traversal - left, current right
Post-Order Traversal - left, right, current
Depth First Search - looks at the current node, then the right and left, pressing them onto a stack unitl discovered
Breadth First Traversal - Starts with the root, the looks at the left and right child. The the chhildren of the left and right nodes and so on.
Binary Search Trees
BSTs
Invariant: leftChild.element() < node.element() < rightChild.element()
In-order traversal of BST gives us the elements in ascending order
Search and insert have worst case O(h) where h is the height of the tree
Remove in BST
leaves can be removed
single child can be elevated
two children, take in order sucessor and place is and its subtree in it's original place
If the in order sucessor is within a subtree, take it and replace is with its left child, and replace the removed number with it;s sucessor
Average complexiites of BST
Assuming uniform distribution of elements, find, insert and delete have average complexities of O(logn) with worst cases of O(n)
AVL Trees: self balancing BST
Height Balancing Property
B(n) = H(n.leftChild()) - H(n.rightChild())
greater than 0 = left heavy
<0 = right heavy
0 = just right
AVL Tree
a special type of BST with another invariant: balance at all nodes is |B(n)| <= 1
The Height Balance Property ensures that subtrees are of roughly equal height
Rotation operations occur after insert/delete to maintain the height balance property
Rotataion
a tree manipulation operations
Moves nodes around whist keeping the invariant, changing the root of a subtree
Right rotation - shifts nodes to the right of the root
Left rotation - shifts nodes to the left of the root
Double rotations
When a single rotation is not good enough it is because a right/left heavy tree has a left/right heavy subtree
Left-right Rotation - a left heavy right subtree, right rotate the left sutree and left rotate the original root
Right-left Rotation - the opposite
Complexites
The height of the AVL tree with n nodes is log(n)
Binary Heaps, Priority Queues and Skip Lists
Binary Heaps
A binary heap is a special type of Binary Tree
It is a complete binary tree
It satisfies the heap order property
MIN Heap Order Property
Each node's value is greater than that of its parent
MAX Heap Order Property
Each node's value is less than that of its parents
Maintaining a Max Heap
Ensure that the property has not been violated
assume the left and right chidren are roots of valid max heaps
check nodes against those vaues
ensure the largest value is at the top
recursively call on left or right child is swap occured
Building a Max Heap
Given that we can represent a max heap as an array, we can build an existing array into a max heap
We do this by allowing values to bubble up/down through the heap
Complexities
Space = O(n)
Building = O(logn) -> O(n)
Height = O(logn)
Priority Queues
A priority queue is a queue that has some piority associated with each element
Elements with the highest (or lowest) priority are exited first
Remove from Max Heap
Remove from the root, swap last element into the root's place and bubble-down the swapped element, maininting heap order property
Insert to Max Heap
Insert at the next available insertion point and check against parent. If there is a violation of heap order property, then swap until this doesnt occur
Skip Lists
Skip List
A list that has nodes of variable size, randomised data structure
Easy to implement as it is a genralisation of a sorted linked list
They allow us to skip over many items
Searching
Skip over as much as posible, if the next key is less than or equal to my search, i can skip
If not then I have to go down a level
Inserting
Search for the space where is is to be insert it, then insert, flipping a coin to decide the height and fix the pointers
How large is h?
Skip lists are probabilistic so we can talk about probabilitythat h reahces a certain level
Probability of tower reaching level i is 1/2^i
Probability that level i has at least one item n/2^i
Complexities
Search
Worst, all towers are h high = O(n + h) where h is O(logn)
Average O(logn)
Space O(n)
Disjoint Sets
Disjoint Sets
A collection of disjoint sets that can be merged, operations add, find, union
Connected components problem
Given a set of edges between components, find sets of connected components
Path Compression
Path compression involves moving the parent pointers so they all point to the root along a specific path. It incolves two passes over the tree.
Path splitting splits the path into two, shortening them both.
Path halving involves moving items straight to their grandparents making efficient union and find operations
Fast Find and Fast Union
Linked structures give us pointers to each item, giving Find a worst case of O(1), however union would involve updating n pointers so O(n)
Array based structures give the same
Amortised analyis of union(X, Y)
m union operations on disjoint sets of size n
one set will at least double in size
This doubling can happen logn times
Worst case O(m+n log n)
Amortised over m operators O(logn)
Rooted Trees
Contains parent pointers only, makinf union time complexity Find + O(1), but find is O(n) makinf union O(n)
We can make this smaller with path compression
Array based structure
Value stored at index [i] is the parent of node i in the rooted tree
Algorithmic Techniques I
Introducing Algorithmic Techniques
How do we devise an algorithm to solve a problem
Common solution: turn it into something we know
Does it look similar to a known problem?
Does the problem have properties that allow us to apply a known algorithmic technique?
Kinds of Problems
Decision problems - yes/no
Functional problem - unique solution
Search problem - unique solution from a feasible set
Counting problem - count the number of solutions in a feasible set
Optimisation problem - best known solution from a feasible set
Algorithmic Techniques or Paradigms: Decomposition View
Problem Decomposition
Divide-and-conquer
Dynamic programming
Solution Space Search
Brute Force (Enumeration, Backtracking)
Branch-and-Bound
Greedy Search
Heuristic
The Solution Space View
Considering what possible solutions to the problem might look like and how to explore those solutions
e.g. searching through brute force or binary serach (a form of greedy search with divide and conquer)
Enumerative Search with Pruning
Solution Space Enumeration
Exploring all solutions to a problem is called brute-force
If we are generating the set of solutions incrementally and searching this then we call this enumeration
Often an enumeration involves some choice which leads naturally to the solution space having a structure
The leaves of the tree are the set of all possible solutions, with n solutions the complexity of brute force is O(n)
Depth-First Search + Pruning
To improve the efficinecy of searching, we can use backtracking and branch-and-bound
Explore a tree of partial solutions
Prune branches and backtrack
In backtracking we prune branch if it cannot be extended to a solution at all
In branch-and-bound we prune a branch if it cannot produce a better solution than on that exosts on another branch
Greedy Algorithms
A greedy algorithm always makes the best choice
This is not always guaranteed to find the best solution
Greedy choice property
We can find a global optimum by making a series of locally optimal choices
OR
We can make a series of choices such that we never need to go back and change our choice later
Algorithmic Techniques II
Divide and Conquer
This is a top-down approach - break a large problem into smaller ones
A bottom-up approach would combine smaller solutions to get a whole one
Memoization
Memorising.saving esults that you can use e.g. in fibonacci you may have to calculate some numbers multiple times, so we can save them in a lookup table (like a map)
Introduction to Dynamic Programming
Bottom-up approach
Dynamic Programming
Dynamic Programming Requirements
Optimal Substructure
Simple/similar subproblems
Subproblem optimality - we can put them together to produce an optimal solution
Overlapping subproblems
Urelated subprobelmes contain subproblems in common. This is the key difference to divide-and-conquer.