# Complexity General way to describe efficiency algorithms (linear vs exponential) indipendent from the computer architecture/speed. ## The RAM - random-access machine Model of computer used in this course. Has random-access memory. ### Basic types and basic operations Has basic types (like int, float, 64bit words). A basic step is an operation on a basic type (load, store, add, sub, ...). A branch is a basic step. Invoking a function and returning is a basic step as well, but the entire execution takes longer. Complexity is not measured by the input value but by the input size in bits. `Fibonacci(10)` in linear in `n` (size of the value) but exponential in `l` (number of bits in `n`, or size of the input). By default, WORST complexity is considered. ## Donald Knuth's A-notation A(c) indicates a quantity that is absolutely at most c Antonio's weight = (pronounced "is") A(100) ## (big-) O-notation f(n) = O(g(n)) *Definition:* if f(n) is such that f(n) = k * A(g(n)) for all _n_ sufficiently large and for some constant k > 0, then we say that # Complexity notations (lecture 2019-02-26) ## Characterizing unknown functions pi(n) = number of primes less than n ## First approximation *Upper bound:* linear function pi(n) = O(n) *Lower bound:* constant function pi(n) = omega(1) *Non-trivial tight bound*: pi(n) = theta(n/log n) ## Theta notation Given a functio ng(n), we define the __family__ of functions theta(g(n)) such that given a c_1, c_2 and an n_0, for all n >= n_0 g(n) is sandwiched between c_1g(n) and c_2g(n) ## Big omega notation Omega(g(n)) is a family of functions such that there exists a c and an n_0 such that for all n>= n_0 g(n) dominates c\*g(n) ## Big "oh" notation O(g(n)) is a family of functions such that there exists a c and an n_0 such that for all n>= n_0 g(n) is dominated by c\*g(n) ## Small "oh" notation o(g(n)) is the family of functions O(g(n)) excluding all the functions in theta(g(n)) ## Small omega notation omega(g(n)) is the family of functions Omega(g(n)) excluding all the functions in theta(g(n)) ## Recap *asymptotically* = <=> theta(g(n)) *asymptotically* < <=> o(g(n)) *asymptotically* > <=> omega(g(n)) *asymptotically* <= <=> O(g(n)) *asymptotically* >= <=> Omega(g(n)) # Insertion sort ## Complexity - *Best case:* Linear (theta(n)) - *Worst case:* Number of swaps = 1 + 2 + ... + n-1 = (n-1)n/2 = theta(n^2) - *Average case:* Number of swaps half of worst case = n(n-1)/4 = theta(n^2) ## Correctness Proof sort of by induction. An algorithm is correct if given an input the output satisfies the conditions stated. The algorithm must terminate. ### The loop invariant Invariant condition able to make a loop equivalent to a straight path in an execution graph. # Heaps and Heapsort A data structure is a way to structure data. A binary heap is like an array and can be of two types: max heap and min heap. ## Interface of an heap - `Build_max_heap(A)` and rearranges a into a max-heap; - `Heap_insert(H, key)` inserts `key` in the heap; - `Heap_extract_max(H)` extracts the maximum `key`; - `H.heap_size` returns the size of the heap. A binary heap is like a binary tree mapped on an array: ``` 1 / \ / \ 2 3 / \ / \ 4 5 6 7 => [1234567] ``` The parent position of `n` is the integer division of n by 2: ```python def parent(x): return x // 2 ``` The left of `n` is `n` times 2, and the right is `n` times 2 plus 1: ```python def left(x): return x * 2 def right(x): return x * 2 + 1 ``` **Max heap property**: for all i > 1 A[parent(i)] >= A[i] # Some data structures Way to organize information A data structure has an interface (functions to work with the DS) A data structure has data and meta-data (like size, length). ## Stack (LIFO) ### Operations: - `push(S, x)` (put one element, move TOS) - `pop(S)` (remove element in TOS, move TOS) - `stack-empty(S)` (returns TRUE if stack is empty) ## Queue (FIFO) ### Structure - Based on array - `length` - `head` - `tail` Queue has always 1 cell free to avoid confusion with full/empty ## Dictionary Data structure for fast search A way to implement a dictionary is a *Direct-access table* ### API of dictionary - `Insert(D, k)` insert a ket `k` to dictionary `D` - `Delete(D, k)` removes key `k` - `Search(D, k)` tells whether `D` contains a key `k` Many different implementations # Direct-access tables - universe of keys = {1,2,...,M} - array `T` of size M - each key has its own position in T ## The 'dumb' approach ```python def Insert(D, x): D[x] == True def Delete(D, x): D[x] == False def Search(D, k): return D[x] ``` Everything is O(1), but memory complexity is awful. Solve this by mapping keys to smaller keys using an hash function. Table T is a table many times smaller than the universe and different by a small constant factor from the set of keys stored S. Instead of using keys directly, the index returned by the hash function is used. **Pigeon hole problem:** More keys (pigeons) in the universe than holes (indexes for table t) ## The solution: the *chained hash table* For every cell in table T don't store True/False but a linked list of keys for each index in the table. This is called *Chained hash table*. ```python def Chained_hash_insert(T, k): return List_insert(T[hash(k)], k) def Chained_hash_search(T, k): return List_search(T[hash(k)], k) ``` Elements are spreaded evenly across the table if the hash function is good. *alpha* = *n / |T|* is the average-case average length of the linked lists inside the table (where n is the number of elements in the table and *|T|* is the size of the table. A good hash table implementation makes the complexity of *alpha* O(1). If *n*, the number of elements that we want to store in the hash table, grows, then *|T|* must also grow. *alpha* represents the time complexity of both insertion and search. ## Growing a *chained hash table* In order to grow a table, a new table must be created. The hash function (or its range parameters) must be changed as well. ### Rehashing *Rehashing* is the process of putting all the elements of the old table in the new table according to the new hash function. The complexity is O(n), since `Chained-hash-insert` is constant. ### Growing the table every time If the table is grown by a constant factor every time the table *overflows*, then the complexity of insertion is O(n^2) due to all the *rehashing* needed. ### Growing the table by doubling the size If the table size is doubled when *overfloiwng*, then the complexity for insertion becomes linear again by sacrificing some memory complexity. # Binary search trees Implementation of a dynamic set over a *totally ordered* (with an order relation that has ...) ## Interface - `Tree-Insert(T, k)` adds a key K to tree T; - `Tree-Delete(T, k)` deletes the key K from tree T; - `Tree-Search(T, k)` returns if key K is in the tree T; - `Tree-Minimum(T)` finds the smallest element in the tree; - `Tree-Maximum(T)` finds the biggest element in the tree; - `Tree-successor(T, k)` and `Tree-predecessor(T, k)` find the next and previous position in the tree ## Height of a tree The height of a tree is the maximum number of edges traversed from parent to child in order to reach a leaf from the root of the tree. ## Rotation ``` b / \ a \ / \ \ / \ \ k <= a a <= k <= b k >= b ``` ``` a / \ / b / / \ / / \ k <= a a <= k <= b k >= b ``` # Red-Black trees 1. Every node has a color: red or black 2. The root is black 3. Every NULL leaf node is black 4. If a node is red, both of its children are black 5. The black height is the same for every branch in the tree A red node is a way to strech the tree, but you cannot stretch it too far. # B-trees Disk access sucks (a billion times slower than registers). So, ignore the RAM model and come up with a better solution. If the number of keys per node is increased to k keys, the complexity of tree search will be log_k+1(n). We can search between the keys linearly because we have to wait for the disks. And log_1000() is much better than log_2(). ## Implementation - Minimum degree (min. num. of children in each node) defined as `t` (`t >= 2`) - Every node other than the root must have at least `t - 1` keys - Every node must contain at most `2t - 1` keys - Every node has a boolean flag for leaf ### Insertion - Read node from disk - If full, split - If leaf, insert in the right position - If not leaf, recurse on the right branch based on the keys ### Search - Read node from disk - Linearly search between keys - If key found, return True - otherwise, if leaf, return False - otherwise, if not leaf, recurse on the right branch