This repository has been archived on 2021-10-31. You can view files and clone it, but cannot push or open issues or pull requests.
DSA/notes.md
2019-05-03 08:41:37 +02:00

8.8 KiB

Complexity

General way to describe efficiency algorithms (linear vs exponential) indipendent from the computer architecture/speed.

The RAM - random-access machine

Model of computer used in this course.

Has random-access memory.

Basic types and basic operations

Has basic types (like int, float, 64bit words). A basic step is an operation on a basic type (load, store, add, sub, ...). A branch is a basic step. Invoking a function and returning is a basic step as well, but the entire execution takes longer.

Complexity is not measured by the input value but by the input size in bits. Fibonacci(10) in linear in n (size of the value) but exponential in l (number of bits in n, or size of the input).

By default, WORST complexity is considered.

Donald Knuth's A-notation

A(c) indicates a quantity that is absolutely at most c

Antonio's weight = (pronounced "is") A(100)

(big-) O-notation

f(n) = O(g(n))

Definition: if f(n) is such that f(n) = k * A(g(n)) for all n sufficiently large and for some constant k > 0, then we say that

Complexity notations (lecture 2019-02-26)

Characterizing unknown functions

pi(n) = number of primes less than n

First approximation

Upper bound: linear function

pi(n) = O(n)

Lower bound: constant function

pi(n) = omega(1)

Non-trivial tight bound:

pi(n) = theta(n/log n)

Theta notation

Given a functio ng(n), we define the family of functions theta(g(n)) such that given a c_1, c_2 and an n_0, for all n >= n_0 g(n) is sandwiched between c_1g(n) and c_2g(n)

Big omega notation

Omega(g(n)) is a family of functions such that there exists a c and an n_0 such that for all n>= n_0 g(n) dominates c*g(n)

Big "oh" notation

O(g(n)) is a family of functions such that there exists a c and an n_0 such that for all n>= n_0 g(n) is dominated by c*g(n)

Small "oh" notation

o(g(n)) is the family of functions O(g(n)) excluding all the functions in theta(g(n))

Small omega notation

omega(g(n)) is the family of functions Omega(g(n)) excluding all the functions in theta(g(n))

Recap

asymptotically = <=> theta(g(n)) asymptotically < <=> o(g(n)) asymptotically > <=> omega(g(n)) asymptotically <= <=> O(g(n)) asymptotically >= <=> Omega(g(n))

Insertion sort

Complexity

  • Best case: Linear (theta(n))
  • Worst case: Number of swaps = 1 + 2 + ... + n-1 = (n-1)n/2 = theta(n^2)
  • Average case: Number of swaps half of worst case = n(n-1)/4 = theta(n^2)

Correctness

Proof sort of by induction.

An algorithm is correct if given an input the output satisfies the conditions stated. The algorithm must terminate.

The loop invariant

Invariant condition able to make a loop equivalent to a straight path in an execution graph.

Heaps and Heapsort

A data structure is a way to structure data. A binary heap is like an array and can be of two types: max heap and min heap.

Interface of an heap

  • Build_max_heap(A) and rearranges a into a max-heap;
  • Heap_insert(H, key) inserts key in the heap;
  • Heap_extract_max(H) extracts the maximum key;
  • H.heap_size returns the size of the heap.

A binary heap is like a binary tree mapped on an array:

    1
   / \
  /   \
 2     3
/ \   / \
4 5   6 7

=> [1234567]

The parent position of n is the integer division of n by 2:

def parent(x):
  return x // 2

The left of n is n times 2, and the right is n times 2 plus 1:

def left(x):
  return x * 2

def right(x):
  return x * 2 + 1

Max heap property: for all i > 1 A[parent(i)] >= A[i]

Some data structures

Way to organize information

A data structure has an interface (functions to work with the DS)

A data structure has data and meta-data (like size, length).

Stack (LIFO)

Operations:

  • push(S, x) (put one element, move TOS)
  • pop(S) (remove element in TOS, move TOS)
  • stack-empty(S) (returns TRUE if stack is empty)

Queue (FIFO)

Structure

  • Based on array
  • length
  • head
  • tail

Queue has always 1 cell free to avoid confusion with full/empty

Dictionary

Data structure for fast search

A way to implement a dictionary is a Direct-access table

API of dictionary

  • Insert(D, k) insert a ket k to dictionary D
  • Delete(D, k) removes key k
  • Search(D, k) tells whether D contains a key k

Many different implementations

Direct-access tables

  • universe of keys = {1,2,...,M}
  • array T of size M
  • each key has its own position in T

The 'dumb' approach


def Insert(D, x):
  D[x] == True

def Delete(D, x):
  D[x] == False

def Search(D, k):
  return D[x]

Everything is O(1), but memory complexity is awful.

Solve this by mapping keys to smaller keys using an hash function.

Table T is a table many times smaller than the universe and different by a small constant factor from the set of keys stored S.

Instead of using keys directly, the index returned by the hash function is used.

Pigeon hole problem: More keys (pigeons) in the universe than holes (indexes for table t)

The solution: the chained hash table

For every cell in table T don't store True/False but a linked list of keys for each index in the table. This is called Chained hash table.


def Chained_hash_insert(T, k):
  return List_insert(T[hash(k)], k)

def Chained_hash_search(T, k):
  return List_search(T[hash(k)], k)

Elements are spreaded evenly across the table if the hash function is good.

alpha = n / |T|

is the average-case average length of the linked lists inside the table (where n is the number of elements in the table and |T| is the size of the table.

A good hash table implementation makes the complexity of alpha O(1). If n, the number of elements that we want to store in the hash table, grows, then |T| must also grow. alpha represents the time complexity of both insertion and search.

Growing a chained hash table

In order to grow a table, a new table must be created. The hash function (or its range parameters) must be changed as well.

Rehashing

Rehashing is the process of putting all the elements of the old table in the new table according to the new hash function. The complexity is O(n), since Chained-hash-insert is constant.

Growing the table every time

If the table is grown by a constant factor every time the table overflows, then the complexity of insertion is O(n^2) due to all the rehashing needed.

Growing the table by doubling the size

If the table size is doubled when overfloiwng, then the complexity for insertion becomes linear again by sacrificing some memory complexity.

Binary search trees

Implementation of a dynamic set over a totally ordered (with an order relation that has ...)

Interface

  • Tree-Insert(T, k) adds a key K to tree T;
  • Tree-Delete(T, k) deletes the key K from tree T;
  • Tree-Search(T, k) returns if key K is in the tree T;
  • Tree-Minimum(T) finds the smallest element in the tree;
  • Tree-Maximum(T) finds the biggest element in the tree;
  • Tree-successor(T, k) and Tree-predecessor(T, k) find the next and previous position in the tree

Height of a tree

The height of a tree is the maximum number of edges traversed from parent to child in order to reach a leaf from the root of the tree.

Rotation

                b
            /       \
          a           \
       /      \         \
     /          \         \
k <= a     a <= k <= b     k >= b
               a
           /       \
         /            b
       /          /       \
     /          /           \
k <= a     a <= k <= b     k >= b

Red-Black trees

  1. Every node has a color: red or black
  2. The root is black
  3. Every NULL leaf node is black
  4. If a node is red, both of its children are black
  5. The black height is the same for every branch in the tree

A red node is a way to strech the tree, but you cannot stretch it too far.

B-trees

Disk access sucks (a billion times slower than registers). So, ignore the RAM model and come up with a better solution.

If the number of keys per node is increased to k keys, the complexity of tree search will be log_k+1(n). We can search between the keys linearly because we have to wait for the disks. And log_1000() is much better than log_2().

Implementation

  • Minimum degree (min. num. of children in each node) defined as t (t >= 2)
  • Every node other than the root must have at least t - 1 keys
  • Every node must contain at most 2t - 1 keys
  • Every node has a boolean flag for leaf

Insertion

  • Read node from disk
  • If full, split
  • If leaf, insert in the right position
  • If not leaf, recurse on the right branch based on the keys
  • Read node from disk
  • Linearly search between keys
    • If key found, return True
    • otherwise, if leaf, return False
    • otherwise, if not leaf, recurse on the right branch