DSA/notes.md

<!-- vim: set ts=2 sw=2 et tw=80: -->

# Complexity
General way to describe efficiency algorithms (linear vs exponential)
indipendent from the computer architecture/speed.

## The RAM - random-access machine
Model of computer used in this course.

Has random-access memory.

### Basic types and basic operations
Has basic types (like int, float, 64bit words). A basic step is an operation on
a basic type (load, store, add, sub, ...). A branch is a basic step. Invoking a
function and returning is a basic step as well, but the entire execution takes
longer.

Complexity is not measured by the input value but by the input size in bits.
`Fibonacci(10)` in linear in `n` (size of the value) but exponential in `l`
(number of bits in `n`, or size of the input).

By default, WORST complexity is considered.

## Donald Knuth's A-notation
A(c) indicates a quantity that is absolutely at most c

Antonio's weight = (pronounced "is") A(100)

## (big-) O-notation
f(n) = O(g(n))

*Definition:* if f(n) is such that f(n) = k * A(g(n)) for all _n_ sufficiently
large and for some constant k > 0, then we say that

# Complexity notations (lecture 2019-02-26)

## Characterizing unknown functions

pi(n) = number of primes less than n

## First approximation

*Upper bound:* linear function

pi(n) = O(n)

*Lower bound:* constant function

pi(n) = omega(1)

*Non-trivial tight bound*:

pi(n) = theta(n/log n)

## Theta notation
Given a functio ng(n), we define the __family__ of functions theta(g(n)) such
that given a c_1, c_2 and an n_0, for all n >= n_0 g(n) is sandwiched between
c_1g(n) and c_2g(n)

## Big omega notation
Omega(g(n)) is a family of functions such that there exists a c and an n_0 such
that for all n>= n_0 g(n) dominates c\*g(n)

## Big "oh" notation
O(g(n)) is a family of functions such that there exists a c and an n_0 such
that for all n>= n_0 g(n) is dominated by c\*g(n)

## Small "oh" notation
o(g(n)) is the family of functions O(g(n)) excluding all the functions in
theta(g(n))

## Small omega notation
omega(g(n)) is the family of functions Omega(g(n)) excluding all the functions
in theta(g(n))

## Recap
*asymptotically* = <=> theta(g(n))
*asymptotically* < <=> o(g(n))
*asymptotically* > <=> omega(g(n))
*asymptotically* <= <=> O(g(n))
*asymptotically* >= <=> Omega(g(n))

# Insertion sort

## Complexity

- *Best case:* Linear (theta(n))
- *Worst case:* Number of swaps = 1 + 2 + ... + n-1 = (n-1)n/2 = theta(n^2)
- *Average case:* Number of swaps half of worst case = n(n-1)/4 = theta(n^2)

## Correctness

Proof sort of by induction.

An algorithm is correct if given an input the output satisfies the conditions
stated. The algorithm must terminate.

### The loop invariant

Invariant condition able to make a loop equivalent to a straight path in an
execution graph.

# Heaps and Heapsort

A data structure is a way to structure data.
A binary heap is like an array and can be of two types: max heap and min heap.

## Interface of an heap
- `Build_max_heap(A)` and rearranges a into a max-heap;
- `Heap_insert(H, key)` inserts `key` in the heap;
- `Heap_extract_max(H)` extracts the maximum `key`;
- `H.heap_size` returns the size of the heap.

A binary heap is like a binary tree mapped on an array:

```
    1
   / \
  /   \
 2     3
/ \   / \
4 5   6 7

=> [1234567]
```

The parent position of `n` is the integer division of n by 2:

```python
def parent(x):
  return x // 2
```

The left of `n` is `n` times 2, and the right is `n` times 2 plus 1:

```python
def left(x):
  return x * 2

def right(x):
  return x * 2 + 1
```

**Max heap property**: for all i > 1 A[parent(i)] >= A[i]

# Some data structures

Way to organize information

A data structure has an interface (functions to work with the DS)

A data structure has data and meta-data (like size, length).

## Stack (LIFO)

### Operations:
- `push(S, x)` (put one element, move TOS)
- `pop(S)` (remove element in TOS, move TOS)
- `stack-empty(S)` (returns TRUE if stack is empty)

## Queue (FIFO)

### Structure

- Based on array
- `length`
- `head`
- `tail`

Queue has always 1 cell free to avoid confusion with full/empty

## Dictionary

Data structure for fast search

A way to implement a dictionary is a *Direct-access table*

### API of dictionary

- `Insert(D, k)` insert a ket `k` to dictionary `D`
- `Delete(D, k)` removes key `k`
- `Search(D, k)` tells whether `D` contains a key `k`

Many different implementations

# Direct-access tables

- universe of keys = {1,2,...,M}
- array `T` of size M
- each key has its own position in T

## The 'dumb' approach

```python

def Insert(D, x):
  D[x] == True

def Delete(D, x):
  D[x] == False

def Search(D, k):
  return D[x]

```

Everything is O(1), but memory complexity is awful.

Solve this by mapping keys to smaller keys using an hash function.

Table T is a table many times smaller than the universe and different by a small
constant factor from the set of keys stored S.

Instead of using keys directly, the index returned by the hash function is used.

**Pigeon hole problem:** More keys (pigeons) in the universe than holes (indexes
for table t)

## The solution: the *chained hash table*

For every cell in table T don't store True/False but a linked list of keys for
each index in the table. This is called *Chained hash table*.

```python

def Chained_hash_insert(T, k):
  return List_insert(T[hash(k)], k)

def Chained_hash_search(T, k):
  return List_search(T[hash(k)], k)

```

Elements are spreaded evenly across the table if the hash function is good.

*alpha* = *n / |T|*

is the average-case average length of the linked lists inside
the table (where n is the number of elements in the table and *|T|* is the size
of the table.

A good hash table implementation makes the complexity of *alpha* O(1).
If *n*, the number of elements that we want to store in the hash table, grows,
then *|T|* must also grow. *alpha* represents the time complexity of both
insertion and search.

## Growing a *chained hash table*

In order to grow a table, a new table must be created. The hash function (or
its range parameters) must be changed as well.

### Rehashing

*Rehashing* is the process of putting all the elements of the old table in the
new table according to the new hash function. The complexity is O(n), since
`Chained-hash-insert` is constant.

### Growing the table every time

If the table is grown by a constant factor every time the table *overflows*,
then the complexity of insertion is O(n^2) due to all the *rehashing* needed.

### Growing the table by doubling the size

If the table size is doubled when *overfloiwng*, then the complexity for
insertion becomes linear again by sacrificing some memory complexity.

# Binary search trees

Implementation of a dynamic set over a *totally ordered* (with an order relation
that has ...)

## Interface

- `Tree-Insert(T, k)` adds a key K to tree T;
- `Tree-Delete(T, k)` deletes the key K from tree T;
- `Tree-Search(T, k)` returns if key K is in the tree T;
- `Tree-Minimum(T)` finds the smallest element in the tree;
- `Tree-Maximum(T)` finds the biggest element in the tree;
- `Tree-successor(T, k)` and `Tree-predecessor(T, k)` find the next and
  previous position in the tree

## Height of a tree

The height of a tree is the maximum number of edges traversed from parent to
child in order to reach a leaf from the root of the tree.

## Rotation

```
                b
            /       \
          a           \
       /      \         \
     /          \         \
k <= a     a <= k <= b     k >= b
```

```
               a
           /       \
         /            b
       /          /       \
     /          /           \
k <= a     a <= k <= b     k >= b
```

# Red-Black trees

1. Every node has a color: red or black
2. The root is black
3. Every NULL leaf node is black
4. If a node is red, both of its children are black
5. The black height is the same for every branch in the tree

A red node is a way to strech the tree, but you cannot stretch it too far.

# B-trees

Disk access sucks (a billion times slower than registers). So, ignore the RAM
model and come up with a better solution.

If the number of keys per node is increased to k keys, the complexity of tree
search will be log_k+1(n). We can search between the keys linearly because we
have to wait for the disks. And log_1000() is much better than log_2().

## Implementation

- Minimum degree (min. num. of children in each node) defined as `t` (`t >= 2`)
- Every node other than the root must have at least `t - 1` keys
- Every node must contain at most `2t - 1` keys
- Every node has a boolean flag for leaf

### Insertion

- Read node from disk
- If full, split
- If leaf, insert in the right position
- If not leaf, recurse on the right branch based on the keys

### Search

- Read node from disk
- Linearly search between keys
  - If key found, return True
  - otherwise, if leaf, return False
  - otherwise, if not leaf, recurse on the right branch
2019-02-21 in class 2019-02-21 11:17:07 +00:00			`<!-- vim: set ts=2 sw=2 et tw=80: -->`

			`# Complexity`
			`General way to describe efficiency algorithms (linear vs exponential)`
			`indipendent from the computer architecture/speed.`

			`## The RAM - random-access machine`
			`Model of computer used in this course.`

			`Has random-access memory.`

			`### Basic types and basic operations`
			`Has basic types (like int, float, 64bit words). A basic step is an operation on`
			`a basic type (load, store, add, sub, ...). A branch is a basic step. Invoking a`
			`function and returning is a basic step as well, but the entire execution takes`
			`longer.`

			`Complexity is not measured by the input value but by the input size in bits.`
			`Fibonacci(10)` in linear in `n` (size of the value) but exponential in `l`
			(number of bits in `n`, or size of the input).

			`By default, WORST complexity is considered.`

			`## Donald Knuth's A-notation`
			`A(c) indicates a quantity that is absolutely at most c`

			`Antonio's weight = (pronounced "is") A(100)`

			`## (big-) O-notation`
			`f(n) = O(g(n))`

			`Definition: if f(n) is such that f(n) = k * A(g(n)) for all _n_ sufficiently`
			`large and for some constant k > 0, then we say that`

2019-02-26 in class 2019-02-26 11:25:18 +00:00			`# Complexity notations (lecture 2019-02-26)`

			`## Characterizing unknown functions`

			`pi(n) = number of primes less than n`

			`## First approximation`

			`Upper bound: linear function`

			`pi(n) = O(n)`

			`Lower bound: constant function`

			`pi(n) = omega(1)`

			`Non-trivial tight bound:`

			`pi(n) = theta(n/log n)`

			`## Theta notation`
			`Given a functio ng(n), we define the __family__ of functions theta(g(n)) such`
			`that given a c_1, c_2 and an n_0, for all n >= n_0 g(n) is sandwiched between`
			`c_1g(n) and c_2g(n)`

			`## Big omega notation`
			`Omega(g(n)) is a family of functions such that there exists a c and an n_0 such`
			`that for all n>= n_0 g(n) dominates c\*g(n)`

			`## Big "oh" notation`
			`O(g(n)) is a family of functions such that there exists a c and an n_0 such`
			`that for all n>= n_0 g(n) is dominated by c\*g(n)`

			`## Small "oh" notation`
			`o(g(n)) is the family of functions O(g(n)) excluding all the functions in`
			`theta(g(n))`

			`## Small omega notation`
2019-02-28 in class 2019-02-28 12:29:34 +00:00			`omega(g(n)) is the family of functions Omega(g(n)) excluding all the functions`
			`in theta(g(n))`
2019-02-26 in class 2019-02-26 11:25:18 +00:00
			`## Recap`
			`asymptotically = <=> theta(g(n))`
			`asymptotically < <=> o(g(n))`
			`asymptotically > <=> omega(g(n))`
			`asymptotically <= <=> O(g(n))`
			`asymptotically >= <=> Omega(g(n))`
2019-02-28 in class 2019-02-28 12:29:34 +00:00
			`# Insertion sort`

			`## Complexity`

			`- Best case: Linear (theta(n))`
			`- Worst case: Number of swaps = 1 + 2 + ... + n-1 = (n-1)n/2 = theta(n^2)`
			`- Average case: Number of swaps half of worst case = n(n-1)/4 = theta(n^2)`

			`## Correctness`

			`Proof sort of by induction.`

			`An algorithm is correct if given an input the output satisfies the conditions`
			`stated. The algorithm must terminate.`
2019-03-05 in class 2019-03-05 12:23:46 +00:00
			`### The loop invariant`

			`Invariant condition able to make a loop equivalent to a straight path in an`
			`execution graph.`
2019-03-21 in class 2019-03-21 14:53:57 +00:00
			`# Heaps and Heapsort`

			`A data structure is a way to structure data.`
			`A binary heap is like an array and can be of two types: max heap and min heap.`

			`## Interface of an heap`
			- `Build_max_heap(A)` and rearranges a into a max-heap;
			- `Heap_insert(H, key)` inserts `key` in the heap;
			- `Heap_extract_max(H)` extracts the maximum `key`;
			- `H.heap_size` returns the size of the heap.

			`A binary heap is like a binary tree mapped on an array:`

			```
			`1`
			`/ \`
			`/ \`
			`2 3`
			`/ \ / \`
			`4 5 6 7`

			`=> [1234567]`
			```

			The parent position of `n` is the integer division of n by 2:

			```python
			`def parent(x):`
			`return x // 2`
			```

			The left of `n` is `n` times 2, and the right is `n` times 2 plus 1:

			```python
			`def left(x):`
			`return x * 2`

			`def right(x):`
			`return x * 2 + 1`
			```

			`Max heap property: for all i > 1 A[parent(i)] >= A[i]`

Partial lesson 2019-04-09 2019-04-09 09:20:37 +00:00			`# Some data structures`
2019-03-26 in class 2019-03-27 22:01:13 +00:00
			`Way to organize information`

			`A data structure has an interface (functions to work with the DS)`

			`A data structure has data and meta-data (like size, length).`

			`## Stack (LIFO)`

			`### Operations:`
			- `push(S, x)` (put one element, move TOS)
			- `pop(S)` (remove element in TOS, move TOS)
			- `stack-empty(S)` (returns TRUE if stack is empty)

			`## Queue (FIFO)`

Partial lesson 2019-04-09 2019-04-09 09:20:37 +00:00			`### Structure`
2019-03-26 in class 2019-03-27 22:01:13 +00:00
			`- Based on array`
			- `length`
			- `head`
			- `tail`

			`Queue has always 1 cell free to avoid confusion with full/empty`

			`## Dictionary`

			`Data structure for fast search`

2019-04-04 in class 2019-04-04 11:38:48 +00:00			`A way to implement a dictionary is a Direct-access table`

Partial lesson 2019-04-09 2019-04-09 09:20:37 +00:00			`### API of dictionary`
2019-04-04 in class 2019-04-04 11:38:48 +00:00
			- `Insert(D, k)` insert a ket `k` to dictionary `D`
			- `Delete(D, k)` removes key `k`
			- `Search(D, k)` tells whether `D` contains a key `k`

			`Many different implementations`

Partial lesson 2019-04-09 2019-04-09 09:20:37 +00:00			`# Direct-access tables`
2019-03-26 in class 2019-03-27 22:01:13 +00:00
2019-04-04 in class 2019-04-04 11:38:48 +00:00			`- universe of keys = {1,2,...,M}`
			- array `T` of size M
			`- each key has its own position in T`

Partial lesson 2019-04-09 2019-04-09 09:20:37 +00:00			`## The 'dumb' approach`

2019-04-04 in class 2019-04-04 11:38:48 +00:00			```python

			`def Insert(D, x):`
			`D[x] == True`

			`def Delete(D, x):`
			`D[x] == False`

			`def Search(D, k):`
			`return D[x]`

			```

			`Everything is O(1), but memory complexity is awful.`
2019-03-26 in class 2019-03-27 22:01:13 +00:00
2019-04-04 in class 2019-04-04 11:38:48 +00:00			`Solve this by mapping keys to smaller keys using an hash function.`

			`Table T is a table many times smaller than the universe and different by a small`
			`constant factor from the set of keys stored S.`

			`Instead of using keys directly, the index returned by the hash function is used.`

			`Pigeon hole problem: More keys (pigeons) in the universe than holes (indexes`
			`for table t)`

2019-04-09 in class 2019-04-09 11:38:46 +00:00			`## The solution: the chained hash table`
2019-04-04 in class 2019-04-04 11:38:48 +00:00
			`For every cell in table T don't store True/False but a linked list of keys for`
			`each index in the table. This is called Chained hash table.`

			```python

			`def Chained_hash_insert(T, k):`
			`return List_insert(T[hash(k)], k)`

			`def Chained_hash_search(T, k):`
			`return List_search(T[hash(k)], k)`

			```
Partial lesson 2019-04-09 2019-04-09 09:20:37 +00:00
			`Elements are spreaded evenly across the table if the hash function is good.`

			`alpha = n / \|T\|`

			`is the average-case average length of the linked lists inside`
			`the table (where n is the number of elements in the table and \|T\| is the size`
			`of the table.`

			`A good hash table implementation makes the complexity of alpha O(1).`
			`If n, the number of elements that we want to store in the hash table, grows,`
			`then \|T\| must also grow. alpha represents the time complexity of both`
			`insertion and search.`

2019-04-09 in class 2019-04-09 11:38:46 +00:00			`## Growing a chained hash table`
Partial lesson 2019-04-09 2019-04-09 09:20:37 +00:00
Stuff before easter break 2019-04-23 11:37:26 +00:00			`In order to grow a table, a new table must be created. The hash function (or`
			`its range parameters) must be changed as well.`
Partial lesson 2019-04-09 2019-04-09 09:20:37 +00:00
			`### Rehashing`

2019-04-09 in class 2019-04-09 11:38:46 +00:00			`Rehashing is the process of putting all the elements of the old table in the`
Stuff before easter break 2019-04-23 11:37:26 +00:00			`new table according to the new hash function. The complexity is O(n), since`
Partial lesson 2019-04-09 2019-04-09 09:20:37 +00:00			`Chained-hash-insert` is constant.

			`### Growing the table every time`

			`If the table is grown by a constant factor every time the table overflows,`
			`then the complexity of insertion is O(n^2) due to all the rehashing needed.`

			`### Growing the table by doubling the size`

			`If the table size is doubled when overfloiwng, then the complexity for`
			`insertion becomes linear again by sacrificing some memory complexity.`

2019-04-09 in class 2019-04-09 11:38:46 +00:00			`# Binary search trees`

			`Implementation of a dynamic set over a totally ordered (with an order relation`
			`that has ...)`

			`## Interface`

			- `Tree-Insert(T, k)` adds a key K to tree T;
			- `Tree-Delete(T, k)` deletes the key K from tree T;
			- `Tree-Search(T, k)` returns if key K is in the tree T;
			- `Tree-Minimum(T)` finds the smallest element in the tree;
			- `Tree-Maximum(T)` finds the biggest element in the tree;
Stuff before easter break 2019-04-23 11:37:26 +00:00			- `Tree-successor(T, k)` and `Tree-predecessor(T, k)` find the next and
			`previous position in the tree`

			`## Height of a tree`

			`The height of a tree is the maximum number of edges traversed from parent to`
			`child in order to reach a leaf from the root of the tree.`

			`## Rotation`

			```
			`b`
			`/ \`
			`a \`
			`/ \ \`
			`/ \ \`
			`k <= a a <= k <= b k >= b`
			```

			```
			`a`
			`/ \`
			`/ b`
			`/ / \`
			`/ / \`
			`k <= a a <= k <= b k >= b`
			```

			`# Red-Black trees`

			`1. Every node has a color: red or black`
			`2. The root is black`
			`3. Every NULL leaf node is black`
			`4. If a node is red, both of its children are black`
			`5. The black height is the same for every branch in the tree`

2019-05-03 in class 2019-05-03 06:41:37 +00:00			`A red node is a way to strech the tree, but you cannot stretch it too far.`

			`# B-trees`

			`Disk access sucks (a billion times slower than registers). So, ignore the RAM`
			`model and come up with a better solution.`

			`If the number of keys per node is increased to k keys, the complexity of tree`
			`search will be log_k+1(n). We can search between the keys linearly because we`
			`have to wait for the disks. And log_1000() is much better than log_2().`

			`## Implementation`

			- Minimum degree (min. num. of children in each node) defined as `t` (`t >= 2`)
			- Every node other than the root must have at least `t - 1` keys
			- Every node must contain at most `2t - 1` keys
			`- Every node has a boolean flag for leaf`

			`### Insertion`

			`- Read node from disk`
			`- If full, split`
			`- If leaf, insert in the right position`
			`- If not leaf, recurse on the right branch based on the keys`

			`### Search`

			`- Read node from disk`
			`- Linearly search between keys`
			`- If key found, return True`
			`- otherwise, if leaf, return False`
			`- otherwise, if not leaf, recurse on the right branch`