- Exam Review

Exam Review

May 6, 2021

$$\texttt{WHEN } \mathcal D \leftarrow \mathcal{D}+\Delta\mathcal D \texttt{ DO:}~~~~\\ \texttt{VIEW} \leftarrow \texttt{VIEW} + \Delta Q(\mathcal D,\Delta\mathcal D)$$

$\Delta Q$	(ideally) Small & fast query
$+$	(ideally) Fast "merge" operation

Intuition

$$\mathcal{R} = \{\ \textbf{A, B, C}\ \},\ \mathcal S = \{\ \textbf{X, Y}\ \} ~~~\Delta\mathcal{R} = \{\ \textbf{D}\ \}$$ $$Q(\mathcal R, \mathcal S) = \texttt{COUNT}(\mathcal R \times \mathcal S)$$

$$ \texttt{COUNT}(\textbf{AX, AY, BX, BY, CX, CY, }\underline{\textbf{DX, DY}}) $$

$$ 6 + \texttt{COUNT}(\underline{\textbf{DX, DY}}) $$

$$\texttt{VIEW} + \texttt{COUNT}(\Delta\mathcal R \times \mathcal S) \sim O(|\Delta\mathcal R| \cdot |\mathcal S|)$$

$\sigma(\mathcal R) \rightarrow \sigma(\mathcal R \uplus \Delta \mathcal R)$

$ \equiv $ $\sigma(\mathcal R)$ $ \uplus $ $\sigma(\Delta \mathcal R)$

$Q(\mathcal D) = \sigma(\mathcal R)$

$\Delta Q(\mathcal D, \Delta \mathcal D) = \sigma(\Delta \mathcal R)$

Set/Bag difference also commutes through selection

$\pi(\mathcal R) \rightarrow \pi(\mathcal R \uplus \Delta \mathcal R)$

$ \equiv $ $\pi(\mathcal R)$ $ \uplus $ $\pi(\Delta \mathcal R)$

$Q(\mathcal D) = \pi(\mathcal R)$

$\Delta Q(\mathcal D, \Delta \mathcal D) = \pi(\Delta \mathcal R)$

Does this work under set semantics?

$\mathcal R_1 \uplus \mathcal R_2 \rightarrow \mathcal R_1 \uplus \Delta \mathcal R_1 \uplus \mathcal R_2 \uplus \Delta \mathcal R_2$

$ \equiv $ $\mathcal R_1 \uplus \mathcal R_2$ $ \uplus $ $\Delta \mathcal R_1 \uplus \Delta \mathcal R_2$

$Q(\mathcal D) = \mathcal R_1 \uplus \mathcal R_2$

$\Delta Q(\mathcal D, \Delta \mathcal D) = \Delta \mathcal R_1 \uplus \Delta \mathcal R_2$

$$(\mathcal R_1 \uplus \Delta \mathcal R_1) \times (\mathcal R_2 \uplus \Delta \mathcal R_2)$$

$$\left(\mathcal R_1 \times (\mathcal R_2 \uplus \Delta \mathcal R_2)\right) \uplus \left(\Delta \mathcal R_1 \times (\mathcal R_2 \uplus \Delta \mathcal R_2)\right)$$

$$\left(\mathcal R_1 \times \mathcal R_2\right) \uplus \left(\mathcal R_1 \times \Delta \mathcal R_2\right) \uplus \left(\Delta \mathcal R_1 \times (\mathcal R_2 \uplus \Delta \mathcal R_2)\right)$$

$$\left(\mathcal R_1 \times \mathcal R_2\right) \uplus \left(\mathcal R_1 \times \Delta \mathcal R_2\right) \uplus \left(\Delta \mathcal R_1 \times \mathcal R_2\right) \uplus \left(\Delta \mathcal R_1 \times \Delta \mathcal R_2\right)$$

Atomicity: My command gets executed fully, or not at all.
Consistency: The final state of the data is "sane".
Isolation: No concurrency glitches.
Durability: If things break, my data is still safe.

Transactions

A "batch" of operations that should execute together

START/BEGIN TRANSACTION: Start batching
COMMIT: Conclude the transaction and apply changes
ABORT/ROLLBACK: Undo all changes applied as part of the transaction

Atomicity: A transaction is either applied fully (COMMITed) or not at all (ABORTed).
Consistency: Constraints are enforced on the final state of the transaction (and the transaction is ABORTed if they fail).
Isolation: Two transactions running in parallel don't interfere with each other
Durability: Once the database returns from a COMMIT successfully, the data is safe from marmots

What does it mean for a transaction to be Isolated?

Alice and Bob submit transactions at the same time!

Option 1: Alice's transaction executes to completion.; Bob's transaction executes to completion.
Option 2: Bob's transaction executes to completion.; Alice's transaction executes to completion.

Schedule: A sequence of read and writes from one or more transactions to objects
Serial Schedule: A schedule with no interleaving; ... but with an arbitrary transaction order
Serializable Schedule: A schedule that produces the same output as a serial schedule

Conflict Equivalence

Two schedules are conflict equivalent if there is a sequence of pairwise "flips" (of reads, or operations on different objects) that gets you from one schedule to the other.

Example 1

Time	T1	T2
\|		`W(B)`
\|	`R(B)`	↑
\|	↓	`W(A)`
↓	`W(A)`

Example 1

Time	T1	T2
\|		`W(B)`
\|		`W(A)`
\|	`R(B)`
↓	`W(A)`

Conflict equivalent to a serial schedule!

Example 2

Time	T1	T2
\|		`W(B)`
\|	`R(B)`
\|	`W(A)`
↓		`W(A)`

Can't rewrite!

Conflict Serializability

A schedule is conflict serializable if it is conflict equivalent to a serial schedule.

How do we determine if a schedule is conflict-serializable?

Example 2

Time	T1	T2
\|		`W(B)`
\|	`R(B)`
\|	`W(A)`
↓		`W(A)`

T2's write to B "happens before" T1's read

T1's write to A "happens before" T2's write

Cycle! No equivalent serial schedule!

An acyclic "Happens Before" or Dependency Graph is conflict serializable.

2-Phase Locking

Create one lock for each object.

Each transaction operates in two "phases".

Acquire Phase: Before accessing an object, the transaction must acquire the object's lock.; The transaction does not release locks.
Release Phase: A transaction can release locks; A transaction can never again access an object it doesn't have a lock for.

In practice, the release phase happens all at once at the end

All schedules created by 2PL are conflict-serializable

Not all conflict-serializable schedules can be created by 2PL

$2PL \subset CS$

Locking is...

... expensive: Costs are still incurred even if there are no problematic conflicts.
... restrictive: We don't know what the transaction will do, so we can't allow all schedules.

Snapshot Isolation

Phase 1: Read: Transaction executes on a private copy of all accessed objects
Phase 2: Validate: Check if applying the transaction would break isolation
Phase 3: Write: Write the transaction's updates out to the main database

Validation Phase

Pick a serial order (e.g., the order in which transactions reach the validation phase)

Make sure the transaction's operations follow this order

T1 and T2 read the same object: OK!
T1 reads an object written by T2 (read-write): Ok. (T1's validation phase started before T2's)
T2 reads an object written by T1 (write-read): Ok if T1-Write finishes before T2-Read starts.
T1 and T2 write the same object (write-write): Ok if T1-Write finishes before T2-Write starts.

All schedules created by Snapshot Isolation are conflict-serializable

... but Snapshot Isolation only checks for equivalence to ONE serial schedule. There might be a different, conflict-equivalent serial schedule.

$SI \subset CS$

Timestamp Concurrency Control

(Snapshot Isolation as seen "in practice")

Each object $A$ gets a read timestamp ($RTS(A)$) and a write timestamp ($WTS(A)$)

Each transaction $\mathcal T$ gets a timestamp ($TS(\mathcal T)$).

(note that these can be logical timestamps like sequence numbers)

(also note that real DBs don't use read timestamps... which creates problems)

View Serializability

View Equivalence

Two schedules are view-equivalent when you can transform one into the other by reordering any pair of operations that...

operate on different objects
OR, are both reads
OR, are both writes to the same object, but ONLY IF the object is later overwritten by another write.

View Serializability

A schedule is view serializable if it is view-equivalent to some serial schedule

Timestamp concurrency control is guaranteed to produce view-serializable schedules.

View Serializability

On the happens-before graph, throw away edges created by "hidden" write-write conflicts.

If the resulting graph is acyclic, the schedule is view serializable

By definition there are view serializable schedules that are not conflict serializable

$CS \subset VS$

$2PL, SI \subset CS \subset VS \subset S$

A long write...

What if the DB fails during a write.

IOs aren't atomic

Atomicity and Durability might be violated!

Buffer memory is limited...

What if we need to page out some pages modified by a live transaction?

If the transaction aborts, the page state needs to be reverted.

Atomicity might be violated

Write-Ahead Logging (WAL)

"Log" the transaction's actions
Wait for the log to be safely on-disk
Write the transaction's effects to disk

Idea: Periodically mark down the index of the earliest log entry still needed

WAL Recovery

Scan backwards through the log to find the checkpoint.
Scan forward starting with the checkpointed 'first log entry' to find all transactions with a COMMIT entry.
Replay log entries starting with the checkpointed 'first log entry', ignoring non-COMMITed transactions.

Undo-Logging

Log

Timestamp	Transaction	Object	Value	Prev
10	T1	Page 5	1010...	00101...
11	T2	Page 3	1000...	0111...
12	T1	Page 1	0011...	0001...
13	T3	Page 5	1100...	1010

Idea: Record the page's previous value.

Recovering after a crash

Rebuild in-memory state (Analyze)
Rebuild buffer manager (Redo)
Abort uncommitted transactions (Undo)

ARIES Recovery

Datalog

[head] :- [body]

$$Q(A) :-~~ R(A, B), S(B, C)$$

like SELECT A FROM R NATURAL JOIN S

A: Head Variable (appears in the head and body)
B, C: Existential Variables (appear only in the body)

Stop thinking about relations as collections of records, and instead think of them as collections of facts

R	A	B
1	1	2
2	1	3
3	2	3
4	2	4

The fact $R(1, 2)$ is true.

The fact $R(2, 1)$ is false (or unknown).

A table contains all facts that are provably true.

$$Q(A) :-~~ R(A, B), S(B, C)$$

For any $A$, the fact $Q(A)$ is true if...

there is some $B$ and $C$ for which...
the fact $R(A, B)$ is true, and...
the fact $S(B, C)$ is true.

$\forall A : \big( \exists B, C : R(A, B) \wedge S(B, C) \big) \rightarrow Q(A)$

$$Q(A) :-~~ R(A, B), S(B, C)$$ $$Q(A) :-~~ R(A, B), R(B, C)$$

Treat multiple rules as a disjunction.
($Q(A)$ is true if any rule is satisfied)

As powerful as Set-RA

Projection: $Q := \pi_A(R)$; $Q(A) :-~~ R(A, \ldots)$
Union: $Q := R \cup S$; $Q(\ldots) :-~~ R(\ldots)$; $Q(\ldots) :-~~ S(\ldots)$
Join: $Q := R \bowtie S$; $Q(\ldots) :-~~ R(\ldots), S(\ldots)$

Selection (Equality): $Q := \sigma_{R.A = R.B}(R)$; $Q(A) :-~~ R(A, A)$
Selection (Equality'): $Q := \sigma_{R.A = 1}(R)$; $Q(B) :-~~ R(1, B)$
Selection (Other): $Q := \sigma_{A > B}(R)$; $Q(A,B) :-~~ R(A, B), [[ A > B ]]$

$[[ A > B ]]$	A	B
	1	0
	2	0
	3	0
	...
	2	1
	...

Relations are Sets of Facts. We can have a relation consisting of all pairs $A, B$ where $A$ is bigger.

Why is this tuple in my query result?
Why is this tuple not in my query result?
Which datasets were used to create this value?
How does this input affect my query output?

Provenance

How does the input data relate to a query output.

Types of Provenance

Why Provenance (Lineage): What's the smallest fragment of my input needed to produce some row
Why-Not Provenance: What's the least I can add to my input to get a desired row
How Provenance: An execution trace of the result; How were the tuples combined?
Where Provenance: Which cell(s) was a given output value taken from
Taint: Was the output affected by any "tainted" input cell/row

Idea: Arithmetic models how a tuple was derived pretty well.

$[[ R(a, b) ]] \rightarrow \texttt{constant}$

$[[R(a, b) \cup S(a, b)]] \rightarrow [[R(a, b)]] \oplus [[S(a, b)]] $

$[[R(a, b) \times S(c, d)]] \rightarrow [[R(a, b)]] \otimes [[S(c, d)]] $

$[[\pi_a R(a, b)]] \rightarrow \sum_b [[R(a, b)]]]] $

$\oplus$	$\otimes$	Effect
$+$	$\times$	Bag multiplicity
$\vee$	$\wedge$	Set existence
$\cup$	$\times$	Why provenance
min	max	Access control