CSE-4/562 Spring 2021 - Exam Review

Exam Review

CSE-4/562 Spring 2021

May 6, 2021

$$\texttt{WHEN } \mathcal D \leftarrow \mathcal{D}+\Delta\mathcal D \texttt{ DO:}~~~~\\ \texttt{VIEW} \leftarrow \texttt{VIEW} + \Delta Q(\mathcal D,\Delta\mathcal D)$$
$\Delta Q$ (ideally) Small & fast query
$+$ (ideally) Fast "merge" operation

Intuition

$$\mathcal{R} = \{\ \textbf{A, B, C}\ \},\ \mathcal S = \{\ \textbf{X, Y}\ \} ~~~\Delta\mathcal{R} = \{\ \textbf{D}\ \}$$ $$Q(\mathcal R, \mathcal S) = \texttt{COUNT}(\mathcal R \times \mathcal S)$$
$$ \texttt{COUNT}(\textbf{AX, AY, BX, BY, CX, CY, }\underline{\textbf{DX, DY}}) $$
$$Q(\mathcal R+\Delta\mathcal R, \mathcal S) \sim O( (|\mathcal R| + |\Delta\mathcal D|) \cdot |\mathcal S|)$$
$$ 6 + \texttt{COUNT}(\underline{\textbf{DX, DY}}) $$
$$\texttt{VIEW} + \texttt{COUNT}(\Delta\mathcal R \times \mathcal S) \sim O(|\Delta\mathcal R| \cdot |\mathcal S|)$$

$\sigma(\mathcal R) \rightarrow \sigma(\mathcal R \uplus \Delta \mathcal R)$

$ \equiv $ $\sigma(\mathcal R)$ $ \uplus $ $\sigma(\Delta \mathcal R)$

$Q(\mathcal D) = \sigma(\mathcal R)$

$\Delta Q(\mathcal D, \Delta \mathcal D) = \sigma(\Delta \mathcal R)$

Set/Bag difference also commutes through selection

$\pi(\mathcal R) \rightarrow \pi(\mathcal R \uplus \Delta \mathcal R)$

$ \equiv $ $\pi(\mathcal R)$ $ \uplus $ $\pi(\Delta \mathcal R)$

$Q(\mathcal D) = \pi(\mathcal R)$

$\Delta Q(\mathcal D, \Delta \mathcal D) = \pi(\Delta \mathcal R)$

Does this work under set semantics?

$\mathcal R_1 \uplus \mathcal R_2 \rightarrow \mathcal R_1 \uplus \Delta \mathcal R_1 \uplus \mathcal R_2 \uplus \Delta \mathcal R_2$

$ \equiv $ $\mathcal R_1 \uplus \mathcal R_2$ $ \uplus $ $\Delta \mathcal R_1 \uplus \Delta \mathcal R_2$

$Q(\mathcal D) = \mathcal R_1 \uplus \mathcal R_2$

$\Delta Q(\mathcal D, \Delta \mathcal D) = \Delta \mathcal R_1 \uplus \Delta \mathcal R_2$

$$(\mathcal R_1 \uplus \Delta \mathcal R_1) \times (\mathcal R_2 \uplus \Delta \mathcal R_2)$$
$$\left(\mathcal R_1 \times (\mathcal R_2 \uplus \Delta \mathcal R_2)\right) \uplus \left(\Delta \mathcal R_1 \times (\mathcal R_2 \uplus \Delta \mathcal R_2)\right)$$
$$\left(\mathcal R_1 \times \mathcal R_2\right) \uplus \left(\mathcal R_1 \times \Delta \mathcal R_2\right) \uplus \left(\Delta \mathcal R_1 \times (\mathcal R_2 \uplus \Delta \mathcal R_2)\right)$$
$$\left(\mathcal R_1 \times \mathcal R_2\right) \uplus \left(\mathcal R_1 \times \Delta \mathcal R_2\right) \uplus \left(\Delta \mathcal R_1 \times \mathcal R_2\right) \uplus \left(\Delta \mathcal R_1 \times \Delta \mathcal R_2\right)$$
Atomicity
My command gets executed fully, or not at all.
Consistency
The final state of the data is "sane".
Isolation
No concurrency glitches.
Durability
If things break, my data is still safe.

Transactions

A "batch" of operations that should execute together

START/BEGIN TRANSACTION
Start batching
COMMIT
Conclude the transaction and apply changes
ABORT/ROLLBACK
Undo all changes applied as part of the transaction
Atomicity
A transaction is either applied fully (COMMITed) or not at all (ABORTed).
Consistency
Constraints are enforced on the final state of the transaction (and the transaction is ABORTed if they fail).
Isolation
Two transactions running in parallel don't interfere with each other
Durability
Once the database returns from a COMMIT successfully, the data is safe from marmots

What does it mean for a transaction to be Isolated?

Alice and Bob submit transactions at the same time!

Option 1
Alice's transaction executes to completion.
Bob's transaction executes to completion.
Option 2
Bob's transaction executes to completion.
Alice's transaction executes to completion.
Schedule
A sequence of read and writes from one or more transactions to objects
Serial Schedule
A schedule with no interleaving
... but with an arbitrary transaction order
Serializable Schedule
A schedule that produces the same output as a serial schedule

Conflict Equivalence

Two schedules are conflict equivalent if there is a sequence of pairwise "flips" (of reads, or operations on different objects) that gets you from one schedule to the other.

Example 1

TimeT1T2
| W(B)
| R(B)
| W(A)
W(A)

 

Example 1

TimeT1T2
| W(B)
| W(A)
| R(B)
W(A)

Conflict equivalent to a serial schedule!

Example 2

TimeT1T2
| W(B)
| R(B)
| W(A)
W(A)

Can't rewrite!

Conflict Serializability

A schedule is conflict serializable if it is conflict equivalent to a serial schedule.

How do we determine if a schedule is conflict-serializable?

Example 2

TimeT1T2
| W(B)
| R(B)
| W(A)
W(A)

T2's write to B "happens before" T1's read

T1's write to A "happens before" T2's write

Cycle! No equivalent serial schedule!

An acyclic "Happens Before" or Dependency Graph is conflict serializable.

2-Phase Locking

Create one lock for each object.

Each transaction operates in two "phases".

Acquire Phase
Before accessing an object, the transaction must acquire the object's lock.
The transaction does not release locks.
Release Phase
A transaction can release locks
A transaction can never again access an object it doesn't have a lock for.

In practice, the release phase happens all at once at the end

All schedules created by 2PL are conflict-serializable

Not all conflict-serializable schedules can be created by 2PL

$2PL \subset CS$

Locking is...

... expensive
Costs are still incurred even if there are no problematic conflicts.
... restrictive
We don't know what the transaction will do, so we can't allow all schedules.

Snapshot Isolation

Phase 1: Read
Transaction executes on a private copy of all accessed objects
Phase 2: Validate
Check if applying the transaction would break isolation
Phase 3: Write
Write the transaction's updates out to the main database

Validation Phase

Pick a serial order (e.g., the order in which transactions reach the validation phase)

Make sure the transaction's operations follow this order

T1 and T2 read the same object
OK!
T1 reads an object written by T2 (read-write)
Ok. (T1's validation phase started before T2's)
T2 reads an object written by T1 (write-read)
Ok if T1-Write finishes before T2-Read starts.
T1 and T2 write the same object (write-write)
Ok if T1-Write finishes before T2-Write starts.

All schedules created by Snapshot Isolation are conflict-serializable

... but Snapshot Isolation only checks for equivalence to ONE serial schedule. There might be a different, conflict-equivalent serial schedule.

$SI \subset CS$

Timestamp Concurrency Control

(Snapshot Isolation as seen "in practice")

Each object $A$ gets a read timestamp ($RTS(A)$) and a write timestamp ($WTS(A)$)

Each transaction $\mathcal T$ gets a timestamp ($TS(\mathcal T)$).

(note that these can be logical timestamps like sequence numbers)

(also note that real DBs don't use read timestamps... which creates problems)

View Serializability

View Equivalence

Two schedules are view-equivalent when you can transform one into the other by reordering any pair of operations that...

  • operate on different objects
  • OR, are both reads
  • OR, are both writes to the same object, but ONLY IF the object is later overwritten by another write.

View Serializability

A schedule is view serializable if it is view-equivalent to some serial schedule

Timestamp concurrency control is guaranteed to produce view-serializable schedules.

View Serializability

On the happens-before graph, throw away edges created by "hidden" write-write conflicts.

If the resulting graph is acyclic, the schedule is view serializable

By definition there are view serializable schedules that are not conflict serializable

$CS \subset VS$

$2PL, SI \subset CS \subset VS \subset S$

A long write...

What if the DB fails during a write.

IOs aren't atomic

Atomicity and Durability might be violated!

Buffer memory is limited...

What if we need to page out some pages modified by a live transaction?

If the transaction aborts, the page state needs to be reverted.

Atomicity might be violated

Write-Ahead Logging (WAL)

  1. "Log" the transaction's actions
  2. Wait for the log to be safely on-disk
  3. Write the transaction's effects to disk

Idea: Periodically mark down the index of the earliest log entry still needed

WAL Recovery

  1. Scan backwards through the log to find the checkpoint.
  2. Scan forward starting with the checkpointed 'first log entry' to find all transactions with a COMMIT entry.
  3. Replay log entries starting with the checkpointed 'first log entry', ignoring non-COMMITed transactions.

Undo-Logging

Log

Timestamp Transaction Object Value Prev
10T1Page 51010...00101...
11T2Page 31000...0111...
12T1Page 10011...0001...
13T3Page 51100...1010

Idea: Record the page's previous value.

Recovering after a crash

  1. Rebuild in-memory state (Analyze)
  2. Rebuild buffer manager (Redo)
  3. Abort uncommitted transactions (Undo)

ARIES Recovery

Datalog

[head] :- [body]

$$Q(A) :-~~ R(A, B), S(B, C)$$

like SELECT A FROM R NATURAL JOIN S

A
Head Variable (appears in the head and body)
B, C
Existential Variables (appear only in the body)

Stop thinking about relations as collections of records, and instead think of them as collections of facts

RAB
112
213
323
424

The fact $R(1, 2)$ is true.

The fact $R(2, 1)$ is false (or unknown).

A table contains all facts that are provably true.

$$Q(A) :-~~ R(A, B), S(B, C)$$

For any $A$, the fact $Q(A)$ is true if...
  • there is some $B$ and $C$ for which...
  • the fact $R(A, B)$ is true, and...
  • the fact $S(B, C)$ is true.

$\forall A : \big( \exists B, C : R(A, B) \wedge S(B, C) \big) \rightarrow Q(A)$

$$Q(A) :-~~ R(A, B), S(B, C)$$ $$Q(A) :-~~ R(A, B), R(B, C)$$

Treat multiple rules as a disjunction.
($Q(A)$ is true if any rule is satisfied)

As powerful as Set-RA

Projection
$Q := \pi_A(R)$
$Q(A) :-~~ R(A, \ldots)$
Union
$Q := R \cup S$
$Q(\ldots) :-~~ R(\ldots)$
$Q(\ldots) :-~~ S(\ldots)$
Join
$Q := R \bowtie S$
$Q(\ldots) :-~~ R(\ldots), S(\ldots)$
Selection (Equality)
$Q := \sigma_{R.A = R.B}(R)$
$Q(A) :-~~ R(A, A)$
Selection (Equality')
$Q := \sigma_{R.A = 1}(R)$
$Q(B) :-~~ R(1, B)$
Selection (Other)
$Q := \sigma_{A > B}(R)$
$Q(A,B) :-~~ R(A, B), [[ A > B ]]$
$[[ A > B ]]$AB
10
20
30
...
21
...

Relations are Sets of Facts. We can have a relation consisting of all pairs $A, B$ where $A$ is bigger.

  • Why is this tuple in my query result?
  • Why is this tuple not in my query result?
  • Which datasets were used to create this value?
  • How does this input affect my query output?

Provenance

How does the input data relate to a query output.

Types of Provenance

Why Provenance (Lineage)
What's the smallest fragment of my input needed to produce some row
Why-Not Provenance
What's the least I can add to my input to get a desired row
How Provenance
An execution trace of the result; How were the tuples combined?
Where Provenance
Which cell(s) was a given output value taken from
Taint
Was the output affected by any "tainted" input cell/row

Idea: Arithmetic models how a tuple was derived pretty well.

$[[ R(a, b) ]] \rightarrow \texttt{constant}$

$[[R(a, b) \cup S(a, b)]] \rightarrow [[R(a, b)]] \oplus [[S(a, b)]] $

$[[R(a, b) \times S(c, d)]] \rightarrow [[R(a, b)]] \otimes [[S(c, d)]] $

$[[\pi_a R(a, b)]] \rightarrow \sum_b [[R(a, b)]]]] $

$\oplus$$\otimes$Effect
$+$$\times$Bag multiplicity
$\vee$$\wedge$Set existence
$\cup$$\times$Why provenance
minmaxAccess control