CSE 662 Fall 2019 - Cayuga

Cayuga

CSE 662 Fall 2019

October 24

Non-Standard Database Workloads

Stock Markets
Alert me when a stock reverses a downward trend.
Manufacturing IoT
Alert me when two adjacent process steps both signal non-critical errors.
Cloud Computing
Alert me when the number of errors is more than twice as high as the 2-week average.
Classical DB
These Problems Classical DB
Expressive Queries Expressive Queries
Changing Data Static Data 🗶
Static Queries Ad-Hoc Queries
Latency: Msec Latency: Sec/Min 🗶
🗶
Classical DB Publish/Subscribe
These Problems Pub/Sub
Expressive Queries Filter Queries 🗶
Changing Data Changing Data
Static Queries Static Queries
Latency: Msec Latency: Msec

Trivial

Expressive

Trivial

Performant Expressiveness

Expressive

Cayuga

Language
Maximize Expressiveness w/o Compromising Performance
Compiler
Emit a tight, optimized program representation
Runtime
Necessary support for concurrent, asynchronous execution

Language

Start with something familiar

Projection, Selection, Union
Single-pass operators: Easy to do efficiently
Join
Multi-pass operator: Will need to revisit
Aggregate
Single-pass operator: Probably ok
Blocking operator: Not ok

Projection


                          SELECT A, B, C, ... 
                          FROM [Query]
    

Emit tuples emitted by [Query] with only columns A, B, C

Selection


                    FILTER { [Condition] } [Query]
    

Emit only tuples emitted by [Query] that pass [Condition]

Union


                        [Query1] UNION [Query2]
    

Emit any tuples emitted by either [Query1] or [Query2]

Join

  1. $O(N^2)$ complexity doesn't work when $N = \infty$
  2. Storage requirements grow infinitely
  3. Work per tuple grows with every insertion

How to fix?

RHS tuple has to arrive after LHS tuple
Storage requirement only scales in LHS complexity
Each LHS tuple joins at most one RHS tuple
$O(N^2) \rightarrow O(N)$
Better chance of work staying constant

Join (Next)


                 [Query1] NEXT { [Condition] } [Query2]
    
  1. For each tuple emitted by [Query1],
  2. wait until [Query2] emits a tuple that passes [Condition]
  3. and emit the cartesian product of the tuples

Aggregate

Blocking operators are not ok. Need semantics that allow tuples to be emitted sooner.

Group-by...ish Aggregates

  • When do you create a new group?
  • Which tuples go into the group?
  • When does the group get emitted?

Aggregate (Fold)


      [Query1] FOLD { [Condition1], [Condition2], [Agg] } [Query2]
    
  1. For each tuple emitted by [Query1]
  2. Wait until [Query2] emits a tuple that passes [Condition1]
  3. Update [Agg]
  4. Emit the cartesian product of the [Query1] tuple, the first [Query2] tuple, and the [Agg] value
  5. If the [Query2] tuple ALSO passes [Condition2] repeat from 2

Analogous to...


      [Query1] NEXT { [Condition1] } [Query2]
               NEXT { [Condition1] } [Query2] 
               NEXT { [Condition1] } [Query2] 
               NEXT { [Condition1] } [Query2] 
               ... until [Condition2] is failed
    

Cayuga

Language
Maximize Expressiveness w/o Compromising Performance
Compiler
Emit a tight, optimized program representation
Runtime
Necessary support for concurrent, asynchronous execution

Deterministic Finite Automata

Model a program by a directed graph

  • Each node is a state
  • Each edge is a transition with a rule
  • One node is a start state
  • One node is an end state

Deterministic Finite Automata

The program accepts an input: A string.

  1. Start at the start state.
  2. Find the transition edge corresponding to the next character and follow it.
  3. Repeat from 2 until the end state of end of string
  4. Accept the string if the final state is the end state

/Hi+!/ ↣ "Hi!" "OHiiiiii!" "Ha!"

Deterministic Finite Automata

  • Simple
  • Easy to implement efficiently
  • Expressive (Regular Expressions)

... but what if we don't know which edge to take?

Nondeterministic Finite Automata

The program state is a set of active states

  1. Start in state $\{\texttt{start}\}$
  2. Initialize the next state to $\{\}$
  3. For each active state, follow each transition edge with a matching letter and add the destination to the active states in the next step
  4. Replace the current state with the next state.
  5. Repeat from 2 until the end state is active or there are no active states
  6. Accept the string if the end state is active

/Ha?i!+/ ↣ "Hi!" "OHai!" "HaHai!" "HiHaH!"

Letter Start $S_1$ $S_2$ $S_3$ End
H
i
H
a
H
!

Nondeterministic Finite Automata

  • Nearly as simple
  • Almost as easy to implement efficiently
  • Expressive (Full Regular Expressions)

NDFAs can be compiled down to DFAs

Cayuga

Language
Maximize Expressiveness w/o Compromising Performance
Compiler
Emit a tight, optimized program representation
Runtime
Necessary support for concurrent, asynchronous execution

Cayuga Autometa

Each node of the NDFA is a relation.

Each transition of the NDFA is a join condition + projection

Example

  1. Look for high-volume (10,000 or more) trades
  2. When one happens, check if it's followed by a 10 minute sequence of trades with dropping prices
  3. Wait for the stock to rally (5% higher than its lowest point) and alert me

  SELECT Name, MaxPrice, MinPrice, Price as FinalPrice
      -- Only consider aggregates spanning 10 minutes or more
  FROM FILTER { dur ≥ 10 min } (
    ( 
      -- Trigger aggregate when a Stock w/ Volume > 10000 sells
      SELECT Name, Price_1 AS MaxPrice, Price as MinPrice
      FROM Filter { Volume > 10000 } Stock
    ) FOLD { 
        $2.Name = $.Name,   -- Grouping Condition
        $2.Price < $.Price  -- Continue Condition
    } Stock -- Fold over any stock
  ) NEXT { 
      -- Find the next upturn after a 10 minute descending run
      $2.Name = $1.Name AND $2.Price > 1.05 * $1.MinPrice
  } Stock
    
  • A: Sequences started by a 10k trade
  • B: >10 min runs

      CREATE TABLE A(
        Name_l STRING,    -- From LHS
        MaxPrice DECIMAL, -- From LHS
        MinPrice DECIMAL, -- From LHS
        Name_r STRING,    -- From RHS
        Price Decimal,    -- From RHS
        Start Int,        -- From LHS
        End Int           -- From RHS
      )
    

      CREATE TABLE B(
        Name STRING,      
        MaxPrice DECIMAL, 
        MinPrice DECIMAL, 
        Price Decimal,    
        Start Int,        
        End Int           
      )
    
  • A: Sequences started by a 10k trade
  • B: >10 min runs
NamePriceValuationTime
State AState BEmitted
Name_lMinPriceName_rPrice NameMinPricePrice  
NamePriceValuationTime
IBM9015,0009:10
State AState BEmitted
Name_lMinPriceName_rPrice NameMinPricePrice  
IBM90IBM90
NamePriceValuationTime
IBM9015,0009:10
IBM857,0009:15
State AState BEmitted
Name_lMinPriceName_rPrice NameMinPricePrice  
IBM90IBM85
NamePriceValuationTime
IBM9015,0009:10
IBM857,0009:15
Dell4011,0009:17
State AState BEmitted
Name_lMinPriceName_rPrice NameMinPricePrice  
IBM90IBM85
Dell40Dell40
NamePriceValuationTime
IBM9015,0009:10
IBM857,0009:15
Dell4011,0009:17
IBM818,0009:21
State AState BEmitted
Name_lMinPriceName_rPrice NameMinPricePrice  
IBM90IBM81 IBM9081
Dell40Dell40
NamePriceValuationTime
IBM9015,0009:10
IBM857,0009:15
Dell4011,0009:17
IBM818,0009:21
MSFT256,0009:23
State AState BEmitted
Name_lMinPriceName_rPrice NameMinPricePrice  
IBM90IBM81 IBM9081
Dell40Dell40
NamePriceValuationTime
IBM9015,0009:10
IBM857,0009:15
Dell4011,0009:17
IBM818,0009:21
MSFT256,0009:23
IBM919,0009:24
State AState BEmitted
Name_lMinPriceName_rPrice NameMinPricePrice  
IBM90IBM81 IBM!
Dell40Dell40

Cayuga

Language
Maximize Expressiveness w/o Compromising Performance
Compiler
Emit a tight, optimized program representation
Runtime
Necessary support for concurrent, asynchronous execution

Challenges

Asynchronous Arrival
Updates may arrive out of order
Threading
Make sure each thread sees a concurrent view of the state
Shallow Copies
Need to keep track of which threads are using which state
Relational State
Lots of work for each event!
String Comparisons
Expensive!

Asynchronous Arrival

Simple Solution: Add a delay to event processing to buffer for out-of-order arrival.

Threading

Mostly Simple Solution: Parallel processing of one event to create a new state, swap in the new state, repeat.

Shallow Copies

Not so Simple Solution: Add an epoch-based garbage collector to detect when an object falls out of scope.

(Reference counting creates points of contention on every refcount update)

Relational State

Simple Solution: Index the states to make it easier to discover which states a new event interacts with.

String Comparison

Simple Solution: Build a dictionary of strings (can be done asynchronously while the event is waiting to be processed).