- Checkpoint 3

Checkpoint 3

April 13, 2021

Goal

Support Aggregation

Catalyzer provides...

Fold operators for all standard aggregation functions.
Aggregate logical plan operator

So where's the challenge?

Test queries will be posted tonight...

Try them!

Sparkisms

New Placeholders
Interpreting the Aggregate operator
Interfacing with Spark's fold operators

New Placeholders

UnresolvedFunction

(e.g., REGEXP_EXTRACT(target, "1(3{2})7", 1))


      UnresolvedFunction(
        name = FunctionIdentifier("REGEXP_EXTRACT"), 
        arguments = Seq(
          UnresolvedAttribute(Seq("target")),
          Literal("1(3{2})7", StringType),
          Literal(1, IntegerType)
        ),
        distinct = false,
        filter = None,
        ignoreNulls = false
      )

Replace like UnresolvedAlias, UnresolvedAttribute

... but with what?

FunctionRegistry


      case UnresolvedFunction(name, arguments, isDistinct, filter, ignoreNulls) =>
      {
        val builder = 
          FunctionRegistry.builtin
            .lookupFunctionBuilder(name)
            .getOrElse {
              throw new RuntimeException(
                s"Unable to resolve function `${name}`"
              )
            }
        builder(arguments) // returns the replacement expression node.
      }

Functions


      val builder = FunctionRegistry.builtin
                      .lookupFunctionBuilder("REGEXP_EXTRACT").get
      builder(
        Attribute("target"),
        Literal("1(3{2})7", StringType),
        Literal(1, IntegerType)
      )

↓


      RegExpExtract(
        Attribute("target"),
        Literal("1(3{2})7", StringType),
        1
      )

RegExpExtract

Functions


      val builder = FunctionRegistry.builtin
                      .lookupFunctionBuilder("SUM").get
      builder(
        Attribute("target")
      )

↓


      Sum(
        Attribute("target")
      )

Sum

Aggregates

An expressions subclassing:

AggregateFunction
- DeclarativeAggregate
- ImperativeAggregate

New Placeholders

Parsing SQL

`SELECT ... FROM R WHERE ...`	`SELECT ... FROM R GROUP BY ...`
↓	↓
Project (or Aggregate?)	Aggregate


      SELECT REGEXP_EXTRACT(...) FROM R


      SELECT SUM(...) FROM R

How does the parser distinguish these cases?

It doesn't


      SELECT SUM(A) FROM R

↓


      Project(Seq(
        UnresolvedFunction("SUM", Seq(
          UnresolvedAttribute("A")
        ))
      ), ...)


      Project(Seq(
        UnresolvedFunction("SUM", Seq(
          UnresolvedAttribute("A")
        ))
      ), ...)

↓


      Project(Seq(
        Sum(Attribute("A"))
      ), ...)

Now you can tell it's an aggregate.

Basic Guideline: If any expression is an AggregateFunction, the entire Project node should be an Aggregate instead.


      Project(targets, child) => 
        Aggregate(Seq(), targets, child)


      Aggregate(
        groupingExpressions: Seq[Expression], 
        aggregateExpressions: Seq[NamedExpression], 
        child: LogicalPlan
      )

groupingExpressions: GROUP BY expressions
aggregateExpressions: SELECT target expressions (may include GROUP BY)
child: as usual

Field	Spark	This Project
`groupingExpressions`	Any Expression	Just Attributes
`aggregateExpressions`	Any Expression	Attribute OR Alias(AggregateFunction(...))

Supporting everything Spark supports
will be a lot more work.

Assign input tuple to a group based on groupingAttributes
Accumulate for each aggregate in aggregateExpressions
Repeat for all input tuples
"Render" result based on aggregateExpressions

AggregateFunction

COUNT(*)

Init: $0$
Fold(Accum, New): $Accum + 1$

SUM(A)

Init: $0$
Fold(Accum, New): $Accum + New$

AVG(A)

Init: $\{ sum = 0, count = 0 \}$
Fold(Accum, New): $\{ sum = Accum.sum + New, \\\;count = Accum.count + 1\}$
Finalize(Accum): $\frac{Accum.sum}{Accum.count}$

Basic Aggregate Pattern

Init: Define a starting value for the accumulator
Fold(Accum, New): Merge a new value into the accumulator
Finalize(Accum): Extract the aggregate from the accumulator.

What does the accumulator look like for each aggregate?

Aggregation Buffers

AggregationFunction.aggBufferAttributes

The attributes that the aggregation function is requesting.

Allocate an InternalRow
with this schema for each function.

Aggregation Buffers

One Buffer Per Aggregate, Per Group
One Buffer Per Group (Aggregates Share)

↙ To be Discussed ↘

↖ What Spark Does ↗

DeclarativeAggregates

Everything is an Expression

initialValues: Seq[Expression]: Evaluate these expressions without a row to get values for the buffer
updateExpressions: Seq[Expression]: Evaluate these expressions on the buffer and input together to get new buffer values
evaluateExpressions: Expression: Evaluate this expressions on the buffer to get the final aggregate result

DeclarativeAggregates

updateExpressions

How to manage Unresolved Attributes?

Input: [Buffer, InputRow]
Schema (for resolution): agg.aggBufferAttributes ++ child.output

Adjust based on how you implemented your aggregation buffer.

DeclarativeAggregates

evaluateExpression

Input: Buffer
Schema (for resolution): agg.aggBufferAttributes

Next Class

Return to Transactions