Checkpoint 2

In this project, you'll extend your SQL runtime to support larger queries. It is worth 12 points

Requirements

Grading Rubric

All tests will be run on dedicated hardware equipped with an Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz with a standard 5400 RPM HDD. Queries will have a per-query timeout as listed below. Grading will be based on total runtime for each batch of queries.
30 randomly-generated Select-Project queries
Under 20 seconds total: 4 of 4 points
Under 10 seconds per query: 2 of 4 points
3 randomly-generated queries based on the TPC-H benchmark, with a scale factor of 0.1 (100MB) and templates listed below
Under 30 seconds total: 8 of 8 points + leaderboard ranking
Under 50 seconds total: 8 of 8 points
Under 120 seconds total: 4 of 8 points
Under 120 seconds per query: 2 of 8 points
Note in particular that these queries make extensive use of equi-joins, order-by, and limit clauses, which will now need to be supported.

Selection Pushdown and Join Conversion

Equi-joins, such as those in the TPC-H queries will be too expensive to compute as cartesian products. You will need to implement some form of equi-join algorithm: Hash, Sort-Merge, or similar. You will also need to implement some form of selection pushdown and join conversion. Both of these can be implemented almost exactly as described in class, via The easiest way to do this is to use pattern matching on the logical plan. For example you can find filters sitting over joins as so:
  plan match {
    case Filter(condition, Join(lhs, rhs, joinType, joinCondition, hint)) => 
      ???
  }
You may find the following methods helpful: You may also find it helpful to decompose a boolean Expression into its component conjunctive atoms:
def splitAnds(expr: Expression): Seq[Expression] =
{
  expr match { 
    case And(a, b) => splitAnds(a) ++ splitAnds(b)
    case _ => Seq(expr)
  }
}

Example Queries

TPC-H is a standard database benchmark. The benchmark consists of a dataset generator and 22 standard query templates. This checkpoint uses three queries based on TPC-H Queries 3, 5, and 6. The dataset generator and template values can be found at the TPC-H website, and is run at scaling factor (SF) 0.1. Minor variations in the queries may be made.

SELECT
  LINEITEM.ORDERKEY,
  LINEITEM.EXTENDEDPRICE*(CAST(1.0 as float)-LINEITEM.DISCOUNT) AS REVENUE,
  ORDERS.ORDERDATE,
  ORDERS.SHIPPRIORITY
FROM
  CUSTOMER,
  ORDERS,
  LINEITEM 
WHERE CUSTOMER.MKTSEGMENT = '[[SEGMENT]]' AND CUSTOMER.CUSTKEY = ORDERS.CUSTKEY
  AND LINEITEM.ORDERKEY = ORDERS.ORDERKEY 
  AND ORDERS.ORDERDATE < DATE '[[DATE]]'
  AND LINEITEM.SHIPDATE > DATE '[[DATE]]'
ORDER BY REVENUE DESC, ORDERDATE
LIMIT 10
SELECT
  NATION.NAME,
  LINEITEM.EXTENDEDPRICE * (CAST(1.0 as float) - LINEITEM.DISCOUNT) AS REVENUE
FROM
  REGION, NATION, CUSTOMER, ORDERS, LINEITEM, SUPPLIER
WHERE CUSTOMER.CUSTKEY = ORDERS.CUSTKEY
  AND LINEITEM.ORDERKEY = ORDERS.ORDERKEY
  AND LINEITEM.SUPPKEY = SUPPLIER.SUPPKEY
  AND CUSTOMER.NATIONKEY = NATION.NATIONKEY 
  AND SUPPLIER.NATIONKEY = NATION.NATIONKEY
  AND NATION.REGIONKEY = REGION.REGIONKEY
  AND REGION.NAME = '[[REGION]]'
  AND ORDERS.ORDERDATE >= DATE '[[DATE]]'
  AND ORDERS.ORDERDATE < DATE '[[DATE_PLUS_ONE]]'
ORDER BY REVENUE DESC
LIMIT 40
SELECT
  LINEITEM.EXTENDEDPRICE*LINEITEM.DISCOUNT AS REVENUE
FROM
  LINEITEM
WHERE LINEITEM.SHIPDATE >= DATE '[[DATE]]'
  AND LINEITEM.SHIPDATE < DATE '[[DATE_PLUS_ONE]]'
  AND LINEITEM.DISCOUNT > CAST([[DISCOUNT_LOW]] as float) AND LINEITEM.DISCOUNT < CAST([[DISCOUNT_HIGH]] as float)
  AND LINEITEM.QUANTITY < CAST([[QUANTITY]] as float)
ORDER BY REVENUE DESC
LIMIT 10

This page last updated 2024-12-09 13:28:25 -0500