Addressing the challenges of big data requires a combination of human intuition and automation. Rather than tackling these challenges head-on with build-from-scratch solutions, or through general-purpose database systems, developer and analyst communities are turning to building blocks: Specialized languages, runtimes, data-structures, services, compilers, and frameworks that simplify the task of creating a systems that are powerful enough to handle terabytes of data or more or efficient enough to run on your smartphone. In this class, we will explore these fundamental building blocks and how they relate to the constraints imposed by workloads and the platforms they run on.
Coursework consists of lectures and a multi-stage final project. Students are expected to attend all lectures and present at least one paper related to their project. Projects may be performed individually or in groups. Projects will be evaluated in three stages through code deliverables, reports, and group meetings with the instructor (or in rare cases a designated project supervisor). During these meetings, instructors will question the entire group extensively about the group's report, deliverables, and any related tools and technology.
After the taking the course, students should be able to:
Design domain specific query languages, by first developing an understanding the common tropes of a target domain, exploring ways of allowing users to efficiently express those tropes, and developing ways of mapping the resulting programs to an efficient evaluation strategy.
Identify concurrency challenges in data-intensive computing tasks, and address them through locking, code associativity, and correctness analysis.
Understand a variety of index data structures, as well as their application and use in data management systems for high velocity, volume, veracity, and/or variety data.
Understand query and program compilation techniques, including the design of intermediate representations, subexpression equivalence, cost estimation, and the construction of target-representation code.
(Linear Algebra on Spark)
(Simulating Data from SQL Logs)
(Replicate Learned Index Structures)
(Filesystem Schema Synthesis)
Paper: The case for learned index structures
A few years ago, a group at Google/MIT/Brown developed a new form of data structure for read-heavy workloads based on the observation that the goal of an index structure was to minimize random IOs by getting you as close to a target value as possible.
If your data values have a uniformly distributed key, you can get reasonably close to any target key by taking $position = \frac{target}{high - low}$.
This type of trick works on on any dataset where you can devise a magic function $f(target)$ that returns a position reasonably close to your target (i.e., mirroring the CDF).
The paper's insight is that a small neural network, trained on the CDF of the keys could be used to implement $f$.
The overarching goal of this project is to replicate the author's results, implementing components as needed, across a wide range of datasets and workloads, and comparing learned data structures to alternative indexes.
For comparison points, see Trevor Brown's homepage, which includes a catalog of competitive datastructures.
A successful project would ideally confirm results in the paper, and identify workloads not explored in the paper where learned datastructures do not perform as well as other options.
Papers: CrimsonDB Website
Log Structured Merge Trees are a family of write-favored index structures that maintain data in a set of internally sorted runs of data. Runs are organized into "levels" or "tiers" where runs within a level are of comparable size and typically a constant factor larger than the prior level. As data is added, runs are merged into larger runs, moving data to progressively lower/larger levels. Although the basic principles are similar across LSM-Trees, there are a wide range of specific implementations, each making a range of different design decisions. The researchers developing CrimsonDB have developed a generic LSM-tree framework generalizing, and building on most popular LSM-tree implementations. The overarching goal of this project is to replicate the author's results, implementing components as needed, across a wide range of datasets and workloads, and comparing learned data structures to alternative indexes. For comparison points, see Trevor Brown's homepage, which includes a catalog of competitive datastructures. A successful project would ideally confirm specific results from CrimsonDB publications, and identify workloads or configurations where the CrimsonDB approach does not perform as well as other options.
Paper: Scalable Linear Algebra on a Relational Database System and Solving All-Pairs Shortest-Paths Problem in Large Graphs Using Apache Spark and APSparkhttps://gitlab.com/SCoRe-Group/APSPark
Linear algebra and relational algebra share much in common.
Both deal with computationally straightforward operations replicated over large data.
Both have standard equivalence rules that can be used to re-organize computation into a more efficient, yet equivalent form.
The key difference is that linear algebra expressions operate over dense, heavily structure data, while relational algebra targets sparse heavily structured data.
There have been several efforts to bridge the two (the paper above being one such effort), which if successful could create a much needed bridge between databases and tools for machine learning.
The goal of this project is to integrate some of these ideas to a distributed relational data processing framework: Spark.
A successful project would demonstrate an efficient tool for computing standard linear algebra operations (at a minimum: matrix/vector addition/multiplication) through Spark's dataset infrastructure.
Hacking on Spark's Catalyst optimizer will be likely required to be successful.
Paper: PocketData: The Need for TPC-Mobile
Systems work in databases requires effective benchmarks, which in turn require two things: Realistic query/update workloads and realistic datasets. Datasets are plentiful, and query/update logs can also be found if one looks hard enough (e.g., SDSS, PhoneLab), but datasets with accompanying query/update logs are extremely rare. The goal of this project is to work backwards: Given a log of SQL queries, DDL operations, and optionally accompanying statistics (e.g., how many records are returned for a query), can you develop a model of the data that can be used to synthesize a dataset for the workload. For example, a SQL query constrains the schema of all tables it accesses, mandating that they contain specific fields. Similarly, an equi-join between two collumns suggests a similar domain, if not distribution of data. A successful project would develop a model of constraints recoverable from a SQL log, use something like a constraint solver to synthesize a datset that satisfies these constraints, and evaluate its performance on one or more logs.
Papers: Using Reenactment to Retroactively Capture Provenance for Transactions and Graceful database schema evolution: the PRISM workbench Also See: Reenactment-Style Updates
Although classical databases support SQL DDL operations, there are a wide variety of reasons why one might wish to keep data immutable. 
(1) Most data sources (CSV files, Hive, URLs) are either unfriendly to point updates or outright don't support writes. 
(2) Changing data in-place loses track of older versions of the data, which may be useful for some applications.
Reenactment is a technique for translating SQL DDL operations (INSERT, UPDATE, DELETE) into equivalent queries.
Similar techniques have been developed for SQL DML operations (CREATE, DROP, ALTER), as in the PRISM workbench paper above.
Combining both, we can simulate DDL/DML operations through view definitions: each version of the database is a new view defined relative to the previous one.
Unfortunately, the complexity of this approach grows with the number of DDL/DML operations applied.
The core challenge of this project is to work out ways to make efficient query processing possible on reenactment-style data tables.
You may find it convenient to work with an existing relational-agebra based query engine such as Mimir
(Supervised by Will Spoth)
Papers: DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases
Researchers, scientific tools, and data repositories often provide datasets not as individual data files, but rather as directory hierarchies of files (e.g. execution traces).
Ingesting such datasets into a standard data analytics tool like Spark, a spreadsheet, or a relational database is challenging, as it requires users to explore and understand the structure of each individual file, as well as how all of the files relate to one another.
The goal of this project is to automate the process, or at least streamline it with a human in the loop.
A successful project would develop an a tool (could be graphical or textual) that targets a directory, infers a structure over it, and proposes a set of relational tables through which files in the directory may be queried.
As a bonus goal, a project could also develop a query optimizer that allows efficient in-situ querying of the data.
For this project, you may wish to build on existing work on schema detection in JSON and XML data (as in DataGuides above).
This page last updated 2025-08-14 19:13:26 -0400