Active Projects


Good data science requires constantly documenting everything, and that is just a pain in the butt. Vizier is a data science notebook that keeps track of what you've done, where your data comes from, and when you changed what. Plus, it's multi-modal, so users can easily manipulate data with a variety of different languages, as well as with a user-friendly spreadsheet interface.

Many analytics tasks are based on information that is initially incomplete, inconsistent, or simply used incorrectly. Existing strategies to help people cope with these sources of uncertainty often require heavyweight upfront organizational tasks (i.e., tagging, data-cleaning, or modeling). The Mimir project aims to streamline this process, making it more on-demand and intuitive.

What are we thinking about?

Micro-Kernel Notebooks [Vizier]

Each cell in vizier is executed in an isolated environment, a "micro-kernel". Microkernels open up a lot of cool stuff, but are slow.

We're looking for tricks we can play to keep communications overheads low, and extract the best parallelism from the microkernels.

Multi-View Code [Vizier]

It's common to use multiple tools (spreadsheets, editors, databases) in a single data science workflow, but what happens when you want to view the same code in different ways. For example, Vizier already provides multiple views of a workflow, optimized for writing code, reading code, tracing data flow, and more.

We're thinking about ways to represent computation abstractly in a form that's easily presented through a variety of different views.

Improving Spreadsheets [Vizier]

Spreadsheets suck at "big data", and aren't great for working with changing data.

We're thinking about ways to take spreadsheets and make them into first-class tools for developing data science workflows.

Re-using Data Science Patterns [Vizier]

Simple things like unit conversions, distance computations, or zip code lookups occur everywhere, but aren't worth the hassle of pulling out into special functions.

We're thinking about heuristics for extracting and labeling simple data transformations from python code, and ways to leverage these transformations in data search problems.

Declarative Compilers

The optimizer is the part of a compiler that rewrites code into faster code. Optimizer construction is a bit of an art, particularly when it comes to making the optimizer run fast. However, if you think of the optimizer as a specialized database, common database optimizations can be used to optimize the optimizer.

We're thinking about ways to re-engineer compilers using tricks developed by the databases community like incremental view maintenance and/or multi-query optimization.

Machine Learning on Small Data [Mimir]

A small dataset might be enough to train a simple model to make predictions about some subpopulations, but not all of them (the "curse of small data"). Unfortunately, once the model is trained, there's often no way to tell which subpopulations are viable.

We're thinking about ways to incorporate completeness information into machine learning models, so that the model can answer with "I don't know".

Mobile Governors Suck [PocketData]

Mobile phone CPUs can slow down to save power. An OS component called the 'governor' chooses when and how this happens. Turns out most governors on Android are over-engineered, making predictions based on useless data.

We're thinking about how to make simpler and smarter mobile governors

Older Projects

ASTralASTral / Just In Time Data Structures

ASTral is a database that uses a combination of programming language, program optimization, and data structure techniques to create and maintain self-adapting physical layouts that rapidly react to changing workloads.


The PocketData project explores how smartphones make use of embedded databases in the interest of designing new energy-efficient, low-latency, developer-friendly data management tools for pocket-scale data.

EttuInsider Threat Detection

One of the greatest threats to a the security of a database system comes from within: Users who have been granted access to data using it in a malicious or illegitimate way. Our goal is to develop new types of statistical signatures for a user or role's behavior as they access a database. Using these signatures, we can identify non-standard behvaior that could be evidence of malicious activity.


DBToaster is an SQL-to-native-code compiler. It generates lightweight, specialized, embeddable query engines for applications that require real-time, low-latency data processing and monitoring capabilities. The DBToaster compiler generates code that can be easily incorporated into any C++ or JVM-based (Java, Scala, ...) project.

Since 2009, DBToaster has spearheaded the currently ongoing database compilers revolution. If you are looking for the fastest possible execution of continuous analytical queries, DBToaster is the answer. DBToaster code is 3-6 orders of magnitude faster than all other systems known to us.

DBToaster was started at Cornell by the research group of Christoph Koch (now at EPFL).  Development on DBToaster continues at the DATA lab at EPFL.


The MayBMS system (note: MayBMS is read as “maybe-MS”, like DBMS) is a complete probabilistic database management system that leverages robust relational database technology: MayBMS is an extension of the Postgres server backend. MayBMS is open source and the source code is available under the BSD license.

MayBMS stands alone as a complete probabilistic database management system that supports a powerful, compositional query language for which nevertheless worst-case efficiency and result quality guarantees can be made. The MayBMS backend is accessible through several APIs, with efficient internal operators for computing and managing probabilistic data.

MayBMS was started at Saarland University by the research group of Christoph Koch (now at EPFL).  MayBMS has turned into Sprout, and is being developed at Oxford.

This page last updated 2024-05-06 11:22:18 -0400