Skip to content

Latest commit

 

History

History
262 lines (149 loc) · 32.4 KB

Readings for auto-db.md

File metadata and controls

262 lines (149 loc) · 32.4 KB

#Readings for Autonomous Database

[TOC]

The traditional database systems are designed for genearl purpose, for which they can handle a wide variety of different schemas, data types, and data distributions. These systems are designed based on experts' priori knowledge and performed badly comparing to the specialized data systems. To improve the efficiency of data access in these system, we are to add extra components such as query optimizers and cost models that leverage more informations about the system's hardware, characteristics of workloads, and data distribution. These components can be replaced with various models of the hardware, underlying data and workloads. A database system can automatically generate the best data structures and algorithms to react the various data processing applications. We call this approach as model-driven database.

Research topics in autonomous database include:

  • database tuning
  • physical designing
  • query optimization
  • query formulation
  • query execution

Topics can be divided into five categories, i.e.,

  1. Query optimization,
  2. data access,
  3. query execution
  4. data analytics
  5. query formulation

I General

  1. SageDB:A Learned Database System (CIDR '19)

    Summary: The key idea of SageDB is “to build or select (synthesize) the best implementation of each component of the data processing engine for a particular application”. The authors argue that the different types of customization (customization through configuration, algorithm picking, self-design, and learning) can be composed via SageDB. A key challenge mentioned in the paper is “choosing the right model for SageDB’s brain”. Their solution is to build or select (synthesize) the best implementation of each component of the data processing engine for a particular application. This synthesis is done by first learning one or more data distributions, workload, and hardware models, which forms the “brain” of SageDB. As the authors said, individual components can often share the previously learned models. Then, SageDB tries its best to compose a database from the following perspectives: (1) Data access, (2) Query execution, (3) Query optimization, and (4)Advanced analytics.

####II Physical design and self-tuning (physical design, self-tuning)

  1. COLT: Continuous On-Line Database Tuning

    Summary: This demonstration introduces COLT (Continuous On-Line Tuning), a novel self-tuning framework that continuously monitors the incoming queries and adjusts the system configuration in order to maximize query performance.

  2. Automatic Database Management System Tuning Through Large-scale Machine Learning (SIGMOD '17)

    Summary: Database management system (DBMS) configuration tuning is an essential aspect of any data-intensive application effort. But this is historically a difficult task because DBMSs have hundreds of configuration "knobs" that control everything in the system, such as the amount of memory to use for caches and how often data is written to storage. The problem with these knobs is that they are not standardized (i.e., two DBMSs use a different name for the same knob), not independent (i.e., changing one knob can impact others), and not universal (i.e., what works for one application may be sub-optimal for another). Worse, information about the effects of the knobs typically comes only from (expensive) experience.

    To overcome these challenges, we present an automated approach that leverages past experience and collects new information to tune DBMS configurations: we use a combination of supervised and unsupervised machine learning methods to (1) select the most impactful knobs, (2) map unseen database workloads to previous workloads from which we can transfer experience, and (3) recommend knob settings. We implemented our techniques in a new tool called OtterTune and tested it on two DBMSs. Our evaluation shows that OtterTune recommends configurations that are as good as or better than ones generated by existing tools or a human expert.

  3. An End-to-End Automatic Cloud Database Tuning System Using Deep Reinforcement Learning (SIGMOD '19)

    Summary: We design an end-to-end automatic CDB tuning system,CDBTune, using deep reinforce-ment learning (RL).CDBTuneutilizes the deep deterministicpolicy gradient method to find the optimal configurationsin high-dimensional continuous space.CDBTuneadopts atry-and-error strategy to learn knob settings with a limitednumber of samples to accomplish the initial training, whichalleviates the difficulty of collecting massive high-qualitysamples.CDBTuneadopts the reward-feedback mechanismin RL instead of traditional regression, which enables end-to-end learning and accelerates the convergence speed of our model and improves efficiency of online tuning. We con-ducted extensive experiments under 6 different workloadson real cloud databases to demonstrate the superiority ofCDBTune. Experimental results showed thatCDBTunehad agood adaptability and significantly outperformed the state-of-the-art tuning tools and DBA experts.

  4. The Case for Learned Index Structures (SIGMOD '18)

    Summary: This work propose a learned index trained by a neural network model to complement existing index such b-tree. They theoretically analyze under which conditions learned indexes outperform traditional index structures and describe the main challenges in designing learned index structures.

  5. Self-Driving Database Management Systems (CIDR '17)

    Summary: This paper presents the architecture of Peloton, the first self-driving DBMS for system tuning and physical design. Peloton is a relational database management system that is designed for autonomous operation. The system’s integrated planning component not only optimizes the system for the current workload, but also predicts future workload trends before they occur so that the system can prepare itself accordingly. It also enables new optimizations that are not possible today because the complexity of managing these systems has surpassed the abilities of human experts.

  6. Design Continuums and the Path Toward Self-Designing Key-Value Stores that Know and Learn (CIDR 2019)

    Summary: We introduce the concept of design continuums for the data layout of key-value stores. A design continuum unifies major distinct data structure designs under the same model. The critical insight and potential long-term impact is that such unifying models 1) render what we consider up to now as fundamentally different data structures to be seen as “views” of the very same overall design space, and 2) allow “seeing” new data structure designs with performance properties that are not feasible by existing designs.

  7. Data Calculator: Data Structure Design and Cost Synthesis from First Principles and Learned Cost Models (SIGMOD 2018)

    Summary: We present a design engine, the Data Calculator, which enables interactive and semi-automated design of data structures. It brings two innovations. First, it offers a set of "fine-grained design primitives that capture the "first principles of data layout design. The second innovation is computation of performance using learned cost models. These models are trained on diverse hardware and data profiles and capture the cost properties of fundamental data access primitives.

  8. Self-tuning database systems: a decade of progress (VLDB '07)

    Summary: In this paper we discuss advances in self-tuning database systems over the past decade, based on our experience in the AutoAdmin project at Microsoft Research. This paper primarily focuses on the problem of automated physical database design. We also highlight other areas where research on self-tuning database technology has made significant progress. We conclude with our thoughts on opportunities and open issues.

II Data Access (storage layout, index, data compression, etc)

  1. AutoAdmin “what-if” index analysis utility (SIGMOD '98)

    Summary: As databases get widely deployed, it becomes increasingly important to reduce the overhead of database administration. An important aspect of data administration that critically influences performance is the ability to select indexes for a database. In order to decide the right indexes for a database, it is crucial for the database administrator (DBA) to be able to perform a quantitative analysis of the existing indexes. Furthermore, the DBA should have the ability to propose hypothetical (“what-if”) indexes and quantitatively analyze their impact on performance of the system. Such impact analysis may consist of analyzing workloads over the database, estimating changes in the cost of a workload, and studying index usage while taking into account projected changes in the sizes of the database tables. In this paper we describe a novel index analysis utility that we have prototyped for Microsoft SQL Server 7.0. We describe the interfaces exposed by this utility that can be leveraged by a variety of front-end tools and sketch important aspects of the user interfaces enabled by the utility. We also discuss the implementation techniques for efficiently supporting “what-if” indexes. Our framework can be extended to incorporate analysis of other aspects of physical database design.

  2. H2O: A Hands-free Adaptive Store (SIGMOD '14)

    In this paper, we present the H2O system which introduces twonovel concepts. First, it is flexible to support multiple storagelayouts and data access patterns in a single engine. Second, andmost importantly, it decides on-the-fly, i.e., during query process-ing, which design is best for classes of queries and the respectivedata parts. At any given point in time, parts of the data mightbe materialized in various patterns purely depending on the queryworkload; as the workload changes and with every single query,the storage and access patterns continuously adapt. In this way,H2O makes no a priori and fixed decisions on how data should bestored, allowing each single query to enjoy a storage and accesspattern which is tailored to its specific properties.

####Part II Query optimization (cardinality estimation, cost model, join ordering, etc.)

  1. Eddies: continuously adaptive query processing (SIGMOD '00)

    In this paper we introduce a query processing mechanism called an eddy, which continuously reorders operators in a query plan as it runs. We characterize the moments of symmetry during which pipelined joins can be easily reordered, and the synchronization barriers that require inputs from different sources to be coordinated. By combining eddies with appropriate join algorithms, we merge the optimization and execution phases of query processing, allowing each tuple to have a flexible ordering of the query operators. This flexibility is controlled by a combination of fluid dynamics and a simple learning algorithm. Our initial implementation demonstrates promising results, with eddies performing nearly as well as a static optimizer/executor in static scenarios, and providing dramatic improvements in dynamic execution environments.

  2. How good are query optimizers, really? (PVLDB '15)

    Summary: Finding a good join order is crucial for query performance. In this paper, we introduce the Join Order Benchmark (JOB) and experimentally revisit the main components in the classic query optimizer architecture using a complex, real-world data set and realistic multi-join queries. We investigate the quality of industrial-strength cardinality estimators and find that all estimators routinely produce large errors. We further show that while estimates are essential for finding a good join order, query performance is unsatisfactory if the query engine relies too heavily on these estimates. Using another set of experiments that measure the impact of the cost model, we find that it has much less influence on query performance than the cardinality estimates. Finally, we investigate plan enumeration techniques comparing exhaustive dynamic programming with heuristic algorithms and find that exhaustive enumeration improves performance despite the sub-optimal cardinality estimates.

  3. Query-based Workload Forecasting for Self-Driving Database Management Systems (SIGMOD '18')

    Summary: This paper presents a robust forecasting framework called QueryBot 5000 that allows a DBMS to predict the expected arrival rate of queries in the future based on historical data. In particular, they propose a clustering-based technique to categorize the queries and use ensemble models combining linear regression and RNN to forecast the future workload.

  4. Cost-Model Oblivious Database Tuning with Reinforcement Learning (DEXA 2015)

    Summary: In this paper, we propose a learning approach to adaptive performance tuning of database applications. The objective is to validate the opportunity to devise a tuning strategy that does not need prior knowledge of a cost model. Instead, the cost model is learned through reinforcement learning. We instantiate our approach to the use case of index tuning. We model the execution of queries and updates as a Markov decision process whose states are database configurations, actions are configuration changes, and rewards are functions of the cost of configuration change and query and update evaluation. During the reinforcement learning process, we face two important challenges: not only the unavailability of a cost model, but also the size of the state space. To address the latter, we devise strategies to prune the state space, both in the general case and for the use case of index tuning. We empirically and comparatively evaluate our approach on a standard OLTP dataset. We show that our approach is competitive with state-of-the-art adaptive index tuning, which is dependent on a cost model.

  5. Learning to Optimize Join Queries With Deep Reinforcement Learning (arxiv 2019)

    Summary: Exhaustive enumeration of all possible join orders is often avoided, and most optimizers leverage heuristics to prune the search space. The design and implementation of heuristics are well-understood when the cost model is roughly linear, and we find that these heuristics can be significantly suboptimal when there are non-linearities in cost. Ideally, instead of a fixed heuristic, we would want a strategy to guide the search space in a more data-driven way---tailoring the search to a specific dataset and query workload. Recognizing the link between classical Dynamic Programming enumeration methods and recent results in Reinforcement Learning (RL), we propose a new method for learning optimized join search strategies. We present our RL-based DQ optimizer, which currently optimizes select-project-join blocks. We implement three versions of DQ to illustrate the ease of integration into existing DBMSes: (1) A version built on top of Apache Calcite, (2) a version integrated into PostgreSQL, and (3) a version integrated into SparkSQL. Our extensive evaluation shows that DQ achieves plans with optimization costs and query execution times competitive with the native query optimizer in each system, but can execute significantly faster after learning (often by orders of magnitude).

  6. Learned Cardinalities: Estimating Correlated Joins with Deep Learning

    Summary: We describe a new deep learning approach to cardinality estimation. MSCN is a multi-set convolutional network, tailored to representing relational query plans, that employs set semantics to capture query features and true cardinalities. MSCN builds on sampling-based estimation, addressing its weaknesses when no sampled tuples qualify a predicate, and in capturing join-crossing correlations. Our evaluation of MSCN using a real-world dataset shows that deep learning significantly enhances the quality of cardinality estimation, which is the core problem in query optimization.

  7. Cost-Model Oblivious Database Tuning with Reinforcement Learning

    Summary: In this paper, we propose a learning approach to adaptive performance tuning of database applications. The objective is to validate the opportunity to devise a tuning strategy that does not need prior knowledge of a cost model. Instead, the cost model is learned through reinforcement learning. We instantiate our approach to the use case of index tuning. We model the execution of queries and updates as a Markov decision process whose states are database configurations, actions are configuration changes, and rewards are functions of the cost of configuration change and query and update evaluation. During the reinforcement learning process, we face two important challenges: not only the unavailability of a cost model, but also the size of the state space. To address the latter, we devise strategies to prune the state space, both in the general case and for the use case of index tuning. We empirically and comparatively evaluate our approach on a standard OLTP dataset. We show that our approach is competitive with state-of-the-art adaptive index tuning, which is dependent on a cost model.

II Query execution (sorting, join, aggregation, scheduling, etc.)

####III Query formulation (query-by-example, human-in-the-loop)

  1. Optimization for active learning-based interactive database exploration (PVLDB '18)

    Summary: There is an increasing gap between fast growth of data and limited human ability to comprehend data. Consequently, there has been a growing demand of data management tools that can bridge this gap and help the user retrieve high-value content from data more effectively. In this work, we aim to build interactive data exploration as a new database service, using an approach called "explore-by-example". In particular, we cast the explore-by-example problem in a principled "active learning" framework, and bring the properties of important classes of database queries to bear on the design of new algorithms and optimizations for active learning-based database exploration. These new techniques allow the database system to overcome a fundamental limitation of traditional active learning, i.e., the slow convergence problem. Evaluation results using real-world datasets and user interest patterns show that our new system significantly outperforms state-of-the-art active learning techniques and data exploration systems in accuracy while achieving desired efficiency for interactive performance.

  2. Explore-by-example: an automatic query steering framework for interactive data exploration (SIGMOD '14)

    Summary: Interactive Data Exploration (IDE) is a key ingredient of a diverseset of discovery-oriented applications, including ones from scien-tific computing and evidence-based medicine. In these applica-tions, data discovery is a highly ad hoc interactive processwhereusers execute numerous exploration queries using varying predi-cates aiming to balance the trade-off between collecting all relevant information and reducing the size of returned data. Therefore, thereis a strong need to support these human-in-the-loop applications byassisting their navigation in the data to find interesting objects.In this paper, we introduce AIDE, anAutomatic Interactive Data Exploration framework, that iteratively steers the user towards in-teresting data areas and “predicts” a query that retrieves his objectsof interest.

  3. AIDE: an automatic user navigation system for interactive data exploration (PVLDB '15)

    Summary: Data analysts often engage in data exploration tasks to discover in-teresting data patterns, without knowing exactly what they are look-ing for. Such exploration tasks can be very labor-intensive becausethey often require the user to review many results of ad-hoc queriesand adjust the predicates of subsequent queries to balance the trade-off between collecting all interesting information and reducing thesize of returned data. In this demonstration we introduceAIDE,a system that automates these exploration tasks. AIDE steers theuser towards interesting data areas based on her relevance feedbackon database samples, aiming to achieve the goal of identifying alldatabase objects that match the user interest with high efficiency.In our demonstration, conference attendees will see AIDE in ac-tion for a variety of exploration tasks on real-world datasets.

  4. Exemplar queries: Give me an example of what you need (PVLDB '14)

    Summary: Search engines are continuously employing advanced techniques that aim to capture user intentions and provide results that go beyond the data that simply satisfy the query conditions. Examples include the personalized results, related searches, similarity search, popular and relaxed queries. In this work we introduce a novel query paradigm that considers a user query as an example of the data in which the user is interested. We call these queries “exemplar queries”. and claim that they can play an important role in dealing with the information deluge. We provide a formal specification of the semantics of such queries and show that they are fundamentally different from notions like queries by example, approximate and related queries.

    We provide an implementation of these semantics for graph-based data and present an exact solution with a number of optimizations that improve performance without compromising the quality of the answers. We study two different similarity functions, isomorphism and strong simulation, for retrieving the answers to an exemplar query, and we provide solutions for both. We also provide an approximate solution that prunes the search space and achieves considerably better time-performance with minimal or no impact on effectiveness. We experimentally evaluate the effectiveness and efficiency of these solutions with synthetic and real datasets, and illustrate the usefulness of exemplar queries in practice.

  5. Querying Knowledge Graphs by Example Entity Tuples (TKDE, Volume27, 2015)

    Summary: We witness an unprecedented proliferation of knowledge graphs that record millions of entities and their relationships. While knowledge graphs are structure-flexible and content-rich, they are difficult to use. The challenge lies in the gap between their overwhelming complexity and the limited database knowledge of non-professional users. If writing structured queries over “simple” tables is difficult, complex graphs are only harder to query. As an initial step toward improving the usability of knowledge graphs, we propose to query such data by example entity tuples, without requiring users to form complex graph queries. Our system, Graph Query By Example (GQBE), automatically discovers a weighted hidden maximum query graph based on input query tuples, to capture a user's query intent. It then efficiently finds and ranks the top approximate matching answer graphs and answer tuples. We conducted experiments and user studies on the large Freebase and DBpedia datasets and observed appealing accuracy and efficiency. Our system provides a complementary approach to the existing keyword-based methods, facilitating user-friendly graph querying. To the best of our knowledge, there was no such proposal in the past in the context of graphs.

II Data Analytics (data cubes, AQP, machine learning)

  1. Database Learning: Toward a Database that Becomes Smarter Every Time

    Summary: In today's databases, previous query answers rarely benefit answering future queries. For the first time, to the best of our knowledge, we change this paradigm in an approximate query processing (AQP) context. We make the following observation: the answer to each query reveals some degree of knowledge about the answer to another query because their answers stem from the same underlying distribution that has produced the entire dataset. Exploiting and refining this knowledge should allow us to answer queries more analytically, rather than by reading enormous amounts of raw data. Also, processing more queries should continuously enhance our knowledge of the underlying distribution, and hence lead to increasingly faster response times for future queries.

    We call this novel idea---learning from past query answers---Database Learning. We exploit the principle of maximum entropy to produce answers, which are in expectation guaranteed to be more accurate than existing sample-based approximations. Empowered by this idea, we build a query engine on top of Spark SQL, called Verdict. We conduct extensive experiments on real-world query traces from a large customer of a major database vendor. Our results demonstrate that database learning supports 73.7% of these queries, speeding them up by up to 23.0x for the same accuracy level compared to existing AQP systems.

  2. Data Synthesis based on Generative Adversarial Networks (PVLDB '18')

    Summary: In this paper, we propose a method that meets bothrequirements. Our method, calledtable-GAN, uses generative ad-versarial networks (GANs) to synthesize fake tables that are sta-tistically similar to the original table yet do not incur informa-tion leakage. We show that the machine learning models trainedusing our synthetic tables exhibit performance that is similar tothat of models trained using the original table for unknown test-ing cases. We call this propertymodel compatibility.

Part IV AI (generative adversarial networks, reinforcement learning, active learning, etc.)

  1. Generative Adversarial Nets (NIPS '14')

    Summary: This paper propose a new framework for estimating generative models via an adversar-ial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimatesthe probability that a sample came from the training data rather than G. The train-ing procedure forGis to maximize the probability of D making a mistake. Thisframework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training datadistribution and D equal to 12 everywhere. In the case where G and D are definedby multilayer perceptrons, the entire system can be trained with backpropagation.There is no need for any Markov chains or unrolled approximate inference net-works during either training or generation of samples. Experiments demonstratethe potential of the framework through qualitative and quantitative evaluation ofthe generated samples.

  2. NIPS 2016 Tutorial: Generative Adversarial Networks

    Summary: This report summarizes the tutorial presented by the author at NIPS 2016 on generative adversarial networks (GANs). The tutorial describes: (1) Why generative modeling is a topic worth studying, (2) how generative models work, and how GANs compare to other generative models, (3) the details of how GANs work, (4) research frontiers in GANs, and (5) state-of-the-art image models that combine GANs with other methods. Finally, the tutorial contains three exercises for readers to complete, and the solutions to these exercises.

  3. How Generative Adversarial Networks and Their Variants Work: An Overview (ACM Computing Surveys 2019)

    Summary: Generative Adversarial Networks (GANs) have received wide attention in the machine learning field for their potential to learn high-dimensional, complex real data distribution. Specifically, they do not rely on any assumptions about the distribution and can generate real-like samples from latent space in a simple manner. This powerful property allows GANs to be applied to various applications such as image synthesis, image attribute editing, image translation, domain adaptation, and other academic fields. In this article, we discuss the details of GANs for those readers who are familiar with, but do not comprehend GANs deeply or who wish to view GANs from various perspectives. In addition, we explain how GANs operates and the fundamental meaning of various objective functions that have been suggested recently. We then focus on how the GAN can be combined with an autoencoder framework. Finally, we enumerate the GAN variants that are applied to various tasks and other fields for those who are interested in exploiting GANs for their research.

  4. Wasserstein GAN (preprint 2019)

    Summary: We introduce a new algorithm named WGAN, an alternative to traditional GAN training. In this new model, we show that we can improve the stability of learning, get rid of problems like mode collapse, and provide meaningful learning curves useful for debugging and hyperparameter searches. Furthermore, we show that the corresponding optimization problem is sound, and provide extensive theoretical work highlighting the deep connections to other distances between distributions.

  5. A Brief Survey of Deep Reinforcement Learning (IEEE Signal Processing Magazine 2017)

    Summary: Deep reinforcement learning (DRL) is poised to revolutionize the field of artificial intelligence (AI) and represents a step toward building autonomous systems with a higher-level understanding of the visual world. Currently, deep learning is enabling reinforcement learning (RL) to scale to problems that were previously intractable, such as learning to play video games directly from pixels. DRL algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of RL, then progress to the main streams of value-based and policy-based methods. Our survey will cover central algorithms in deep RL, including the deep Q-network (DQN), trust region policy optimization (TRPO), and asynchronous advantage actor critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via RL. To conclude, we describe several current areas of research within the field.