Skip to content
zhangyuan edited this page Jul 30, 2019 · 1 revision

Socrates: The New SQL Server in the Cloud

KEYWORDS

Database as a Service, Cloud Database Architecture, High Availability

log --> durability --> not require copies in fast storage

storage --> availability -->not require a fiexed number of replicas

Concretely, Socrates requires

  • less expensive copies of data in fast local storage,

  • fewer copies of data overall,

  • less network bandwidth,

  • less compute resources to keep copies up-to-date

than other database architectures currently on the market.

  1. 举四个例子

  2. SQL DB - HEADER

  3. Google spanner

  4. Amazon Auraro

  5. Oracle Extradata/RAC

  6. sql server重要feature

3.1 page version store

Compute nodes must also share row versions in the shared storage tier.

3.2 Accelerated Database Recovery

//省略了undo阶段

 eliminate the undo phase in many cases and the database becomes available immediately after the analysis and redo phases, a constant-time operation bounded by the checkpointing interval.

3.3 Resilient Buffer Pool Extension

bufpool缓存到SSD, 比remote sever读取page快

3.4 RBIO protocol

新的无状态网络协议

3.5 Snapshot Backup/Restore

依赖Xstore blob snapshot机制,备份时间与数据大小无关

3.6 I/O stack Virtualization

I/O抽象层,相当于NCDB client

4 Scorate architecture

4.1 Design Goals and Principles

4.1.1 Local Fast Storage vs. Cheap, Scalable, Durable Storage.

 空间可以在scrorate存储层动态扩展

4.1.2 Bounded-time Operations.

 传统数据库的维护时间和数据量相关

4.1.3 From Shared-nothing to Shared-disk

节约存储资源,有效利用CPU

4.1.4 Low Log Latency, Separation of Log

单独的log service, 读取和发送log更灵活

Therefore, Socrates keeps recent log records in main memory and distributes them in a scalable way (potentially to hundreds of machines) whereas old log records are destaged and made available only upon demand.

4.1.5 Pushdown Storage Functions.

Most importantly, every database function that can be offloaded to storage (whether backup, checkpoint, IO filtering, etc.) relieves the Primary Compute node and the log, the two bottlenecks of the system.

4.1.6 Reuse Components, Tuning, Optimization.

query optimizer, the query runtime, security, transaction management and re- covery, etc. are unchanged.

4.2 Socrates Architecture Overview

分四层:compute node -> xlog service-> storage(page servers) -> Xstore service

Compute nodes and Page Servers are stateless.

secondary异步消费log

page server管理db的一个分区, 本地ssd, 有两个作用

  1. 为计算节点提供page

  2. checkpoint & backup in xstore

The XLOG service achieves low commit latencies and good scalability at the storage tier (scale-out)

XStore is a highly scalable, durable, and cheap storage service based on hard disks.

4.3 XLOG service

XIO 3个副本

primary同步写log 到LZ

log格式向后兼容,新log允许读正在写的log

One way to think about this scheme is that Socrates writes synchronously and reliably into the LZ for durability and asynchronously to the XLOG process for availability.

lz harden blocks 安全吗,没有持久化???

LZ is circle buffer

写到LZ的block才从pending blocks中移出到Log Broker, log broker先放到local ssd cache处理,最后会写到Xstore

xstore默认保留30天的日志

This local SSD cache is another circular buffer of the tail of the log.

先从sequence map找,再从local ssd cache找, 再从LT(Xstore), 最后LZ ???

Socrates: The New SQL Server in the Cloud

4.4 Primary Compute Node and GetPage@LSN

primary和传统sqlsqlerver的区别

• Storage level operations such as checkpoint, backup/restore, page repair, etc. are delegated to the Page Servers and lower storage tiers.

• The Socrates Primary writes log to the LZ using the virtualized filesystem mechanism of Section 3.6. This mechanism produces an I/O pattern that is compatible with the LZ concept described in Section 4.3.

• The Socrates Primary makes use of the RBPEX cache (Section 3.3). RBPEX is integrated transparently as a layer just above the I/O virtualization layer.

• Arguably, the biggest difference is that a Socrates Pri- mary does not keep a full copy of the database. It merely caches a hot portion of the database that fits into its main memory buffers and SSD (RBPEX).

getPage(pageId, LSN)

To guarantee freshness, the Page Server handles a get- Page(X, X-LSN) request in the following way:

(1) Wait until it has applied all log records from XLOG up to X-LSN.

(2) Return Page X.

// 以下策略同NCDB LELs

Instead, the Primary builds a hash map (on pageId) which stores in each bucket the highest LSN for every page evicted from the Primary keyed by pageId. Given that Page X was evicted at some point from the Primary, this mechanism will guarantee to give an X-LSN value that is at least as large as the largest LSN for Page X and is, thus, safe.

//primary不需要关心secondary ???

A Socrates Primary Compute node behaves almost identi- cally to a standalone process in an on-premise SQL Server installation. The database instance itself is unaware of the presence of other replicas.

4.5 Secondary Compute Node

secondary不保存log,log有XLOG service复责

secondary不关心谁产生日志

secondary只apply 在buffer中的page

secondary先注册readonly事务的GetPage@LSN请求, 如果有apply需check page是否在bufferpool需等待GetPage@LSN请求完成

scrorate 按batch start lsn来读page 如果按batch end lsn来读page,就不会丢失更新,但是提前读入buf的page不应该再apply, 通过page的lsn来判断。 如果apply 可能存在以下问题 假设batch内对同一page存在多个log,log1 reorganize log2 insert, 再apply log1, log2将不一致

还有如果此page是普通page,可能存在涉及的undopage还没有apply的情况,undopage应该在普通page之前apply

// 如果处理btr split ??? If the Secondary detects such an inconsistency during an index traversal, it will pause to give the log apply thread some time to consume more log to refresh the stale index pages (i.e., Page P). After that pause, it will restart the B-tree traversal, hoping that the index structure is now consistent.

4.6 Page Servers

128GB

A Page Server is responsible for (i) maintaining a partition of the database by applying log to it, (ii) responding to Get- Page@LSN requests from Compute nodes and (iii) perform- ing distributed checkpoints and taking backups.

page server只处理自己分区内的log

log block包含了设计的pageid信息???

Compute nodes cache the hottest pages for best performance; their caches are sparse. In contrast, Page Servers cache less hot pages, those pages that are not hot enough to make it in the Compute node’s cache.

page server缓存所以page,不惧怕读放大

5 SOCRATES AT WORK

The Socrates mini-services like Primary, Secondaries, XLOG, and Page Servers are autonomous and decoupled and commu- nication is asynchronous whenever possible

6 DISCUSSION & SOCRATES DEPLOYMENTS

7 PERFORMANCE EXPERIMENTS AND RESULTS

7.1 Software and Services Used

用CDB(Cloud Database Benchmark)测试 对比 HEADER 下降 %5, 原因remote IO/remote xlog

0%

7.2 Experiment 1: CDB Default Mix, Throughput, Production Cluster

对比 HEADER 下降 %5, 原因remote IO/remote xlog

7.3 Experiment 2: Caching Behavior

ssd cash 相对rate还可以

Table 4 shows that even though the cache is only about 1% of the size of the database, Socrates had a 32% hit rate.

7.4 Experiment 3: Update-heavy CDB, Log Throughput

log是瓶颈的场景,scorate获胜,得益于scorate的backup在storage tier(Xstore)

8 CONCLUSION

Socrates relies on the well established principle of separating Compute and Storage to achieve better availability and elasticity. Fur- thermore, Socrates separates durability and availability.

The big advantage of this separation is that it allows to flexibly meet customer requirements regarding the cost / performance / availability tradeoff.