-
Notifications
You must be signed in to change notification settings - Fork 0
Scorate
Socrates: The New SQL Server in the Cloud
KEYWORDS
Database as a Service, Cloud Database Architecture, High Availability
log --> durability --> not require copies in fast storage
storage --> availability -->not require a fiexed number of replicas
Concretely, Socrates requires
-
less expensive copies of data in fast local storage,
-
fewer copies of data overall,
-
less network bandwidth,
-
less compute resources to keep copies up-to-date
than other database architectures currently on the market.
-
举四个例子
-
SQL DB - HEADER
-
Google spanner
-
Amazon Auraro
-
Oracle Extradata/RAC
-
sql server重要feature
3.1 page version store
Compute nodes must also share row versions in the shared storage tier.
3.2 Accelerated Database Recovery
//省略了undo阶段
eliminate the undo phase in many cases and the database becomes available immediately after the analysis and redo phases, a constant-time operation bounded by the checkpointing interval.
3.3 Resilient Buffer Pool Extension
bufpool缓存到SSD, 比remote sever读取page快
3.4 RBIO protocol
新的无状态网络协议
3.5 Snapshot Backup/Restore
依赖Xstore blob snapshot机制,备份时间与数据大小无关
3.6 I/O stack Virtualization
I/O抽象层,相当于NCDB client
4 Scorate architecture
4.1 Design Goals and Principles
4.1.1 Local Fast Storage vs. Cheap, Scalable, Durable Storage.
空间可以在scrorate存储层动态扩展
4.1.2 Bounded-time Operations.
传统数据库的维护时间和数据量相关
4.1.3 From Shared-nothing to Shared-disk
节约存储资源,有效利用CPU
4.1.4 Low Log Latency, Separation of Log
单独的log service, 读取和发送log更灵活
Therefore, Socrates keeps recent log records in main memory and distributes them in a scalable way (potentially to hundreds of machines) whereas old log records are destaged and made available only upon demand.
4.1.5 Pushdown Storage Functions.
Most importantly, every database function that can be offloaded to storage (whether backup, checkpoint, IO filtering, etc.) relieves the Primary Compute node and the log, the two bottlenecks of the system.
4.1.6 Reuse Components, Tuning, Optimization.
query optimizer, the query runtime, security, transaction management and re- covery, etc. are unchanged.
4.2 Socrates Architecture Overview
分四层:compute node -> xlog service-> storage(page servers) -> Xstore service
Compute nodes and Page Servers are stateless.
secondary异步消费log
page server管理db的一个分区, 本地ssd, 有两个作用
-
为计算节点提供page
-
checkpoint & backup in xstore
The XLOG service achieves low commit latencies and good scalability at the storage tier (scale-out)
XStore is a highly scalable, durable, and cheap storage service based on hard disks.
4.3 XLOG service
XIO 3个副本
primary同步写log 到LZ
log格式向后兼容,新log允许读正在写的log
One way to think about this scheme is that Socrates writes synchronously and reliably into the LZ for durability and asynchronously to the XLOG process for availability.
lz harden blocks 安全吗,没有持久化???
LZ is circle buffer
写到LZ的block才从pending blocks中移出到Log Broker, log broker先放到local ssd cache处理,最后会写到Xstore
xstore默认保留30天的日志
This local SSD cache is another circular buffer of the tail of the log.
先从sequence map找,再从local ssd cache找, 再从LT(Xstore), 最后LZ ???
Socrates: The New SQL Server in the Cloud
4.4 Primary Compute Node and GetPage@LSN
primary和传统sqlsqlerver的区别
• Storage level operations such as checkpoint, backup/restore, page repair, etc. are delegated to the Page Servers and lower storage tiers.
• The Socrates Primary writes log to the LZ using the virtualized filesystem mechanism of Section 3.6. This mechanism produces an I/O pattern that is compatible with the LZ concept described in Section 4.3.
• The Socrates Primary makes use of the RBPEX cache (Section 3.3). RBPEX is integrated transparently as a layer just above the I/O virtualization layer.
• Arguably, the biggest difference is that a Socrates Pri- mary does not keep a full copy of the database. It merely caches a hot portion of the database that fits into its main memory buffers and SSD (RBPEX).
getPage(pageId, LSN)
To guarantee freshness, the Page Server handles a get- Page(X, X-LSN) request in the following way:
(1) Wait until it has applied all log records from XLOG up to X-LSN.
(2) Return Page X.
// 以下策略同NCDB LELs
Instead, the Primary builds a hash map (on pageId) which stores in each bucket the highest LSN for every page evicted from the Primary keyed by pageId. Given that Page X was evicted at some point from the Primary, this mechanism will guarantee to give an X-LSN value that is at least as large as the largest LSN for Page X and is, thus, safe.
//primary不需要关心secondary ???
A Socrates Primary Compute node behaves almost identi- cally to a standalone process in an on-premise SQL Server installation. The database instance itself is unaware of the presence of other replicas.
4.5 Secondary Compute Node
secondary不保存log,log有XLOG service复责
secondary不关心谁产生日志
secondary只apply 在buffer中的page
secondary先注册readonly事务的GetPage@LSN请求, 如果有apply需check page是否在bufferpool需等待GetPage@LSN请求完成
scrorate 按batch start lsn来读page 如果按batch end lsn来读page,就不会丢失更新,但是提前读入buf的page不应该再apply, 通过page的lsn来判断。 如果apply 可能存在以下问题 假设batch内对同一page存在多个log,log1 reorganize log2 insert, 再apply log1, log2将不一致
还有如果此page是普通page,可能存在涉及的undopage还没有apply的情况,undopage应该在普通page之前apply
// 如果处理btr split ??? If the Secondary detects such an inconsistency during an index traversal, it will pause to give the log apply thread some time to consume more log to refresh the stale index pages (i.e., Page P). After that pause, it will restart the B-tree traversal, hoping that the index structure is now consistent.
4.6 Page Servers
128GB
A Page Server is responsible for (i) maintaining a partition of the database by applying log to it, (ii) responding to Get- Page@LSN requests from Compute nodes and (iii) perform- ing distributed checkpoints and taking backups.
page server只处理自己分区内的log
log block包含了设计的pageid信息???
Compute nodes cache the hottest pages for best performance; their caches are sparse. In contrast, Page Servers cache less hot pages, those pages that are not hot enough to make it in the Compute node’s cache.
page server缓存所以page,不惧怕读放大
5 SOCRATES AT WORK
The Socrates mini-services like Primary, Secondaries, XLOG, and Page Servers are autonomous and decoupled and commu- nication is asynchronous whenever possible
6 DISCUSSION & SOCRATES DEPLOYMENTS
7 PERFORMANCE EXPERIMENTS AND RESULTS
7.1 Software and Services Used
用CDB(Cloud Database Benchmark)测试 对比 HEADER 下降 %5, 原因remote IO/remote xlog
0%
7.2 Experiment 1: CDB Default Mix, Throughput, Production Cluster
对比 HEADER 下降 %5, 原因remote IO/remote xlog
7.3 Experiment 2: Caching Behavior
ssd cash 相对rate还可以
Table 4 shows that even though the cache is only about 1% of the size of the database, Socrates had a 32% hit rate.
7.4 Experiment 3: Update-heavy CDB, Log Throughput
log是瓶颈的场景,scorate获胜,得益于scorate的backup在storage tier(Xstore)
8 CONCLUSION
Socrates relies on the well established principle of separating Compute and Storage to achieve better availability and elasticity. Fur- thermore, Socrates separates durability and availability.
The big advantage of this separation is that it allows to flexibly meet customer requirements regarding the cost / performance / availability tradeoff.