diff --git a/editions/1/zh/api.html b/editions/1/zh/api.html new file mode 100644 index 0000000..5b080b8 --- /dev/null +++ b/editions/1/zh/api.html @@ -0,0 +1,491 @@ +The Core API + + + + + + + + + + + +

The Core API

+ +

本章节将仔细的来探索CouchDB. 我们会讲解所有的关于CouchDB的重要的话题以及一些明智的解决方法. 我们会讲解一些最佳实践并在一些常见问题上进行指导. + +

让我们从回顾在前几个章节的操作开始, 看看这些操作的背后在做什么. 我们还会讲解Futon在它的底层需要做些什么来提供给我们先前我们看到的那些美妙的特性. + +

这个章节同时是一个关于核心CouchDB API的介绍和参考. 如果你记不起来如何执行一个特殊请求或者忘记了为什么需要某个参数, 你总是可以回到这里来查找(我们自己可能是使用这一章节最多的用户). + +

当我们在探索API时, 有时候需要绕个弯路来解释某个特定请求的原因. 这对我们来说, 是一个告诉你为什么CouchDB这么工作的好机会. + +

API可以被分成下面的几个部分. 我们会分别来看它们: + +

+ +

服务器

+ +

这部分是基础的也是简单的. 它可以检查CouchDB是否正在工作. 也可以在有些需要特定版本CouchDB的软件库里检查CouchDB版本来做为安全保障. 我们会再一次用到curl这个工具. + +

+curl http://127.0.0.1:5984/
+
+ +

CouchDB响应:

+ +
+{"couchdb":"Welcome","version":"0.10.1"}
+
+ +

你会得到一个JSON字符串, 这个字符串, 如果把它作为原生对象或者你所使用的编程语言里的数据结构进行解析, 会得到一个welcome字符串和版本信息. + +

这些并不是非常有用, 但是它很好的展示了与CouchDB交互的一种方法. 发送一个HTTP请求然后就会在HTTP响应里收到一个JSON字符串作为结果. + +

数据库

+ +

现在让我们做些更加有用的: 创建数据库. 严格的来说, CouchDB是一个数据库管理系统(DMS). 这意味着它可以有多个数据库. 一个数据库相当于一个桶, 这个桶里保存有一些"相互联系的数据". 我们会在后面解释这到底意味着什么. 而在实际中, 这个术语的意思有些重叠了, 人们经常把DMS当成一个数据库, 也同时把一个在DMS里数据库当成DMS. 我们可能会不去关心这种奇怪的现象, 所以请不要被它搞混了. 通常情况下, 通过上下文, 还是能的分清楚我们在讲的到底是整个CouchDB还是一个在CouchDB里的数据库的. + +

现在让我们来创建一个! 我们想要存储我们最喜欢的音乐专辑, 把数据库命名为albums. 注意, 我们又一次使用了-X这个选项来告诉curl来发送一个PUT请求而不是默认的GET请求: + +

+curl -X PUT http://127.0.0.1:5984/albums
+
+ +

CouchDB响应: + +

+{"ok":true}
+
+ +

这样. 你创建了一个数据库, CouchDB告诉你一切正常. 如果你试图创建一个已经存在的数据库会发生什么事? 我们来试试再创建同一个数据库: + +

+curl -X PUT http://127.0.0.1:5984/albums
+
+ +

CouchDB响应:

+ +
+{"error":"file_exists","reason":"The database could not be created, the file already exists."}
+
+ +

我们得到了一个错误. 这相当的方便. 我们同时学到了一点关于CouchDB是如何工作的知识. CouchDB里每个数据库都存储在一个单一的文件里. 非常简单, 虽然这样做会产生一些影响, 但现在请先忽略这些细节, 我们会在附录F, B-Trees的威力中再详细介绍底层的存储系统. + +

让我们来创建另一个数据库, 这次带上curl-v("verbose"的简写)选项. verbose这个选项告诉curl不仅仅只显示必要的信息---HTTP的回复体, 还要显示请求与回复的细节: + +

+curl -vX PUT http://127.0.0.1:5984/albums-backup
+
+ +

curl详细显示: + +

+* About to connect() to 127.0.0.1 port 5984 (#0)
+*   Trying 127.0.0.1... connected
+* Connected to 127.0.0.1 (127.0.0.1) port 5984 (#0)
+> PUT /albums-backup HTTP/1.1
+> User-Agent: curl/7.16.3 (powerpc-apple-darwin9.0) libcurl/7.16.3 OpenSSL/0.9.7l zlib/1.2.3
+> Host: 127.0.0.1:5984
+> Accept: */*
+>
+< HTTP/1.1 201 Created
+< Server: CouchDB/0.9.0 (Erlang OTP/R12B)
+< Date: Sun, 05 Jul 2009 22:48:28 GMT
+< Content-Type: text/plain;charset=utf-8
+< Content-Length: 12
+< Cache-Control: must-revalidate
+<
+{"ok":true}
+* Connection #0 to host 127.0.0.1 left intact
+* Closing connection #0
+
+ +

满满的一屏幕. 让我们一行行的来看, 搞明白具体是在做什么并且找出哪些是重要的. 当看过几次这样的输出后, 你会更加容易的找出哪些是重要的. + +

+* About to connect() to 127.0.0.1 port 5984 (#0)
+
+ +

curl告诉我们它正在向我们的请求URI中的CouchDB服务器建立一个TCP连接. 这里没什么重要的东西, 只在调试网络问题的时候有用. + +

+*   Trying 127.0.0.1... connected
+* Connected to 127.0.0.1 (127.0.0.1) port 5984 (#0)
+
+ +

curl告诉我们成功的连接到了CouchDB. 这些也不重要, 如果没有发现什么网络问题的话. + +

下面的几行有一个>或者<的前缀. >的意思是这几行被逐字发送到CouchDB(不包括>). <的意思是这些是CouchDB发送回给curl的. + +

+> PUT /albums-backup HTTP/1.1
+
+ +

这行初始化一个HTTP请求. 它的方法PUT, URI/albums-backup, HTTP版本HTTP/1.1. 还有一个HTTP/1.0的版本, 它在某些情况下更简单, 但是因为各种现实原因, 我们应该使用HTTP/1.1. + +

接下来, 我们看到一些请求头. 这些是用来提供到CouchDB的请求的附加细节的. + +

+> User-Agent: curl/7.16.3 (powerpc-apple-darwin9.0) libcurl/7.16.3 OpenSSL/0.9.7l zlib/1.2.3
+
+ +

User-Agent头告诉CouchDB哪种客户端软件在作HTTP请求. 这里没什么新奇的东西, 我们用的就是curl. 在web开发中这个头经常很有用, 当服务器响应某个客户端实现的请求出现问题时. 它也可以帮助我们知道, 用户是在哪个平台上的. 这个信息可以被用于某些技巧和数据统计. 对于CouchDB来说, User-Agent头是无关紧要的. + +

+> Host: 127.0.0.1:5984
+
+ +

Host头是HTTP1.1需要的, 它告诉服务器请求的主机名. + +

+> Accept: */*
+
+ +

Accept头告诉CouchDB, curl接受任何媒体类型. 我们会在后面来深入了解为什么这很有用. + +

+>
+
+ +

一个空行表示请求头已经结束了, 剩下的请求包含我们要发送给数据库的数据. 在这个例子里, 我们不发送任何数据, 所以剩下的curl输出是HTTP响应的了. + +

+< HTTP/1.1 201 Created
+
+ +

CouchDB的HTTP响应的第一行包含了HTTP版本信息(也让我们知道了, 我们使用的HTTP版本可以被处理), HTTP状态码状态码信息. 不同的请求会触发不同的返回状态码. 存在有一系列的状态码来告诉客户端(这个例子里是curl), 它作出的请求在服务器上起了什么作用. 或者, 如果有错误发生了, 告诉客户端什么错误发生了. RFC 2612(the HTTP 1.1 specification)清楚了定义了返回状态码的行为. CouchDB完全遵守这个RFC. + +

201 Created状态码告诉客户端, 请求创建的资源被成功的创建了. 这里没有什么值得惊奇的事, 但如果你还记得当我们试图两个创建这个数据库时得到了一个错误, 就明白这时会有一个不同的返回码. 根据返回码做出处理是一个常见的做法. 比如, 所有400或400以上的返回码告诉你有什么错误发生了. 如果你想要简化逻辑并即时处理错误, 可以只检查>=400的返回码. + +

+< Server: CouchDB/0.10.1 (Erlang OTP/R13B)
+
+ +

Server头对于诊断很有用, 它告诉我们是再和哪个版本的CouchDB和底层的哪个版本的Erlang打交道. 通常来说, 可以忽略这个头, 但当你需要它的时候, 要知道它在哪里. + +

+< Date: Sun, 05 Jul 2009 22:48:28 GMT
+
+ +

Date头告诉你服务器的时间. 因为客户端和服务器端的时间没有要求一定要保持同步, 这个头只是纯粹告诉你服务器时间这一信息而已. 你不应该根据这个信息为逻辑构建任何关键应用. + +

+< Content-Type: text/plain;charset=utf-8
+
+ +

这个头告诉你HTTP响应体的Content-Type和编码. 我们已经知道CouchDB返回JSON字符串. 合适的Content-Typeapplication/json. 为什么我们看到的是text/plain呢? 这就是实践战胜纯粹理论的地方了. 发送一个applicaion/jsonContent-Type头会使浏览器把返回的JSON提供给你下载而不是显示它. 因为可以在浏览器里测试CouchDB非常重要, CouchDB发送了一个text/plain的Content-Type, 这样浏览器就把JSON以文本的形式显示出来. + +

有一些浏览器插件可以让你的浏览器认得出JSON, 但是它们并不是默认安装的. + +

你还记得Accept请求头吗, 它被设置成\*/\* -> */*, 表示它接受任何的Content-Type. 如果你在你的请求里发送Accept: application/json, CouchDB认为你可以处理纯JSON响应, 就会返回正确的Conten-Type头, 而不是text/plain. + +

+< Content-Length: 12
+
+ +

这个Content-Length仅仅告诉我们响应体有多少字节. + +

+< Cache-Control: must-revalidate
+
+ +

这个Cache-Control告诉你, 或者任何在CouchDB和你之间的代理服务器, 不要缓存这个响应.

+ +
+<
+
+ +

这个空行告诉我们响应头已经完了, 接下来的是响应体了.

+ +
+{"ok":true}
+
+ +

我们以前已经看到过这个了.

+ +
+* Connection #0 to host 127.0.0.1 left intact
+* Closing connection #0
+
+ +

最后两行是curl告诉我们它会保持TCP连接打开一会, 但是在接收完整个响应后会关闭它. + +

贯穿于整书中, 我们会讲解更多的带-v选项的请求, 但会忽略掉一些我们已经在这里看过的头, 只讲解那些对于某个特定请求来说重要的头. + +

我们已经知道如何创建数据库了, 但是如何删除一个呢? 简单, 只要改变HTTP方法: + +

+curl -vX DELETE http://127.0.0.1:5984/albums-backup
+
+ +

这会删除一个CouchDB数据库. 这个请求会删除存储数据库内容的文件. 删除数据库时, 没有"你确定吗"这样的提醒或者"清空垃圾箱"之类的魔法. 请谨慎的使用这个命令. 你的数据会被删除, 并且如果你没有做复制, 就没有机会再轻易的恢复回来了. + +

这部分深入的讲解了HTTP并且为讲解剩下的CouchDB API建立了基础. 下一站: 文档. + +

文档

+ +

文档是CouchDB的核心数据结构. 文档背后的观念是, 不出意料的, 就是真实世界的文档. 像帐单, 食谱或者名片一样的小纸片. 我们已经知道了, CouchDB使用JSON格式存储文档. 让我们看看这种存储是底层是如何工作的. + +

CouchDB里的每个文档都会一个ID. 每个数据库里这个ID都是唯一的. 你可以选择任何字符串作为ID, 但是最好的, 我们推荐使用UUID(或者GUID). Universally (or Globally) Unique IDentifier. UUID是一些极小概率可能重复的数字, 小到即使每个人每分钟产生成千上万个UUID, 持续几百万年都不会有重复产生. 这是一种非常棒的方法来保证两个不同的人不会产生相同id的文档. 为什么你要关心其他人在干什么? 第一个原因, 那个其他人可能会是以后某个时间某台不同电脑上的你自己; 第二个原因, CouchDB让你可以和其他人分享文档, 它使用UUID来保证其正常工作. 呆会我们再详细解释, 现在先来创建几个文档. + +

+curl -X PUT http://127.0.0.1:5984/albums/6e1295ed6c29495e54cc05947f18c8af -d '{"title":"There is Nothing Left to Lose","artist":"Foo Fighters"}'
+
+ +

CouchDB响应: + +

+{"ok":true,"id":"6e1295ed6c29495e54cc05947f18c8af","rev":"1-2902191555"}
+
+ +

这个curl命令看起来有些复杂, 我们来分解一下. 首先-X PUT告诉curl作一个PUT请求. 它后面跟着一个URL来指定你的CouchDB的IP地址和端口. URL的资源部分/albums/6e1295ed6c29495e54cc05947f18c8af指定了我们的albums数据库中文档的位置. 那串乱七八糟的数字和字母集合是一个UUID. 这个UUID是你的文档的ID. 最后, -d标志告诉curl用后面跟着的字符串来做PUT请求的body. 这个字符串是一个简单的JSON结构, 包括了titleartist以及它们相应的值. + +

+ +

如果你手头上没有UUID, 你可以让CouchDB给你一个(实际上, 这正是我们刚才做的, 只不过没有向你展示出来). 仅仅需要发送一个GET请求到 /_uuids: + +

+curl -X GET http://127.0.0.1:5984/_uuids
+
+ +

CouchDB响应: + +

+{"uuids":["6e1295ed6c29495e54cc05947f18c8af"]}
+
+ +

如果你需要多于一个的UUID, 你可以传入?count=10 的HTTP参数来请求10个UUID, 或者任何你想要的数字.

+ +
+ +

为了确认CouchDB没有撒谎说它已经保存了你的文档, 实际上却并没有(通常它不会撒谎的), 试着用一个GET请求来得到这个文档. + +

+curl -X GET http://127.0.0.1:5984/albums/6e1295ed6c29495e54cc05947f18c8af
+
+ +

我们希望你能看出来这种模式. CouchDB里的一切东西都有一个地址, 一个URI; 你使用不同的HTTP方法来操作这些URI. + +

CouchDB响应: + +

+{"_id":"6e1295ed6c29495e54cc05947f18c8af","_rev":"1-2902191555","title":"There is Nothing Left to Lose","artist":"Foo Fighters"}
+
+ +

这和你要CouchDB保存的文档很相似, 很好. 但是你应该注意到了, CouchDB在JSON结构中加了两个域. 第一个是_id, 它的值是我们要求CouchDB保存的文档的UUID. 请求一个文档时总是能得到文档的ID, 这很方便. + +

第二个是_rev. 它代表修订号. + +

修订号

+ +

如果你想更改CouchDB里的一个文档, 不是去找那个文档中的某个域然后插入一个新值. 而是从CouchDB载入整个文档, 在得到的JSON结构里作改变(或者是一个对象, 如果你在使用某个编程语言), 然后把整个新修订的文档存回CouchDB. 每个修订由一个新的_rev值标识. + +

如果你想要更新或者删除一个文档, CouchDB会期望你提供一个_rev域来标识你要改变的那个修订. 当CouchDB接受了一个更改以后, 它会产生一个新的修订号. 这种机制保证了, 万一有人在你对文档做更新之前做了一个你并不知情的更新, CouchDB不会接受你的更新因为你可能会覆盖你以为并不存在的数据. 或者简单点的说: 谁先保存了对一个文档的改变, 谁就赢了. 让我们来看看如果我们不提供一个_rev域会发生什么(这和提供一个过时的值是一样的). + +

+curl -X PUT http://127.0.0.1:5984/albums/6e1295ed6c29495e54cc05947f18c8af -d '{"title":"There is Nothing Left to Lose","artist":"Foo Fighters","year":"1997"}'
+
+ +

CouchDB响应: + +

+{"error":"conflict","reason":"Document update conflict."}
+
+ +

如果你看到了这个, 在JSON结构里加上你的文档的最新修订号: + +

+curl -X PUT http://127.0.0.1:5984/albums/6e1295ed6c29495e54cc05947f18c8af -d '{"_rev":"1-2902191555","title":"There is Nothing Left to Lose", "artist":"Foo Fighters","year":"1997"}'
+
+ +

现在你发现为什么在作初始请求时CouchDB会返回_rev是件很方便的事了吧. CouchDB响应: + +

+{"ok":true,"id":"6e1295ed6c29495e54cc05947f18c8af","rev":"2-2739352689"}
+
+ +

CouchDB接受了你的写请求并且它也产生了一个新的修订号. 修订号是文档的md5散列, 加上一个N-的前缀表示文档被更新的次数. 这对复制很有用. 具体查看第17章, 冲突管理. + +

为什么CouchDB使用这种修订系统, 也被叫作多版本并发控制(MVCC), 有多个原因. 我们来解释其中的一些. + +

CouchDB使用的HTTP协议一个的特性便是它的无状态性. 这是什么意思? 要和CouchDB交流, 你需要做出请求. 做一个请求包括了打开一个到CouchDB的网络连接, 交换字节然后关闭连接. 这些事情在你每做一个请求时都会重复一遍. 其他协议允许你打开一个连接, 交换字节, 保持这个连接打开, 然后在此后交换更多的字节--可能是根据你一开始交换字节里包含的内容--最后关闭连接. 保持一个连接用于今后使用要求服务器做额外的工作. 在一个连接的生命周期里, 常见的模式是, 客户端会有一个持久的, 静态的服务器端的数据视图. 管理巨大量的并行连接是一项极大工作量的工作. HTTP连接通常是短生命周期的, 作出同样的保障会轻松很多. 结果就是, CouchDB可以处理更多的并发连接. + +

另外一个原因是这个模型在概念上更简单, 因此更加容易编程. CouchDB使用了更少的代码来达到目标, 而使用更少的代码总是好事, 因为固定行数代码的缺陷比例是固定的. + +

修订系统对于复制和存储机制也有积极的作用, 但我们将在本书的后面章节来讲到它们. + +

+ +

术语版本(version)修订(revision)听起来似乎很熟悉(如果你编程时不使用版本控制, 现在就赶紧把本书扔了, 然后找个流行的版本控制系统学习一下). 使用文档的新版本看赶来很像版本控制, 但是它们有一个很重要的区别: CouchDB保证老版本一定不会丢失. + +

+ +

文档的细节

+ +

现在让我们用curl-v选项来仔细的看看文档创建请求, 这在之前我们探索数据库API的时候很有用. 这也是一个创建更多文档的好机会, 以便我们在今后的例子中使用. + +

我们会增加一些喜欢的音乐专辑. 从/_uuids这个URI资源得到一个新的UUID. 如果你不记得这是怎么做了, 把书翻回去几页找找. + +

+curl -vX PUT http://127.0.0.1:5984/albums/70b50bfa0a4b3aed1f8aff9e92dc16a0 -d '{"title":"Blackened Sky","artist":"Biffy Clyro","year":2002}'
+
+ +
+ +

顺便提一下, 如果你正好知道更多的关于最喜爱专辑的信息的话, 不要犹豫添请加上这些属性. 也不要着急如果你不知道所有这些专辑的所有信息, CouchDB的无模式文档可以包含任何你知道的. 总之, 你应该放松, 不要去担心数据. + +

+ +

带着-v选项, CouchDB响应的重要部分看起来应该像是这样: + +

+> PUT /albums/70b50bfa0a4b3aed1f8aff9e92dc16a0 HTTP/1.1
+>
+< HTTP/1.1 201 Created
+< Location: http://127.0.0.1:5984/albums/70b50bfa0a4b3aed1f8aff9e92dc16a0
+< Etag: "1-2248288203"
+<
+{"ok":true,"id":"70b50bfa0a4b3aed1f8aff9e92dc16a0","rev":"1-2248288203"}
+
+ +

在返回头中, 我们得到了一个201 CreatedHTTP状态码, 这在之前我们创建数据库时我们也见过了. Location头告诉我们最新创建文档的完整URL. 而且有一个新的头; 来看看Etag先生. 在HTTP里, 一个Etag标识了一个资源的特定版本. 在这个例子里, 它标识了我们的新文档的一个特定版本. 听起来很熟悉? 是的, 从概念上讲, 一个Etag就是CouchDB文档的一个修订号, 所以CouchDB使用修订号作为Etag也没有什么可以惊讶的了. Etag在缓存系统中很有用, 我们会在第8章, 显示函数中学会如何使用. + +

附件
+ +

CouchDB文档可以有附件, 就像email可以带附件一样. 一个附件由一个名字和它的源类型(或者Content-Type)以及它的字节数来标识. 附件可以是任何数据. 最简单的理解是附件就是附加在文档上的文件. 这些文件可以是文本, 图像, Word文档, 音乐或者电影文档. 让我们来创建一个. + +

附件有它们自己的URL, 你可以把数据上传到那. 假设我们想要把一张专辑的封面添加到文档6e1295ed6c29495e54cc05947f18c8af, 并且假设封面文档是当前目录下的artwork.jpg: + +

+> curl -vX PUT http://127.0.0.1:5984/albums/6e1295ed6c29495e54cc05947f18c8af/ artwork.jpg?rev=2-2739352689 --data-binary @artwork.jpg -H "Content-Type: image/jpg"
+
+ +

-d@ 选项告诉 curl 去读取文档的内容放到HTTP请求体. 我们使用-H选项告诉CouchDB我们上传的是一个JPG文件. CouchDB会保存这个信息并且当我们请求这个文档的时候, 返回合适的头; 比如像这样的一个图像, 一个浏览器会显示这个图像而不会要你下载这个数据. 这在今后会变得很方便. 注意, 你需要提供你想到附加到的文档的当前修订号, 就和你更新一个文档时一样. 因为, 不管怎么样, 附加一些数据也是在改变这个文档. + +

如果你把你的浏览器指向http://127.0.0.1:5984/albums/6e1295ed6c29495e54cc05947f18c8af/artwork.jpg, 你应该会看到你的封面图片. + +

如果你再一次请求文档, 你会看到一个新的域: + +

+curl http://127.0.0.1:5984/albums/6e1295ed6c29495e54cc05947f18c8af
+
+ +

CouchDB响应:

+ +
+{"_id":"6e1295ed6c29495e54cc05947f18c8af","_rev":"3-131533518","title": "There is Nothing Left to Lose","artist":"Foo Fighters","year":"1997","_attachments":{"artwork.jpg":{"stub":true,"content_type":"image/jpg","length":52450}}}
+
+ +

_attachments是一个key和value的列表, value是包含附件原数据JSON对象. stub=true告诉我们, 这个附件只是一个元数据. 如果我们在请求一个文档时, 使用?attachments=true这个HTTP选项, 我们会得到一个包含附件数据的base64编码数据. + +

在我们探索CouchDB特性时, 会看到更多的文档请求选项. 比如复制, 我们的下一个主题. + +

复制

+ +

CouchDB复制是一个用于数据库同步的机制. 很像rsync在本地或者网络上同步两个目录, 复制也在本地或者远程同步两个数据库. + +

在一个简单的POST请求里, 告诉CouchDB复制的目标, CouchDB会在上找出有哪些文档和哪些新文档修订是目标上没有的并且会把它们移到目标上. + +

我们会在本书的后面深入的探索复制; 在本章节中, 我们只是展示如何使用它. + +

首先, 我们创建一个目标数据库. 注意, CouchDB不会自动的为你创建一个目标数据库而是会返回一个复制失败, 如果目标不存在的话(少了源的话也一样, 不过这个错误很不容易犯:) + +

+curl -X PUT http://127.0.0.1:5984/albums-replica
+
+ +

现在我们可以使用数据库album-replica作为一个复制目标: + +

+curl -vX POST http://127.0.0.1:5984/_replicate -d '{"source":"albums","target":"albums-replica"}'
+
+ +
+ +

在版本0.11中, CouchDB在POST到 _replicate URL的JSON里支持了选项"create_target":true. 如果目标数据库不存在, 它会非显式的创建数据库. + +

+ +

CouchDB响应(这次我们对输出做了格式化, 这样你可以更加简单的读它): + +

+{
+  "history": [
+    {
+      "start_last_seq": 0,
+      "missing_found": 2,
+      "docs_read": 2,
+      "end_last_seq": 5,
+      "missing_checked": 2,
+      "docs_written": 2,
+      "doc_write_failures": 0,
+      "end_time": "Sat, 11 Jul 2009 17:36:21 GMT",
+      "start_time": "Sat, 11 Jul 2009 17:36:20 GMT"
+    }
+  ],
+  "source_last_seq": 5,
+  "session_id": "924e75e914392343de89c99d29d06671",
+  "ok": true
+}
+
+ +

CouchDB会维护一份复制的历史. 一个复制请求的响应会包含这个复制的历史复制. 复制请求会一直保持打开直到复制结束. 如果你有很多的文档, 这会花点时间, 直到它们都被复制了. 而且在它们都被复制之前, 你不会得到复制的响应. 有一点很重要的需要注意的是, 复制只会复制复制开始时这个点上的数据库数据. 所以, 任何在复制开始后的添加, 更改或者删除都不会被复制. + +

最后的"ok":true告诉我们一切顺利. 如果现在你看下albums-replica数据库, 你应该会看到所有你在albums数据库创建的文档. + +

刚才所做的在CouchDB的术语里叫做本地复制. 你创建了一个本地的数据库的副本. 这对于备份, 或者保留一份在某个特定时间的快照的数据用于日后使用来说是很有用的. 如果你在开发一个应用, 但是想在需要的时候可以返回到稳定的代码和数据版本, 你可能会想要这么做. + +

还有其他种类的复制, 在其他状况下有用. 我们复制的sourcetarget实际上是链接(就像在HTML里的那种), 而到目前为止, 我们看到的链接是指向我们正在工作的(就是本地的). 你也可以指定一个远程数据库作为目标: + +

+curl -vX POST http://127.0.0.1:5984/_replicate -d '{"source":"albums","target":"http://127.0.0.1:5984/albums-replica"}'
+
+ +

使用一个本地源和一个远程目标数据库被叫做推送复制. 我们把改变推到远程服务器. + +

+ +

这里因为我们没有第二个CouchDB服务器, 我们就使用了本地单一服务器的绝对地址来演示. 但是从这里你应该可以看出来, 一个远程的服务器也是可以这样工作的. + +

+ +

想和远程服务器或者对门的哥们共享数据, 这方法好极了. + +

你也可以使用一个远程源和一个本地目标做拉取复制. 想要拿到别人数据库上作的最新改变, 这是个好办法. + +

+curl -vX POST http://127.0.0.1:5984/_replicate -d '{"source":"http://127.0.0.1:5984/albums-replica","target":"albums"}'
+
+ +

最后, 你可以作远程复制, 这在进行管理操作时比较有用: + +

+curl -vX POST http://127.0.0.1:5984/_replicate -d '{"source":"http://127.0.0.1:5984/albums","target":"http://127.0.0.1:5984/albums-replica"}'
+
+ +
+ +

CouchDB和REST

+ +

CouchDB对于拥有一个REST化的API感到很自豪, 但是复制请求看起来并不是很REST化. 这里出了什么问题? CouchDB的核心数据库, 文档, 以及附件API是REST化的, 但并不是所有的CouchDB API都是. 复制API就是其中的一个例子. 还有更多的非REST形式的API, 我们会在本书的后面章节里看到. + +

为什么这些REST化的非REST化的API混在了一起呢? 是这些开发人员懒的使这些API都REST化吗? 记住, REST是一种架构风格以用来建立一种特定的架构(比如CouchDB的文档API). 但它不能解决所有的问题, 同一尺寸大小的满足不了所有的, 你懂的. 触发像复制这样的事件在REST世界中并没有什么非常的意义. 它更像是一种传统的远程过程调用. 所以CouchDB这样做并没有什么不妥. + +

我们非常相信"使用合适的工具来工作"的哲学, 而REST并不合适于所有的工作. 为了得到支持, 我们参考了Leonard Richardson和Sam Ruby的意见, 他们写了RESTful Web Services (O'Reilly)这本书, 他们和我们有着同样的观点. + +

+ +

收尾

+ +

这仍然不是完整的CouchDB API, 但是我们仔细的讨论了必要的部分. 我们会在下面的章节里把剩下的慢慢补完. 现在我们相邻你已经准备好构建CouchDB应用了. diff --git a/editions/1/zh/balancing.html b/editions/1/zh/balancing.html new file mode 100644 index 0000000..2314f30 --- /dev/null +++ b/editions/1/zh/balancing.html @@ -0,0 +1,41 @@ +Load Balancing + + + + + + + + + + + +

Load Balancing

+ +

Jill is woken up at 4:30 a.m. by her mobile phone. She receives text message after text message, one every minute. Finally, Joe calls. Joe is furious, and Jill has trouble understanding what Joe is saying. In fact, Jill has a hard time figuring out why Joe would call her in the middle of the night. Then she remembers: Joe is running an online shop selling sports gear on one of her servers, and he is furious because the server went down and now his customers in New Zealand are angry because they can’t get to the online shop. + +

This is a typical scenario, and you have probably seen many variations of it, being in the role of Jill, Joe, or both. If you are Jill, you want to sleep at night, and if you are Joe, you want your customers to buy from you whenever it pleases them. + +

Having a Backup

+ +

The problems persist: computers fail, and in many ways. There are hardware problems, power outages, bugs in the operating system or application software, etc. Only CouchDB doesn’t have any bugs. (Well, of course, that’s not true. All software has bugs, with the possible exception of things written by Daniel J. Bernstein and Donald Knuth.) + +

Whatever the cause is, you want to make sure that the service you are providing (in Jill and Joe’s case, the database for an online store) is resilient against failure. The road to resilience is a road of finding and removing single points of failure. A server’s power supply can fail. To keep the server from turning off during such an event, most come with at least two power supplies. To take this further, you could get a server where everything is duplicated (or more), but that would be a highly specialized (and expensive) piece of hardware. It is much cheaper to get two similar servers where the one can take over if the other has a problem. However, you need to make sure both servers have the same set of data in order to switch them without a user noticing. + +

Removing all single points of failure will give you a highly available or a fault-tolerant system. The order of tolerance is restrained only by your budget. If you can’t afford to lose a customer’s shopping cart in any event, you need to store it on at least two servers in at least two far apart geographical locations. + +

+ +

Amazon does this for the Amazon.com website. If one data center is the victim of an earthquake, a user will still be able to shop. + +

It is likely, though, that Amazon’s problems are not your problems and that you will have a whole set of new problems when your data center goes away. But you still want to be able to live through a server failure. + +

+ +

Before we dive into setting up a highly available CouchDB system, let’s look at another situation. Joe calls Jill during regular business hours and relays his customers’ complaints that loading the online shop takes “forever.” Jill takes a quick look at the server and concludes that this is a lucky problem to have, leaving Joe puzzled. Jill explains that Joe’s shop is suddenly attracting many more users who are buying things. Joe chimes in, “I got a great review on that blog. That’s where they must be coming from.” A quick referrer check reveals that indeed many of the new customers are coming from a single site. The blog post already includes comments from unhappy customers voicing their frustration with the slow site. Joe wants to make his customers happy and asks Jill what to do. Jill advises that they set up a second server that can take half of the load of the current server, making sure all requests get answered in a reasonable amount of time. Joe agrees, and Jill begins to set things up. + +

The solution to the outlined problem looks a lot like the earlier one for providing a fault-tolerant setup: install a second server and synchronize all data. The difference is that with fault tolerance, the second server just sits there and waits for the first one to fail. In the server-overload case, a second server helps answer all incoming requests. This case is not fault-tolerant: if one server crashes, the other will get all the requests and will likely break down, or at least provide very slow service, either of which is not acceptable. + +

Keep in mind that although the solutions look similar, high availability and fault tolerance are not the same. We’ll get back to the second scenario later on, but first we will take a look at how to set up a fault-tolerant CouchDB system. + +

We already gave it away in the previous chapters: the solution to synchronizing servers is replication. diff --git a/editions/1/zh/btree.html b/editions/1/zh/btree.html new file mode 100644 index 0000000..db02aaf --- /dev/null +++ b/editions/1/zh/btree.html @@ -0,0 +1,61 @@ +The Power of B-trees + + + + + + + + + + + +

The Power of B-trees

+ +

CouchDB uses a data structure called a B-tree to index its documents and views. We’ll look at B-trees enough to understand the types of queries they support and how they are a good fit for CouchDB. + +

This is our first foray into CouchDB internals. To use CouchDB, you don’t need to know what’s going on under the hood, but if you understand how CouchDB performs its magic, you’ll be able to pull tricks of your own. Additionally, if you understand the consequences of the ways you are using CouchDB, you will end up with smarter systems. + +

If you weren’t looking closely, CouchDB would appear to be a B-tree manager with an HTTP interface. + +

+ +

CouchDB is actually using a B+ tree, which is a slight variation of the B-tree that trades a bit of (disk) space for speed. When we say B-tree, we mean CouchDB’s B+ tree. + +

+ +

A B-tree is an excellent data structure for storing huge amounts of data for fast retrieval. When there are millions and billions of items in a B-tree, that’s when they get fun. B-trees are usually a shallow but wide data structure. While other trees can grow very high, a typical B-tree has a single-digit height, even with millions of entries. This is particularly interesting for CouchDB, where the leaves of the tree are stored on a slow medium such as a hard drive. Accessing any part of the tree for reading or writing requires visiting only a few nodes, which translates to a few head seeks (which are what make a hard drive slow), and because the operating system is likely to cache the upper tree nodes anyway, only the seek to the final leaf node is needed. + +

+ +

From a practical point of view, B-trees, therefore, guarantee an access time of less than 10 ms even for extremely large datasets. + +

—Dr. Rudolf Bayer, inventor of the B-tree + +

+ +

CouchDB’s B-tree implementation is a bit different from the original. While it maintains all of the important properties, it adds Multi-Version Concurrency Control (MVCC) and an append-only design. B-trees are used to store the main database file as well as view indexes. One database is one B-tree, and one view index is one B-tree. + +

MVCC allows concurrent reads and writes without using a locking system. Writes are serialized, allowing only one write operation at any point in time for any single database. Write operations do not block reads, and there can be any number of read operations at any time. Each read operation is guaranteed a consistent view of the database. How this is accomplished is at the core of CouchDB’s storage model. + +

The short answer is that because CouchDB uses append-only files, the B-tree root node must be rewritten every time the file is updated. However, old portions of the file will never change, so every old B-tree root, should you happen to have a pointer to it, will also point to a consistent snapshot of the database. + +

Early in the book we explained how the MVCC system uses the document’s _rev value to ensure that only one person can change a document version. The B-tree is used to look up the existing _rev value for comparison. By the time a write is accepted, the B-tree can expect it to be an authoritative version. + +

Since old versions of documents are not overwritten or deleted when new versions come in, requests that are reading a particular version do not care if new ones are written at the same time. With an often changing document, there could be readers reading three different versions at the same time. Each version was the latest one when a particular client started reading it, but new versions were being written. From the point when a new version is committed, new readers will read the new version while old readers keep reading the old version. + +

In a B-tree, data is kept only in leaf nodes. CouchDB B-trees append data only to the database file that keeps the B-tree on disk and grows only at the end. Add a new document? The file grows at the end. Delete a document? That gets recorded at the end of the file. The consequence is a robust database file. Computers fail for plenty of reasons, such as power loss or failing hardware. Since CouchDB does not overwrite any existing data, it cannot corrupt anything that has been written and committed to disk already. See Figure 1, “Flat B-tree and append-only”. + +

Committing is the process of updating the database file to reflect changes. This is done in the file footer, which is the last 4k of the database file. The footer is 2k in size and written twice in succession. First, CouchDB appends any changes to the file and then records the file’s new length in the first database footer. It then force-flushes all changes to disk. It then copies the first footer over to the second 2k of the file and force-flushes again. + +

+ + + +

Figure 1. Flat B-tree and append-only + +

+ +

If anywhere in this process a problem occurs—say, power is cut off and CouchDB is restarted later—the database file is in a consistent state and doesn’t need a checkup. CouchDB starts reading the database file backward. When it finds a footer pair, it makes some checks: if the first 2k are corrupt (a footer includes a checksum), CouchDB replaces it with the second footer and all is well. If the second footer is corrupt, CouchDB copies the first 2k over and all is well again. Only once both footers are flushed to disk successfully will CouchDB acknowledge that a write operation was successful. Data is never lost, and data on disk is never corrupted. This design is the reason for CouchDB having no off switch. You just terminate it when you are done. + +

There’s a lot more to say about B-trees in general, and if and how SSDs change the runtime behavior. The Wikipedia article on B-trees is a good starting point for further investigations. Scholarpedia includes notes by Dr. Rudolf Bayer, inventor of the B-tree. diff --git a/editions/1/zh/btree/01.png b/editions/1/zh/btree/01.png new file mode 100644 index 0000000..732b175 Binary files /dev/null and b/editions/1/zh/btree/01.png differ diff --git a/editions/1/zh/clustering.html b/editions/1/zh/clustering.html new file mode 100644 index 0000000..4de6259 --- /dev/null +++ b/editions/1/zh/clustering.html @@ -0,0 +1,117 @@ +Clustering + + + + + + + + + + + +

Clustering

+ +

OK, you’ve made it this far. I’m assuming you more or less understand what CouchDB is and how the application API works. Maybe you’ve deployed an application or two, and now you’re dealing with enough traffic that you need to think about scaling. “Scaling” is an imprecise word, but in this chapter we’ll be dealing with the aspect of putting together a partitioned or sharded cluster that will have to grow at an increasing rate over time from day one. + +

We’ll look at request and response dispatch in a CouchDB cluster with stable nodes. Then we’ll cover how to add redundant hot-failover twin nodes, so you don’t have to worry about losing machines. In a large cluster, you should plan for 5–10% of your machines to experience some sort of failure or reduced performance, so cluster design must prevent node failures from affecting reliability. Finally, we’ll look at adjusting cluster layout dynamically by splitting or merging nodes using replication. + +

Introducing CouchDB Lounge

+ +

CouchDB Lounge is a proxy-based partitioning and clustering application, originally developed for Meebo, a web-based instant messaging service. Lounge comes with two major components: one that handles simple GET and PUT requests for documents, and another that distributes view requests. + +

The dumbproxy handles simple requests for anything that isn’t a CouchDB view. This comes as a module for nginx, a high-performance reverse HTTP proxy. Because of the way reverse HTTP proxies work, this automatically allows configurable security, encryption, load distribution, compression, and, of course, aggressive caching of your database resources. + +

The smartproxy handles only CouchDB view requests, and dispatches them to all the other nodes in the cluster so as to distribute the work, making view performance a function of the cluster’s cumulative processing power. This comes as a daemon for Twisted, a popular and high-performance event-driven network programming framework for Python. + +

Consistent Hashing

+ +

CouchDB’s storage model uses unique IDs to save and retrieve documents. Sitting at the core of Lounge is a simple method of hashing your document IDs. Lounge then uses the first few characters of this hash to determine which shard to dispatch the request to. You can configure this behavior by writing a shard map for Lounge, which is just a simple text configuration file. + +

Because Lounge allocates a portion of the hash (known as a keyspace) to each node, you can add as many nodes as you like. Because the hash function produces hexidecimal strings that bare no apparent relation to your DocIDs, and because we dispatch requests based on the first few characters, we ensure that all nodes see roughly equal load. And because the hash function is consistent, Lounge will take any arbitrary DocID from an HTTP request URI and point it to the same node each time. + +

This idea of splitting a collection of shards based on a keyspace is commonly illustrated as a ring, with the hash wrapped around the outside. Each tic mark designates the boundaries in the keyspace between two partitions. The hash function maps from document IDs to positions on the ring. The ring is continuous so that you can always add more nodes by splitting a single partition into pieces. With four physical servers, you allocate the keyspace into 16 independent partitions by distributing them across the servers like so: + +

+ + + + + + + + + + + + + + + +
A0,1,2,3
B4,5,6,7
C8,9,a,b
Dc,d,e,f
+ +
+ +

If the hash of your DocID starts with 0, it would be dispatched to shard A. Similarly for 1, 2, or 3. Whereas, if the hash started with c, d, e, or f, it would be dispatched to shard D. As a full example, the hash 71db329b58378c8fa8876f0ec04c72e5 is mapped to the node B, database 7 in the table just shown. This could map to http://B.couches.local/db-7/ on your backend cluster. In this way, the hash table is just a mapping from hashes to backend database URIs. Don’t worry if this all sounds very complex; all you have to do is provide a mapping of shards to nodes and Lounge will build the hash ring appropriately—so no need to get your hands dirty if you don’t want to. + +

To frame the same concept with web architecture, because CouchDB uses HTTP, the proxy can partition documents according to the request URL, without inspecting the body. This is a core principle behind REST and is one of the many benefits using HTTP affords us. In practice, this is accomplished by running the hash function against the request URI and comparing the result to find the portion of the keyspace allocated. Lounge then looks up the associated shard for the hash in a configuration table, forwarding the HTTP request to the backend CouchDB server. + +

Consistent hashing is a simple way to ensure that you can always find the documents you saved, while balancing storage load evenly across partitions. Because the hash function is simple (it is based on CRC32), you are free to implement your own HTTP intermediaries or clients that can similarly resolve requests to the correct physical location of your data. + +

Redundant Storage

+ +

Consistent hashing solves the problem of how to break up a single logical database evenly across a set of partitions, which can then be distributed across multiple servers. It does not address the problem of how to ensure that data you’ve stored is safe from loss due to hardware or software failure. If you are serious about your data, you can’t consider it saved until you have at least two copies of it, preferably in different geographical locations. + +

CouchDB replication makes maintaining hot-failover redundant slaves or load-balanced multi-master databases relatively painless. The specifics of how to manage replication are covered in Chapter 16, Replication. What is important in this context is to understand that maintaining redundant copies is orthogonal to the harder task of ensuring that the cluster consistently chooses the same partition for a particular document ID. + +

For data safety, you’ll want to have at least two or three copies of everything. However, if you encapsulate redundancy, the higher layers of the cluster can treat each partition as a single unit and let the logical partitions themselves manage redundancy and failover. + +

Redundant Proxies

+ +

Just as we can’t accept the possibility of hardware failure leading to data loss, we’ll need to run multiple instances of the proxy nodes to avoid the chance that a proxy node crash could leave portions of the cluster unavailable. By running redundant proxy instances, and load balancing across them, we can increase cluster throughput as well as reliability. + +

View Merging

+ +

Consistent hashing leaves documents on the proper node, but documents can still emit() any key. The point of incremental MapReduce is to bring the function to the data, so we shoudn’t redistribute the emitted keys; instead, we send the queries to the CouchDB nodes via HTTP proxy, and merge the results using the Twisted Python Smartproxy. + +

Smartproxy sends each view request to every node, so it needs to merge the responses before returning them to the client. Thankfully, this operation is not resource-intensive, as merging can be done in constant memory space no matter how many rows are returned. The Smartproxy receives the first row from each cluster node and compares them. We sort the nodes according to their row key using CouchDB’s collation rules. Smartproxy pops the top row from the first sorted node and returns it to the client. + +

This process can be repeated as long as the clients continue to send rows, but if a limit is imposed by the client, Smartproxy must end the response early, discarding any extra rows sent by the nodes. + +

This layout is simple and loosely coupled. It has the advantage that it’s simple, which helps in understanding topology and diagnosing failures. There is work underway to move the behavior to Erlang, which ought to make managing dynamic clusters possible as well as let us integrate cluster control into the CouchDB runtime. + +

Growing the Cluster

+ +

Using CouchDB at web scale likely requires CouchDB clusters that can be scaled dynamically. Growing sites must continuously add more storage capacity, so we need a strategy to increase the size of our cluster without taking it down. Some workloads can result in temporary growth in data size, in which case we’ll also need a process for shrinking the cluster without an interruption in service. + +

In this section, we’ll see how we can use CouchDB’s replication filters to split one database into several partitions, and how to use that technique to grow the cluster without downtime. There are simple steps you can take to avoid partitioning databases while growing the cluster. + +

Oversharding is a technique where you partition the cluster so that there are multiple shards on each physical machine. Moving a partition from one machine to another is simpler than splitting it into smaller partitions, as the configuration map of the cluster used by the proxy only needs to change to point to shards at their new homes, rather than adding new logical shards. It’s also less resource-intensive to move a partition than to split it into many. + +

One question we need to answer is, “How much should we overshard?” The answer depends on your application and deployment, but there are some forces that push us in one direction over another. If we get the number of shards right, we’ll end up with a cluster that can grow optimally. + +

In the section called “View Merging”, we discussed how merges can be accomplished in constant space, no matter the number of rows returned. The memory space and network resources required to merge views, as well as to map from document IDs to partitions, does, however, grow linearly with the number of partitions under a given proxy. For this reason, we’ll want to limit the number of partitions for each proxy. However, we can’t accept an upper limit on cluster size. The solution is to use a tree of proxies, where the root proxy partitions to some number of intermediate proxies, which then proxy to database nodes. + +

The factors that come into play when deciding how many partitions each proxy should manage are: the storage available to each individual server node, the projected growth rate of the data, the network and memory resources available to proxies, and the acceptable latency for requests against the cluster. + +

Assuming a conservative 64 shards per proxy, and 1 TB of data storage per node (including room for compaction, these nodes will need roughly 2 TB of drive space), we can see that with a single proxy in front of CouchDB data nodes, we’ll be able to store at maximum 64 TB of data (on 128 or perhaps 192 server nodes, depending on the level of redundancy required by the system) before we have to increase the number of partitions. + +

By replacing database nodes with another proxy, and repartitioning each of the 64 partitions into another 64 partitions, we end up with 4,096 partitions and a tree depth of 2. Just as the initial system can hold 64 partitions on just a few nodes, we can transition to the 2-layer tree without needing thousands of machines. If we assume each proxy must be run on its own node, and that at first database nodes can hold 16 partitions, we’ll see that we need 65 proxies and 256 database machines (not including redundancy factors, which should typically multiply the cluster size by two or three times). To get started with a cluster that can grow smoothly from 64 TB to 4 PB, we can begin with roughly 600 to 1,000 server nodes, adding new ones as data size grows and we move partitions to other machines. + +

We’ve seen that even a cluster with a depth of 2 can hold a vast amount of data. Basic arithmetic shows us that by applying the same process to create a cluster with three layers of proxies, we can manage 262 petabytes on thousands of machines. Conservative estimates for the latency introduced by each layer is about 100 ms, so even without performance tuning we should see overall response times of 300 ms even with a tree depth of 3, and we should be able to manage queries over exabyte datasets in less than a second. + +

By using oversharding and iteratively replacing full shards (database nodes that host only one partition) with proxy nodes that point to another set of oversharded partitions, we can grow the cluster to very large sizes while incurring a minimum of latency. + +

Now we need to look at the mechanics of the two processes that allow the cluster to grow: moving a partition from an overcrowded node to an empty node, and splitting a large partition into many subpartitions. Moving partitions is simpler, which is why it makes sense to use it when possible, running the more resource-intensive repartition process only when partitions get large enough that only one or two can fit on each database server. + +

Moving Partitions

+ +

As we mentioned earlier, each partition is made up of N redundant CouchDB databases, each stored on different physical servers. To keep things easy to conceptualize, any operations should be applied to all redundant copies automatically. For the sake of discussion, we’ll just talk about the abstract partition, but be aware that the redundant nodes will all be the same size and so should require the same operations during cluster growth. + +

The simplest way to move a partition from one node to another is to create an empty database on the target node and use CouchDB replication to fill the new node with data from the old node. When the new copy of the partition is up-to-date with the original, the proxy node can be reconfigured to point to the new machine. Once the proxy points to the new partition location, one final round of replication will bring it up-to-date, and the old partition can be retired, freeing space on the original machine. + +

Another method for moving partition databases is to rsync the files on disk from the old node to the new one. Depending on how recently the partition was compacted, this should result in efficient, low-CPU initialization of a new node. Replication can then be used to bring the rsynced file up-to-date. See more about rsync and replication in Chapter 16, Replication. + +

Splitting Partitions

+ +

The last major thing we need to run a CouchDB cluster is the capability to split an oversized partition into smaller pieces. In Chapter 16, Replication, we discussed how to do continuous replication using the _changes API. The _changes API can use filters (see Chapter 20, Change Notifications), and replication can be configured to use a filter function to replicate only a subset of a total database. Splitting partitions is accomplished by creating the target partitions and configuring them with the range of hash keys they are interested in. They then apply filtered replication to the source partition database, requesting only documents that meet their hash criteria. The result is multiple partial copies of the source database, so that each new partition has an equal share of the data. In total, they have a complete copy of the original data. Once the replication is complete and the new partitions have also brought their redundant backups up-to-date, a proxy for the new set of partitions is brought online and the top-level proxy is pointed at it instead of the old partition. Just like with moving a partition, we should do one final round of replication after the old partition is no longer reachable by the cluster, so that any last second updates are not lost. Once that is done, we can retire the old partition so that its hardware can be reused elsewhere in the cluster. diff --git a/editions/1/zh/colophon.html b/editions/1/zh/colophon.html new file mode 100644 index 0000000..0d48280 --- /dev/null +++ b/editions/1/zh/colophon.html @@ -0,0 +1,19 @@ +Colophon + + + + + + + + + +

Colophon

+ +

The animal on the cover of CouchDB: The Definitive Guide is a Pomeranian dog (Canis familiaris), a small variety of the generally larger German Spitz breed, named for the Baltic region of Pomerania (today spilt between northeastern Germany and northern Poland) where it was first bred. + +

Originally, Pomeranians were closer in size to their German Spitz relatives—weighing 30–50 pounds—and were bred as herding dogs because of their intelligence, energy, and loyalty. From the late 19th century, however, breeders began to favor increasingly smaller dogs, a move caused in large part by Queen Victoria’s affinity for that variety. Today, Pomeranians are classed as “toy dogs,” weighing only 4–7 pounds, and are particularly kept as small pets and show dogs. + +

The Pomeranian exhibits many of the physical and behavioral characteristics of its larger ancestors and relatives. It has a short, pointed muzzle, upright and pointed ears, a large bushy tail carried curled over the back, and is especially spirited and friendly. Pomeranians are also particularly noted for their double coat—a soft and dense undercoat and a long, straight and harshly textured outer coat—and come in a wide variety of colors, including white, black, brown, red, orange, sable, spotted, or any combination thereof. Because of their small size, Pomeranians are able to exercise sufficiently in small indoor spaces if taken for a daily walk, and consequently make excellent apartment pets. + +

The cover image is from Lydekker’s Royal Natural History. The cover font is Adobe ITC Garamond. The text font is Linotype Birka; the heading font is Adobe Myriad Condensed; and the code font is LucasFont’s TheSansMonoCondensed. diff --git a/editions/1/zh/conflicts.html b/editions/1/zh/conflicts.html new file mode 100644 index 0000000..f3f76a7 --- /dev/null +++ b/editions/1/zh/conflicts.html @@ -0,0 +1,266 @@ +Conflict Management + + + + + + + + + + + +

Conflict Management

+ +

Suppose you are sitting in a coffee shop working on your book. J. Chris comes over and tells you about his new phone. The new phone came with a new number, and you have J. Chris dictate it while you change it using your laptop’s address book application. + +

Luckily, your address book is built on CouchDB, so when you come home, all you need to do to get your home computer up-to-date with J. Chris’s number is replicate your address book from your laptop. Neat, eh? What’s more, CouchDB has a mechanism to maintain continuous replication, so you can keep a whole set of computers in sync with the same data, whenever a network connection is available. + +

Let’s change the scenario a little bit. Since J. Chris didn’t anticipate meeting you at the coffee shop, he also sent you an email with the new number. At the time you weren’t using WiFi because you wanted concentrate on your work, so you didn’t read his email until you got home. But it was a long day and by then you had forgotten that you changed the number in the address book on your laptop. When you read the email at home, you simply copy-and-pasted the number into the address book on your home computer. Now—and here’s the twist—it turns out you entered the wrong number in your laptop’s address book. + +

You now have a document in each of the databases that has different information. This situation is called a conflict. Conflicts occur in distributed systems. They are a natural state of your data. How does CouchDB’s replication system deal with conflicts? + +

When you replicate two databases in CouchDB and you have conflicting changes, CouchDB will detect this and will flag the affected document with the special attribute "_conflicts":true. Next, CouchDB determines which of the changes will be stored as the latest revision (remember, documents in CouchDB are versioned). The version that gets picked to be the latest revision is the winning revision. The losing revision gets stored as the previous revision. + +

CouchDB does not attempt to merge the conflicting revision. Your application dictates how the merging should be done. The choice of picking the winning revision is arbitrary. In the case of the phone number, there is no way for a computer to decide on the right revision. This is not specific to CouchDB; no other software can do this (ever had your phone’s sync-contacts tool ask you which contact from which source to take?). + +

Replication guarantees that conflicts are detected and that each instance of CouchDB makes the same choice regarding winners and losers, independent of all the other instances. There is no group decision made; instead, a deterministic algorithm determines the order of the conflicting revision. After replication, all instances taking part have the same data. The data set is said to be in a consistent state. If you ask any instance for a document, you will get the same answer regardless which one you ask. + +

Whether or not CouchDB picked the version that your application needs, you need to go and resolve the conflict, just as you need to resolve a conflict in a version control system like Subversion. Simply create a version that you want to be the latest by either picking the latest, or the previous, or both (by merging them) and save it as the now latest revision. Done. Replicate again and your resolution will populate over to all other instances of CouchDB. Your conflict resolving on one node could lead to further conflicts, all of which will need to be addressed, but eventually, you will end up with a conflict-free database on all nodes. + +

The Split Brain

+ +

This is an interesting conflicts scenario in that we helped a BBC build a solution for it that is now in production. The basic setup is this: to guarantee that the company’s website is online 24/7, even in the event of the loss of a data center, it has multiple data centers backing up the website. The “loss” of a data center is a rare occasion, but it can be as simple as a network outage, where the data center is still alive and well but can’t be reached by anyone. + +

The “split brain” scenario is where two (for simplicity’s sake we’ll stick to two) data centers are up and well connected to end users, but the connection between the data centers—which is most likely not the same connection that end users use to talk to the computers in the data center—fails. + +

The inter data center connection is used to keep both centers in sync so that either one can take over for the other in case of a failure. If that link goes down, you end up with two halves of a system that act independently—the split brain. + +

As long as all end users can get to their data, the split brain is not scary. Resolving the split brain situation by bringing up the connection that links the data centers and starting synchronization again is where it gets hairy. Arbitrary conflict resolution, like CouchDB does by default, can lead to unwanted effects on the user’s side. Data could revert to an earlier stage and leave the impression that changes weren’t reliably saved, when in fact they were. + +

Conflict Resolution by Example

+ +

Let’s go through an illustrated example of how conflicts emerge and how to solve them in super slow motion. Figure 1, “Conflict management by example: step 1” illustrates the basic setup: we have two CouchDB databases, and we are replicating from database A to database B. To keep this simple, we assume triggered replication and not continuous replication, and we don’t replicate back from database B to A. All other replication scenarios can be reduced to this setup, so this explains everything we need to know. + +

+ + + +

Figure 1. Conflict management by example: step 1 + +

+ +

We start out by creating a document in database A (Figure 2, “Conflict management by example: step 2”). Note the clever use of imagery to identify a specific revision of a document. Since we are not using continuous replication, database B won’t know about the new document for now. + +

+ + + +

Figure 2. Conflict management by example: step 2 + +

+ +

We now trigger replication and tell it to use database A as the source and database B as the target (Figure 3, “Conflict management by example: step 3”). Our document gets copied over to database B. To be precise, the latest revision of our document gets copied over. + +

+ + + +

Figure 3. Conflict management by example: step 3 + +

+ +

Now we go to database B and update the document (Figure 4, “Conflict management by example: step 4”). We change some values and upon change, CouchDB generates a new revision for us. Note that this revision has a new image. Node A is ignorant of any activity. + +

+ + + +

Figure 4. Conflict management by example: step 4 + +

+ +

Now we make a change to our document in database A by changing some other values (Figure 5, “Conflict management by example: step 5”). See how it makes a different image for us to see the difference? It is important to note that this is still the same document. It’s just that there are two different revisions of that same document in each database. + +

+ + + +

Figure 5. Conflict management by example: step 5 + +

+ +

Now we trigger replication again from database A to database B as before (Figure 6, “Conflict management by example: step 6”). By the way, it doesn’t make a difference if the two databases live in the same CouchDB server or on different servers connected over a network. + +

+ + + +

Figure 6. Conflict management by example: step 6 + +

+ +

When replicating, CouchDB detects that there are two different revisions for the same document, and it creates a conflict (Figure 7, “Conflict management by example: step 7”). A document conflict means that there are now two latest revisions for this document. + +

+ + + +

Figure 7. Conflict management by example: step 7 + +

+ +

Finally, we tell CouchDB which version we would like to be the latest revision by resolving the conflict (Figure 8, “Conflict management by example: step 8”). Now both databases have the same data. + +

+ + + +

Figure 8. Conflict management by example: step 8 + +

+ +

Other possible outcomes include choosing the other revision and replicating that decision back to database A, or creating yet another revision in database B that includes parts of both conflicting revisions (a merge) and replicating that back to database A. + +

Working with Conflicts

+ +

Now that we’ve walked through replication with pretty pictures, let’s get our hands dirty and see what the API calls and responses for this and other scenarios look like. We’ll be continuing Chapter 4, The Core API by using curl on the command line to make raw API requests. + +

First, we create two databases that we can use for replication. These live on the same CouchDB instance, but they might as well live on a remote instance—CouchDB doesn’t care. To save us some typing, we create a shell variable for our CouchDB base URL that we want to talk to. We then create two databases, db and db-replica: + +

+HOST="http://127.0.0.1:5984"
+
+> curl -X PUT $HOST/db
+{"ok":true}
+
+> curl -X PUT $HOST/db-replica
+{"ok":true}
+
+ +

In the next step, we create a simple document {"count":1} in db and trigger replication to db-replica: + +

+> curl -X PUT $HOST/db/foo -d '{"count":1}'
+{"ok":true,"id":"foo","rev":"1-74620ecf527d29daaab9c2b465fbce66"}
+
+> curl -X POST $HOST/_replicate -d '{"source":"db","target":"http://127.0.0.1:5984/db-replica"}'
+{"ok":true,...,"docs_written":1,"doc_write_failures":0}]}
+
+ +

We skip a bit of the output of the replication session (see Chapter 16, Replication for details). If you see "docs_written":1 and "doc_write_failures":0, our document made it over to db-replica. We now update the document to {"count":2} in db-replica. Note that we now need to include the correct _rev property. + +

+> curl -X PUT $HOST/db-replica/foo -d '{"count":2,"_rev":"1-74620ecf527d29daaab9c2b465fbce66"}'
+{"ok":true,"id":"foo","rev":"2-de0ea16f8621cbac506d23a0fbbde08a"}
+
+ +

Next, we create the conflict! We change our document on db to {"count":3}. Our document is now logically in conflict, but CouchDB doesn’t know about it until we replicate again: + +

+> curl -X PUT $HOST/db/foo -d '{"count":3,"_rev":"1-74620ecf527d29daaab9c2b465fbce66"}'
+{"ok":true,"id":"foo","rev":"2-7c971bb974251ae8541b8fe045964219"}
+
+> curl -X POST $HOST/_replicate -d '{"source":"db","target":"http://127.0.0.1:5984/db-replica"}'
+{"ok":true,..."docs_written":1,"doc_write_failures":0}]}
+
+ +

To see that we have a conflict, we create a simple view in db-replica. The map function looks like this: + +

+function(doc) {
+  if(doc._conflicts) {
+    emit(doc._conflicts, null);
+  }
+}
+
+ +

When we query this view, we get this result: + +

+{"total_rows":1,"offset":0,"rows":[
+{"id":"foo","key":["2-7c971bb974251ae8541b8fe045964219"],"value":null}
+]}
+
+ +

The key here corresponds to the doc._conflicts property of our document in db-replica. It is an array listing all conflicting revisions. We see that the revision we wrote on db ({"count":3}) is in conflict. CouchDB’s automatic promotion of one revision to be the winning revision chose our first change ({"count":2}). To verify that, we just request that document from db-replica: + +

+> curl -X GET $HOST/db-replica/foo
+{"_id":"foo","_rev":"2-de0ea16f8621cbac506d23a0fbbde08a","count":2}
+
+ +

To resolve the conflict, we need to determine which one we want to keep. + +

+ +

How Does CouchDB Decide Which Revision to Use? + +

CouchDB guarantees that each instance that sees the same conflict comes up with the same winning and losing revisions. It does so by running a deterministic algorithm to pick the winner. The application should not rely on the details of this algorithm and must always resolve conflicts. We’ll tell you how it works anyway. + +

Each revision includes a list of previous revisions. The revision with the longest revision history list becomes the winning revision. If they are the same, the _rev values are compared in ASCII sort order, and the highest wins. So, in our example, 2-de0ea16f8621cbac506d23a0fbbde08a beats 2-7c971bb974251ae8541b8fe045964219. + +

One advantage of this algorithm is that CouchDB nodes do not have to talk to each other to agree on winning revisions. We already learned that the network is prone to errors and avoiding it for conflict resolution makes CouchDB very robust. + +

+ +

Let’s say we want to keep the highest value. This means we don’t agree with CouchDB’s automatic choice. To do this, we first overwrite the target document with our value and then simply delete the revision we don’t like: + +

+> curl -X DELETE $HOST/db-replica/foo?rev=2-de0ea16f8621cbac506d23a0fbbde08a
+{"ok":true,"id":"foo","rev":"3-bfe83a296b0445c4d526ef35ef62ac14"}
+
+> curl -X PUT $HOST/db-replica/foo -d '{"count":3,"_rev":"2-7c971bb974251ae8541b8fe045964219"}'
+{"ok":true,"id":"foo","rev":"3-5d0319b075a21b095719bc561def7122"}
+
+ +

CouchDB creates yet another revision that reflects our decision. Note that the 3- didn’t get incremented this time. We didn’t create a new version of the document body; we just deleted a conflicting revision. To see that all is well, we check whether our revision ended up in the document. + +

+> curl -X GET $HOST/db-replica/foo
+{"_id":"foo","_rev":"3-5d0319b075a21b095719bc561def7122","count":3}
+
+ +

We also verify that our document is no longer in conflict by querying our conflicts view again, and we see that there are no more conflicts: + +

+{"total_rows":0,"offset":0,"rows":[
+]}
+
+ +

Finally, we replicate from db-replica back to db by simply swapping source and target in our request to _replicate: + +

+> curl -X POST $HOST/_replicate -d '{"target":"db","source":"http://127.0.0.1:5984/db-replica"}'
+
+ +

We see that our revision ends up in db, too: + +

+> curl -X GET $HOST/db/foo
+{"_id":"foo","_rev":"3-5d0319b075a21b095719bc561def7122","count":3}
+
+ +

And we’re done. + +

Deterministic Revision IDs

+ +

Let’s have a look at this revision ID: 3-5d0319b075a21b095719bc561def7122. Parts of the format might look familiar. The first part is an integer followed by a dash (3-). The integer increments for each new revision the document receives. Updates to the same document on multiple instances create their own independent increments. When replicating, CouchDB knows that there are two different revisions (like in our previous example) by looking at the second part. + +

The second part is an md5-hash over a set of document properties: the JSON body, the attachments, and the _deleted flag. This allows CouchDB to save on replication time in case you make the same change to the same document on two instances. Earlier versions (0.9 and back) used random integers to specify revisions, and making the same change on two instances would result in two different revision IDs, creating a conflict where it was not really necessary. CouchDB 0.10 and above uses deterministic revision IDs using the md5 hash. + +

For example, let’s create two documents, a and b, with the same contents: + +

+> curl -X PUT $HOST/db/a -d '{"a":1}'
+{"ok":true,"id":"a","rev":"1-23202479633c2b380f79507a776743d5"}
+
+> curl -X PUT $HOST/db/b -d '{"a":1}'
+{"ok":true,"id":"b","rev":"1-23202479633c2b380f79507a776743d5"}
+
+ +

Both revision IDs are the same, a consequence of the deterministic algorithm used by CouchDB. + +

Wrapping Up

+ +

This concludes our tour of the conflict management system. You should now be able to create distributed setups that deal with conflicts in a proper way. diff --git a/editions/1/zh/conflicts/01.png b/editions/1/zh/conflicts/01.png new file mode 100644 index 0000000..5e5fb60 Binary files /dev/null and b/editions/1/zh/conflicts/01.png differ diff --git a/editions/1/zh/conflicts/02.png b/editions/1/zh/conflicts/02.png new file mode 100644 index 0000000..7215230 Binary files /dev/null and b/editions/1/zh/conflicts/02.png differ diff --git a/editions/1/zh/conflicts/03.png b/editions/1/zh/conflicts/03.png new file mode 100644 index 0000000..66d7864 Binary files /dev/null and b/editions/1/zh/conflicts/03.png differ diff --git a/editions/1/zh/conflicts/04.png b/editions/1/zh/conflicts/04.png new file mode 100644 index 0000000..842827d Binary files /dev/null and b/editions/1/zh/conflicts/04.png differ diff --git a/editions/1/zh/conflicts/05.png b/editions/1/zh/conflicts/05.png new file mode 100644 index 0000000..a3c27d1 Binary files /dev/null and b/editions/1/zh/conflicts/05.png differ diff --git a/editions/1/zh/conflicts/06.png b/editions/1/zh/conflicts/06.png new file mode 100644 index 0000000..a3c27d1 Binary files /dev/null and b/editions/1/zh/conflicts/06.png differ diff --git a/editions/1/zh/conflicts/07.png b/editions/1/zh/conflicts/07.png new file mode 100644 index 0000000..3acd6d3 Binary files /dev/null and b/editions/1/zh/conflicts/07.png differ diff --git a/editions/1/zh/conflicts/08.png b/editions/1/zh/conflicts/08.png new file mode 100644 index 0000000..5e5fb60 Binary files /dev/null and b/editions/1/zh/conflicts/08.png differ diff --git a/editions/1/zh/consistency.html b/editions/1/zh/consistency.html new file mode 100644 index 0000000..2c67a32 --- /dev/null +++ b/editions/1/zh/consistency.html @@ -0,0 +1,213 @@ +最终一致性 + + + + + + + + + + + +

最终一致性

+ +

在上一章节中, 我们看到CouchDB的灵活性允许我们可以随着我们应用程序的增长和变化来演进我们的数据. 在这一章节中, 我们会了解如何顺着CouchDB的套路, 在我们的应用中实现架构的简单性, 并且帮助我们自然的构建可扩展的, 分布式的系统. + +

按照套路来工作

+ +

分布式系统是一个可以在一个广泛的网络上健壮工作的系统. 网络计算的一个明显的特点就是网络连接可能会中断, 而且有多种策略来处理这个网络错误. CouchDB不同于其他的地方在于, 它接受最终一致性, 和把绝对一致性放在资源可用性之前(像关系型数据库管理系统或者Paxos算法)正好相反. 但所有这些系统的共同之处在于它们都关心当多人同时访问时的数据表现不一致性. 它们的实现不同则在于一致性, 可用性或者可分割性上所选择的优先级不同. + +

开发分布式系统是需要技巧的. 许多你会遇到的警告和问题不会马上显现出来. 我们没有全部的解决办法, CouchDB也不是万能药, 但是如果你顺着CouchDB的套路而不是反着它, 那么在你构建可扩展的应用时遇到的阻碍会是最少的. + +

当然, 构建一个公布式系统只是一个开始. 一个网站, 如果其数据库只在一半的时间里可用, 那么这个网站基本就没有什么价值. 不幸的是, 传统的关系数据库对于一致性的实现使得它很容易让程序员依赖于全局状态, 全局时钟等, 却没有意识到自己正在这么做. 在了解CouchDB是如何提升可扩展性之前, 我们来看一看一个分布式系统所面临的各种约束. 在看过应用程序无法一直保持节点间的实时联系会出现的问题后, 我们会看到CouchDB提供了一种直观且又可效的方法来构建高可用性的应用. + +

CAP理论

+ +

CAP理论描述了基于网络的分布式系统的几种不同的策略. CouchDB的解决方法是使用复制, 在各个参与节点间同步应用的改变. 这是一个与一致性算法和关系数据库在根本上不同的实现, 它们的不同在于, 一致性, 可用性和可分割性的交叉点. + +

CAP理论, 如图1. CAP理论所示, 指出了三个明显不同的关心点: + +

+ +
一致性
+ +
所有的数据库客户端都看到相同的数据, 即使存在并发的更新.
+ +
可用性
+ +
所有的数据库客户端总能得到某一版本的数据.
+ +
可分割性
+ +
数据库可以被分割在多个服务器上.
+ +
+ +

选择其中的两个. + +

+ + + +

图1. CAP理论 + +

+ +

当一个系统增长到足够大, 大到一个单一的数据库节点不能处理给它的负载时, 一个很显然的办法就是增加更加的服务器. 当我们增加节点, 我们不得不开始考虑, 如何在节点间分割数据. 我们是在这些数据库上放置完全相同的数据? 还是把不同部分的数据放在不同的数据库服务器上? 还是应该只让一些确定的数据库服务来写数据而另外一些则处理读的请求? + +

不管我们选择哪种实现, 一个我们一直会遇到的问题是保持所有这些数据库服务器的同步. 如果你在一个节点上写了某些数据, 你要如何保证, 一个到另一个数据库服务器的读请求会反映这个最新的信息? 这些事件可能会发生在几毫秒之内. 即使只有少量的数据库服务器集, 这个问题也是相当复杂的. + +

当所有客户端都看到数据库的同一视图这一点非常重要的时候, 一个节点的用户在读和写数据库之前, 不得不等待其他节点的用户的同意. 在这个例子中, 我们可以看到, 可用性一致性更加不重要. 然而, 也有可用性比一致性来的重要的状况. + +

+ +

系统中的每个结点都应该能够只根据本地的状态作出决策. 如果你想要在有错误发生的高负载环境下运行并且还想达到一致性, 你已经迷途了... 如果你关心可扩展性, 那么任何用于保证一致性的算法到最后都会成为你的瓶颈. 把它当作一个既定的事实接受吧. + +

—Werner Vogels, Amazon CTO 兼副总裁 + +

+ +

如果把可用性优先来考虑, 我们可以让客户端在不等待其他节点同意的情况下就把数据写入一个数据库节点. 如果数据库知道如何处理各个数据库节点间的数据一致性, 那么我们就达到了一种"最终一致性", 以此为交换, 我们得到了高可用性. 对于很多应用来说, 这是一个相当可接受的交易. + +

不像关系数据库, 每一个操作都会导致整个数据库层面的一致性检查, CouchDB的最终一致性可以更加简便的来构建那些以牺牲实时一致性来换取巨大的性能提升的应用. + +

本地一致性

+ +

在我们去了解CouchDB是如何在集群上工作之前, 理解单一CouchDB节点的内部工作原理很重要. CouchDB API的设计提供了一个方便的, 但是精简的, 封装的数据库核心. 通过仔细的了解数据库核心结构, 我们会对它的API有一个更好的理解. + +

数据的钥匙

+ +

在CouchDB的核心, 是一个强大的B-Tree存储引擎. B-Tree是一个经过排序的数据结构, 它允许在对数时间(logarithmic time)里进行搜索, 插入和删除. 如图2. "一个视图请求的解剖"所示, CouchDB的所有内部数据, 文档和视图都使用这个B-Tree存储引擎. 如果我们理解了其中的一个, 我们就理解了所有的. + +

+ + + +

图2. 一个视图请求的解剖 + +

+ +

CouchDB使用MapReduce来计算一个视图的结果. MapReduce用了两个过程, "map"和"reduce", 它们被独立的用于每个文档. 可以独立的进行这些操作意味着视图计算本身可以是平行的和增量的. 更加重要的是, 因为这些过程会产生key/value对, CouchDB就可以把它们插入到B-Tree引擎中, 并以key来排序. 在B-Tree中, 通过key, 或者key范围来查找, 效率非常高, 用大O表示法来描述的话分别是, O(log N)O(log N + K). + +

在CouchDB里, 我们通过key或者key范围来得到文档和视图. 这是基于CouchDB的B-Tree引擎操作的一种直接映射. 和文档插入与更新一起, 这种直接映射是我们把CouchDB API描述成一个数据库核心的轻量的封装的原因. + +

只能通过key来得到结果是一个非常重要的限制, 因为它使得我们得到可以巨大的性能提升. 此外还有巨大的速度提升, 我们可以把我们的数据分割到多个节点, 但却不会影响到每个节点独立查找数据的能力. BigTable, Hadoop, SimpleDB, 还有memcached限制通过key来查找对象也是因为这些原因. + +

无锁的

+ +

关系数据库里的一张表是一个单一的数据结构. 如果你想要修改一张表, 比如更新一行, 数据库系统必须保证没有其他正在更新那一行并且没有人可以读正在更新的这一行. 处理这种情况的一般方法就是所谓的锁机制. 如果多个客户端想到用这张表, 第一个客户端会得到一个锁, 这会让其他人全部等待. 当第一个客户端处理完成后, 下一个得到许可, 其他人继续等待, 如此往下. 这种顺序的请求处理方式, 即使是在它们同时到达的时候, 也会浪费掉你服务的相当可观的处理能力. 在高负载下, 一个关系数据库可能会把更多的时候花费在, 谁被允许干什么事, 是以怎样一个顺序, 而大于真正要做的事情的时间. + +

作为锁机制的代替, CouchDB使用了多版本并发控制(MVCC)来管理对于数据库的关发请求. 图3, "MVCC意味着无锁"形象描述了MVCC和传统锁机制的不同. MVCC意味着可以一直, 甚至在高负载的情况下全速运行. 请求平行的被处理, 榨干你机器提供的最终一滴处理能力. + +

+ +MVCC意味着无锁 + +

图3. MVCC意味着无锁 + +

+ +

CouchDB里的文档有版本的, 很像一个常规的版本控制系统, 比如Subversion. 如果你想改变一个文档的value, 你会创建一个全新的文档, 然后保存在原来的文档之上. 这么做了之后, 你等于有一个同一文档的两个版本, 一个老的和一个新的. + +

这种方式与锁机制比改进在哪里呢? 考虑下这种情况, 一堆请求想要同一个文档. 第一个请求读这个文档. 当这个请求正在被处理时, 第二个请求改变了这个方法. 因为第二个请求包含了一个全新版本的文档, CouchDB可以简单的把它附加到数据库, 而不用等待那个读请求结束. + +

当第三个请求想要读这个文档时, CouchDB会把它指向一个那个被写入的新版本. 在这整个过程中, 第一个请求可以仍旧在读那个原始的版本. + +

一个读请求总是会看到最近的数据库快照. + +

验证

+ +

作为应用开发者, 我们必须要考虑哪些请求我们是接受的, 哪些是要拒绝的. 传统关系数据库中的对复杂数据作这种有效的验证的能力也是我们想要的. 幸运的是, CouchDB提供了一种强大的方法, 可以在数据库中做预文档的验证. + +

CouchDB可以使用类似于那些用于MapReduce的JavaScript方法来验证文档. 每当你试图去修改一个文档, CouchDB会传给验证方法一个现存文档的副本, 一个新文档的副本, 和一个附加信息的集合, 比如用户认证细节. 然后, 验证方法可以来通过或者拒绝这个更新. + +

顺着这种套路, 让CouchDB帮你做这些工作, 我们为自己节省了极大的CPU时间. 这些时间原本会被用在, 从SQL顺序化对象图表, 把它们转化成域对象, 然后使用这些对象做应用层的验证. + +

分布一致性

+ +

保持单一数据库节点的一致性对于大多数数据库来说都相对简单. 当你试图在多个数据库数据库之前保持一致性时, 真正的问题就开始浮现了. 如果一个客户端在服务器A上做了一个写操作, 我们怎么样才能保证它和服务器B, C或者D保持一致? 对于关系数据库来说, 这是一个非常复杂的问题, 可以用一整本书来阐述它的解决办法. 你可以使用多multi-master, master/slave, 分区(partitioning), 分片(sharding), write-through caches, 和其他复杂的技术. + +

增量复制

+ +

因为CouchDB操作是在一个单一文档的上下文中进行的, 如果你想要使用两个数据库节点, 你不需要去担心它们正处于实时通讯中. CouchDB通过增量复制, 一个在各个服务器之间周期性的复制文档变化的过程, 来实现这种最终一致性. 我们可以建立那种被叫作无分享(shared nothing)的数据库集群, 在那里每个节点是独立并且自我满足的, 不会有系统级的竞争. + +

想要扩展你的CouchDB数据库集群? 只要扔进另外一个服务器就行了. + +

图4, "在CouchDB节点间进行增量复制"所示, 使用CouchDB的增量复制, 你可以用任何你喜欢的方式, 任何你喜欢的时间, 在任意两个数据库之间同步你的数据. 复制完成后, 每个数据库都可以独立工作. + +

你可以使用这个特性,在一个集群或者数据中心之间, 用像cron这样的工作排程来同步数据库, 或者你也可以用它把数据同步到你的笔记本, 然后再旅行的时候作离线工作使用. 每个数据库都可以像平常那样使用而且数据库间的变化可以在以后双向同步. + +

+ + + +

图4. 在CouchDB节点间进行增量复制 + +

+ +

如果你在两个数据库中改变了同一个文档并且想要同步它们, 怎么办? CouchDB的复制系统拥有自动的冲突检测解决方法. 当CouchDB发现两个数据库中的文档都被改变了, 它把这个文档标记成冲突状态, 很像一个常规的版本控制系统所做的那样. + +

这个问题不像第一次听起来那样麻烦. 如果在复制中有一个文档的两个版本冲突了, 胜出的版本是那个在文档历史中最近被保存的那个版本. 不像你可能想像的那样, 把输了的那个版本丢弃, CouchDB把这个版本在文档历史中保存为上一个版本, 这样如果你需要的话就可以得到它. 这是自动和实时的, 所以两个数据库会作出完全相同的选择. + +

对于你的应用如何处理冲突更好取决于你自己. 你可以按默认的解决冲突, 也可以把文档版本转而更老的那个, 或者试着把两个版本合并然后存储这个结果. + +

案例分析

+ +

Greg Borenstein, 我们的一个朋友和同事, 写了一个库用来把Songbird播放列表转化成JSON对象, 并决定把它们存储到CouchDB中, 作为一个备份应用的一部分. 整个软件使用了CouchDB的MVCC和文档版本来保证Songbird播放列表在节点间健壮的被备份. + +

+ +

Songbird是一个和浏览器集成的免费的多媒体播放器, 它建立在Mozilla的XULRunner平台上. Songbird可以在Microsoft Windows, Apple Mac OS X, Solaris, 和Linux上使用. + +

+ +

让我们来看一下Songbird备份应用的整个工作流程, 首先一个用户从一个单一的计算机备份了播放列表, 然后用Songbird在多个计算机之间同步播放列表. 我们可以看到文档版本如何把一个麻烦的问题变成一件自然而然的事情. + +

第一次使用备份应用时, 我们导入播放列表来初始化一个复制. 每个播放列表被转化成一个JSON对象然后提交给CouchDB数据库. 如图5, "备份到一个单一的数据库"所示, CouchDB返回保存到数据库的每个播放列表文档的ID和版本号. + +

+ + + +

图5. 备份到一个单一的数据库 + +

+ +

几天以后, 我们发现我们的播放列表已经更新了, 我们想要备份这些改变. 在我们把播放列表导入到备份程序后, 它会去CouchDB取最新的版本, 同时能到相应的文档版本. 当应用程序提交新播放列表文档时, CouchDB接受这个文档, 而文档的版本已经包含在请求中了. + +

然后, CouchDB会保证请求里提交给它的文档版本和数据库里保存的文档版本一致. 因为CouchDB在每个修改后都会更新版本, 如果这两个版本不同步, 这说明有人在我们请求这个文档和发送我们的更新之间, 对这个文档做了改变. 当有人改变文档后, 不首先检查这些改变就去改变这个文档通常会是个坏主意. + +

强制要求客户端返回正确的文档版本是CouchDB并发的中心. + +

我们有一个笔记本, 想要和桌面计算机同步. 为了同步在桌面计算机上的所有播放列表, 第一步是在我们的笔记本上选择"从备份恢复". 这是我们第一次做, 所以做完后我们应该会得到和我们的桌面计算机播放列表完全一个的一个副本. + +

当在笔记本上编辑了Argentine Tango这个播放列表, 增加了一些我们买的新歌之后, 我们想要保存这些改变. 备份程序在笔记本上的CouchDB数据库中替代掉播放列表文档, 一个新文档版本产生了. 几天后我们想起了这些新歌, 想要把播放列表复制到桌面计算机. 如图6, "在两个数据库之间进行同步"所示, 备份程序统计复制新文档和新版本到桌面计算机的CouchDB数据库. 现在两个CouchDB数据库都有相同的版本号. + +

+ + + +

图6. 在两个数据库之间进行同步 + +

+ +

因为CouchDB跟踪文档版本号, 它保证像这样的更新只会在基于当前信息的基础上才会工作. 如果我们在同步过程之中, 对播放列表文档作出过改变, 事情就不会那么顺利了. + +

我们在笔记本上备份了一些改变后忘记同步了. 几天后, 我们在桌面计算机上编辑了我们的播放列表, 做了个备份, 然后想要和我们的笔记本同步了. 如图7, "两个数据库之间的同步冲突"所示, 当我们的备份应用程序想要在两个数据库之间进行复制时, CouchDB发现从桌面计算机上发来的文档是经过改变的, 然后告诉我们有一个冲突存在. + +

从应用层的观点来看, 从这个错误里恢复很容易做到. 只要下载CouchDB播放列表的版本, 然后可以合并这些改变或者在一个新的播放列表里存储本地改变就可以了. + +

+ + + +

图7. 两个数据库之间的同步冲突 + +

+ +

收尾

+ +

CouchDB的设计从Web架构里借用了很多东西, 并且从在Web架构上部署大型分布式系统中学到了很多. 通过理解为什么这种架构可以按照它的方式运行, 通过学习观察你应用程序的哪些部分容易做分布, 哪些部分不容易, 将增强你设计分布式的和可扩展的应用程序的能力--不管是否使用CouchDB. + +

我们已经了解了CouchDB一致性模型的主要议题, 并提示了顺着CouchDB工作而不是逆着它得到的好处. 但是, 理论已经足够了, 让我们起来看看所有这些到底是什么. diff --git a/editions/1/zh/consistency/01.png b/editions/1/zh/consistency/01.png new file mode 100644 index 0000000..d4c2e82 Binary files /dev/null and b/editions/1/zh/consistency/01.png differ diff --git a/editions/1/zh/consistency/02.png b/editions/1/zh/consistency/02.png new file mode 100644 index 0000000..06c23ea Binary files /dev/null and b/editions/1/zh/consistency/02.png differ diff --git a/editions/1/zh/consistency/03.png b/editions/1/zh/consistency/03.png new file mode 100644 index 0000000..2164c6c Binary files /dev/null and b/editions/1/zh/consistency/03.png differ diff --git a/editions/1/zh/consistency/04.png b/editions/1/zh/consistency/04.png new file mode 100644 index 0000000..068fa77 Binary files /dev/null and b/editions/1/zh/consistency/04.png differ diff --git a/editions/1/zh/consistency/05.png b/editions/1/zh/consistency/05.png new file mode 100644 index 0000000..a94f9c3 Binary files /dev/null and b/editions/1/zh/consistency/05.png differ diff --git a/editions/1/zh/consistency/06.png b/editions/1/zh/consistency/06.png new file mode 100644 index 0000000..af316d4 Binary files /dev/null and b/editions/1/zh/consistency/06.png differ diff --git a/editions/1/zh/consistency/07.png b/editions/1/zh/consistency/07.png new file mode 100644 index 0000000..7fb5027 Binary files /dev/null and b/editions/1/zh/consistency/07.png differ diff --git a/editions/1/zh/cookbook.html b/editions/1/zh/cookbook.html new file mode 100644 index 0000000..1008b83 --- /dev/null +++ b/editions/1/zh/cookbook.html @@ -0,0 +1,422 @@ +View Cookbook for SQL Jockeys + + + + + + + + + + + +

View Cookbook for SQL Jockeys

+ +

This is a collection of some common SQL queries and how to get the same result in CouchDB. The key to remember here is that CouchDB does not work like an SQL database at all and that best practices from the SQL world do not translate well or at all to CouchDB. This chapter’s “cookbook” assumes that you are familiar with the CouchDB basics such as creating and updating databases and documents. + +

Using Views

+ +

How you would do this in SQL: + +

+CREATE TABLE
+
+ +

or: + +

+ALTER TABLE
+
+ +

Using views is a two-step process. First you define a view; then you query it. This is analogous to defining a table structure (with indexes) using CREATE TABLE or ALTER TABLE and querying it using an SQL query. + +

Defining a View

+ +

Defining a view is done by creating a special document in a CouchDB database. The only real specialness is the _id of the document, which starts with _design/—for example, _design/application. Other than that, it is just a regular CouchDB document. To make sure CouchDB understands that you are defining a view, you need to prepare the contents of that design document in a special format. Here is an example: + +

+{
+  "_id": "_design/application",
+  "_rev": "1-C1687D17",
+  "views": {
+    "viewname": {
+      "map": "function(doc) { ... }",
+      "reduce": "function(keys, values) { ... }"
+    }
+  }
+}
+
+ +

We are defining a view viewname. The definition of the view consists of two functions: the map function and the reduce function. Specifying a reduce function is optional. We’ll look at the nature of the functions later. Note that viewname can be whatever you like: users, by-name, or by-date are just some examples. + +

A single design document can also include multiple view definitions, each identified by a unique name: + +

+{
+  "_id": "_design/application",
+  "_rev": "1-C1687D17",
+  "views": {
+    "viewname": {
+      "map": "function(doc) { ... }",
+      "reduce": "function(keys, values) { ... }"
+    },
+    "anotherview": {
+      "map": "function(doc) { ... }",
+      "reduce": "function(keys, values) { ... }"
+    }
+  }
+}
+
+ +

Querying a View

+ +

The name of the design document and the name of the view are significant for querying the view. To query the view viewname, you perform an HTTP GET request to the following URI: + +

+/database/_design/application/_view/viewname
+
+ +

database is the name of the database you created your design document in. Next up is the design document name, and then the view name prefixed with _view/. To query anotherview, replace viewname in that URI with anotherview. If you want to query a view in a different design document, adjust the design document name. + +

MapReduce Functions

+ +

MapReduce is a concept that solves problems by applying a two-step process, aptly named the map phase and the reduce phase. The map phase looks at all documents in CouchDB separately one after the other and creates a map result. The map result is an ordered list of key/value pairs. Both key and value can be specified by the user writing the map function. A map function may call the built-in emit(key, value) function 0 to N times per document, creating a row in the map result per invocation. + +

CouchDB is smart enough to run a map function only once for every document, even on subsequent queries on a view. Only changes to documents or new documents need to be processed anew. + +

Map functions
+ +

Map functions run in isolation for every document. They can’t modify the document, and they can’t talk to the outside world—they can’t have side effects. This is required so that CouchDB can guarantee correct results without having to recalculate a complete result when only one document gets changed. + +

The map result looks like this: + +

+{"total_rows":3,"offset":0,"rows":[
+{"id":"fc2636bf50556346f1ce46b4bc01fe30","key":"Lena","value":5},
+{"id":"1fb2449f9b9d4e466dbfa47ebe675063","key":"Lisa","value":4},
+{"id":"8ede09f6f6aeb35d948485624b28f149","key":"Sarah","value":6}
+]}
+
+ +

It is a list of rows sorted by the value of key. The id is added automatically and refers back to the document that created this row. The value is the data you’re looking for. For example purposes, it’s the girl’s age. + +

The map function that produces this result is: + +

+function(doc) {
+  if(doc.name && doc.age) {
+    emit(doc.name, doc.age);
+  }
+}
+
+ +

It includes the if statement as a sanity check to ensure that we’re operating on the right fields and calls the emit function with the name and age as the key and value. + +

Reduce functions
+ +

Reduce functions are explained in the section called “Aggregate Functions”. + +

Look Up by Key

+ +

How you would do this in SQL: + +

+SELECT field FROM table WHERE value="searchterm"
+
+ +

Use case: get a result (which can be a record or set of records) associated with a key ("searchterm"). + +

To look something up quickly, regardless of the storage mechanism, an index is needed. An index is a data structure optimized for quick search and retrieval. CouchDB’s map result is stored in such an index, which happens to be a B+ tree. + +

To look up a value by "searchterm", we need to put all values into the key of a view. All we need is a simple map function: + +

+function(doc) {
+  if(doc.value) {
+    emit(doc.value, null);
+  }
+}
+
+ +

This creates a list of documents that have a value field sorted by the data in the value field. To find all the records that match "searchterm", we query the view and specify the search term as a query parameter: + +

+/database/_design/application/_view/viewname?key="searchterm"
+
+ +

Consider the documents from the previous section, and say we’re indexing on the age field of the documents to find all the five-year-olds: + +

+function(doc) {
+  if(doc.age && doc.name) {
+    emit(doc.age, doc.name);
+  }
+}
+
+ +

Query: + +

+/ladies/_design/ladies/_view/age?key=5
+
+ +

Result: + +

+{"total_rows":3,"offset":1,"rows":[
+{"id":"fc2636bf50556346f1ce46b4bc01fe30","key":5,"value":"Lena"}
+]}
+
+ +

Easy. + +

Note that you have to emit a value. The view result includes the associated document ID in every row. We can use it to look up more data from the document itself. We can also use the ?include_docs=true parameter to have CouchDB fetch the documents individually for us. + +

Look Up by Prefix

+ +

How you would do this in SQL: + +

+SELECT field FROM table WHERE value LIKE "searchterm%"
+
+ +

Use case: find all documents that have a field value that starts with searchterm. For example, say you stored a MIME type (like text/html or image/jpg) for each document and now you want to find all documents that are images according to the MIME type. + +

The solution is very similar to the previous example: all we need is a map function that is a little more clever than the first one. But first, an example document: + +

+{
+  "_id": "Hugh Laurie",
+  "_rev": "1-9fded7deef52ac373119d05435581edf",
+  "mime-type": "image/jpg",
+  "description": "some dude"
+}
+
+ +

The clue lies in extracting the prefix that we want to search for from our document and putting it into our view index. We use a regular expression to match our prefix: + +

+function(doc) {
+  if(doc["mime-type"]) {
+    // from the start (^) match everything that is not a slash ([^\/]+) until
+    // we find a slash (\/). Slashes needs to be escaped with a backslash (\/)
+    var prefix = doc["mime-type"].match(/^[^\/]+\//);
+    if(prefix) {
+      emit(prefix, null);
+    }
+  }
+}
+
+ +

We can now query this view with our desired MIME type prefix and not only find all images, but also text, video, and all other formats: + +

+/files/_design/finder/_view/by-mime-type?key="image/"
+
+ +

Aggregate Functions

+ +

How you would do this in SQL: + +

+SELECT COUNT(field) FROM table
+
+ +

Use case: calculate a derived value from your data. + +

We haven’t explained reduce functions yet. Reduce functions are similar to aggregate functions in SQL. They compute a value over multiple documents. + +

To explain the mechanics of reduce functions, we’ll create one that doesn’t make a whole lot of sense. But this example is easy to understand. We’ll explore more useful reductions later. + +

Reduce functions operate on the output of the map function (also called the map re⁠sult or intermediate result). The reduce function’s job, unsurprisingly, is to reduce the list that the map function produces. + +

Here’s what our summing reduce function looks like: + +

+function(keys, values) {
+  var sum = 0;
+  for(var idx in values) {
+    sum = sum + values[idx];
+  }
+  return sum;
+}
+
+ +

Here’s an alternate, more idiomatic JavaScript version: + +

+function(keys, values) {
+  var sum = 0;
+  values.forEach(function(element) {
+    sum = sum + element;
+  });
+  return sum;
+}
+
+ +

This reduce function takes two arguments: a list of keys and a list of values. For our summing purposes we can ignore the keys-list and consider only the value list. We’re looping over the list and add each item to a running total that we’re returning at the end of the function. + +

You’ll see one difference between the map and the reduce function. The map function uses emit() to create its result, whereas the reduce function returns a value. + +

For example, from a list of integer values that specify the age, calculate the sum of all years of life for the news headline, “786 life years present at event.” A little contrived, but very simple and thus good for demonstration purposes. Consider the documents and the map view we used earlier in this chapter. + +

The reduce function to calculate the total age of all girls is: + +

+function(keys, values) {
+  return sum(values);
+}
+
+ +

Note that, instead of the two earlier versions, we use CouchDB’s predefined sum() function. It does the same thing as the other two, but it is such a common piece of code that CouchDB has it included. + +

The result for our reduce view now looks like this: + +

+{"rows":[
+{"key":null,"value":15}
+]}
+
+ +

The total sum of all age fields in all our documents is 15. Just what we wanted. The key member of the result object is null, as we can’t know anymore which documents took part in the creation of the reduced result. We’ll cover more advanced reduce cases later on. + +

As a rule of thumb, the reduce function should reduce a single scalar value. That is, an integer; a string; or a small, fixed-size list or object that includes an aggregated value (or values) from the values argument. It should never just return values or similar. CouchDB will give you a warning if you try to use reduce “the wrong way”: + +

+{"error":"reduce_overflow_error","message":"Reduce output must shrink more rapidly: Current output: ..."}
+
+ +

Get Unique Values

+ +

How you would do this in SQL: + +

+SELECT DISTINCT field FROM table
+
+ +

Getting unique values is not as easy as adding a keyword. But a reduce view and a special query parameter give us the same result. Let’s say you want a list of tags that your users have tagged themselves with and no duplicates. + +

First, let’s look at the source documents. We punt on _id and _rev attributes here: + +

+{
+  "name":"Chris",
+  "tags":["mustache", "music", "couchdb"]
+}
+
+{
+  "name":"Noah",
+  "tags":["hypertext", "philosophy", "couchdb"]
+}
+
+{
+  "name":"Jan",
+  "tags":["drums", "bike", "couchdb"]
+}
+
+ +

Next, we need a list of all tags. A map function will do the trick: + +

+function(dude) {
+  if(dude.name && dude.tags) {
+    dude.tags.forEach(function(tag) {
+      emit(tag, null);
+    });
+  }
+}
+
+ +

The result will look like this: + +

+{"total_rows":9,"offset":0,"rows":[
+{"id":"3525ab874bc4965fa3cda7c549e92d30","key":"bike","value":null},
+{"id":"3525ab874bc4965fa3cda7c549e92d30","key":"couchdb","value":null},
+{"id":"53f82b1f0ff49a08ac79a9dff41d7860","key":"couchdb","value":null},
+{"id":"da5ea89448a4506925823f4d985aabbd","key":"couchdb","value":null},
+{"id":"3525ab874bc4965fa3cda7c549e92d30","key":"drums","value":null},
+{"id":"53f82b1f0ff49a08ac79a9dff41d7860","key":"hypertext","value":null},
+{"id":"da5ea89448a4506925823f4d985aabbd","key":"music","value":null},
+{"id":"da5ea89448a4506925823f4d985aabbd","key":"mustache","value":null},
+{"id":"53f82b1f0ff49a08ac79a9dff41d7860","key":"philosophy","value":null}
+]}
+
+ +

As promised, these are all the tags, including duplicates. Since each document gets run through the map function in isolation, it cannot know if the same key has been emitted already. At this stage, we need to live with that. To achieve uniqueness, we need a reduce: + +

+function(keys, values) {
+  return true;
+}
+
+ +

This reduce doesn’t do anything, but it allows us to specify a special query parameter when querying the view: + +

+/dudes/_design/dude-data/_view/tags?group=true
+
+ +

CouchDB replies: + +

+{"rows":[
+{"key":"bike","value":true},
+{"key":"couchdb","value":true},
+{"key":"drums","value":true},
+{"key":"hypertext","value":true},
+{"key":"music","value":true},
+{"key":"mustache","value":true},
+{"key":"philosophy","value":true}
+]}
+
+ +

In this case, we can ignore the value part because it is always true, but the result includes a list of all our tags and no duplicates! + +

With a small change we can put the reduce to good use, too. Let’s see how many of the non-unique tags are there for each tag. To calculate the tag frequency, we just use the summing up we already learned about. In the map function, we emit a 1 instead of null: + +

+function(dude) {
+  if(dude.name && dude.tags) {
+    dude.tags.forEach(function(tag) {
+      emit(tag, 1);
+    });
+  }
+}
+
+ +

In the reduce function, we return the sum of all values: + +

+function(keys, values) {
+  return sum(values);
+}
+
+ +

Now, if we query the view with the ?group=true parameter, we get back the count for each tag: + +

+{"rows":[
+{"key":"bike","value":1},
+{"key":"couchdb","value":3},
+{"key":"drums","value":1},
+{"key":"hypertext","value":1},
+{"key":"music","value":1},
+{"key":"mustache","value":1},
+{"key":"philosophy","value":1}
+]}
+
+ +

Enforcing Uniqueness

+ +

How you would do this in SQL: + +

+UNIQUE KEY(column)
+
+ +

Use case: your applications require that a certain value exists only once in a database. + +

This is an easy one: within a CouchDB database, each document must have a unique _id field. If you require unique values in a database, just assign them to a document’s _id field and CouchDB will enforce uniqueness for you. + +

There’s one caveat, though: in the distributed case, when you are running more than one CouchDB node that accepts write requests, uniqueness can be guaranteed only per node or outside of CouchDB. CouchDB will allow two identical IDs to be written to two different nodes. On replication, CouchDB will detect a conflict and flag the document accordingly. diff --git a/editions/1/zh/design.html b/editions/1/zh/design.html new file mode 100644 index 0000000..5db8942 --- /dev/null +++ b/editions/1/zh/design.html @@ -0,0 +1,143 @@ +Design Documents + + + + + + + + + + + +

Design Documents

+ +

Design documents are a special type of CouchDB document that contains application code. Because it runs inside a database, the application API is highly structured. We’ve seen JavaScript views and other functions in the previous chapters. In this section, we’ll take a look at the function APIs, and talk about how functions in a design document are related within applications. + +

This part (Part II, “Developing with CouchDB”, Chapters Chapter 5, Design Documents through Chapter 9, Transforming Views with List Functions) lays the foundation for Part III, “Example Application”, where we take what we’ve learned and build a small blog application to further develop an understanding of how CouchDB applications are built. The application is called Sofa, and on a few occasions we discuss it this part. If you are unclear on what we are referring to, do not worry, we’ll get to it in Part III, “Example Application”. + +

Document Modeling

+ +

In our experience, there are two main kinds of documents. The first kind is like something a word processor would save, or a user profile. With that sort of data, you want to denormalize as much as you possibly can. Basically, you want to be able to load the document in one request and get something that makes sense enough to display. + +

A technique exists for creating “virtual” documents by using views to collate data together. You could use this to store each attribute of your user profiles in a different document, but I wouldn’t recommend it. Virtual documents are useful in cases where the presented view will be created by merging the work of different authors; for instance, the reference example, a blog post, and its comments in one query. A blog post titled “CouchDB Joins,” by Christopher Lenz, covers this in more detail. + +

This virtual document idea takes us to the other kind of document—the event log. Use this in cases where you don’t trust user input or where you need to trigger an asynchronous job. This records the user action as an event, so only minimal validation needs to occur at save time. It’s when you load the document for further work that you’d check for complex relational-style constraints. + +

You can treat documents as state machines, with a combination of user input and background processing managing document state. You’d use a view by state to pull out the relevant document—changing its state would move it in the view. + +

This approach is also useful for logging—combined with the batch=ok performance hint, CouchDB should make a fine log store, and reduce views are ideal for finding things like average response time or highly active users. + +

The Query Server

+ +

CouchDB’s default query server (the software package that executes design document functions) is written in JavaScript, but there are views servers available for nearly any language you can imagine. Implementing a new language is a matter of handling a few JSON commands from a simple line-based program. + +

In this section, we’ll review existing functionality like MapReduce views, update validation functions, and show and list transforms. We’ll also briefly describe capabilities available on CouchDB’s roadmap, like replication filters, update handlers for parsing non-JSON input, and a rewrite handler for making application URLs more palatable. Since CouchDB is an open source project, we can’t really say when each planned feature will become available, but it’s our hope that everything described here is available by the time you read this. We’ll make it clear in the text when we’re talking about things that aren’t yet in the CouchDB trunk. + +

Applications Are Documents

+ +

CouchDB is designed to work best when there is a one-to-one correspondence between applications and design documents. + +

A design document is a CouchDB document with an id that begins with _design/. For instance, the example blog application, Sofa, is stored in a design document with the ID _design/sofa (see Figure 1, “Anatomy of our design document”). Design documents are just like any other CouchDB document—they replicate along with the other documents in their database and track edit conflicts with the rev parameter. + +

As we’ve seen, design documents are normal JSON documents, denoted by the fact that their DocID is prefixed with _design/. + +

CouchDB looks for views and other application functions here. The static HTML pages of our application are served as attachments to the design document. Views and validations, however, aren’t stored as attachments; rather, they are directly included in the design document’s JSON body. + +

+ + + +

Figure 1. Anatomy of our design document + +

+ +

CouchDB’s MapReduce queries are stored in the views field. This is how Futon displays and allows you to edit MapReduce queries. View indexes are stored on a per–design document basis, according to a fingerprint of the function’s text contents. This means that if you edit attachments, validations, or any other non-view (or language) fields on the design document, the views will not be regenerated. However, if you change a map or a reduce function, the view index will be deleted and a new index built for the new view functions. + +

CouchDB has the capability to render responses in formats other than raw JSON. The design doc fields show and list contain functions used to transform raw JSON into HTML, XML, or other Content-Types. This allows CouchDB to serve Atom feeds without any additional middleware. The show and list functions are a little like “actions” in traditional web frameworks—they run some code based on a request and render a response. However, they differ from actions in that they may not have side effects. This means that they are largely restricted to handling GET requests, but it also means they can be cached by HTTP proxies like Varnish. + +

Because application logic is contained in a single document, code upgrades can be accomplished with CouchDB replication. This also opens the possibility for a single database to host multiple applications. The interface a newspaper editor needs is vastly different from what a reader desires, although the data is largely the same. They can both be hosted by the same database, in different design documents. + +

A CouchDB database can contain many design documents. Example design DocIDs are: + +

+_design/calendar
+_design/contacts
+_design/blog
+_design/admin
+
+ +

In the full CouchDB URL structure, you’d be able to GET the design document JSON at URLs like: + +

+http://localhost:5984/mydb/_design/calendar
+http://127.0.0.1:5984/mydb/_design/contacts
+http://127.0.0.1:5984/mydb/_design/blog
+http://127.0.0.1:5984/mydb/_design/admin
+
+ +

We show this to note that design documents have a special case, as they are the only documents whose URLs can be used with a literal slash. We’ve done this because nobody likes to see %2F in their browser’s location bar. In all other cases, a slash in a DocID must be escaped when used in a URL. For instance, the DocID movies/jaws would appear in the URL like this: http://127.0.0.1:5984/mydb/movies%2Fjaws. + +

We’ll build the first iteration of the example application without using show or list, because writing Ajax queries against the JSON API is a better way to teach CouchDB as a database. The APIs we explore in the first iteration are the same APIs you’d use to analyze log data, archive assets, or manage persistent queues. + +

In the second iteration, we’ll upgrade our example blog so that it can function with client-side JavaScript turned off. For now, sticking to Ajax queries gives more transparency into how CouchDB’s JSON/HTTP API works. JSON is a subset of JavaScript, so working with it in JavaScript keeps the impedance mismatch low, while the browser’s XMLHttpRequest (XHR) object handles the HTTP details for us. + +

CouchDB uses the validate_doc_update function to prevent invalid or unauthorized document updates from proceeding. We use it in the example application to ensure that blog posts can be authored only by logged-in users. CouchDB’s validation functions also can’t have any side effects, and they have the opportunity to block not only end user document saves, but also replicated documents from other nodes. We’ll talk about validation in depth in Part III, “Example Application”. + +

The raw images, JavaScript, CSS, and HTML assets needed by Sofa are stored in the _attachments field, which is interesting in that by default it shows only the stubs, rather than the full content of the files. Attachments are available on all CouchDB documents, not just design documents, so asset management applications have as much flexibility as they could need. If a set of resources is required for your application to run, they should be attached to the design document. This means that a new user can easily bootstrap your application on an empty database. + +

The other fields in the design document shown in Figure 1, “Anatomy of our design document” (and in the design documents we’ll be using) are used by CouchApp’s upload process (see Chapter 10, Standalone Applications for more information on CouchApp). The signatures field allows us to avoid updating attachments that have not changed between the disk and the database. It does this by comparing file content hashes. The lib field is used to hold additional JavaScript code and JSON data to be inserted at deploy time into view, show, and validation functions. We’ll explain CouchApp in the next chapter. + +

A Basic Design Document

+ +

In the next section we’ll get into advanced techniques for working with design documents, but before we finish here, let’s look at a very basic design document. All we’ll do is define a single view, but it should be enough to show you how design documents fit into the larger system. + +

First, add the following text (or something like it) to a text file called mydesign.json using your editor: + +

+{
+  "_id" : "_design/example",
+  "views" : {
+    "foo" : {
+      "map" : "function(doc){ emit(doc._id, doc._rev)}"
+    }
+  }
+}
+
+ +

Now use curl to PUT the file to CouchDB (we’ll create a database first for good measure): + +

+curl -X PUT http://127.0.0.1:5984/basic
+curl -X PUT http://127.0.0.1:5984/basic/_design/example --data-binary @mydesign.json
+
+ +

From the second request, you should see a response like: + +

+{"ok":true,"id":"_design/example","rev":"1-230141dfa7e07c3dbfef0789bf11773a"}
+
+ +

Now we can query the view we’ve defined, but before we do that, we should add a few documents to the database so we have something to view. Running the following command a few times will add empty documents: + +

+curl -X POST http://127.0.0.1:5984/basic -d '{}'
+
+ +

Now to query the view: + +

+curl http://127.0.0.1:5984/basic/_design/example/_view/foo
+
+ +

This should give you a list of all the documents in the database (except the design document). You’ve created and used your first design document! + +

Looking to the Future

+ +

There are other design document functions that are being introduced at the time of this writing, including _update and _filter that we aren’t covering in depth here. Filter functions are covered in Chapter 20, Change Notifications. Imagine a web service that POSTs an XML blob at a URL of your choosing when particular events occur. PayPal’s instant payment notification is one of these. With an _update handler, you can POST these directly in CouchDB and it can parse the XML into a JSON document and save it. The same goes for CSV, multi-part form, or any other format. + +

The bigger picture we’re working on is like an app server, but different in one crucial regard: rather than let the developer do whatever he wants (loop a list of DocIDs and make queries, make queries based on the results of other queries, etc.), we’re defining “safe” transformations, such as view, show, list, and update. By safe, we mean that they have well-known performance characteristics and otherwise fit into CouchDB’s architecture in a streamlined way. + +

The goal here is to provide a way to build standalone apps that can also be easily indexed by search engines and used via screen readers. Hence, the push for plain old HTML. You can pretty much rely on JavaScript getting executed (except when you can’t). Having HTML resources means CouchDB is suitable for public-facing web apps. + +

On the horizon are a rewrite handler and a database event handler, as they seem to flesh out the application capabilities nicely. A rewrite handler would allow your application to present its own URL space, which would make integration into existing systems a bit easier. An event handler would allow you to run asynchronous processes when the database changes, so that, for instance, a document update can trigger a workflow, multi-document validation, or message queue. diff --git a/editions/1/zh/design/01.png b/editions/1/zh/design/01.png new file mode 100644 index 0000000..22f1906 Binary files /dev/null and b/editions/1/zh/design/01.png differ diff --git a/editions/1/zh/documents.html b/editions/1/zh/documents.html new file mode 100644 index 0000000..580bfd0 --- /dev/null +++ b/editions/1/zh/documents.html @@ -0,0 +1,368 @@ +Storing Documents + + + + + + + + + + + +

Storing Documents

+ +

Documents are CouchDB’s central data structure. To best understand and use CouchDB, you need to think in documents. This chapter walks you though the lifecycle of designing and saving a document. We’ll follow up by reading documents and aggregating and querying them with views. In the next section, you’ll see how CouchDB can also transform documents into other formats. + +

Documents are self-contained units of data. You might have heard the term record to describe something similar. Your data is usually made up of small native types such as integers and strings. Documents are the first level of abstraction over these native types. They provide some structure and logically group the primitive data. The height of a person might be encoded as an integer (176), but this integer is usually part of a larger structure that contains a label ("height": 176) and related data ({"name":"Chris", "height": 176}). + +

How many data items you put into your documents depends on your application and a bit on how you want to use views (later), but generally, a document roughly corresponds to an object instance in your programming language. Are you running an online shop? You will have items and sales and comments for your items. They all make good candidates for objects and, subsequently, documents. + +

Documents differ subtly from garden-variety objects in that they usually have authors and CRUD operations (create, read, update, delete). Document-based software (like the word processors and spreadsheets of yore) builds its storage model around saving documents so that authors get back what they created. Similarly, in a CouchDB application you may find yourself giving greater leeway to the presentation layer. If, instead of adding timestamps to your data in a controller, you allow the user to control them, you get draft status and the ability to publish articles in the future for free (by viewing published documents using an endkey of now). + +

Validation functions are available so that you don’t have to worry about bad data causing errors in your system. Often in document-based software, the client application edits and manipulates the data, saving it back. As long as you give the user the document she asked you to save, she’ll be happy. + +

Say your users can comment on the item (“lovely book”); you have the option to store the comments as an array, on the item document. This makes it trivial to find the item’s comments, but, as they say, “it doesn’t scale.” A popular item could have tens of comments, or even hundreds or more. + +

Instead of storing a list on the item document, in this case it may be better to model comments into a collection of documents. There are patterns for accessing collections, which CouchDB makes easy. You likely want to show only 10 or 20 at a time and provide previous and next links. By handling comments as individual entities, you can group them with views. A group could be the entire collection or slices of 10 or 20, sorted by the item they apply to so that it’s easy to grab the set you need. + +

A rule of thumb: break up into documents everything that you will be handling separately in your application. Items are single, and comments are single, but you don’t need to break them into smaller pieces. Views are a convenient way to group your documents in meaningful ways. + +

Let’s go through building our example application to show you in practice how to work with documents. + +

JSON Document Format

+ +

The first step in designing any application (once you know what the program is for and have the user interaction nailed down) is deciding on the format it will use to represent and store data. Our example blog is written in JavaScript. A few lines back we said documents roughly represent your data objects. In this case, there is a an exact correspondence. CouchDB borrowed the JSON data format from JavaScript; this allows us to use documents directly as native objects when programming. This is really convenient and leads to fewer problems down the road (if you ever worked with an ORM system, you might know what we are hinting at). + +

Let’s draft a JSON format for blog posts. We know we’ll need each post to have an author, a title, and a body. We know we’d like to use document IDs to find documents so that URLs are search engine–friendly, and we’d also like to list them by creation date. + +

It should be pretty straightforward to see how JSON works. Curly braces ({}) wrap objects, and objects are key/value lists. Keys are strings that are wrapped in double quotes (""). Finally, a value is a string, an integer, an object, or an array ([]). Keys and values are separated by a colon (:), and multiple keys and values by comma (,). That’s it. For a complete description of the JSON format, see Appendix E, JSON Primer. + +

Figure 1, “The JSON post format” shows a document that meets our requirements. The cool thing is we just made it up on the spot. We didn’t go and define a schema, and we didn’t define how things should look. We just created a document with whatever we needed. Now, requirements for objects change all the time during the development of an application. Coming up with a different document that meets new, evolved needs is just as easy. + +

+ + + +

Figure 1. The JSON post format + +

+ +
+ +

Do I really look like a guy with a plan? You know what I am? I’m a dog chasing cars. I wouldn’t know what to do with one if I caught it. You know, I just do things. The mob has plans, the cops have plans, Gordon’s got plans. You know, they’re schemers. Schemers trying to control their little worlds. I’m not a schemer. I try to show the schemers how pathetic their attempts to control things really are. + +

—The Joker, The Dark Knight + +

+ +

Let’s examine the document in a little more detail. The first two members (_id and _rev) are for CouchDB’s housekeeping and act as identification for a particular instance of a document. _id is easy: if I store something in CouchDB, it creates the _id and returns it to me. I can use the _id to build the URL where I can get my something back. + +

+ +

Your document’s _id defines the URL the document can be found under. Say you have a database movies. All documents can be found somewhere under the URL /movies, but where exactly? + +

If you store a document with the _id Jabberwocky ({"_id":"Jabberwocky"}) into your movies database, it will be available under the URL /movies/Jabberwocky. So if you send a GET request to /movies/Jabberwocky, you will get back the JSON that makes up your document ({"_id":"Jabberwocky"}). + +

+ +

The _rev (or revision ID) describes a version of a document. Each change creates a new document version (that again is self-contained) and updates the _rev. This becomes useful because, when saving a document, you must provide an up-to-date _rev so that CouchDB knows you’ve been working against the latest document version. + +

We touched on this in Chapter 2, Eventual Consistency. The revision ID acts as a gatekeeper for writes to a document in CouchDB’s MVCC system. A document is a shared resource; many clients can read and write them at the same time. To make sure two writing clients don’t step on each other’s feet, each client must provide what it believes is the latest revision ID of a document along with the proposed changes. If the on-disk revision ID matches the provided _rev, CouchDB will accept the change. If it doesn’t, the update will be rejected. The client should read the latest version, integrate the changes, and try saving again. + +

This mechanism ensures two things: a client can only overwrite a version it knows, and it can’t trip over changes made by other clients. This works without CouchDB having to manage explicit locks on any document. This ensures that no client has to wait for another client to complete any work. Updates are serialized, so CouchDB will never attempt to write documents faster than your disk can spin, and it also means that two mutually conflicting writes can’t be written at the same time. + +

Beyond _id and _rev: Your Document Data

+ +

Now that you thoroughly understand the role of _id and _rev on a document, let’s look at everything else we’re storing. + +

+{
+  "_id":"Hello-Sofa",
+  "_rev":"2-2143609722",
+  "type":"post",
+
+ +

The first thing is the type of the document. Note that this is an application-level parameter, not anything particular to CouchDB. The type is just an arbitrarily named key/value pair as far as CouchDB is concerned. For us, as we’re adding blog posts to Sofa, it has a little deeper meaning. Sofa uses the type field to determine which validations to apply. It can then rely on documents of that type being valid in the views and the user interface. This removes the need to check for every field and nested JSON value before using it. This is purely by convention, and you can make up your own or infer the type of a document by its structure (“has an array with three elements”—a.k.a. duck typing). We just thought this was easy to follow and hope you agree. + +

+  "author":"jchris",
+  "title":"Hello Sofa",
+
+ +

The author and title fields are set when the post is created. The title field can be changed, but the author field is locked by the validation function for security. Only the author may edit the post. + +

+  "tags":["example","blog post","json"],
+
+ +

Sofa’s tag system just stores them as an array on the document. This kind of denormalization is a particularly good fit for CouchDB. + +

+  "format":"markdown",
+  "body":"some markdown text",
+  "html":"<p>the html text</p>",
+
+ +

Blog posts are composed in the Markdown HTML format to make them easy to author. The Markdown format as typed by the user is stored in the body field. Before the blog post is saved, Sofa converts it to HTML in the client’s browser. There is an interface for previewing the Markdown conversion, so you can be sure it will display as you like. + +

+  "created_at":"2009/05/25 06:10:40 +0000"
+}
+
+ +

The created_at field is used to order blog posts in the Atom feed and on the HTML index page. + +

The Edit Page

+ +

The first page we need to build in order to get one of these blog entries into our post is the interface for creating and editing posts. + +

Editing is more complex than just rendering posts for visitors to read, but that means once you’ve read this chapter, you’ll have seen most of the techniques we touch on in the other chapters. + +

The first thing to look at is the show function used to render the HTML page. If you haven’t already, read Chapter 8, Show Functions to learn about the details of the API. We’ll just look at this code in the context of Sofa, so you can see how it all fits together. + +

+function(doc, req) {
+  // !json templates.edit
+  // !json blog
+  // !code vendor/couchapp/path.js
+  // !code vendor/couchapp/template.js
+
+ +

Sofa’s edit page show function is very straightforward. In the previous section, we showed the important templates and libraries we’ll use. The important line is the !json macro, which loads the edit.html template from the templates directory. These macros are run by CouchApp, as Sofa is being deployed to CouchDB. For more information about the macros, see Chapter 13, Showing Documents in Custom Formats. + +

+  // we only show html
+  return template(templates.edit, {
+    doc : doc,
+    docid : toJSON((doc && doc._id) || null),
+    blog : blog,
+    assets : assetPath(),
+    index : listPath('index','recent-posts',{descending:true,limit:8})
+  });
+}
+
+ +

The rest of the function is simple. We’re just rendering the HTML template with data culled from the document. In the case where the document does not yet exist, we make sure to set the docid to null. This allows us to use the same template both for creating new blog posts as well as editing existing ones. + +

The HTML Scaffold

+ +

The only missing piece of this puzzle is the HTML that it takes to save a document like this. + +

In your browser, visit http://127.0.0.1:5984/blog/_design/sofa/_show/edit and, using your text editor, open the source file templates/edit.html (or view source in your browser). Everything is ready to go; all we have to do is wire up CouchDB using in-page JavaScript. See Figure 2, “HTML listing for edit.html”. + +

Just like any web application, the important part of the HTML is the form for accepting edits. The edit form captures a few basic data items: the post title, the body (in Markdown format), and any tags the author would like to apply. + +

+<!-- form to create a Post -->
+<form id="new-post" action="new.html" method="post">
+  <h1>Create a new post</h1>
+  <p><label>Title</label>
+    <input type="text" size="50" name="title"></p>
+  <p><label for="body">Body</label>
+    <textarea name="body" rows="28" cols="80">
+    </textarea></p>
+  <p><input id="preview" type="button" value="Preview"/>
+    <input type="submit" value="Save &rarr;"/></p>
+</form>
+
+ +

We start with just a raw HTML document, containing a normal HTML form. We use JavaScript to convert user input into a JSON document and save it to CouchDB. In the spirit of focusing on CouchDB, we won’t dwell on the JavaScript here. It’s a combination of Sofa-specific application code, CouchApp’s JavaScript helpers, and jQuery for interface elements. The basic story is that it watches for the user to click “Save,” and then applies some callbacks to the document before sending it to CouchDB. + +

+ + + +

Figure 2. HTML listing for edit.html + +

+ +

Saving a Document

+ +

The JavaScript that drives blog post creation and editing centers around the HTML form from Figure 2, “HTML listing for edit.html”. The CouchApp jQuery plug-in provides some abstraction, so we don’t have to concern ourselves with the details of how the form is converted to a JSON document when the user hits the submit button. $.CouchApp also ensures that the user is logged in and makes her information available to the application. See Figure 3, “JavaScript callbacks for edit.html”. + +

+$.CouchApp(function(app) {
+  app.loggedInNow(function(login) {
+
+ +

The first thing we do is ask the CouchApp library to make sure the user is logged in. Assuming the answer is yes, we’ll proceed to set up the page as an editor. This means we apply a JavaScript event handler to the form and specify callbacks we’d like to run on the document, both when it is loaded and when it saved. + +

+ + + +

Figure 3. JavaScript callbacks for edit.html + +

+ +
+    // w00t, we're logged in (according to the cookie)
+    $("#header").prepend('<span id="login">'+login+'</span>');
+    // setup CouchApp document/form system, adding app-specific callbacks
+    var B = new Blog(app);
+
+ +

Now that we know the user is logged in, we can render his username at the top of the page. The variable B is just a shortcut to some of the Sofa-specific blog rendering code. It contains methods for converting blog post bodies from Markdown to HTML, as well as a few other odds and ends. We pulled these functions into blog.js so we could keep them out of the way of main code. + +

+    var postForm = app.docForm("form#new-post", {
+      id : <%= docid %>,
+      fields : ["title", "body", "tags"],
+      template : {
+        type : "post",
+        format : "markdown",
+        author : login
+      },
+
+ +

CouchApp’s app.docForm() helper is a function to set up and maintain a correspondence between a CouchDB document and an HTML form. Let’s look at the first three arguments passed to it by Sofa. The id argument tells docForm() where to save the document. This can be null in the case of a new document. We set fields to an array of form elements that will correspond directly to JSON fields in the CouchDB document. Finally, the template argument is given a JavaScript object that will be used as the starting point, in the case of a new document. In this case, we ensure that the document has a type equal to “post,” and that the default format is Markdown. We also set the author to be the login name of the current user. + +

+      onLoad : function(doc) {
+        if (doc._id) {
+          B.editing(doc._id);
+          $('h1').html('Editing <a href="../post/'+doc._id+'">'+doc._id+'</a>');
+          $('#preview').before('<input type="button" id="delete"
+              value="Delete Post"/> ');
+          $("#delete").click(function() {
+            postForm.deleteDoc({
+              success: function(resp) {
+                $("h1").text("Deleted "+resp.id);
+                $('form#new-post input').attr('disabled', true);
+              }
+            });
+            return false;
+          });
+        }
+        $('label[for=body]').append(' <em>with '+(doc.format||'html')+'</em>');
+
+ +

The onLoad callback is run when the document is loaded from CouchDB. It is useful for decorating the document before passing it to the form, or for setting up other user interface elements. In this case, we check to see if the document already has an ID. If it does, that means it’s been saved, so we create a button for deleting it and set up the callback to the delete function. It may look like a lot of code, but it’s pretty standard for Ajax applications. If there is one criticism to make of this section, it’s that the logic for creating the delete button could be moved to the blog.js file so we can keep more user-interface details out of the main flow. + +

+      },
+      beforeSave : function(doc) {
+        doc.html = B.formatBody(doc.body, doc.format);
+        if (!doc.created_at) {
+          doc.created_at = new Date();
+        }
+        if (!doc.slug) {
+          doc.slug = app.slugifyString(doc.title);
+          doc._id = doc.slug;
+        }
+        if(doc.tags) {
+          doc.tags = doc.tags.split(",");
+          for(var idx in doc.tags) {
+            doc.tags[idx] = $.trim(doc.tags[idx]);
+          }
+        }
+      },
+
+ +

The beforeSave() callback to docForm is run after the user clicks the submit button. In Sofa’s case, it manages setting the blog post’s timestamp, transforming the title into an acceptable document ID (for prettier URLs), and processing the document tags from a string into an array. It also runs the Markdown-to-HTML conversion in the browser so that once the document is saved, the rest of the application has direct access to the HTML. + +

+      success : function(resp) {
+        $("#saved").text("Saved _rev: "+resp.rev).fadeIn(500).fadeOut(3000);
+        B.editing(resp.id);
+      }
+    });
+
+ +

The last callback we use in Sofa is the success callback. It is fired when the document is successfully saved. In our case, we use it to flash a message to the user that lets her know she’s succeeded, as well as to add a link to the blog post so that when you create a blog post for the first time, you can click through to see its permalink page. + +

That’s it for the docForm() callbacks. + +

+    $("#preview").click(function() {
+      var doc = postForm.localDoc();
+      var html = B.formatBody(doc.body, doc.format);
+      $('#show-preview').html(html);
+      // scroll down
+      $('body').scrollTo('#show-preview', {duration: 500});
+    });
+
+ +

Sofa has a function to preview blog posts before saving them. Since this doesn’t affect how the document is saved, the code that watches for events from the “preview” button is not applied within the docForm() callbacks. + +

+  }, function() {
+    app.go('<%= assets %>/account.html#'+document.location);
+  });
+});
+
+ +

The last bit of code here is triggered when the user is not logged in. All it does is redirect him to the account page so that he can log in and try editing again. + +

Validation

+ +

Hopefully, you can see how the previous code will send a JSON document to CouchDB when the user clicks save. That’s great for creating a user interface, but it does nothing to protect the database from unwanted updates. This is where validation functions come into play. With a proper validation function, even a determined hacker cannot get unwanted documents into your database. Let’s look at how Sofa’s works. For more on validation functions, see Chapter 7, Validation Functions. + +

+function (newDoc, oldDoc, userCtx) {
+  // !code lib/validate.js
+
+ +

This line imports a library from Sofa that makes the rest of the function much more readable. It is just a wrapper around the basic ability to mark requests as either forbidden or unauthorized. In this chapter, we’ve concentrated on the business logic of the validation function. Just be aware that unless you use Sofa’s validate.js, you’ll need to work with the more primitive logic that the library abstracts. + +

+  unchanged("type");
+  unchanged("author");
+  unchanged("created_at");
+
+ +

These lines do just what they say. If the document’s type, author, or created_at fields are changed, they throw an error saying the update is forbidden. Note that these lines make no assumptions about the content of these fields. They merely state that updates must not change the content from one revision of the document to the next. + +

+  if (newDoc.created_at) dateFormat("created_at");
+
+ +

The dateFormat helper makes sure that the date (if one is provided) is in the format that Sofa’s views expect. + +

+  // docs with authors can only be saved by their author
+  // admin can author anything...
+  if (!isAdmin(userCtx) && newDoc.author && newDoc.author != userCtx.name) {
+      unauthorized("Only "+newDoc.author+" may edit this document.");
+  }
+
+ +

If the person saving the document is an admin, let the edit proceed. Otherwise, make certain that the author and the person saving the document are the same. This ensures that authors may edit only their own posts. + +

+  // authors and admins can always delete
+  if (newDoc._deleted) return true;
+
+ +

The next block of code will check the validity of various types of documents. However, deletions will normally not be valid according to those specifications, because their content is just _deleted: true, so we short-circut the validation function here. + +

+  if (newDoc.type == 'post') {
+    require("created_at", "author", "body", "html", "format", "title", "slug");
+    assert(newDoc.slug == newDoc._id, "Post slugs must be used as the _id.")
+  }
+}
+
+ +

Finally, we have the validation for the actual post document itself. Here we require the fields that are particular to the post document. Because we’ve validated that they are present, we can count on them in views and user interface code. + +

Save Your First Post

+ +

Let’s see how this all works together! Fill out the form with some practice data, and hit “save” to see a success response. + +

Figure 4, “JSON over HTTP to save the blog post” shows how JavaScript has used HTTP to PUT the document to a URL constructed of the database name plus the document ID. It also shows how the document is just sent as a JSON string in the body of the PUT request. If you were to GET the document URL, you’d see the same set of JSON data, with the addition of the _rev parameter as applied by CouchDB. + +

+ + + +

Figure 4. JSON over HTTP to save the blog post + +

+ +

To see the JSON version of the document you’ve saved, you can also browse to it in Futon. Visit http://127.0.0.1:5984/_utils/database.html?blog/_all_docs and you should see a document with an ID corresponding to the one you just saved. Click it to see what Sofa is sending to CouchDB. + +

Wrapping Up

+ +

We’ve covered how to design JSON formats for your application, how to enforce those designs with validation functions, and the basics of how documents are saved. In the next chapter, we’ll show how to load documents from CouchDB and display them in the browser. diff --git a/editions/1/zh/documents/01.png b/editions/1/zh/documents/01.png new file mode 100644 index 0000000..4efb367 Binary files /dev/null and b/editions/1/zh/documents/01.png differ diff --git a/editions/1/zh/documents/02.png b/editions/1/zh/documents/02.png new file mode 100644 index 0000000..2441a17 Binary files /dev/null and b/editions/1/zh/documents/02.png differ diff --git a/editions/1/zh/documents/03.png b/editions/1/zh/documents/03.png new file mode 100644 index 0000000..164875b Binary files /dev/null and b/editions/1/zh/documents/03.png differ diff --git a/editions/1/zh/documents/04.png b/editions/1/zh/documents/04.png new file mode 100644 index 0000000..324a8e9 Binary files /dev/null and b/editions/1/zh/documents/04.png differ diff --git a/editions/1/zh/foreword.html b/editions/1/zh/foreword.html new file mode 100644 index 0000000..40a78e4 --- /dev/null +++ b/editions/1/zh/foreword.html @@ -0,0 +1,31 @@ +Foreword + + + + + + + + + +

Foreword

+ +

Damien Katz, Creator of CouchDB

+ +

As the creator of CouchDB, it gives me great pleasure to write this Foreword. This book has been a long time coming. I’ve worked on CouchDB since 2005, when it was only a vision in my head and only my wife Laura believed I could make it happen. + +

Now the project has taken on a life of its own, and code is literally running on millions of machines. I couldn’t stop it now if I tried. + +

A great analogy J. Chris uses is that CouchDB has felt like a boulder we’ve been pushing up a hill. Over time, it’s been moving faster and getting easier to push, and now it’s moving so fast it’s starting to feel like it could get loose and crush some unlucky villagers. Or something. Hey, remember “Tales of the Runaway Boulder” with Robert Wagner on Saturday Night Live? Good times. + +

Well, now we are trying to safely guide that boulder. Because of the villagers. You know what? This boulder analogy just isn’t working. Let’s move on. + +

The reason for this book is that CouchDB is a very different way of approaching data storage. A way that isn’t inherently better or worse than the ways before—it’s just another tool, another way of thinking about things. It’s missing some features you might be used to, but it’s gained some abilities you’ve maybe never seen. Sometimes it’s an excellent fit for your problems; sometimes it’s terrible. + +

And sometimes you may be thinking about your problems all wrong. You just need to approach them from a different angle. + +

Hopefully this book will help you understand CouchDB and the approach that it takes, and also understand how and when it can be used for the problems you face. + +

Otherwise, someday it could become a runaway boulder, being misused and causing disasters that could have been avoided. + +

And I’ll be doing my best Charlton Heston imitation, on the ground, pounding the dirt, yelling, “You maniacs! You blew it up! Ah, damn you! God damn you all to hell!” Or something like that. diff --git a/editions/1/zh/formats.html b/editions/1/zh/formats.html new file mode 100644 index 0000000..e512448 --- /dev/null +++ b/editions/1/zh/formats.html @@ -0,0 +1,133 @@ +Showing Documents in Custom Formats + + + + + + + + + + + +

Showing Documents in Custom Formats

+ +

CouchDB’s show functions are a RESTful API inspired by a similar feature in Lotus Notes. In a nutshell, they allow you to serve documents to clients, in any format you choose. + +

A show function builds an HTTP response with any Content-Type, based on a stored JSON document. For Sofa, we’ll use them to show the blog post permalink pages. This will ensure that these pages are indexable by search engines, as well as make the pages more accessible. Sofa’s show function displays each blog post as an HTML page, with links to stylesheets and other assets, which are stored as attachments to Sofa’s design document. + +

Hey, this is great—we’ve rendered a blog post! See Figure 1, “A rendered post”. + +

+ + + +

Figure 1. A rendered post + +

+ +

The complete show function and template will render a static, cacheable resource that does not depend on details about the current user or anything else aside from the requested document and Content-Type. Generating HTML from a show function will not cause any side effects in the database, which has positive implications for building simple scalable applications. + +

Rendering Documents with Show Functions

+ +

Let’s look at the source code. The first thing we’ll see is the JavaScript function body, which is very simple—it simply runs a template function to generate the HTML page. Let’s break it down: + +

+function(doc, req) {
+  // !json templates.post
+  // !json blog
+  // !code vendor/couchapp/template.js
+  // !code vendor/couchapp/path.js
+
+ +

We’re familiar with the !code and !json macros from Chapter 12, Storing Documents. In this case, we’re using them to import a template and some metadata about the blog (as JSON data), as well as to include link and template rendering functions as inline code. + +

Next, we render the template: + +

+  return template(templates.post, {
+    title : doc.title,
+    blogName : blog.title,
+    post : doc.html,
+    date : doc.created_at,
+    author : doc.author,
+
+ +

The blog post title, HTML body, author, and date are taken from the document, with the blog’s title included from its JSON value. The next three calls all use the path.js library to generate links based on the request path. This ensures that links within the application are correct. + +

+    assets : assetPath(),
+    editPostPath : showPath('edit', doc._id),
+    index : listPath('index','recent-posts',{descending:true, limit:5})
+  });
+}
+
+ +

So we’ve seen that the function body itself just calculates some values (based on the document, the request, and some deployment specifics, like the name of the database) to send to the template for rendering. The real action is in the HTML template. Let’s take a look. + +

The Post Page Template

+ +

The template defines the output HTML, with the exception of a few tags that are replaced with dynamic content. In Sofa’s case, the dynamic tags look like <%= replace_me %>, which is a common templating tag delimiter. + +

The template engine used by Sofa is adapted from John Resig’s blog post, “JavaScript Micro-Templating”. It was chosen as the simplest one that worked in the server-side context without modification. Using a different template engine would be a simple exercise. + +

Let’s look at the template string. Remember that it is included in the JavaScript using the CouchApp !json macro, so that CouchApp can handle escaping it and including it to be used by the templating engine. + +

+<!DOCTYPE html>
+<html>
+  <head>
+    <title><%= title %> : <%= blogName %></title>
+
+ +

This is the first time we’ve seen a template tag in action—the blog post title, as well as the name of the blog as defined in blog.json are both used to craft the HTML <title> tag. + +

+    <link rel="stylesheet" href="../../screen.css" type="text/css">
+
+ +

Because show functions are served from within the design document path, we can link to attachments on the design document using relative URIs. Here we’re linking to screen.css, a file stored in the _attachments folder of the Sofa source directory. + +

+  </head>
+  <body>
+    <div id="header">
+      <a id="edit" href="<%= editPostPath %>">Edit this post</a>
+      <h2><a href="<%= index %>"><%= blogName %></a></h2>
+
+ +

Again, we’re seeing template tags used to replace content. In this case, we link to the edit page for this post, as well as to the index page of the blog. + +

+    </div>
+    <div id="content">
+      <h1><%= title %></h1>
+      <div id="post">
+        <span class="date"><%= date %></span>
+
+ +

The post title is used for the <h1> tag, and the date is rendered in a special tag with a class of date. See the section called “Dynamic Dates” for an explanation of why we output static dates in the HTML instead of rendering a user-friendly string like “3 days ago” to describe the date. + +

+        <div class="body"><%= post %></div>
+      </div>
+    </div>
+  </body>
+</html>
+
+ +

In the close of the template, we render the post HTML (as converted from Markdown and saved from the author’s browser). + +

Dynamic Dates

+ +

When running CouchDB behind a caching proxy, this means each show function should have to be rendered only once per updated document. However, it also explains why the timestamp looks like 2008/12/25 23:27:17 +0000 instead of “9 days ago.” + +

It also means that for presentation items that depend on the current time, or the identity of the browsing user, we’ll need to use client-side JavaScript to make dynamic changes to the final HTML. + +

+    $('.date').each(function() {
+      $(this).text(app.prettyDate(this.innerHTML));
+    });
+
+ +

We include this detail about the browser-side JavaScript implementation not to teach you about Ajax, but because it epitomizes the kind of thinking that makes sense when you are presenting documents to client applications. CouchDB should provide the most useful format for the document, as requested by the client. But when it comes time to integrate information from other queries or bring the display up-to-date with other web services, by asking the client’s application to do the lifting, you move computing cycles and memory costs from CouchDB to the client. Since there are typically many more clients than CouchDBs, pushing the load back to the clients means each CouchDB can serve more users. diff --git a/editions/1/zh/formats/01.png b/editions/1/zh/formats/01.png new file mode 100644 index 0000000..a65bd55 Binary files /dev/null and b/editions/1/zh/formats/01.png differ diff --git a/editions/1/zh/index.html b/editions/1/zh/index.html new file mode 100644 index 0000000..bfc79e7 --- /dev/null +++ b/editions/1/zh/index.html @@ -0,0 +1,87 @@ +CouchDB: The Definitive Guide + + + + + + + +

Table of Contents

+ +

Foreword

+ +

Preface

+ +

Part I. Introduction

+ +

1. Why CouchDB?

+ +

2. Eventual Consistency

+ +

3. Getting Started

+ +

4. The Core API

+ +

Part II. Developing with CouchDB

+ +

5. Design Documents

+ +

6. Finding Your Data with Views

+ +

7. Validation Functions

+ +

8. Show Functions

+ +

9. Transforming Views with List Functions

+ +

Part III. Example Application

+ +

10. Standalone Applications

+ +

11. Managing Design Documents

+ +

12. Storing Documents

+ +

13. Showing Documents in Custom Formats

+ +

14. Viewing Lists of Blog Posts

+ +

Part IV. Deploying CouchDB

+ +

15. Scaling Basics

+ +

16. Replication

+ +

17. Conflict Management

+ +

18. Load Balancing

+ +

19. Clustering

+ +

Part V. Reference

+ +

20. Change Notifications

+ +

21. View Cookbook for SQL Jockeys

+ +

22. Security

+ +

23. High Performance

+ +

24. Recipes

+ +

Part VI. Appendix

+ +

A. Installing on Unix-like Systems

+ +

B. Installing on Mac OS X

+ +

C. Installing on Windows

+ +

D. Installing from Source

+ +

E. JSON Primer

+ +

F. The Power of B-trees

+ +

Colophon

diff --git a/editions/1/zh/json.html b/editions/1/zh/json.html new file mode 100644 index 0000000..69e2cc8 --- /dev/null +++ b/editions/1/zh/json.html @@ -0,0 +1,113 @@ +JSON Primer + + + + + + + + + + + +

JSON Primer

+ +

CouchDB uses JavaScript Object Notation (JSON) for data storage, a lightweight format based on a subset of JavaScipt syntax. One of the best bits about JSON is that it’s easy to read and write by hand, much more so than something like XML. We can parse it naturally with JavaScript because it shares part of the same syntax. This really comes in handy when we’re building dynamic web applications and we want to fetch some data from the server. + +

Here’s a sample JSON document: + +

+{
+    "Subject": "I like Plankton",
+    "Author": "Rusty",
+    "PostedDate": "2006-08-15T17:30:12-04:00",
+    "Tags": [
+        "plankton",
+        "baseball",
+        "decisions"
+    ],
+    "Body": "I decided today that I don't like baseball. I like plankton."
+}
+
+ +

You can see that the general structure is based around key/value pairs and lists of things. + +

Data Types

+ +

JSON has a number of basic data types you can use. We’ll cover them all here. + +

Numbers

+ +

You can have positive integers: "Count": 253 + +

Or negative integers: "Score": -19 + +

Or floating-point numbers: "Area": 456.31 + +

+ +

There is a subtle but important difference between floating-point numbers and decimals. When you use a number like 15.7, this will be interpreted as 15.699999999999999 by most clients, which may be problematic for your application. For this reason, currency values are usually better represented as strings in JSON. A string like "15.7" will be interpreted as "15.7" by every JSON client. + +

+ +

Or scientific notation: "Density": 5.6e+24 + +

Strings

+ +

You can use strings for values: + +

+"Author": "Rusty"
+
+ +

You have to escape some special characters, like tabs or newlines: + +

+"poem": "May I compare thee to some\n\tsalty plankton."
+
+ +

The JSON site has details on what needs to be escaped. + +

Booleans

+ +

You can have boolean true values: + +

+"Draft": true
+
+ +

Or boolean false values: + +

+"Draft": false
+
+ +

Arrays

+ +

An array is a list of values: + +

+"Tags": ["plankton", "baseball", "decisions"]
+
+ +

An array can contain any other data type, including arrays: + +

+"Context": ["dog", [1, true], {"Location": "puddle"}]
+
+ +

Objects

+ +

An object is a list of key/value pairs: + +

+{"Subject": "I like Plankton", "Author": "Rusty"}
+
+ +

Nulls

+ +

You can have null values: + +

+"Surname": null
+
diff --git a/editions/1/zh/lists.html b/editions/1/zh/lists.html new file mode 100644 index 0000000..321df04 --- /dev/null +++ b/editions/1/zh/lists.html @@ -0,0 +1,268 @@ +Viewing Lists of Blog Posts + + + + + + + + + + + +

Viewing Lists of Blog Posts

+ +

The last few chapters dealt with getting data into and out of CouchDB. You learned how to model your data into documents and retrieve it via the HTTP API. In this chapter, we’ll look at the views used to power Sofa’s index page, and the list function that renders those views as HTML or XML, depending on the client’s request. + +

Now that we’ve successfully created a blog post and rendered it as HTML, we’ll be building the front page where visitors will land when they’ve found your blog. This page will have a list of the 10 most recent blog posts, with titles and short summaries. The first step is to write the MapReduce query that constructs the index used by CouchDB at query time to find blog posts based on when they were written. + +

In Chapter 6, Finding Your Data with Views, we noted that reduce isn’t needed for many common queries. For the index page, we’re only interested in an ordering of the posts by date, so we don’t need to use a reduce function, as the map function alone is enough to order the posts by date. + +

Map of Recent Blog Posts

+ +

You’re now ready to write the map function that builds a list of all blog posts. The goals for this view are simple: sort all blog posts by date. + +

Here is the source code for the view function. I’ll call out the important bits as we encounter them. + +

+function(doc) {
+  if (doc.type == "post") {
+
+ +

The first thing we do is ensure that the document we’re dealing with is a post. We don’t want comments or anything other than blog posts getting on the front page. The expression doc.type == "post" evaluates to true for posts but no other kind of document. In Chapter 7, Validation Functions, we saw that the validation function gives us certain guarantees about posts, designed to make us comfortable about putting them on the front page of our blog. + +

+    var summary = (doc.html.replace(/<(.|\n)*?>/g, '').substring(0,350) + '...');
+
+ +

This line shortens the blog post’s HTML (generated from Markdown before saving) and strips out most tags and images, at least well enough to keep them from showing up on the index page, for brevity. + +

The next section is the crux of the view. We’re emitting for each document a key (doc.created_at) and a value. The key is used for sorting, so that we can pull out all the posts in a particular date range efficiently. + +

+    emit(doc.created_at, {
+      html : doc.html,
+      summary : summary,
+      title : doc.title,
+      author : doc.author
+    });
+
+ +

The value we’ve emitted is a JavaScript object, which copies some fields from the document (but not all), and the summary string we’ve just generated. It’s preferable to avoid emitting entire documents. As a general rule, you want to keep your views as lean as possible. Only emit data you plan to use in your application. In this case we emit the summary (for the index page), the HTML (for the Atom feed), the blog post title, and its author. + +

+  }
+};
+
+ +

You should be able to follow the definition of the previous map function just fine by now. The emit() call creates an entry for each blog post document in our view’s result set. We’ll call the view recent-posts. Our design document looks like this now: + +

+{
+  "_design/sofa",
+  "views": {
+    "recent-posts": {
+      "map": "function(doc) { if (doc.type == "post") { ... code to emit posts ... }"
+    }
+  }
+  "_attachments": {
+    ...
+  }
+}
+
+ +

CouchApp manages aggregating the filesystem files into our JSON design document, so we can edit our view in a file called views/recent-posts/map.js. Once the map function is stored on the design document, our view is ready to be queried for the latest 10 posts. Again, this looks very similar to displaying a single post. The only real difference now is that we get back an array of JSON objects instead of just a single JSON object. + +

The GET request to the URI is: + +

+/blog/_design/sofa/_view/recent-posts
+
+ +

A view defined in the document /database/_design/designdocname in the views field ends up being callable under /database/_design/designdocname/_view/viewname. + +

You can pass in HTTP query arguments to customize your view query. In this case, we pass in: + +

+descending: true, limit: 5
+
+ +

This gets the latest post first and only the first five posts in all. + +

The actual view request URL is: + +

+/blog/_design/sofa/_view/recent-posts?descending=true&limit=5
+
+ +

Rendering the View as HTML Using a List Function

+ +

The _list function was covered in detail in Chapter 5, Design Documents. In our example application, we’ll use a JavaScript list function to render a view of recent blog posts as both XML and HTML formats. CouchDB’s JavaScript view server also ships with the ability to respond appropriately to HTTP content negotiation and Accept headers. + +

The essence of the _list API is a function that is fed one row at a time and sends the response back one chunk at a time. + +

Sofa’s List Function

+ +

Let’s take a look at Sofa’s list function. This is a rather long listing, and it introduces a few new concepts, so we’ll take it slow and be sure to cover everything of interest. + +

+function(head, req) {
+  // !json templates.index
+  // !json blog
+  // !code vendor/couchapp/path.js
+  // !code vendor/couchapp/date.js
+  // !code vendor/couchapp/template.js
+  // !code lib/atom.js
+
+ +

The top of the function declares the arguments head and req. Our function does not use head, just req, which contains information about the request such as the headers sent by the client and a representation of the query string as sent by the client. The first lines of the function are CouchApp macros that pull in code and data from elsewhere in the design document. As we’ve described in more detail in Chapter 11, Managing Design Documents, these macros allow us to work with short, readable functions that pull in library code from elsewhere in the design document. Our list function uses the CouchApp JavaScript helpers for generating URLs (path.js), for working with date objects (date.js), and the template function we’re using to render HTML. + +

+  var indexPath = listPath('index','recent-posts',{descending:true, limit:5});
+  var feedPath = listPath('index','recent-posts',{descending:true, limit:5, format:"atom"});
+
+ +

The next two lines of the function generate URLs used to link to the index page itself, as well as the XML Atom feed version of it. The listPath function is defined in path.js—the upshot is that it knows how to link to lists generated by the same design document it is run from. + +

The next section of the function is responsible for rendering the HTML output of the blog. Refer to Chapter 8, Show Functions for details about the API we use here. In short, clients can describe the format(s) they prefer in the HTTP Accept header, or in a format query parameter. On the server, we declare which formats we provide, as well as assign each format a priority. In cases where the client accepts multiple formats, the first declared format is returned. It is not uncommon for browsers to accept a wide range of formats, so take care to put HTML at the top of the list, or else you can end up with browsers receiving alternate formats when they expect HTML. + +

+  provides("html", function() {
+
+ +

The provides function takes two arguments: the name of the format (which is keyed to a list of default MIME types) and a function to execute when rendering that format. Note that when using provides, all send and getRow calls must happen within the render function. Now let’s look at how the HTML is actually generated. + +

+    send(template(templates.index.head, {
+      title : blog.title,
+      feedPath : feedPath,
+      newPostPath : showPath("edit"),
+      index : indexPath,
+      assets : assetPath()
+    }));
+
+ +

The first thing we see is a template being run with an object that contains the blog title and a few relative URLs. The template function used by Sofa is fairly simple; it just replaces some parts of the template string with passed in values. In this case, the template string is stored in the variable templates.index.head, which was imported using a CouchApp macro at the top of the function. The second argument to the template function are the values that will be inserted into the template; in this case, title, feedPath, newPostPath, index, and assets. We’ll look at the template itself later in this chapter. For now, it’s sufficient to know that the template stored in templates.index.head renders the topmost portion of the HTML page, which does not change regardless of the contents of our recent posts view. + +

Now that we have rendered the top of the page, it’s time to loop over the blog posts, rendering them one at a time. The first thing we do is declare our variables and our loop: + +

+    var row, key;
+    while (row = getRow()) {
+      var post = row.value;
+      key = row.key;
+
+ +

The row variable is used to store each JSON view row as it is sent to our function. The key variable plays a different role. Because we don’t know ahead of time which of our rows will be the last row to be processed, we keep the key available in its own variable, to be used after all rows are rendered, to generate the link to the next page of results. + +

+send(template(templates.index.row, {
+    title : post.title,
+    summary : post.summary,
+    date : post.created_at,
+    link : showPath('post', row.id)
+  }));
+}
+
+ +

Now that we have the row and its key safely stored, we use the template engine again for rendering. This time we use the template stored in templates.index.row, with a data item that includes the blog post title, a URL for its page, the summary of the blog post we generated in our map view, and the date the post was created. + +

Once all the blog posts included in the view result have been listed, we’re ready to close the list and finish rendering the page. The last string does not need to be sent to the client using send(), but it can be returned from the HTML function. Aside from that minor detail, rendering the tail template should be familiar by now. + +

+    return template(templates.index.tail, {
+      assets : assetPath(),
+      older : olderPath(key)
+    });
+  });
+
+ +

Once the tail has been returned, we close the HTML generating function. If we didn’t care to offer an Atom feed of our blog, we’d be done here. But we know most readers are going to be accessing the blog through a feed reader or some kind of syndication, so an Atom feed is crucial. + +

+  provides("atom", function() {
+
+ +

The Atom generation function is defined in just the same way as the HTML generation function—by being passed to provides() with a label describing the format it outputs. The general pattern of the Atom function is the same as the HTML function: output the first section of the feed, then output the feed entries, and finally close the feed. + +

+    // we load the first row to find the most recent change date
+    var row = getRow();
+
+ +

One difference is that for the Atom feed, we need to know when it was last changed. This will normally be the time at which the first item in the feed was changed, so we load the first row before outputting any data to the client (other than HTTP headers, which are set when the provides function picks the format). Now that we have the first row, we can use the date from it to set the Atom feed’s last-updated field. + +

+    // generate the feed header
+    var feedHeader = Atom.header({
+      updated : (row ? new Date(row.value.created_at) : new Date()),
+      title : blog.title,
+      feed_id : makeAbsolute(req, indexPath),
+      feed_link : makeAbsolute(req, feedPath),
+    });
+
+ +

The Atom.header function is defined in lib/atom.js, which was imported by CouchApp at the top of our function. This library uses JavaScript’s E4X extension to generate feed XML. + +

+    // send the header to the client
+    send(feedHeader);
+
+ +

Once the feed header has been generated, sending it to the client uses the familiar send() call. Now that we’re done with the header, we’ll generate each Atom entry, based on a row in the view. We use a slightly different loop format in this case than in the HTML case, as we’ve already loaded the first row in order to use its timestamp in the feed header. + +

+    // loop over all rows
+    if (row) {
+      do {
+
+ +

The JavaScript do/while loop is similar to the while loop used in the HTML function, except that it’s guaranteed to run at least once, as it evaluates the conditional statement after each iteration. This means we can output an entry for the row we’ve already loaded, before calling getRow() to load the next entry. + +

+        // generate the entry for this row
+        var feedEntry = Atom.entry({
+          entry_id : makeAbsolute(req, '/' +
+            encodeURIComponent(req.info.db_name) +
+            '/' + encodeURIComponent(row.id)),
+          title : row.value.title,
+          content : row.value.html,
+          updated : new Date(row.value.created_at),
+          author : row.value.author,
+          alternate : makeAbsolute(req, showPath('post', row.id))
+        });
+        // send the entry to client
+        send(feedEntry);
+
+ +

Rendering the entries also uses the Atom library in atom.js. The big difference between the Atom entries and the list items in HTML, is that for our HTML screen we only output the summary of the entry text, but for the Atom entries we output the entire entry. By changing the value of content from row.value.html to row.value.summary, you could change the Atom feed to only include shortened post summaries, forcing subscribers to click through to the actual post to read it. + +

+      } while (row = getRow());
+    }
+
+ +

As we mentioned earlier, this loop construct puts the loop condition at the end of the loop, so here is where we load the next row of the loop. + +

+    // close the loop after all rows are rendered
+    return "</feed>";
+  });
+};
+
+ +

Once all rows have been looped over, we end the feed by returning the closing XML tag to the client as the last chunk of data. + +

The Final Result

+ +

Figure 1, “The rendered index page” shows the final result. + +

+ + + +

Figure 1. The rendered index page + +

+ +

This is our final list of blog posts. That wasn’t too hard, was it? We now have the front page of the blog, we know how to query single documents as well as views, and we know how to pass arguments to views. diff --git a/editions/1/zh/lists/01.png b/editions/1/zh/lists/01.png new file mode 100644 index 0000000..2898bfb Binary files /dev/null and b/editions/1/zh/lists/01.png differ diff --git a/editions/1/zh/mac.html b/editions/1/zh/mac.html new file mode 100644 index 0000000..cae5ff4 --- /dev/null +++ b/editions/1/zh/mac.html @@ -0,0 +1,79 @@ +Installing on Mac OS X + + + + + + + + + + + +

Installing on Mac OS X

+ +

CouchDBX

+ +

The easiest way to get started with CouchDB on Mac OS X is by downloading CouchDBX. This unofficial application doesn’t install anything to your system and can be run with a single double-click. Note, however, that for more serious use, it is recommended that you do a traditional installation with something like Homebrew. + +

Homebrew

+ +

Homebrew is a recent addition to the software management tools on Mac OS X. Its premise is zero configuration, heavy optimizations, and a beer theme. Get Homebrew from http://github.com/mxcl/homebrew. The installation instructions are minimal. Once you are set up, run: + +

+brew install couchdb
+
+ +

in the Terminal and wait until it is done. To start CouchDB, simply run: + +

+couchdb
+
+ +

to see all the startup options available to you, run: + +

+couchdb -h
+
+ +

This tells you how to run CouchDB in the background, among other useful hints. + +

To verify that CouchDB is indeed running, open your browser and visit http://127.0.0.1:5984/_utils/index.html. + +

MacPorts

+ +

MacPorts is the de facto package management tool for Mac OS X. While not an official part of the operating system, it can be used to simplify the process of installing FLOSS software on your machine. Before you can install CouchDB with MacPorts, you need to download and install MacPorts. + +

Make sure your MacPorts installation is up-to-date by running: + +

+sudo port selfupdate
+
+ +

You can install CouchDB with MacPorts by running: + +

+sudo port install couchdb
+
+ +

This command will install all of the necessary dependencies for CouchDB. If a dependency was already installed, MacPorts will not take care of upgrading the dependency to the newest version. To make sure that all of the dependencies are up-to-date, you should also run: + +

+sudo port upgrade couchdb
+
+ +

Mac OS X has a service management framework called launchd that can be used to start, stop, or manage system daemons. You can use this to start CouchDB automatically when the system boots up. If you want to add CouchDB to your launchd configuration, you should run: + +

+sudo launchctl load -w /opt/local/Library/LaunchDaemons/org.apache.couchdb.plist
+
+ +

After running this command, CouchDB should be available at: + +

+ +http://127.0.0.1:5984/_utils/index.html + +
+ +

CouchDB will also be started and stopped along with the operating system. diff --git a/editions/1/zh/managing.html b/editions/1/zh/managing.html new file mode 100644 index 0000000..9dceaac --- /dev/null +++ b/editions/1/zh/managing.html @@ -0,0 +1,338 @@ +Managing Design Documents + + + + + + + + + + + +

Managing Design Documents

+ +

Applications can live in CouchDB—nice. You just attach a bunch of HTML and JavaScript files to a design document and you are good to go. Spice that up with view-powered queries and show functions that render any media type from your JSON documents, and you have all it takes to write self-contained CouchDB applications. + +

Working with the Example Application

+ +
+ +

If you want to install and hack on your own version of Sofa while you read the following chapters, we’ll be using CouchApp to upload the source code as we explore it. + +

We’re particularly excited by the prospect of deploying applications to CouchDB because, depending on a least-common denominator environment, that encourages users to control not just the data but also the source code, which will let more people build personal web apps. And when the web app you’ve hacked together in your spare time hits the big time, the ability of CouchDB to scale to larger infrastructure sure doesn’t hurt. + +

+ +

In a CouchDB design document, there are a mix of development languages (HTML, JS, CSS) that go into different places like attachments and design document attributes. Ideally, you want your development environment to help you as much as possible. More important, you’re already used to proper syntax highlighting, validation, integrated documentation, macros, helpers, and whatnot. Editing HTML and JavaScript code as the string attributes of a JSON object is not exactly modern computing. + +

Lucky for you, we’ve been working on a solution. Enter CouchApp. CouchApp lets you develop CouchDB applications in a convenient directory hierarchy—views and shows are separate, neatly organized .js files; your static assets (CSS, images) have their place; and with the simplicity of a couchapp push, you save your app to a design document in CouchDB. Make a change? couchapp push and off you go. + +

This chapter guides you through the installation and moving parts of CouchApp. You will learn what other neat helpers it has in store to make your life easier. Once we have CouchApp, we’ll use it to install and deploy Sofa to a CouchDB database. + +

Installing CouchApp

+ +

The CouchApp Python script and JavaScript framework we’ll be using grew out of the work designing this example application. It’s now in use for a variety of applications, and has a mailing list, wiki, and a community of hackers. Just search the Internet for “couchapp” to find the latest information. Many thanks to Benoît Chesneau for building and maintaining the library (and contributing to CouchDB’s Erlang codebase and many of the Python libraries). + +

CouchApp is easiest to install using the Python easy_install script, which is part of the setuptools package. If you are on a Mac, easy_install should already be available. If easy_install is not installed and you are on a Debian variant, such as Ubuntu, you can use the following command to install it: + +

+sudo apt-get install python-setuptools
+
+ +

Once you have easy_install, installing CouchApp should be as easy as: + +

+sudo easy_install -U couchapp
+
+ +

Hopefully, this works and you are ready to start using CouchApp. If not, read on…. + +

The most common problem people have installing CouchApp is with old versions of dependencies, especially easy_install itself. If you got an installation error, the best next step is to attempt to upgrade setuptools and then upgrade CouchApp by running the following commands: + +

+sudo easy_install -U setuptools
+sudo easy_install -U couchapp
+
+ +

If you have other problems installing CouchApp, have a look at setuptools for Python’s easy install troubleshooting, or visit the CouchApp mailing list. + +

Using CouchApp

+ +

Installing CouchApp via easy_install should, as they say, be easy. Assuming all goes according to plan, it takes care of any dependencies and puts the couchapp utility into your system’s PATH so you can immediately begin by running the help command: + +

+couchapp --help
+
+ +

We’ll be using the clone and push commands. clone pulls an application from a running instance in the cloud, saving it as a directory structure on your filesystem. push deploys a standalone CouchDB application from your filesystem to any CouchDB over which you have administrative control. + +

Download the Sofa Source Code

+ +

There are three ways to get the Sofa source code. They are all equally valid; it’s just a matter of personal preference and how you plan to use the code once you have it. The easiest way is to use CouchApp to clone it from a running instance. If you didn’t install CouchApp in the previous section, you can read the source code (but not install and run it) by downloading and extracting the ZIP or TAR file. If you are interested in hacking on Sofa and would like to join the development community, the best way to get the source code is from the official Git repository. We’ll cover these three methods in turn. First, enjoy Figure 1, “A happy bird to ease any install-induced frustration”. + +

+ + + +

Figure 1. A happy bird to ease any install-induced frustration + +

+ +

CouchApp Clone

+ +

One of the easiest ways to get the Sofa source code is by cloning directly from J. Chris’s blog using CouchApp’s clone command to download Sofa’s design document to a collection of files on your local hard drive. The clone command operates on a design document URL, which can be hosted in any CouchDB database accessible via HTTP. To clone Sofa from the version running on J. Chris’s blog, run the following command: + +

+couchapp clone http://jchrisa.net/drl/_design/sofa
+
+ +

You should see this output: + +

+[INFO] Cloning sofa to ./sofa
+
+ +

Now that you’ve got Sofa on your local filesystem, you can skip to the section called “Deploying Sofa” to make a small local change and push it to your own CouchDB. + +

ZIP and TAR Files

+ +

If you merely want to peruse the source code while reading along with this book, it is available as standard ZIP or TAR downloads. To get the ZIP version, access the following URL from your browser, which will redirect to the latest ZIP file of Sofa: http://github.com/couchapp/couchapp/zipball/master. If you prefer, a TAR file is available as well: http://github.com/couchapp/couchapp/tarball/master. + +

Join the Sofa Development Community on GitHub

+ +

The most up-to-date version of Sofa will always be available at its public code repository. If you are interested in staying up-to-date with development efforts and contributing patches back to the source, the best way to do it is via Git and GitHub. + +

Git is a form of distributed version control that allows groups of developers to track and share changes to software. If you are familiar with Git, you’ll have no trouble using it to work on Sofa. If you’ve never used Git before, it has a bit of a learning curve, so depending on your tolerance for new software, you might want to save learning Git for another day—or you might want to dive in head first! For more information about Git and how to install it, see the official Git home page. For other hints and help using Git, see the GitHub guides. + +

To get Sofa (including all development history) using Git, run the following command: + +

+git clone git://github.com/jchris/sofa.git
+
+ +

Now that you’ve got the source, let’s take a quick tour. + +

The Sofa Source Tree

+ +

Once you’ve succeeded with any of these methods, you’ll have a copy of Sofa on your local disk. The following text is generated by running the tree command on the Sofa directory to reveal the full set of files it contains. Sections of the text are annotated to make it clear how various files and directories correspond to the Sofa design document. + +

+sofa/
+|-- README.md
+|-- THANKS.txt
+
+ +

The source tree contains some files that aren’t necessary for the application—the README and THANKS files are among those. + +

+|-- _attachments
+|   |-- LICENSE.txt
+|   |-- account.html
+|   |-- blog.js
+|   |-- jquery.scrollTo.js
+|   |-- md5.js
+|   |-- screen.css
+|   |-- showdown-licenese.txt
+|   |-- showdown.js
+|   |-- tests.js
+|   `-- textile.js
+
+ +

The _attachments directory contains files that are saved to the Sofa design document as binary attachments. CouchDB serves attachments directly (instead of including them in a JSON wrapper), so this is where we store JavaScript, CSS, and HTML files that the browser will access directly. + +

+ +

Making your first edit to the Sofa source code will show you how easy it is to modify the application. + +

+ +
+|-- blog.json
+
+ +

The blog.json file contains JSON used to configure individual installations of Sofa. Currently, it sets one value, the title of the blog. You should open this file now and personalize the title field—you probably don’t want to name your blog “Daytime Running Lights,” so now’s your chance to come up with something more fun! + +

You could add other blog configurations to this file—maybe things like how many posts to show per page and a URL for an About page for the author. Working changes like these into the application will be easy once you’ve walked through later chapters. + +

+|-- couchapp.json
+
+ +

We’ll see later that couchapp outputs a link to Sofa’s home page when couchapp push is run. The way this works is pretty simple: CouchApp looks for a JSON field on the design document at the address design_doc.couchapp.index. If it finds it, it appends the value to the location of the design document itself to build the URL. If there is no CouchApp index specified, but the design document has an attachment called index.html, then it is considered the index page. In Sofa’s case, we use the index value to point to a list of the most recent posts. + +

+|-- helpers
+|   `-- md5.js
+
+ +

The helpers directory here is just an arbitrary choice—CouchApp will push any files and folders to the design document. In this case, the source code to md5.js is JSON-encoded and stored on the design_document.helpers.md5 element. + +

+|-- lists
+|   `-- index.js
+
+ +

The lists directory contains a JavaScript function that will be executed by CouchDB to render view rows as Sofa’s HTML and Atom indexes. You could add new list functions by creating new files within this directory. Lists are covered in depth in Chapter 14, Viewing Lists of Blog Posts. + +

+|-- shows
+|   |-- edit.js
+|   `-- post.js
+
+ +

The shows directory holds the functions CouchDB uses to generate HTML views of blog posts. There are two views: one for reading posts and the other for editing. We’ll look at these functions in the next few chapters. + +

+|-- templates
+|   |-- edit.html
+|   |-- index
+|   |   |-- head.html
+|   |   |-- row.html
+|   |   `-- tail.html
+|   `-- post.html
+
+ +

The templates directory is like the helpers directory and unlike the lists, shows, or views directories in that the code stored is not directly executed on CouchDB’s server side. Instead, the templates are included into the body of the list and show functions using macros run by CouchApp when pushing code to the server. These CouchApp macros are covered in Chapter 12, Storing Documents. The key point is that the templates name could be anything. It is not a special member of the design document; just a convenient place to store and edit our template files. + +

+|-- validate_doc_update.js
+
+ +

This file corresponds to the JavaScript validation function used by Sofa to ensure that only the blog owner can create new posts, as well as to ensure that the comments are well formed. Sofa’s validation function is covered in detail in Chapter 12, Storing Documents. + +

+|-- vendor
+|   `-- couchapp
+|       |-- README.md
+|       |-- _attachments
+|       |   `-- jquery.couchapp.js
+|       |-- couchapp.js
+|       |-- date.js
+|       |-- path.js
+|       `-- template.js
+
+ +

The vendor directory holds code that is managed independently of the Sofa application itself. In Sofa’s case, the only vendor package used is couchapp, which contains JavaScript code that knows how to do things like link between list and show URLs and render templates. + +

During couchapp push, files within a vendor/**/_attachments/* path are pushed as design document attachments. In this case, jquery.couchapp.js will be pushed to an attachment called couchapp/jquery.couchapp.js (so that multiple vendor packages can have the same attachment names without worry of collisions). + +

+`-- views
+    |-- comments
+    |   |-- map.js
+    |   `-- reduce.js
+    |-- recent-posts
+    |   `-- map.js
+    `-- tags
+        |-- map.js
+        `-- reduce.js
+
+ +

The views directory holds MapReduce view definitions, with each view represented as a directory, holding files corresponding to map and reduce functions. + +

Deploying Sofa

+ +

The source code is safely on your hard drive, and you’ve even been able to make minor edits to the blog.json file. Now it’s time to deploy the blog to a local CouchDB. The push command is simple and should work the first time, but two other steps are involved in setting up an admin account on your CouchDB and for your CouchApp deployments. By the end of this chapter you’ll have your own running copy of Sofa. + +

Pushing Sofa to Your CouchDB

+ +

Any time you make edits to the on-disk version of Sofa and want to see them in your browser, run the following command: + +

+couchapp push . sofa
+
+ +

This deploys the Sofa source code into CouchDB. You should see output like this: + +

+[INFO] Pushing CouchApp in /Users/jchris/sofa to design doc:
+http://127.0.0.1:5984/sofa/_design/sofa
+[INFO] Visit your CouchApp here:
+http://127.0.0.1:5984/sofa/_design/sofa/_list/index/recent-posts?descending=
+true&limit=5
+
+ +

If you get an error, make sure your target CouchDB instance is running by making a simple HTTP request to it: + +

+curl http://127.0.0.1:5984
+
+ +

The response should look like: + +

+{"couchdb":"Welcome","version":"0.10.1"}
+
+ +

If CouchDB is not running yet, go back to Chapter 3, Getting Started and follow the “Hello World” instructions there. + +

Visit the Application

+ +

If CouchDB was running, then couchapp push should have directed you to visit the application’s index URL. Visiting the URL should show you something like Figure 2, “Empty index page”. + +

+ + + +

Figure 2. Empty index page + +

+ +

We’re not done yet—there are a couple of steps remaining before you’ve got a fully functional Sofa instance. + +

Set Up Your Admin Account

+ +

Sofa is a single-user application. You, the author, are the administrator and the only one who can add and edit posts. To make sure no one else goes in and messes with your writing, you must create an administrator account in CouchDB. This is a straightforward task. Find your local.ini file and open it in your text editor. (By default, it’s stored at /usr/local/etc/couchdb/local.ini.) If you haven’t already, uncomment the [admins] section at the end of the file. Next, add a line right below the [admins] section with your preferred username and password: + +

+[admins]
+jchris = secretpass
+
+ +

Now that you’ve edited your local.ini configuration file, you need to restart CouchDB for changes to take effect. Depending on how you started CouchDB, there are different methods of restarting it. If you started in a console, then hitting Ctrl-C and rerunning the same command you used to start it is the simplest way. + +

If you don’t like your passwords lying around in plain-text files, don’t worry. When CouchDB starts up and reads this file, it takes your password and changes it to a secure hash, like this: + +

+[admins]
+jchris = -hashed-207b1b4f8434dc604206c2c0c2aa3aae61568d6c,96406178007181395cb72cb4e8f2e66e
+
+ +

CouchDB will now ask you for your credentials when you try to create databases or change documents—exactly the things you want to keep to yourself. + +

Deploying to a Secure CouchDB

+ +

Now that we’ve set up admin credentials, we’ll need to supply them on the command line when running couchapp push. Let’s try it: + +

+couchapp push . http://jchris:secretpass@localhost:5984/sofa
+
+ +

Make sure to replace jchris and secretpass with your actual values or you will get a “permission denied” error. If all works according to plan, everything will be set up in CouchDB and you should be able to start using your blog. + +

At this point, we are technically ready to move on, but you’ll be much happier if you make use of the .couchapprc file as documented in the next section. + +

Configuring CouchApp with .couchapprc

+ +

If you don’t want to have to put the full URL (potentially including authentication parameters) of your database onto the command line each time you push, you can use the .couchapprc file to store deployment settings. The contents of this file are not pushed along with the rest of the app, so it can be a safe place to keep credentials for uploading your app to secure servers. + +

The .couchapprc file lives in the source directory of your application, so you should look to see if it is at /path/to/the/directory/of/sofa/.couchapprc (or create it there if it is missing). Dot files (files with names that start with a period) are left out of most directory listings. Use whatever tricks your OS has to “show hidden files.” The simplest one in a standard command shell is to list the directory using ls -a, which will show all hidden files as well as normal files. + +

+    {
+      "env": {
+        "default": {
+          "db": "http://jchris:secretpass@localhost:5984/sofa"
+        },
+        "staging": {
+          "db": "http://jchris:secretpass@jchrisa.net:5984/sofa-staging"
+        },
+        "drl": {
+          "db": "http://jchris:secretpass@jchrisa.net/drl"
+        }
+      }
+    }
+
+ +

With this file set up, you can push your CouchApp with the command couchapp push, which will push the application to the “default” database. CouchApp also supports alternate environments. To push your application to a development database, you could use couchapp push dev. In our experience, taking the time to set up a good .couchapprc is always worth it. Another benefit is that it keeps your passwords off the screen when you are working. diff --git a/editions/1/zh/managing/01.png b/editions/1/zh/managing/01.png new file mode 100644 index 0000000..2aaf319 Binary files /dev/null and b/editions/1/zh/managing/01.png differ diff --git a/editions/1/zh/managing/02.png b/editions/1/zh/managing/02.png new file mode 100644 index 0000000..d04cc03 Binary files /dev/null and b/editions/1/zh/managing/02.png differ diff --git a/editions/1/zh/notifications.html b/editions/1/zh/notifications.html new file mode 100644 index 0000000..b144169 --- /dev/null +++ b/editions/1/zh/notifications.html @@ -0,0 +1,237 @@ +Change Notifications + + + + + + + + + + + +

Change Notifications

+ +

Say you are building a message service with CouchDB. Each user has an inbox database and other users send messages by dropping them into the inbox database. When users want to read all messages received, they can just open their inbox databases and see all messages. + +

So far, so simple, but now you’ve got your users hitting the Refresh button all the time once they’ve looked at their messages to see if there are new messages. This is commonly referred to as polling. A lot of users are generating a lot of requests that, most of the time, don’t show anything new, just the list of all the messages they already know about. + +

Wouldn’t it be nice to ask CouchDB to give you notice when a new message arrives? The _changes database API does just that. + +

The scenario just described can be seen as the cache invalidation problem; that is, when do I know that what I am displaying right now is no longer an apt representation of the underlying data store? Any sort of cache invalidation, not only backend/frontend-related, can be built using _changes. + +

_changes is also designed and suited to extract an activity stream from a database, whether for simple display or, equally important, to act on a new document (or a document change) when it occurs. + +

The beauty of systems that use the changes API is that they are decoupled. A program that is interested only in latest updates doesn’t need to know about programs that create new documents and vice versa. + +

Here’s what a changes item looks like: + +

+{"seq":12,"id":"foo","changes":[{"rev":"1-23202479633c2b380f79507a776743d5"}]}
+
+ +

There are three fields: + +

+ +
seq
+ +
The update_seq of the database that was created when the document with the id got created or changed.
+ +
id
+ +
The document ID.
+ +
changes
+ +
An array of fields, which by default includes the document’s revision ID, but can also include information about document conflicts and other things.
+ +
+ +

The changes API is available for each database. You can get changes that happen in a single database per request. But you can easily send multiple requests to multiple databases’ changes API if you need that. + +

Let’s create a database that we can use as an example later in this chapter: + +

+> HOST="http://127.0.0.1:5984"
+> curl -X PUT $HOST/db
+{"ok":true}
+
+ +

There are three ways to request notifications: polling (the default), long polling and continuous. Each is useful in a different scenario, and we’ll discuss all of them in detail. + +

Polling for Changes

+ +

In the previous example, we tried to avoid the polling method, but it is very simple and in some cases the only one suitable for a problem. Because it is the simplest case, it is the default for the changes API. + +

Let’s see what the changes for our test database look like. First, the request (we’re using curl again): + +

+curl -X GET $HOST/db/_changes
+
+ +

The result is simple: + +

+{"results":[
+
+],
+"last_seq":0}
+
+ +

There’s nothing there because we didn’t put anything in yet—no surprise. But you can guess where we’d see results—when they start to come in. Let’s create a document: + +

+curl -X PUT $HOST/db/test -d '{"name":"Anna"}'
+
+ +

CouchDB replies: + +

+{"ok":true,"id":"test","rev":"1-aaa8e2a031bca334f50b48b6682fb486"}
+
+ +

Now let’s run the changes request again: + +

+{"results":[
+{"seq":1,"id":"test","changes":[{"rev":"1-aaa8e2a031bca334f50b48b6682fb486"}]}
+],
+"last_seq":1}
+
+ +

We get a notification about our new document. This is pretty neat! But wait—when we created the document and got information like the revision ID, why would we want to make a request to the changes API to get it again? Remember that the purpose of the changes API is to allow you to build decoupled systems. The program that creates the document is very likely not the same program that requests changes for the database, since it already knows what it put in there (although this is blurry, the same program could be interested in changes made by others). + +

Behind the scenes, we created another document. Let’s see what the changes for the database look like now: + +

+{"results":[
+{"seq":1,"id":"test","changes":[{"rev":"1-aaa8e2a031bca334f50b48b6682fb486"}]},
+{"seq":2,"id":"test2","changes":[{"rev":"1-e18422e6a82d0f2157d74b5dcf457997"}]}
+],
+"last_seq":2}
+
+ +

See how we get a new line in the result that represents the new document? In addition, the first document we put in there got listed again. The default result for the changes API is the history of all changes that the database has seen. + +

We’ve already seen the change for "seq":1, and we’re no longer really interested in it. We can tell the changes API about that by using the since=1 query parameter: + +

+curl -X GET $HOST/db/_changes?since=1
+
+ +

This returns all changes after the seq specified by since: + +

+{"results":[
+{"seq":2,"id":"test2","changes":[{"rev":"1-e18422e6a82d0f2157d74b5dcf457997"}]}
+],
+"last_seq":2}
+
+ +

While we’re discussing options, use style=all_docs to get more revision and conflict information in the changes array for each result row. If you want to specify the default explicitly, the value is main_only. + +

Long Polling

+ +

The technique of long polling was invented for web browsers to remove one of the problems with the regular polling approach: it doesn’t run any requests if nothing changed. Long polling works like this: when making a request to the long polling API, you open an HTTP connection to CouchDB until a new row appears in the changes result, and both you and CouchDB keep the HTTP connection open. As soon as a result appears, the connection is closed. + +

This works well for low-frequency updates. If a lot of changes occur for a client, you find yourself opening many new requests, and the usefulness of this approach over regular polling declines. Another general consequence of this technique is that for each client requesting a long polling change notification, CouchDB will have to keep an HTTP connection open. CouchDB is well capable of doing so, as it is designed to handle many concurrent requests. But you need to make sure your operating system allows CouchDB to use at least as many sockets as you have long polling clients (and a few spare for regular requests, of course). + +

To make a long polling request, add the feed=longpoll query parameter. For this listing, we added timestamps to show you when things happen. + +

+00:00: > curl -X GET "$HOST/db/_changes?feed=longpoll&since=2"
+00:00: {"results":[
+00:10: {"seq":3,"id":"test3","changes":[{"rev":"1-02c6b758b08360abefc383d74ed5973d"}]}
+00:10: ],
+00:10: "last_seq":3}
+
+ +

At 00:10, we create another document behind your back again, and CouchDB promptly sends us the change. Note that we used since=2 to avoid getting any of the previous notifications. Also note that we have to use double quotes for the curl command because we are using an ampersand, which is a special character for our shell. + +

The style option works for long polling requests just like for regular polling requests. + +

Networks are a tricky beast, and sometimes you don’t know whether there are no changes coming or your network connection went stale. If you add another query parameter, heartbeat=N, where N is a number, CouchDB will send you a newline character each N milliseconds. As long as you are receiving newline characters, you know there are no new change notifications, but CouchDB is still ready to send you the next one when it occurs. + +

Continuous Changes

+ +

Long polling is great, but you still end up opening an HTTP request for each change notification. For web browsers, this is the only way to avoid the problems of regular polling. But web browsers are not the only client software that can be used to talk to CouchDB. If you are using Python, Ruby, Java, or any other language really, you have yet another option. + +

The continuous changes API allows you to receive change notifications as they come in using a single HTTP connection. You make a request to the continuous changes API and both you and CouchDB will hold the connection open “forever.” CouchDB will send you newlines for notifications when the occur and—as opposed to long polling—will keep the HTTP connection open, waiting to send the next notification. + +

This is great for both infrequent and frequent notifications, and it has the same consequence as long polling: you’re going to have a lot of long-living HTTP connections. But again, CouchDB easily supports these. + +

Use the feed=continuous parameter to make a continuous changes API request. Following is the result, again with timestamps. At 00:10 and 00:15, we’ll create a new document each: + +

+00:00: > curl -X GET "$HOST/db/_changes?feed=continuous&since=3"
+00:10: {"seq":4,"id":"test4","changes":[{"rev":"1-02c6b758b08360abefc383d74ed5973d"}]}
+00:15: {"seq":5,"id":"test5","changes":[{"rev":"1-02c6b758b08360abefc383d74ed5973d"}]}
+
+ +

Note that the continuous changes API result doesn’t include a wrapping JSON object with a results member with the individual notification results as array items; it includes only a raw line per notification. Also note that the lines are no longer separated by a comma. Whereas the regular and long polling APIs result is a full valid JSON object when the HTTP request returns, the continuous changes API sends individual rows as valid JSON objects. The difference makes it easier for clients to parse the respective results. The style and heartbeat parameters work as expected with the continuous changes API. + +

Filters

+ +

The change notification API and its three modes of operation already give you a lot of options requesting and processing changes in CouchDB. Filters for changes give you an additional level of flexibility. Let’s say the messages from our first scenario have priorities, and a user is interested only in notifications about messages with a high priority. + +

Enter filters. Similar to view functions, a filter is a JavaScript function that gets stored in a design document and is later executed by CouchDB. They live in special member filters under a name of your choice. Here is an example: + +

+{
+  "_id": "_design/app",
+  "_rev": "1-b20db05077a51944afd11dcb3a6f18f1",
+  "filters": {
+    "important": "function(doc, req) { if(doc.priority == 'high') { return true; }
+    else { return false; }}"
+  }
+}
+
+ +

To query the changes API with this filter, use the filter=designdocname/filtername query parameter: + +

+curl "$HOST/db/_changes?filter=app/important"
+
+ +

The result now includes only rows for document updates for which the filter function returns true—in our case, where the priority property of our document has the value high. This is pretty neat, but CouchDB takes it up another notch. + +

Let’s take the initial example application where users can send messages to each other. Instead of having a database per user that acts as the inbox, we now use a single database as the inbox for all users. How can a user register for changes that represent a new message being put in her inbox? + +

We can make the filter function using a request parameter: + +

+function(doc, req)
+{
+  if(doc.name == req.query.name) {
+    return true;
+  }
+
+  return false;
+}
+
+ +

If you now run a request adding a ?name=Steve parameter, the filter function will only return result rows for documents that have the name field set to “Steve.” If you are running a request for a different user, just change the request parameter (name=Joe). + +

Now, adding a query parameter to a filtered changes request is easy. What would hinder Steve from passing in name=Joe as the parameter and seeing Joe’s inbox? Not much. Can CouchDB help with this? We wouldn’t bring this up if it couldn’t, would we? + +

The req parameter of the filter function includes a member userCtx, the user context. This includes information about the user that has already been authenticated over HTTP earlier in the phase of the request. Specifically, req.userCtx.name includes the username of the user who makes the filtered changes request. We can be sure that the user is who he says he is because he has been authenticated against one of the authenticating schemes in CouchDB. With this, we don’t even need the dynamic filter parameter (although it can still be useful in other situations). + +

If you have configured CouchDB to use authentication for requests, a user will have to make an authenticated request and the result is available in our filter function: + +

+function(doc, req)
+{
+  if(doc.name) {
+    if(doc.name == req.userCtx.name) {
+      return true;
+    }
+  }
+
+  return false;
+}
+
+ +

Wrapping Up

+ +

The changes API lets you build sophisticated notification schemes useful in many scenarios with isolated and asynchronous components yet working to the same beat. In combination with replication, this API is the foundation for building distributed, highly available, and high-performance CouchDB clusters. diff --git a/editions/1/zh/performance.html b/editions/1/zh/performance.html new file mode 100644 index 0000000..9372fdc --- /dev/null +++ b/editions/1/zh/performance.html @@ -0,0 +1,221 @@ +High Performance + + + + + + + + + + + +

High Performance

+ +

This chapter will teach you the fastest ways to insert and query data with CouchDB. It will also explain why there is a wide range of performance across various techniques. + +

The take-home message: bulk operations result in lower overhead, higher throughput, and more space efficiency. If you can’t work in bulk in your application, we’ll also describe other options to get throughput and space benefits. Finally, we describe interfacing directly with CouchDB from Erlang, which can be a useful technique if you want to integrate CouchDB storage with a server for non-HTTP protocols, like SMTP (email) or XMPP (chat). + +

Good Benchmarks Are Non-Trivial

+ +

Each application is different. Performance requirements are not always obvious. Different use cases need to tune different parameters. A classic trade-off is latency versus throughput. Concurrency is another factor. Many database platforms behave very differently with 100 clients than they do with 1,000 or more concurrent clients. Some data profiles require serialized operations, which increase total time (latency) for the client, and load on the server. We think simpler data and access patterns can make a big difference in the cacheability and scalability of your app, but we’ll get to that later. + +

The upshot: real benchmarks require real-world load. Simulating load is hard. Erlang tends to perform better under load (especially on multiple cores), so we’ve often seen test rigs that can’t drive CouchDB hard enough to see where it falls over. + +

Let’s take a look at what a typical web app looks like. This is not exactly how Craigslist works (because we don’t know how Craigslist works), but it is a close enough approximation to illustrate problems with benchmarking. + +

You have a web server, some middleware, and a database. A user request comes in, and the web server takes care of the networking and parses the HTTP request. The request gets handed to the middleware layer, which figures out what to run, then it runs whatever is needed to serve the request. The middleware might talk to your database and other external resources like files or remote web services. The request bounces back to the web server, which sends out any resulting HTML. The HTML includes references to other resources living on your web server (like CSS, JS, or image files), and the process starts anew for every resource. A little different each time, but in general, all requests are similar. And along the way there are caches to store intermediate results to avoid expensive recomputation. + +

That’s a lot of moving parts. Getting a top-to-bottom profile of all components to figure out where bottlenecks lie is pretty complex (but nice to have). We start making up numbers now. The absolute values are not important; only numbers relative to each other are. Say a request takes 1.5 seconds (1,500 ms) to be fully rendered in a browser. + +

In a simple case like Craigslist, there is the initial HTML, a CSS file, a JS file, and the favicon. Except for the HTML, these are all static resources and involve reading some data from a disk (or from memory) and serving it to the browser that then renders it. The most notable things to do for performance are keeping data small (GZIP compression, high JPG compression) and avoiding requests all together (HTTP-level caching in the browser). Making the web server any faster doesn’t buy us much (yeah, hand wavey, but we don’t want to focus on static resources here). Let’s say all static resources take 500 ms to serve and render. + +

+ +

Read all about improving client experience with proper use of HTTP from Steve Souders, web performance guru. His YSlow tool is indispensable for tuning a website. + +

+ +

That leaves us with 1,000 ms for the initial HTML. We’ll chop off 200 ms for network latency (see Chapter 1, Why CouchDB?). Let’s pretend HTTP parsing, middleware routing and execution, and database access share equally the rest of the time, 200 ms each. + +

If you now set out to improve one part of the big puzzle that is your web app and gain 10 ms in the database access time, this is probably time not well spent (unless you have the numbers to prove it). + +

However, breaking down a single request like this and looking for how much time is spent in each component is also misleading. Even if only a small percentage of the time is spent in your database under normal load, that doesn’t teach you what will happen during traffic spikes. If all requests are hitting the same database, then any locking there could block many web requests. Your database may have minimal impact on total query time, under normal load, but under spike load it may turn into a bottleneck, magnifying the effect of the spike on the application servers. CouchDB can minimize this by dedicating an Erlang process to each connection, ensuring that all clients are handled, even if latency goes up a bit. + +

High Performance CouchDB

+ +

Now that you see database performance is only a small part of overall web performance, we’ll give you some tips to squeeze the most out of CouchDB. + +

CouchDB is designed from the ground up to service highly concurrent use cases, which make up the majority of web application load. However, sometimes we need to import a large batch of data into CouchDB or initiate transforms across an entire database. Or maybe we’re building a custom Erlang application that needs to link into CouchDB at a lower level than HTTP. + +

Hardware

+ +

Invariably people will want to know what type of disk they should use, how much RAM, what sort of CPU, etc. The real answer is that CouchDB is flexible enough to run on everything from a smart phone to a cluster, so the answers will vary. + +

More RAM is better because CouchDB makes heavy use of the filesystem cache. CPU cores are more important for building views than serving documents. Solid State Drives (SSDs) are pretty sweet because they can append to a file while loading old blocks, with a minimum of overhead. As they get faster and cheaper, they’ll be really handy for CouchDB. + +

An Implementation Note

+ +

We’re not going to rehash append-only B-trees here, but understanding CouchDB’s data format is key to gaining an intuition about which strategies yield the best performance. Each time an update is made, CouchDB loads from disk the B-tree nodes that point to the updated documents or the key range where a new document’s _id would be found. + +

This loading will normally come from the filesystem cache, except when updates are made to documents in regions of the tree that have not been touched in a long time. In those cases, the disk has to seek, which can block writing and have other ripple effects. Preventing these disk seeks is the name of the game in CouchDB performance. + +

We’ll use some numbers in this chapter that come from a JavaScript test suite. It’s not the most accurate, but the strategy it uses (counting the number of documents that can be saved in 10 seconds) makes up for the JavaScript overhead. The hardware the benchmarks were run on is modest: just an old white MacBook Intel Core 2 Duo (remember those?). + +

You can run the benchmarks yourself by changing to the bench/ directory of CouchDB’s trunk and running ./runner.sh while CouchDB is running on port 5984. + +

Bulk Inserts and Mostly Monotonic DocIDs

+ +

Bulk inserts are the best way to have seekless writes. Random IDs force seeking after the file is bigger than can be cached. Random IDs also make for a bigger file because in a large database you’ll rarely have multiple documents in one B-tree leaf. + +

Optimized Examples: Views and Replication

+ +

If you’re curious what a good performance profile is for CouchDB, look at how views and replication are done. Triggered replication applies updates to the database in large batches to minimize disk chatter. Currently the 0.11.0 development trunk boasts an additional 3–5x speed increase over 0.10’s view generation. + +

Views load a batch of updates from disk, pass them through the view engine, and then write the view rows out. Each batch is a few hundred documents, so the writer can take advantage of the bulk efficiencies we see in the next section. + +

Bulk Document Inserts

+ +

The fastest mode for importing data into CouchDB via HTTP is the _bulk_docs endpoint. The bulk documents API accepts a collection of documents in a single POST request and stores them all to CouchDB in a single index operation. + +

Bulk docs is the API to use when you are importing a corpus of data using a scripting language. It can be 10 to 100 times faster than individual bulk updates and is just as easy to work with from most languages. + +

The main factor that influences performance of bulk operations is the size of the update, both in terms of total data transferred as well as the number of documents included in an update. + +

Here are sequential bulk document inserts at four different granularities, from an array of 100 documents, up through 1,000, 5,000, and 10,000: + +

+bulk_doc_100
+4400 docs
+437.37574552683895 docs/sec
+
+ +
+bulk_doc_1000
+17000 docs
+1635.4016354016355 docs/sec
+
+ +
+bulk_doc_5000
+30000 docs
+2508.1514923501377 docs/sec
+
+ +
+bulk_doc_10000
+30000 docs
+2699.541078016737 docs/sec
+
+ +

You can see that larger batches yield better performance, with an upper limit in this test of 2,700 documents/second. With larger documents, we might see that smaller batches are more useful. For references, all the documents look like this: {"foo":"bar"} + +

Although 2,700 documents per second is fine, we want more power! Next up, we’ll explore running bulk documents in parallel. + +

With a different script (using bash and cURL with benchbulk.sh in the same directory), we’re inserting large batches of documents in parallel to CouchDB. With batches of 1,000 docs, 10 at any given time, averaged over 10 rounds, I see about 3,650 documents per second on a MacBook Pro. Benchbulk also uses sequential IDs. + +

We see that with proper use of bulk documents and sequential IDs, we can insert more than 3,000 docs per second just using scripting languages. + +

Batch Mode

+ +

To avoid the indexing and disk sync overhead associated with individual document writes, there is an option that allows CouchDB to build up batches of documents in memory, flushing them to disk when a certain threshold has been reached or when triggered by the user. The batch option does not give the same data integrity guarantees that normal updates provide, so it should only be used when the potential loss of recent updates is acceptable. + +

Because batch mode only stores updates in memory until a flush occurs, updates that are saved to CouchDB directly proceeding a crash can be lost. By default, CouchDB flushes the in-memory updates once per second, so in the worst case, data loss is still minimal. To reflect the reduced integrity guarantees when batch=ok is used, the HTTP response code is 202 Accepted, as opposed to 201 Created. + +

The ideal use for batch mode is for logging type applications, where you have many distributed writers each storing discrete events to CouchDB. In a normal logging scenario, losing a few updates on rare occasions is worth the trade-off for increased storage throughput. + +

There is a pattern for reliable storage using batch mode. It’s the same pattern as is used when data needs to be stored reliably to multiple nodes before acknowledging success to the saving client. In a nutshell, the application server (or remote client) saves to Couch A using batch=ok, and then watches update notifications from Couch B, only considering the save successful when Couch B’s _changes stream includes the relevant update. We covered this pattern in detail in Chapter 16, Replication. + +

+batch_ok_doc_insert
+4851 docs
+485.00299940011996 docs/sec
+
+ +

This JavaScript benchmark only gets around 500 documents per second, six times slower than the bulk document API. However, it has the advantage that clients don’t need to build up bulk batches. + +

Single Document Inserts

+ +

Normal web app load for CouchDB comes in the form of single document inserts. Because each insert comes from a distinct client, and has the overhead of an entire HTTP request and response, it generally has the lowest throughput for writes. + +

Probably the slowest possible use case for CouchDB is the case of a writer that has to make many serialized writes against the database. Imagine a case where each write depends on the result of the previous write so that only one writer can run. This sounds like a bad case from the description alone. If you find yourself in this position, there are probably other problems to address as well. + +

We can write about 258 documents per second with a single writer in serial (pretty much the worst-case scenario writer). + +

+single_doc_insert
+2584 docs
+257.9357157117189 docs/sec
+
+ +

Delayed commit (along with sequential UUIDs) is probably the most important CouchDB configuration setting for performance. When it is set to true (the default), CouchDB allows operations to be run against the disk without an explicit fsync after each operation. Fsync operations take time (the disk may have to seek, on some platforms the hard disk cache buffer is flushed, etc.), so requiring an fsync for each update deeply limits CouchDB’s performance for non-bulk writers. + +

Delayed commit should be left set to true in the configuration settings, unless you are in an environment where you absolutely need to know when updates have been received (such as when CouchDB is running as part of a larger transaction). It is also possible to trigger an fsync (e.g., after a few operations) using the _ensure_full_commit API. + +

When delayed commit is disabled, CouchDB writes data to the actual disk before it responds to the client (except in batch=ok mode). It’s a simpler code path, so it has less overhead when running at high throughput levels. However, for individual clients, it can seem slow. Here’s the same benchmark in full commit mode: + +

+single_doc_insert
+46 docs
+4.583042741855135 docs/sec
+
+ +

Look at how slow single_doc_insert is with full-commit enabled—four or five documents per second! That’s 100% a result of the fact that Mac OS X has a real fsync, so be thankful! Don’t worry; the full commit story gets better as we move into bulk operations. + +

On the other hand, we’re getting better times for large bulks with delayed commit off, which lets us know that tuning for your application will always bring better results than following a cookbook. + +

Hovercraft

+ +

Hovercraft is a library for accessing CouchDB from within Erlang. Hovercraft benchmarks should show the fastest possible performance of CouchDB’s disk and index subsystems, as it avoids all HTTP connection and JSON conversion overhead. + +

Hovercraft is useful primarily when the HTTP interface doesn’t allow for enough control, or is otherwise redundant. For instance, persisting Jabber instant messages to CouchDB might use ejabberd and Hovercraft. The easiest way to create a failure-tolerant message queue is probably a combination of RabbitMQ and Hovercraft. + +

Hovercraft was extracted from a client project that used CouchDB to store massive amounts of email as document attachments. HTTP doesn’t have an easy mechanism to allow a combination of bulk updates with binary attachments, so we used Hovercraft to connect an Erlang SMTP server directly to CouchDB, to stream attachments directly to disk while maintaining the efficiency of bulk index updates. + +

Hovercraft includes a basic benchmarking feature, and we see that we can get many documents per second. + +

+> hovercraft:lightning().
+Inserted 100000 docs in 9.37 seconds with batch size of 1000.
+(10672 docs/sec)
+
+ +

Trade-Offs

+ +

Tool X might give you 5 ms response times, an order of magnitude faster than anything else on the market. Programming is all about trade-offs, and everybody is bound by the same laws. + +

On the outside, it might appear that everybody who is not using Tool X is a fool. But speed and latency are only part of the picture. We already established that going from 5 ms to 50 ms might not even be noticeable by anyone using your product. Speed may come at the expense of other things, such as: + +

+ +
Memory
+ +
Instead of doing computations over and over, Tool X might have a cute caching layer that saves recomputation by storing results in memory. If you are CPU bound, that might be good; if you are memory bound, it might not. A trade-off.
+ +
Concurrency
+ +
The clever data structures in Tool X are extremely fast when only one request at a time is processed, and because it is so fast most of the time, it appears as if it would process multiple requests in parallel. Eventually, though, a high number of concurrent requests fill up the request queue and response time suffers. A variation on this is that Tool X might work exceptionally well on a single CPU or core, but not on many, leaving your beefy servers idling.
+ +
Reliability
+ +
Making sure data is actually stored is an expensive operation. Making sure a data store is in a consistent state and not corrupted is another. There are two trade-offs here. First, buffers store data in memory before committing it to disk to ensure a higher data throughput. In the event of a power loss or crash (of hard- or software), the data is gone. This may or may not be acceptable for your application. Second, a consistency check is required to run after a failure, and if you have a lot of data, this can take days. If you can afford to be offline, that’s OK, but maybe you can’t afford it.
+ +
+ +

Make sure to understand what requirements you have and pick the tool that complies instead of picking the one that has the prettiest numbers. Who’s the fool when your web application is offline for a fixup for a day while your customers impatiently wait to get their jobs done or, worse, you lose their data? + +

But…My Boss Wants Numbers!

+ +

You want to know which one of these databases, caches, programming languages, language constructs, or tools is faster, harder, or stronger. Numbers are cool—you can draw pretty graphs that management types can compare and make decisions from. + +

But the first thing a good executive knows is that she is operating on insufficient data, as diagrams drawn from numbers are a very distilled view of reality. And graphs from numbers that are made up by bad profiling are effectively fantasies. + +

If you are going to produce numbers, make sure you understand how much information is and isn’t covered by your results. Before passing the numbers on, make sure the receiving person knows it too. Again, the best thing to do is test with something as close to real-world load as possible. And that isn’t easy. + +

A Call to Arms

+ +

We’re in the market for databases and key/value stores. Every solution has a sweet spot in terms of data, hardware, setup, and operation, and there are enough permutations that you can pick the one that is closest to your problem. But how to find out? Ideally, you download and install all possible candidates, create a profiling test suite with proper testing data, make extensive tests, and compare the results. This can easily take weeks, and you might not have that much time. + +

We would like to ask developers of storage systems to compile a set of profiling suites that simulate different usage patterns of their systems (read-heavy and write-heavy loads, fault tolerance, distributed operation, and many more). A fault-tolerance suite should include the steps necessary to get data live again, such as any rebuild or checkup time. We would like users of these systems to help their developers find out how to reliably measure different scenarios. + +

We are working on CouchDB, and we’d like very much to have such a suite! Even better, developers could agree (a far-fetched idea, to be sure) on a set of benchmarks that objectively measure performance for easy comparison. We know this is a lot of work and the results may still be questionable, but it’ll help our users a great deal when figuring out what to use. diff --git a/editions/1/zh/preface.html b/editions/1/zh/preface.html new file mode 100644 index 0000000..e79c5cb --- /dev/null +++ b/editions/1/zh/preface.html @@ -0,0 +1,53 @@ +Preface + + + + + + + + + + + +

Preface

+ +

Thanks for purchasing this book! If it was a gift, then congratulations. If, on the other hand, you downloaded it without paying, well, actually, we’re pretty happy about that too! This book is available under a free license, and that’s important because we want it to serve the community as documentation—and documentation should be free. + +

So, why pay for a free book? Well, you might like the warm fuzzy feeling you get from holding a book in your hands, as you cosy up on the couch with a cup of coffee. On the couch...get it? Bad jokes aside, whatever your reasons, buying the book helps support us, so we have more time to work on improvements for both the book and CouchDB. So thank you! + +

We set out to compile the best and most comprehensive collection of CouchDB information there is, and yet we know we failed. CouchDB is a fast-moving target and grew significantly during the time we were writing the book. We were able to adapt quickly and keep things up-to-date, but we also had to draw the line somewhere if we ever hoped to publish it. + +

At the time of this writing, CouchDB 0.10.1 is the latest release, but you might already be seeing 0.10.2 or even 0.11.0 released or being prepared—maybe even 1.0. Although we have some ideas about how future releases will look, we don’t know for certain and didn’t want to make any wild guesses. CouchDB is a community project, so ultimately it’s up to you, our readers, to help shape the project. + +

On the plus side, many people successfully run CouchDB 0.10 in production, and you will have more than enough on your hands to run a solid project. Future releases of CouchDB will make things easier in places, but the core features should remain the same. Besides, learning the core features helps you understand and appreciate the shortcuts and allows you to roll your own hand-tailored solutions. + +

Writing an open book was great fun. We’re happy O’Reilly supported our decision in every way possible. The best part—besides giving the CouchDB community early access to the material—was the commenting functionality we implemented on the book’s website. It allows anybody to comment on any paragraph in the book with a simple click. We used some simple JavaScript and Google Groups to allow painless commenting. The result was astounding. As of today, 866 people have sent more than 1,100 messages to our little group. Submissions have ranged from pointing out small typos to deep technical discussions. Feedback on our original first chapter led us to a complete rewrite in order to make sure the points we wanted to get across did, indeed, get across. This system allowed us to clearly formulate what we wanted to say in a way that worked for you, our readers. + +

Overall, the book has become so much better because of the help of hundreds of volunteers who took the time to send in their suggestions. We understand the immense value this model has, and we want to keep it up. New features in CouchDB should make it into the book without us necessarily having to do a reprint every thee months. The publishing industry is not ready for that yet, but we want to continue to release new and revised content and listen closely to the feedback. The specifics of how we’ll do this are still in flux, but we’ll be posting the information to the book’s website the first moment we know it. That’s a promise! So make sure to visit the book’s website at http://books.couchdb.org/relax to keep up-to-date. + +

Before we let you dive into the book, we want to make sure you’re well prepared. CouchDB is written in Erlang, but you don’t need to know anything about Erlang to use CouchDB. CouchDB also heavily relies on web technologies like HTTP and JavaScript, and some experience with those does help when following the examples throughout the book. If you have built a website before—simple or complex—you should be ready to go. + +

If you are an experienced developer or systems architect, the introduction to CouchDB should be comforting, as you already know everything involved—all you need to learn are the ways CouchDB puts them together. Toward the end of the book, we ramp up the experience level to help you get as comfortable building large-scale CouchDB systems as you are with personal projects. + +

If you are a beginning web developer, don’t worry—by the time you get to the later parts of the book, you should be able to follow along with the harder stuff. + +

Now, sit back, relax, and enjoy the ride through the wonderful world of CouchDB. + +

Acknowledgments

+ +

J. Chris

+ +

I would like to acknowledge all the committers of CouchDB, the people sending patches, and the rest of the community. I couldn’t have done it without my wife, Amy, who helps me think about the big picture; without the patience and support of my coauthors and O’Reilly; nor without the help of everyone who helped us hammer out book content details on the mailing lists. And a shout-out to the copyeditor, who was awesome! + +

Jan

+ +

I would like to thank the CouchDB community. Special thanks go out to a number of nice people all over the place who invited me to attend or talk at a conference, who let me sleep on their couches (pun most definitely intended), and who made sure I had a good time when I was abroad presenting CouchDB. There are too many to name, but all of you in Dublin, Portland, Lisbon, London, Zurich, San Francisco, Mountain View, Dortmund, Stockholm, Hamburg, Frankfurt, Salt Lake City, Blacksburg, San Diego, and Amsterdam: you know who you are—thanks! + +

To my family, friends, and coworkers: thanks you for your support and your patience with me over the last year. You won’t hear, “I’ve got to leave early, I have a book to write” from me anytime soon, promise! + +

Anna, you believe in me; I couldn’t have done this without you. + +

Noah

+ +

I would like to thank O’Reilly for their enthusiasm in CouchDB and for realizing the importance of free documentation. And of course, I’d like to thank Jan and J. Chris for being so great to work with. But a special thanks goes out to the whole CouchDB community, for making everything so fun and rewarding. Without you guys, none of this would be possible. And if you’re reading this, that means you! diff --git a/editions/1/zh/recipes.html b/editions/1/zh/recipes.html new file mode 100644 index 0000000..94a7527 --- /dev/null +++ b/editions/1/zh/recipes.html @@ -0,0 +1,464 @@ +Recipes + + + + + + + + + + + +

Recipes

+ +

This chapter shows some common tasks and how to solve them with CouchDB using best practices and easy-to-follow step-by-step instructions. + +

Banking

+ +

Banks are serious business. They need serious databases to store serious transactions and serious account information. They can’t lose any money. Ever. They also can’t create money. A bank must be in balance. All the time. + +

Conventional wisdom says a database needs to support transactions to be taken seriously. CouchDB does not support transactions in the traditional sense (although it works transactionally), so you could conclude CouchDB is not well suited to store bank data. Besides, would you trust your money to a couch? Well, we would. This chapter explains why. + +

Accountants Don’t Use Erasers

+ +

Say you want to give $100 to your cousin Paul for the New York cheesecake he sent to you. Back in the day, you had to travel all the way to New York and hand Paul the money, or you could send it via (paper) mail. Both methods were considerably inconvenient, so people started looking for alternatives. At one point, banks offered to take care of the money and make sure it arrived at Paul’s bank safely without headaches. Of course, they’d charge for the convenience, but you’d be happy to pay a little fee if it could save a trip to New York. Behind the scenes, the bank would send somebody with your money to give it to Paul’s bank—the same procedure, but another person was dealing with the trouble. Banks could also batch money transfers; instead of sending each order on its own, they could collect all transfers to New York for a week and send them all at once. In case of any problems—say, the recipient was no longer a customer of the bank (remember, it used to take weeks to travel from one coast to the other)—the money was sent back to the originating account. + +

Eventually, the modern banking system was put in place and the actual sending of money back and forth could be stopped (much to the disdain of highwaymen). Banks had money on paper, which they could send around without actually sending valuables. The old concept is stuck in our heads though. To send somebody money from our bank account, the bank needs to take the notes out of the account and bring them to the receiving account. But nowadays we’re used to things happen instantaneously. It takes just a few clicks to order goods from Amazon and have them placed into the mail, so why should a banking transaction take any longer? + +

Banks are all electronic these days (and have been for a while). When we issue a money transfer, we expect it to go through immediately, and we expect it to work in the way it worked back in the day: take money from my account, add it to Paul’s account, and if anything goes wrong, put it back in my account. While this is logically what happens, that’s not quite how it works behind the scenes, and hasn’t since way before computers were used for banking. + +

When you go to your bank and ask it to send money to Paul, the accountant will start a transaction by noting down that you ordered the sending of the money. The transaction will include the date, amount, and recipient. Remember that banks always need to be in balance. The money taken from your account cannot vanish. The accountant will move the money into an in-transit account that the bank maintains for you. Your account balance at this point is an aggregation of your current balance and the transactions in the in-transit account. Now the bank checks whether Paul’s account is what you say it is and whether the money could arrive there safely. If that’s the case, the money is moved in another single transaction from the in-transit account to Paul’s account. Everything is in balance. Notice how there are multiple independent transactions, not one big transaction that combines a number of actions. + +

Now let’s consider an error case: say Paul’s account no longer exists. The bank finds this out while performing the batch operation of all the in-transit transactions that need to be performed. A second transaction is generated that moves the money back from the in-transit account to your bank account. Note that the transaction that moved the money out of your account is not undone. Rather, a second transaction that does the reverse action is created. + +

Here’s another error case: say you don’t have sufficient funds to send $100 to Paul. This will be checked by the accountant (or software) before the bank creates any money-deducting transaction. For accountability, a bank cannot pretend an action didn’t happen; it has to record every action minutely in a log. Undoing is done explicitly by performing a reverse action, not by reverting or removing an existing transaction. “Accountants don’t use erasers” is a quote from Pat Helland, a senior architect of transactional systems who worked at Microsoft and Amazon. + +

To rehash, a transaction can succeed or fail, but nothing in between. The only operation that CouchDB guarantees to have succeed or fail is a single document write. All operations that comprise a transaction need to be combined into a single document. If business logic detects that an error occurred (e.g., not enough funds), a reverse transaction needs to be created. + +

Let’s look at a CouchDB example. We mentioned earlier that your account balance is an aggregated value. If we stick to this picture, things become downright easy. Instead of updating the balance of two accounts (yours and Paul’s, or yours and the in-transit account), we simply create a single transaction document that describes what we’re doing and use a view to aggregate your account balance. + +

Let’s consider a bunch of transactions: + +

+...
+{"from":"Jan","to":"Paul","amount":100}
+{"from":"Paul","to":"Steve","amount":20}
+{"from":"Work","to":"Jan","amount":200}
+...
+
+ +

Single document writes in CouchDB are atomic. Querying a view forces an update to the view index with all changes to all documents. The view result is always consistent with the data in our documents. This guarantees that our bank is always in balance. There are many more transactions, of course, but these will do for illustration purposes. + +

How do we read the current account balance? Easy—create a MapReduce view: + +

+function(transaction) {
+  emit(transaction.from, transaction.amount * -1);
+  emit(transaction.to, transaction.amount);
+}
+
+ +
+function(keys, values) {
+  return sum(values);
+}
+
+ +

Doesn’t look too hard, does it? We’ll store this in a view balance in a _design/account document. Let’s find out Jan’s balance: + +

+curl 'http://127.0.0.1:5984/bank/_design/account/_view/balance?key="Jan"'
+
+ +

CouchDB replies: + +

+{"rows":[
+{"key":null,"value":100}
+]}
+
+ +

Looks good! Now let’s see if our bank is actually in balance. The sum of all transactions should be zero: + +

+curl http://127.0.0.1:5984/bank/_design/account/_view/balance
+
+ +

CouchDB replies: + +

+{"rows":[
+{"key":null,"value":0}
+]}
+
+ +

Wrapping Up

+ +

This should explain that applications with strong consistency requirements can use CouchDB if it is possible to break up bigger transactions into smaller ones. A bank is a good enough approximation of a serious business, so you can be safe modeling your important business logic into small CouchDB transactions. + +

Ordering Lists

+ +

Views let you sort things by any value of your data—even complex JSON keys are possible, as we’ve seen in earlier chapters. Sorting by date is very useful for allowing users to find things quickly; a name is much easier to find in a list of names that is sorted alphabetically. Humans naturally resort to a divide-and-conquer algorithm (sound familiar?) and don’t consider a large part of the input set because they know the name won’t show up there. Likewise, sorting by number and date helps a great deal to let users manage their ever-increasing amounts of data. + +

There’s another sorting type that is a little more fuzzy. Search engines show you results in order of relevance. That relevance is what the search engine thinks is most relevant to you given your search term (and potential search and surfing history). There are other systems trying to infer from earlier data what is most relevant to you, but they have the near-to-impossible task of guessing what a user is interested in. Computers are notoriously bad at guessing. + +

The easiest way for a computer to figure out what’s most relevant for a user is to let the user prioritize things. Take a to-do application: it allows users to reorder to-do items so they know what they need to work on next. The underlying problem—keeping a user-defined sorting order—can be found in a number of other places. + +

A List of Integers

+ +

Let’s stick with the to-do application example. The naïve approach is pretty easy: with each to-do item we store an integer that specifies the location in a list. We use a view to get all to-do items in the right order. + +

First, we need some example documents: + +

+{
+  "title":"Remember the Milk",
+  "date":"2009-07-22T09:53:37",
+  "sort_order":2
+}
+
+{
+  "title":"Call Fred",
+  "date":"2009-07-21T19:41:34",
+  "sort_order":3
+}
+
+{
+  "title":"Gift for Amy",
+  "date":"2009-07-19T17:33:29",
+  "sort_order":4
+}
+
+{
+  "title":"Laundry",
+  "date":"2009-07-22T14:23:11",
+  "sort_order":1
+}
+
+ +

Next, we create a view with a simple map function that emits rows that are then sorted by the sort_order field of our documents. The view’s result looks like we’d expect: + +

+function(todo) {
+  if(todo.sort_order && todo.title) {
+    emit(todo.sort_order, todo.title);
+  }
+}
+
+ +
+{
+  "total_rows": 4,
+  "offset": 0,
+  "rows": [
+    {
+      "key":1,
+      "value":"Laundry",
+      "id":"..."
+    },
+    {
+      "key":2,
+      "value":"Remember the Milk",
+      "id":"..."
+    },
+    {
+      "key":3,
+      "value":"Call Fred",
+      "id":"..."
+    },
+    {
+      "key":4,
+      "value":"Gift for Amy",
+      "id":"..."
+    }
+  ]
+}
+
+ +

That looks reasonably easy, but can you spot the problem? Here’s a hint: what do you have to do if getting a gift for Amy becomes a higher priority than remembering the milk? Conceptually, the work required is simple: + +

    + +
  1. Assign “Gift for Amy” the sort_order of “Remember the Milk.”
  2. + +
  3. Increment the sort_order of “Remember the Milk” and all items that follow by one.
  4. + +
+ +

Under the hood, this is a lot of work. With CouchDB you’d have to load every document, increment the sort_order, and save it back. If you have a lot of to-do items (I do), then this is some significant work. Maybe there’s a better approach. + +

A List of Floats

+ +

The fix is simple: instead of using an integer to specify the sort order, we use a float: + +

+{
+  "title":"Remember the Milk",
+  "date":"2009-07-22T09:53:37",
+  "sort_order":0.2
+}
+
+{
+  "title":"Call Fred",
+  "date":"2009-07-21T19:41:34",
+  "sort_order":0.3
+}
+
+{
+  "title":"Gift for Amy",
+  "date":"2009-07-19T17:33:29",
+  "sort_order":0.4
+}
+
+{
+  "title":"Laundry",
+  "date":"2009-07-22T14:23:11",
+  "sort_order":0.1
+}
+
+ +

The view stays the same. Reading this is as easy as the previous approach. Reordering becomes much easier now. The application frontend can keep a copy of the sort_order values around, so when we move an item and store the move, we not only have available the new position, but also the sort_order value for the two new surrounding items. + +

Let’s move “Gift for Amy” so it’s above “Remember the Milk.” The surrounding sort_orders in the target position are 0.1 and 0.2. To store “Gift for Amy” with the correct sort_order, we simply use the median of the two surrounding values: (0.1 + 0.2) / 2 = 0.3 / 2 = 0.15. + +

If we query the view again, we now get the desired result: + +

+{
+  "total_rows": 4,
+  "offset": 0,
+  "rows": [
+    {
+      "key":0.1,
+      "value":"Laundry",
+      "id":"..."
+    },
+    {
+      "key":0.15,
+      "value":"Gift for Amy",
+      "id":"..."
+    },
+    {
+      "key":0.2,
+      "value":"Remember the Milk",
+      "id":"..."
+    },
+    {
+      "key":0.3,
+      "value":"Call Fred",
+      "id":"..."
+    }
+  ]
+}
+
+ +

The downside of this approach is that with an increasing number of reorderings, float precision can become an issue as digits “grow” infinitely. One solution is not to care and expect that a single user will not exceed any limits. Alternatively, an administrative task can reset the whole list to single decimals when a user is not active. + +

The advantage of this approach is that you have to touch only a single document, which is efficient for storing the new ordering of a list and updating the view that maintains the ordered index since only the changed document has to be incorporated into the index. + +

Pagination

+ +

This recipe explains how to paginate over view results. Pagination is a user interface (UI) pattern that allows the display of a large number of rows (the result set) without loading all the rows into the UI at once. A fixed-size subset, the page, is displayed along with next and previous links or buttons that can move the viewport over the result set to an adjacent page. + +

We assume you’re familiar with creating and querying documents and views as well as the multiple view query options. + +

Example Data

+ +

To have some data to work with, we’ll create a list of bands, one document per band: + +

+{ "name":"Biffy Clyro" }
+
+{ "name":"Foo Fighters" }
+
+{ "name":"Tool" }
+
+{ "name":"Nirvana" }
+
+{ "name":"Helmet" }
+
+{ "name":"Tenacious D" }
+
+{ "name":"Future of the Left" }
+
+{ "name":"A Perfect Circle" }
+
+{ "name":"Silverchair" }
+
+{ "name":"Queens of the Stone Age" }
+
+{ "name":"Kerub" }
+
+ +

A View

+ +

We need a simple map function that gives us an alphabetical list of band names. This should be easy, but we’re adding extra smarts to filter out “The” and “A” in front of band names to put them into the right position: + +

+function(doc) {
+  if(doc.name) {
+    var name = doc.name.replace(/^(A|The) /, "");
+    emit(name, null);
+  }
+}
+
+ +

The views result is an alphabetical list of band names. Now say we want to display band names five at a time and have a link pointing to the next five names that make up one page, and a link for the previous five, if we’re not on the first page. + +

We learned how to use the startkey, limit, and skip parameters in earlier chapters. We’ll use these again here. First, let’s have a look at the full result set: + +

+{"total_rows":11,"offset":0,"rows":[
+  {"id":"a0746072bba60a62b01209f467ca4fe2","key":"Biffy Clyro","value":null},
+  {"id":"b47d82284969f10cd1b6ea460ad62d00","key":"Foo Fighters","value":null},
+  {"id":"45ccde324611f86ad4932555dea7fce0","key":"Tenacious D","value":null},
+  {"id":"d7ab24bb3489a9010c7d1a2087a4a9e4","key":"Future of the Left","value":null},
+  {"id":"ad2f85ef87f5a9a65db5b3a75a03cd82","key":"Helmet","value":null},
+  {"id":"a2f31cfa68118a6ae9d35444fcb1a3cf","key":"Nirvana","value":null},
+  {"id":"67373171d0f626b811bdc34e92e77901","key":"Kerub","value":null},
+  {"id":"3e1b84630c384f6aef1a5c50a81e4a34","key":"Perfect Circle","value":null},
+  {"id":"84a371a7b8414237fad1b6aaf68cd16a","key":"Queens of the Stone Age","value":null},
+  {"id":"dcdaf08242a4be7da1a36e25f4f0b022","key":"Silverchair","value":null},
+  {"id":"fd590d4ad53771db47b0406054f02243","key":"Tool","value":null}
+]}
+
+ +

Setup

+ +

The mechanics of paging are very simple: + +

+ +

Or in a pseudo-JavaScript snippet: + +

+var result = new Result();
+var page = result.getPage();
+
+page.display();
+
+if(result.hasPrev()) {
+  page.display_link('prev');
+}
+
+if(result.hasNext()) {
+  page.display_link('next');
+}
+
+ +

Slow Paging (Do Not Use)

+ +

Don’t use this method! We just show it because it might seem natural to use, and you need to know why it is a bad idea. To get the first five rows from the view result, you use the ?limit=5 query parameter: + +

+curl -X GET http://127.0.0.1:5984/artists/_design/artists/_view/by-name?limit=5
+
+ +

The result: + +

+{"total_rows":11,"offset":0,"rows":[
+  {"id":"a0746072bba60a62b01209f467ca4fe2","key":"Biffy Clyro","value":null},
+  {"id":"b47d82284969f10cd1b6ea460ad62d00","key":"Foo Fighters","value":null},
+  {"id":"45ccde324611f86ad4932555dea7fce0","key":"Tenacious D","value":null},
+  {"id":"d7ab24bb3489a9010c7d1a2087a4a9e4","key":"Future of the Left","value":null},
+  {"id":"ad2f85ef87f5a9a65db5b3a75a03cd82","key":"Helmet","value":null}
+]}
+
+ +

By comparing the total_rows value to our limit value, we can determine if there are more pages to display. We also know by the offset member that we are on the first page. We can calculate the value for skip= to get the results for the next page: + +

+var rows_per_page = 5;
+var page = (offset / rows_per_page) + 1; // == 1
+var skip = page * rows_per_page; // == 5 for the first page, 10 for the second ...
+
+ +

So we query CouchDB with: + +

+curl -X GET 'http://127.0.0.1:5984/artists/_design/artists/_view/by-name?limit=5&skip=5'
+
+ +

Note we have to use ' (single quotes) to escape the & character that is special to the shell we execute curl in. + +

The result: + +

+{"total_rows":11,"offset":5,"rows":[
+  {"id":"a2f31cfa68118a6ae9d35444fcb1a3cf","key":"Nirvana","value":null},
+  {"id":"67373171d0f626b811bdc34e92e77901","key":"Kerub","value":null},
+  {"id":"3e1b84630c384f6aef1a5c50a81e4a34","key":"Perfect Circle","value":null},
+  {"id":"84a371a7b8414237fad1b6aaf68cd16a","key":"Queens of the Stone Age","value":null},
+  {"id":"dcdaf08242a4be7da1a36e25f4f0b022","key":"Silverchair","value":null}
+]}
+
+ +

Implementing the hasPrev() and hasNext() method is pretty straightforward: + +

+function hasPrev()
+{
+  return page > 1;
+}
+
+function hasNext()
+{
+  var last_page = Math.floor(total_rows / rows_per_page) +
+    (total_rows % rows_per_page);
+  return page != last_page;
+}
+
+ +
The dealbreaker
+ +

This all looks easy and straightforward, but it has one fatal flaw. Remember how view results are generated from the underlying B-tree index: CouchDB jumps to the first row (or the first row that matches startkey, if provided) and reads one row after the other from the index until there are no more rows (or limit or endkey match, if provided). + +

The skip argument works like this: in addition to going to the first row and starting to read, skip will skip as many rows as specified, but CouchDB will still read from the first row; it just won’t return any values for the skipped rows. If you specify skip=100, CouchDB will read 100 rows and not create output for them. This doesn’t sound too bad, but it is very bad, when you use 1000 or even 10000 as skip values. CouchDB will have to look at a lot of rows unnecessarily. + +

As a rule of thumb, skip should be used only with single digit values. While it’s possible that there are legitimate use cases where you specify a larger value, they are a good indicator for potential problems with your solution. Finally, for the calculations to work, you need to add a reduce function and make two calls to the view per page to get all the numbering right, and there’s still a potential for error. + +

Fast Paging (Do Use)

+ +

The correct solution is not much harder. Instead of slicing the result set into equally sized pages, we look at 10 rows at a time and use startkey to jump to the next 10 rows. We even use skip, but only with the value 1. + +

Here is how it works: + +

+ +

The trick to finding the next page is pretty simple. Instead of requesting 10 rows for a page, you request 11 rows, but display only 10 and use the values in the 11th row as the startkey for the next page. Populating the link to the previous page is as simple as carrying the current startkey over to the next page. If there’s no previous startkey, we are on the first page. We stop displaying the link to the next page if we get rows_per_page or less rows back. This is called linked list pagination, as we go from page to page, or list item to list item, instead of jumping directly to a pre-computed page. There is one caveat, though. Can you spot it? + +

CouchDB view keys do not have to be unique; you can have multiple index entries read. What if you have more index entries for a key than rows that should be on a page? startkey jumps to the first row, and you’d be screwed if CouchDB didn’t have an additional parameter for you to use. All view keys with the same value are internally sorted by docid, that is, the ID of the document that created that view row. You can use the startkey_docid and endkey_docid parameters to get subsets of these rows. For pagination, we still don’t need endkey_docid, but startkey_docid is very handy. In addition to startkey and limit, you also use startkey_docid for pagination if, and only if, the extra row you fetch to find the next page has the same key as the current startkey. + +

It is important to note that the *_docid parameters only work in addition to the *key parameters and are only useful to further narrow down the result set of a view for a single key. They do not work on their own (the one exception being the built-in _all_docs view that already sorts by document ID). + +

The advantage of this approach is that all the key operations can be performed on the super-fast B-tree index behind the view. Looking up a page doesn’t include scanning through hundreds and thousands of rows unnecessarily. + +

Jump to Page

+ +

One drawback of the linked list style pagination is that you can’t pre-compute the rows for a particular page from the page number and the rows per page. Jumping to a specific page doesn’t really work. Our gut reaction, if that concern is raised, is, “Not even Google is doing that!” and we tend to get away with it. Google always pretends on the first page to find 10 more pages of results. Only if you click on the second page (something very few people actually do) might Google display a reduced set of pages. If you page through the results, you get links for the previous and next 10 pages, but no more. Pre-computing the necessary startkey and startkey_docid for 20 pages is a feasible operation and a pragmatic optimization to know the rows for every page in a result set that is potentially tens of thousands of rows long, or more. + +

If you really do need to jump to a page over the full range of documents (we have seen applications that require that), you can still maintain an integer value index as the view index and take a hybrid approach at solving pagination. diff --git a/editions/1/zh/replication.html b/editions/1/zh/replication.html new file mode 100644 index 0000000..ea38456 --- /dev/null +++ b/editions/1/zh/replication.html @@ -0,0 +1,118 @@ +Replication + + + + + + + + + + + +

Replication

+ +

This chapter introduces CouchDB’s world-class replication system. Replication synchronizes two copies of the same database, allowing users to have low latency access data no matter where they are. These databases can live on the same server or on two different servers—CouchDB doesn’t make a distinction. If you change one copy of the database, replication will send these changes to the other copy. + +

Replication is a one-off operation: you send an HTTP request to CouchDB that includes a source and a target database, and CouchDB will send the changes from the source to the target. That is all. Granted, calling something world-class and then only needing one sentence to explain it does seem odd. But part of the reason why CouchDB’s replication is so powerful lies in its simplicity. + +

Let’s see what replication looks like: + +

+POST /_replicate HTTP/1.1
+{"source":"database","target":"http://example.org/database"}
+
+ +

This call sends all the documents in the local database database to the remote database http://example.org/database. A database is considered “local” when it is on the same CouchDB instance you send the POST /_replicate HTTP request to. All other instances of CouchDB are “remote.” + +

If you want to send changes from the target to the source database, you just make the same HTTP requests, only with source and target database swapped. That is all. + +

+POST /_replicate HTTP/1.1
+{"source":"http://example.org/database","target":"database"}
+
+ +

A remote database is identified by the same URL you use to talk to it. CouchDB replication works over HTTP using the same mechanisms that are available to you. This example shows that replication is a unidirectional process. Documents are copied from one database to another and not automatically vice versa. If you want bidirectional replication, you need to trigger two replications with source and target swapped. + +

The Magic

+ +

When you ask CouchDB to replicate one database to another, it will go and compare the two databases to find out which documents on the source differ from the target and then submit a batch of the changed documents to the target until all changes are transferred. Changes include new documents, changed documents, and deleted documents. Documents that already exist on the target in the same revision are not transferred; only newer revisions are. + +

Databases in CouchDB have a sequence number that gets incremented every time the database is changed. CouchDB remembers what changes came with which sequence number. That way, CouchDB can answer questions like, “What changed in database A between sequence number 212 and now?” by returning a list of new and changed documents. Finding the differences between databases this way is an efficient operation. It also adds to the robustness of replication. + +

+ +

CouchDB views use the same mechanism when determining when a view needs updating and which documents to replication. You can use this to build your own solutions as well. + +

+ +

You can use replication on a single CouchDB instance to create snapshots of your databases to be able to test code changes without risking data loss or to be able to refer back to older states of your database. But replication gets really fun if you use two or more different computers, potentially geographically spread out. + +

With different servers, potentially hundreds or thousands of miles apart, problems are bound to happen. Servers crash, network connections break off, things go wrong. When a replication process is interrupted, it leaves two replicating CouchDBs in an inconsistent state. Then, when the problems are gone and you trigger replication again, it continues where it left off. + +

Simple Replication with the Admin Interface

+ +

You can run replication from your web browser using Futon, CouchDB’s built-in administration interface. Start CouchDB and open your browser to http://127.0.0.1:5984/_utils/. On the righthand side, you will see a list of things to visit in Futon. Click on “Replication.” + +

Futon will show you an interface to start replication. You can specify a source and a target by either picking a database from the list of local databases or filling in the URL of a remote database. + +

Click on the Replicate button, wait a bit, and have a look at the lower half of the screen where CouchDB gives you some statistics about the replication run or, if an error occurred, an explanatory message. + +

Congratulations—you ran your first replication. + +

Replication in Detail

+ +

So far, we’ve skipped over the result from a replication request. Now is a good time to look at it in detail. Here’s a nicely formatted example: + +

+{
+  "ok": true,
+  "source_last_seq": 10,
+  "session_id": "c7a2bbbf9e4af774de3049eb86eaa447",
+  "history": [
+    {
+      "session_id": "c7a2bbbf9e4af774de3049eb86eaa447",
+      "start_time": "Mon, 24 Aug 2009 09:36:46 GMT",
+      "end_time": "Mon, 24 Aug 2009 09:36:47 GMT",
+      "start_last_seq": 0,
+      "end_last_seq": 1,
+      "recorded_seq": 1,
+      "missing_checked": 0,
+      "missing_found": 1,
+      "docs_read": 1,
+      "docs_written": 1,
+      "doc_write_failures": 0,
+    }
+  ]
+}
+
+ +

The "ok": true part, similar to other responses, tells us everything went well. source_last_seq includes the source’s update_seq value that was considered by this replication. Each replication request is assigned a session_id, which is just a UUID; you can also talk about a replication session identified by this ID. + +

The next bit is the replication history. CouchDB maintains a list of history sessions for future reference. The history array is currently capped at 50 entries. Each unique replication trigger object (the JSON string that includes the source and target databases as well as potential options) gets its own history. Let’s see what a history entry is all about. + +

The session_id is recorded here again for convenience. The start and end time for the replication session are recorded. The _last_seq denotes the update_seqs that were valid at the beginning and the end of the session. recorded_seq is the update_seq of the target again. It’s different from end_last_seq if a replication process dies in the middle and is restarted. missing_checked is the number of docs on the target that are already there and don’t need to be replicated. missing_found is the number of missing documents on the source. + +

The last three—docs_read, docs_written, and doc_write_failures—show how many documents we read from the source, wrote to the target, and how many failed. If all is well, _read and _written are identical and doc_write_failures is 0. If not, you know something went wrong during replication. Possible failures are a server crash on either side, a lost network connection, or a validate_doc_update function rejecting a document write. + +

One common scenario is triggering replication on nodes that have admin accounts enabled. Creating design documents is restricted to admins, and if the replication is triggered without admin credentials, writing the design documents during replication will fail and be recorded as doc_write_failures. If you have admins, be sure to include the credentials in the replication request: + +

+> curl -X POST http://127.0.0.1:5984/_replicate  -d '{"source":"http://example.org/database", "target":"http://admin:password@e127.0.0.1:5984/database"}'
+
+ +

Continuous Replication

+ +

Now that you know how replication works under the hood, we share a neat little trick. When you add "continuous":true to the replication trigger object, CouchDB will not stop after replicating all missing documents from the source to the target. It will listen on CouchDB’s _changes API (see Chapter 20, Change Notifications) and automatically replicate over any new docs as they come into the source to the target. In fact, they are not replicated right away; there’s a complex algorithm determining the ideal moment to replicate for maximum performance. The algorithm is complex and is fine-tuned every once in a while, and documenting it here wouldn’t make much sense. + +

+> curl -X POST http://127.0.0.1:5984/_replicate -d '{"source":"db", "target":"db-replica", "continuous":true}'
+
+ +

At the time of writing, CouchDB doesn’t remember continuous replications over a server restart. For the time being, you are required to trigger them again when you restart CouchDB. In the future, CouchDB will allow you to define permanent continuous replications that survive a server restart without you having to do anything. + +

That’s It?

+ +

Replication is the foundation on which the following chapters build on. Make sure you have understood this chapter. If you don’t feel comfortable yet, just read it again and play around with the replication interface in Futon. + +

We haven’t yet told you everything about replication. The next chapters show you how to manage replication conflicts (see Chapter 17, Conflict Management), how to use a set of synchronized CouchDB instances for load balancing (see Chapter 18, Load Balancing), and how to build a cluster of CouchDBs that can handle more data or write requests than a single node (see Chapter 19, Clustering). diff --git a/editions/1/zh/scaling.html b/editions/1/zh/scaling.html new file mode 100644 index 0000000..d714c9b --- /dev/null +++ b/editions/1/zh/scaling.html @@ -0,0 +1,69 @@ +Scaling Basics + + + + + + + + + + + +

Scaling Basics

+ +

Scaling is an overloaded term. Finding a discrete definition is tricky. Everyone and her grandmother have their own idea of what scaling means. Most definitions are valid, but they can be contradicting. To make things even worse, there are a lot of misconceptions about scaling. To really define it, one needs a scalpel to find out the important bits. + +

First, scaling doesn’t refer to a specific technique or technology; scaling, or scalability, is an attribute of a specific architecture. What is being scaled varies for nearly every project. + +

+ +

Scaling is specialization. + +

—Joe Stump, Lead Architect of Digg.com and SimpleGeo.com + +

+ +

Joe’s quote is the one that we find to be the most accurate description of scaling. It is also wishy-washy, but that is the nature of scaling. An example: a website like Facebook.com— with a whole lot of users and data associated with those users and with more and more users coming in every day—might want to scale over user data that typically lives in a database. In contrast, Flickr.com at its core is like Facebook with users and data for users, but in Flickr’s case, the data that grows fastest is images uploaded by users. These images do not necessarily live in a database, so scaling image storage is Flickr’s path to growth. + +

+ +

It is common to think of scaling as scaling out. This is shortsighted. Scaling can also mean scaling in—that is, being able to use fewer computers when demand declines. More on that later. + +

+ +

These are just two services. There are a lot more, and every one has different things they want to scale. CouchDB is a database; we are not going to cover every aspect of scaling any system. We concentrate on the bits that are interesting to you, the CouchDB user. We have identified three general properties that you can scale with CouchDB: + +

+ +

Scaling Read Requests

+ +

A read request retrieves a piece of information from the database. It passes the following stations within CouchDB. First, the HTTP server module needs to accept the request. For that, it opens a socket to send data over. The next station is the HTTP request handle module that analyzes the request and directs it to the appropriate submodule in CouchDB. For single documents, the request then gets passed to the database module where the data for the document is looked up on the filesystem and returned all the way up again. + +

All this takes processing time and enough sockets (or file descriptors) must be available. The storage backend of the server must be able to fulfill all read requests. There are a few more things that can limit a system to accept more read requests; the basic point here is that a single server can process only so many concurrent requests. If your applications generate more requests, you need to set up a second server that your application can read from. + +

The nice thing about read requests is that they can be cached. Often-used items can be held in memory and can be returned at a much higher level than the one that is your bottleneck. Requests that can use this cache don’t ever hit your database and are thus virtually toll-free. Chapter 18, Load Balancing explains this scenario. + +

Scaling Write Requests

+ +

A write request is like a read request, only a little worse. It not only reads a piece of data from disk, it writes it back after modifying it. Remember, the nice thing about reads is that they’re cacheable. Writes: not so much. A cache must be notified when a write changes data, or clients must be told to not use the cache. If you have multiple servers for scaling reads, a write must occur on all servers. In any case, you need to work harder with a write. Chapter 19, Clustering covers methods for scaling write requests across servers. + +

Scaling Data

+ +

The third way of scaling is scaling data. Today’s hard drives are cheap and have a lot of capacity, and they will only get better in the future, but there is only so much data a single server can make sensible use of. It must maintain one more indexes to the data that uses disk space again. Creating backups will take longer and other maintenance tasks become a pain. + +

The solution is to chop the data into manageable chunks and put each chunk on a separate server. All servers with a chunk now form a cluster that holds all your data. Chapter 19, Clustering takes a look at creating and using these clusters. + +

While we are taking separate looks at scaling of reads, writes, and data, these rarely occur isolated. Decisions to scale one will affect the others. We will describe individual as well as combined solutions in the following chapters. + +

Basics First

+ +

Replication is the basis for all of the three scaling methods. Before we go scaling, Chapter 16, Replication will familiarize you with CouchDB’s excellent replication feature. diff --git a/editions/1/zh/security.html b/editions/1/zh/security.html new file mode 100644 index 0000000..b1216a8 --- /dev/null +++ b/editions/1/zh/security.html @@ -0,0 +1,243 @@ +Security + + + + + + + + + + + +

Security

+ +

We mentioned earlier that CouchDB is still in development and that features may have been added since the publication of this book. This is especially true for the security mechanisms in CouchDB. There is rudimentary support in the currently released versions (0.10.0), but as we’re writing these lines, additions are being discussed. + +

In this chapter, we’ll look at the basic security mechanisms in CouchDB: the Admin Party, Basic Authentication, Cookie Authentication, and OAuth. + +

The Admin Party

+ +

When you start out fresh, CouchDB allows any request to be made by anyone. Create a database? No problem, here you go. Delete some documents? Same deal. CouchDB calls this the Admin Party. Everybody has privileges to do anything. Neat. + +

While it is incredibly easy to get started with CouchDB that way, it should be obvious that putting a default installation into the wild is adventurous. Any rogue client could come along and delete a database. + +

A note of relief: by default, CouchDB will listen only on your loopback network interface (127.0.0.1 or localhost) and thus only you will be able to make requests to CouchDB, nobody else. But when you start to open up your CouchDB to the public (that is, by telling it to bind to your machine’s public IP address), you will want to think about restricting access so that the next bad guy doesn’t ruin your admin party. + +

In our previous discussions, w dropped some keywords about how things without the admin party work. First, there’s admin itself, which implies some sort of super user. Then there are privileges. Let’s explore these terms a little more. + +

CouchDB has the idea of an admin user (e.g. an administrator, a super user, or root) that is allowed to do anything to a CouchDB installation. By default, everybody is an admin. If you don’t like that, you can create specific admin users with a username and password as their credentials. + +

CouchDB also defines a set of requests that only admin users are allowed to do. If you have defined one or more specific admin users, CouchDB will ask for identification for certain requests: + +

+ +

Creating New Admin Users

+ +

Let’s do another walk through the API using curl to see how CouchDB behaves when you add admin users. + +

+> HOST="http://127.0.0.1:5984"
+> curl -X PUT $HOST/database
+{"ok":true}
+
+ +

When starting out fresh, we can add a database. Nothing unexpected. Now let’s create an admin user. We’ll call her anna, and her password is secret. Note the double quotes in the following code; they are needed to denote a string value for the configuration API (as we learned earlier): + +

+curl -X PUT $HOST/_config/admins/anna -d '"secret"'
+""
+
+ +

As per the _config API’s behavior, we’re getting the previous value for the config item we just wrote. Since our admin user didn’t exist, we get an empty string. + +

When we now sneak over to the CouchDB log file, we find these two entries: + +

+[debug] [<0.43.0>] saving to file '/Users/jan/Work/couchdb-git/etc/couchdb/local_dev.ini', Config: '{{"admins","anna"},"secret"}'
+
+[debug] [<0.43.0>] saving to file '/Users/jan/Work/couchdb-git/etc/couchdb/local_dev.ini', Config:'{{"admins","anna"}, "-hashed-6a1cc3760b4d09c150d44edf302ff40606221526,a69a9e4f0047be899ebfe09a40b2f52c"}'
+
+ +

The first is our initial request. You see that our admin user gets written to the CouchDB configuration files. We set our CouchDB log level to debug to see exactly what is going on. We first see the request coming in with a plain-text password and then again with a hashed password. + +

Hashing Passwords

+ +

Seeing the plain-text password is scary, isn’t it? No worries; in normal operation when the log level is not set to debug, the plain-text password doesn’t show up anywhere. It gets hashed right away. The hash is that big, ugly, long string that starts out with -hashed-. How does that work? + +

    + +
  1. Creates a new 128-bit UUID. This is our salt.
  2. + +
  3. Creates a sha1 hash of the concatenation of the bytes of the plain-text password and the salt (sha1(password + salt)).
  4. + +
  5. Prefixes the result with -hashed- and appends ,salt.
  6. + +
+ +

To compare a plain-text password during authentication with the stored hash, the same procedure is run and the resulting hash is compared to the stored hash. The probability of two identical hashes for different passwords is too insignificant to mention (c.f. Bruce Schneier). Should the stored hash fall into the hands of an attacker, it is, by current standards, way too inconvenient (i.e., it’d take a lot of money and time) to find the plain-text password from the hash. + +

But what’s with the -hashed- prefix? Well, remember how the configuration API works? When CouchDB starts up, it reads a set of .ini files with config settings. It loads these settings into an internal data store (not a database). The config API lets you read the current configuration as well as change it and create new entries. CouchDB is writing any changes back to the .ini files. + +

The .ini files can also be edited by hand when CouchDB is not running. Instead of creating the admin user as we showed previously, you could have stopped CouchDB, opened your local.ini, added anna = secret to the [admins] section, and restarted CouchDB. Upon reading the new line from local.ini, CouchDB would run the hashing algorithm and write back the hash to local.ini, replacing the plain-text password. To make sure CouchDB only hashes plain-text passwords and not an existing hash a second time, it prefixes the hash with -hashed-, to distinguish between plain-text passwords and hashed passwords. This means your plain-text password can’t start with the characters -hashed-, but that’s pretty unlikely to begin with. + +

Basic Authentication

+ +

Now that we have defined an admin, CouchDB will not allow us to create new databases unless we give the correct admin user credentials. Let’s verify: + +

+> curl -X PUT $HOST/somedatabase
+{"error":"unauthorized","reason":"You are not a server admin."}
+
+ +

That looks about right. Now we try again with the correct credentials: + +

+> HOST="http://anna:secret@127.0.0.1:5984"
+> curl -X PUT $HOST/somedatabase
+{"ok":true}
+
+ +

If you have ever accessed a website or FTP server that was password-protected, the username:password@ URL variant should look familiar. + +

If you are security conscious, the missing s in http:// will make you nervous. We’re sending our password to CouchDB in plain text. This is a bad thing, right? Yes, but consider our scenario: CouchDB listens on 127.0.0.1 on a development box that we’re the sole user of. Who could possibly sniff our password? + +

If you are in a production environment, however, you need to reconsider. Will your CouchDB instance communicate over a public network? Even a LAN shared with other colocation customers is public. There are multiple ways to secure communication between you or your application and CouchDB that exceed the scope of this book. We suggest you read up on VPNs and setting up CouchDB behind an HTTP proxy (like Apache httpd’s mod_proxy, nginx, or varnish) that will handle SSL for you. CouchDB does not support exposing its API via SSL at the moment. It can, however, replicate with other CouchDB instances that are behind an SSL proxy CouchDB as of version 1.1.0 comes with SSL built in. + +

Update Validations Again

+ +

Do you remember Chapter 7, Validation Functions? We had an update validation function that allowed us to verify that the claimed author of a document matched the authenticated username. + +

+function(newDoc, oldDoc, userCtx) {
+  if (newDoc.author) {
+    if(newDoc.author != userCtx.name) {
+      throw("forbidden": "You may only update documents with author " +
+        userCtx.name});
+    }
+  }
+}
+
+ +

What is this userCtx exactly? It is an object filled with information about the current request’s authentication data. Let’s have a look at what’s in there. We’ll show you a simple trick how to introspect what’s going on in all the JavaScript you are writing. + +

+> curl -X PUT $HOST/somedatabase/_design/log -d '{"validate_doc_update":"function(newDoc, oldDoc, userCtx) { log(userCtx); }"}'
+{"ok":true,"id":"_design/log","rev":"1-498bd568e17e93d247ca48439a368718"}
+
+ +

Let’s show the validate_doc_update function: + +

+function(newDoc, oldDoc, userCtx) {
+  log(userCtx);
+}
+
+ +

This gets called for every future document update and does nothing but print a log entry into CouchDB’s log file. If we now create a new document: + +

+> curl -X POST $HOST/somedatabase/ -d '{"a":1}'
+{"ok":true,"id":"36174efe5d455bd45fa1d51efbcff986","rev":"1-23202479633c2b380f79507a776743d5"}
+
+ +

we should see this in our couch.log file: + +

+[info] [<0.9973.0>] OS Process :: {"db": "somedatabase","name": "anna","roles":["_admin"]}
+
+ +

Let’s format this again: + +

+{
+  "db": "somedatabase",
+  "name": "anna",
+  "roles": ["_admin"]
+}
+
+ +

We see the current database, the name of the authenticated user, and an array of roles, with one role "_admin". We can conclude that admin users in CouchDB are really just regular users with the admin role attached to them. + +

By separating users and roles from each other, the authentication system allows for flexible extension. For now, we’ll just look at admin users. + +

Cookie Authentication

+ +

Basic authentication that uses plain-text passwords is nice and convenient, but not very secure if no extra measures are taken. It is also a very poor user experience. If you use basic authentication to identify admins, your application’s users need to deal with an ugly, unstylable browser modal dialog that says non-professional at work more than anything else. + +

To remedy some of these concerns, CouchDB supports cookie authentication. With cookie authentication your application doesn’t have to include the ugly login dialog that the users’ browsers come with. You can use a regular HTML form to submit logins to CouchDB. Upon receipt, CouchDB will generate a one-time token that the client can use in its next request to CouchDB. When CouchDB sees the token in a subsequent request, it will authenticate the user based on the token without the need to see the password again. By default, a token is valid for 10 minutes. + +

To obtain the first token and thus authenticate a user for the first time, the username and password must be sent to the _session API. The API is smart enough to decode HTML form submissions, so you don’t have to resort to any smarts in your application. + +

If you are not using HTML forms to log in, you need to send an HTTP request that looks as if an HTML form generated it. Luckily, this is super simple: + +

+> HOST="http://127.0.0.1:5984"
+> curl -vX POST $HOST/_session -H 'application/x-www-form-urlencoded' -d 'name=anna&password=secret'
+
+ +

CouchDB replies, and we’ll give you some more detail: + +

+< HTTP/1.1 200 OK
+< Set-Cookie: AuthSession=YW5uYTo0QUIzOTdFQjrC4ipN-D-53hw1sJepVzcVxnriEw;
+< Version=1; Path=/; HttpOnly
+> ...
+<
+{"ok":true}
+
+ +

A 200 response code tells us all is well, a Set-Cookie header includes the token we can use for the next request, and the standard JSON response tells us again that the request was successful. + +

Now we can use this token to make another request as the same user without sending the username and password again: + +

+> curl -vX PUT $HOST/mydatabase --cookie AuthSession=YW5uYTo0QUIzOTdFQjrC4ipN-D-53hw1sJepVzcVxnriEw -H "X-CouchDB-WWW-Authenticate: Cookie" -H "Content-Type: application/x-www-form-urlencoded"
+{"ok":true}
+
+ +

You can keep using this token for 10 minutes by default. After 10 minutes you need to authenticate your user again. The token lifetime can be configured with the timeout (in seconds) setting in the couch_httpd_auth configuration section. + +

+ +

Please note that for cookie authentication to work, you need to enable the cookie_authentication_handler in your local.ini: + +

+[httpd]
+authentication_handlers = {couch_httpd_auth, cookie_authentication_handler}, {couch_httpd_oauth, oauth_authentication_handler}, {couch_httpd_auth, default_authentication_handler}
+
+ +

In addition, you need to define a server secret: + +

+[couch_httpd_auth]
+secret = yours3cr37pr4s3
+
+ +
+ +

Network Server Security

+ +

CouchDB is a networked server, and there are best practices for securing these that are beyond the scope of this book. Appendix D, Installing from Source includes some of those best practices. Make sure to understand the implications. diff --git a/editions/1/zh/show.html b/editions/1/zh/show.html new file mode 100644 index 0000000..44060d0 --- /dev/null +++ b/editions/1/zh/show.html @@ -0,0 +1,298 @@ +Show Functions + + + + + + + + + + + +

Show Functions

+ +

CouchDB’s JSON documents are great for programmatic access in most environments. Almost all languages have HTTP and JSON libraries, and in the unlikely event that yours doesn’t, writing them is fairly simple. However, there is one important use case that JSON documents don’t cover: building plain old HTML web pages. Browsers are powerful, and it’s exciting that we can build Ajax applications using only CouchDB’s JSON and HTTP APIs, but this approach is not appropriate for most public-facing websites. + +

HTML is the lingua franca of the web, for good reasons. By rendering our JSON documents into HTML pages, we make them available and accessible for a wider variety of uses. With the pure Ajax approach, visually impaired visitors to our blog stand a chance of not seeing any useful content at all, as popular screen-reading browsers have a hard time making sense of pages when the content is changed on the fly via JavaScript. Another important concern for authors is that their writing be indexed by search engines. Maintaining a high-quality blog doesn’t do much good if readers can’t find it via a web search. Most search engines do not execute JavaScript found within a page, so to them an Ajax blog looks devoid of content. We also mustn’t forget that HTML is likely more friendly as an archive format in the long term than the platform-specific JavaScript and JSON approach we used in previous chapters. Also, by serving plain HTML, we make our site snappier, as the browser can render meaningful content with fewer round-trips to the server. These are just a few of the reasons it makes sense to provide web content as HTML. + +

The traditional way to accomplish the goal of rendering HTML from database records is by using a middle-tier application server, such as Ruby on Rails or Django, which loads the appropriate records for a user request, runs a template function using them, and returns the resulting HTML to the visitor’s browser. The basics of this don’t change in CouchDB’s case; wrapping JSON views and documents with an application server is relatively straightforward. Rather than using browser-side JavaScript to load JSON from CouchDB and rendering dynamic pages, Rails or Django (or your framework of choice) could make those same HTTP requests against CouchDB, render the output to HTML, and return it to the browser. We won’t cover this approach in this book, as it is specific to particular languages and frameworks, and surveying the different options would take more space than you want to read. + +

CouchDB includes functionality designed to make it possible to do most of what an application tier would do, without relying on additional software. The appeal of this approach is that CouchDB can serve the whole application without dependencies on a complex environment such as might be maintained on a production web server. Because CouchDB is designed to run on client computers, where the environment is out of the control of application developers, having some built-in templating capabilities greatly expands the potential uses of these applications. When your application can be served by a standard CouchDB instance, you gain deployment ease and flexibility. + +

The Show Function API

+ +

Show functions, as they are called, have a constrained API designed to ensure cacheability and side effect–free operation. This is in stark contrast to other application servers, which give the programmer the freedom to run any operation as the result of any request. Let’s look at a few example show functions. + +

The most basic show function looks something like this: + +

+function(doc, req) {
+  return '<h1>' + doc.title + '</h1>';
+}
+
+ +

When run with a document that has a field called title with the content “Hello World,” this function will send an HTTP response with the default Content-Type of text/html, the UTF-8 character encoding, and the body <h1>Hello World</h1>. + +

The simplicity of the request/response cycle of a show function is hard to overstate. The most common question we hear is, “How can I load another document so that I can render its content as well?” The short answer is that you can’t. The longer answer is that for some applications you might use a list function to render a view result as HTML, which gives you the opportunity to use more than one document as the input of your function. + +

The basic function from a document and a request to a response, with no side effects and no alternative inputs, stays the same even as we start using more advanced features. Here’s a more complex show function illustrating the ability to set custom headers: + +

+function(doc, req) {
+  return {
+    body : '<foo>' + doc.title + '</foo>',
+    headers : {
+      "Content-Type" : "application/xml",
+      "X-My-Own-Header": "you can set your own headers"
+    }
+  }
+}
+
+ +

If this function were called with the same document as we used in the previous example, the response would have a Content-Type of application/xml and the body <foo>Hello World</foo>. You should be able to see from this how you’d be able to use show functions to generate any output you need, from any of your documents. + +

Popular uses of show functions are for outputting HTML page, CSV files, or XML needed for compatibility with a particular interface. The CouchDB test suite even illustrates using show functions to output a PNG image. To output binary data, there is the option to return a Base64-encoded string, like this: + +

+function(doc, req) {
+  return {
+    base64 :
+      ["iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAMAAAAoLQ9TAAAAsV",
+        "BMVEUAAAD////////////////////////5ur3rEBn////////////////wDBL/",
+        "AADuBAe9EB3IEBz/7+//X1/qBQn2AgP/f3/ilpzsDxfpChDtDhXeCA76AQH/v7",
+        "/84eLyWV/uc3bJPEf/Dw/uw8bRWmP1h4zxSlD6YGHuQ0f6g4XyQkXvCA36MDH6",
+        "wMH/z8/yAwX64ODeh47BHiv/Ly/20dLQLTj98PDXWmP/Pz//39/wGyJ7Iy9JAA",
+        "AADHRSTlMAbw8vf08/bz+Pv19jK/W3AAAAg0lEQVR4Xp3LRQ4DQRBD0QqTm4Y5",
+        "zMxw/4OleiJlHeUtv2X6RbNO1Uqj9g0RMCuQO0vBIg4vMFeOpCWIWmDOw82fZx",
+        "vaND1c8OG4vrdOqD8YwgpDYDxRgkSm5rwu0nQVBJuMg++pLXZyr5jnc1BaH4GT",
+        "LvEliY253nA3pVhQqdPt0f/erJkMGMB8xucAAAAASUVORK5CYII="].join(''),
+    headers : {
+      "Content-Type" : "image/png"
+    }
+  };
+}
+
+ +

This function outputs a 16×16 pixel version of the CouchDB logo. The JavaScript code necessary to generate images from document contents would likely be quite complex, but the ability to send Base64-encoded binary data means that query servers written in other languages like C or PHP have the ability to output any data type. + +

Side Effect–Free

+ +

We’ve mentioned that a key constraint of show functions is that they are side effect–free. This means that you can’t use them to update documents, kick off background processes, or trigger any other function. In the big picture, this is a good thing, as it allows CouchDB to give performance and reliability guarantees that standard web frameworks can’t. Because a show function will always return the same result given the same input and can’t change anything about the environment in which it runs, its output can be cached and intelligently reused. In a high-availability deployment with proper caching, this means that a given show function will be called only once for any particular document, and the CouchDB server may not even be contacted for subsequent requests. + +

Working without side effects can be a little bit disorienting for developers who are used to the anything-goes approach offered by most application servers. It’s considered best practice to ensure that actions run in response to GET requests are side effect–free and cacheable, but rarely do we have the discipline to achieve that goal. CouchDB takes a different tack: because it’s a database, not an application server, we think it’s more important to enforce best practices (and ensure that developers don’t write functions that adversely effect the database server) than offer absolute flexibility. Once you’re used to working within these constraints, they start to make a lot of sense. (There’s a reason they are considered best practices.) + +

Design Documents

+ +

Before we look into show functions themselves, we’ll quickly review how they are stored in design documents. CouchDB looks for show functions stored in a top-level field called shows, which is named like this to be parallel with views, lists, and filters. Here’s an example design document that defines two show functions: + +

+{
+  "_id" : "_design/show-function-examples",
+  "shows" : {
+    "summary" : "function(doc, req){ ... }",
+    "detail" : "function(doc, req){ ... }"
+  }
+}
+
+ +

There’s not much to note here except the fact that design documents can define multiple show functions. Now let’s see how these functions are run. + +

Querying Show Functions

+ +

We’ve described the show function API, but we haven’t yet seen how these functions are run. + +

The show function lives inside a design document, so to invoke it we append the name of the function to the design document itself, and then the ID of the document we want to render: + +

+GET /mydb/_design/mydesign/_show/myshow/72d43a93eb74b5f2
+
+ +

Because show functions (and the others like list, etc.) are available as resources within the design document path, all resources provided by a particular design document can be found under a common root, which makes custom application proxying simpler. We’ll see an example of this in Part III, “Example Application”. + +

If the document with ID 72d43a93eb74b5f2 does not exist, the request will result in an HTTP 500 Internal Server Error response. This seems a little harsh; why does it happen? If we query a show function with a document ID that doesn’t point to an existing document, the doc argument in the function is null. Then the show function tries to access it, and the JavaScript interpreter doesn’t like that. So it bails out. To secure against these errors, or to handle non-existing documents in a custom way (e.g., a wiki could display a “create new page” page), you can wrap the code in our function with if(doc !== null) { ... }. + +

However, show functions can also be called without a document ID at all, like this: + +

+GET /mydb/_design/mydesign/_show/myshow
+
+ +

In this case, the doc argument to the function has the value null. This option is useful in cases where the show function can make sense without a document. For instance, in the example application we’ll explore in Part III, “Example Application”, we use the same show function to provide for editing existing blog posts when a DocID is given, as well as for composing new blog posts when no DocID is given. The alternative would be to maintain an alternate resource (likely a static HTML attachment) with parallel functionality. As programmers, we strive not to repeat ourselves, which motivated us to give show functions the ability to run without a document ID. + +

Design Document Resources

+ +

In addition to the ability to run show functions, other resources are available within the design document path. This combination of features within the design document resource means that applications can be deployed without exposing the full CouchDB API to visitors, with only a simple proxy to rewrite the paths. We won’t go into full detail here, but the gist of it is that end users would run the previous query from a path like this: + +

+GET /_show/myshow/72d43a93eb74b5f2
+
+ +

Under the covers, an HTTP proxy can be programmed to prepend the database and design document portion of the path (in this case, /mydb/_design/mydesign) so that CouchDB sees the standard query. With such a system in place, end users can access the application only via functions defined on the design document, so developers can enforce constraints and prevent access to raw JSON document and view data. While it doesn’t provide 100% security, using custom rewrite rules is an effective way to control the access end users have to a CouchDB application. This technique has been used in production by a few websites at the time of this writing. + +

Query Parameters

+ +

The request object (including helpfully parsed versions of query parameters) is available to show functions as well. By way of illustration, here’s a show function that returns different data based on the URL query parameters: + +

+function(doc, req) {
+  return "<p>Aye aye, " + req.parrot + "!</p>";
+}
+
+ +

Requesting this function with a query parameter will result in the query parameter being used in the output: + +

+GET /mydb/_design/mydesign/_show/myshow?parrot=Captain
+
+ +

In this case, we’ll see the output: <p>Aye aye, Captain!</p> + +

Allowing URL parameters into the function does not affect cacheability, as each unique invocation results in a distinct URL. However, making heavy use of this feature will lower your cache effectiveness. Query parameters like this are most useful for doing things like switching the mode or the format of the show function output. It’s recommended that you avoid using them for things like inserting custom content (such as requesting the user’s nickname) into the response, as that will mean each users’s data must be cached separately. + +

Accept Headers

+ +

Part of the HTTP spec allows for clients to give hints to the server about which media types they are capable of accepting. At this time, the JavaScript query server shipped with CouchDB 0.10.0 contains helpers for working with Accept headers. However, web browser support for Accept headers is very poor, which has prompted frameworks such as Ruby on Rails to remove their support for them. CouchDB may or may not follow suit here, but the fact remains that you are discouraged from relying on Accept headers for applications that will be accessed via web browsers. + +

There is a suite of helpers for Accept headers present that allow you to specify the format in a query parameter as well. For instance: + +

+GET /db/_design/app/_show/post
+Accept: application/xml
+
+ +

is equivalent to a similar URL with mismatched Accept headers. This is because browsers don’t use sensible Accept headers for feed URLs. Browsers 1, Accept headers 0. Yay browsers. + +

+GET /db/_design/app/_show/post?format=xml
+Accept: x-foo/whatever
+
+ +

The request function allows developers to switch response Content-Types based on the client’s request. The next example adds the ability to return either HTML, XML, or a developer-designated media type: x-foo/whatever. + +

CouchDB’s main.js library provides the ("format", render_function) function, which makes it easy for developers to handle client requests for multiple MIME types in one form function. + +

This function also shows off the use of registerType(name, mime_types), which adds new types to mapping objects used by respondWith. The end result is ultimate flexibility for developers, with an easy interface for handling different types of requests. main.js uses a JavaScript port of Mimeparse, an open source reference implementation, to provide this service. + +

Etags

+ +

We’ve mentioned that show function requests are side effect–free and cacheable, but we haven’t discussed the mechanism used to accomplish this. Etags are a standard HTTP mechanism for indicating whether a cached copy of an HTTP response is still current. Essentially, when the client makes its first request to a resource, the response is accompanied by an Etag, which is an opaque string token unique to the version of the resource requested. The second time the client makes a request against the same resource, it sends along the original Etag with the request. If the server determines that the Etag still matches the resource, it can avoid sending the full response, instead replying with a message that essentially says, “You have the latest version already.” + +

When implemented properly, the use of Etags can cut down significantly on server load. CouchDB provides an Etag header, so that by using an HTTP proxy cache like Squid, you’ll instantly remove load from CouchDB. + +

Functions and Templates

+ +

CouchDB’s process runner looks only at the functions stored under show, but we’ll want to keep the template HTML separate from the content negotiation logic. The couchapp script handles this for us, using the !code and !json handlers. + +

Let’s follow the show function logic through the files that Sofa splits it into. Here’s Sofa’s edit show function: + +

+function(doc, req) {
+  // !json templates.edit
+  // !json blog
+  // !code vendor/couchapp/path.js
+  // !code vendor/couchapp/template.js
+
+  // we only show html
+  return template(templates.edit, {
+    doc : doc,
+    docid : toJSON((doc && doc._id) || null),
+    blog : blog,
+    assets : assetPath(),
+    index : listPath('index','recent-posts',{descending:true,limit:8})
+  });
+}
+
+ +

This should look pretty straightforward. First, we have the function’s head, or signature, that tells us we are dealing with a function that takes two arguments: doc and req. + +

The next four lines are comments, as far as JavaScript is concerned. But these are special documents. The CouchApp upload script knows how to read these special comments on top of the show function. They include macros; a macro starts with a bang (!) and a name. Currently, CouchApp supports the two macros !json and !code. + +

The !json Macro

+ +

The !json macro takes one argument: the path to a file in the CouchApp directory hierarchy in the dot notation. Instead of a slash (/) or backslash (\), you use a dot (.). The !json macro then reads the contents of the file and puts them into a variable that has the same name as the file’s path in dot notation. + +

For example, if you use the macro like this: + +

+  // !json template.edit
+
+ +

CouchDB will read the file template/edit.* and place its contents into a variable: + +

+  var template.edit = "contents of edit.*"
+
+ +

When specifying the path, you omit the file’s extension. That way you can read .json, .js, or .html files, or any other files into variables in your functions. Because the macro matches files with any extensions, you can’t have two files with the same name but different extensions. + +

In addition, you can specify a directory and CouchApp will load all the files in this directory and any subdirectory. So this: + +

+  // !json template
+
+ +

creates: + +

+  var template.edit = "contents of edit.*"
+  var teplate.post = "contents of post.*"
+
+ +

Note that the macro also takes care of creating the top-level template variable. We just omitted that here for brevity. The !json macro will generate only valid JavaScript. + +

The !code Macro

+ +

The !code macro is similar to the !json macro, but it serves a slightly different purpose. Instead of making the contents of one or more files available as variables in your functions, it replaces itself with the contents of the file referenced in the argument to the macro. + +

This is useful for sharing library functions between CouchDB functions (map/reduce/show/list/validate) without having to maintain their source code in multiple places. + +

Our example shows this line: + +

+  // !code vendor/couchapp/path.js
+
+ +

If you look at the CouchApp sources, there is a file in vendor/couchapp/path.js that includes a bunch of useful function related to the URL path of a request. In the example just shown, CouchApp will replace the line with the contents of path.js, making the functions locally available to the show function. + +

The !code macro can load only a single file at a time. + +

Learning Shows

+ +

Before we dig into the full code that will render the post permalink pages, let’s look at some Hello World form examples. The first one shows just the function arguments and the simplest possible return value. See Figure 1, “Basic form function”. + +

+ + + +

Figure 1. Basic form function + +

+ +

A show function is a JavaScript function that converts a document and some details about the HTTP request into an HTTP response. Typically it will be used to construct HTML, but it is also capable of returning Atom feeds, images, or even just filtered JSON. The document argument is just like the documents passed to map functions. + +

Using Templates

+ +

The only thing missing from the show function development experience is the ability to render HTML without ruining your eyes looking at a whole lot of string manipulation, among other unpleasantries. Most programming environments solve this problem with templates; for example, documents that look like HTML but have portions of their content filled out dynamically. + +

Dynamically combining template strings and data in JavaScript is a solved problem. However, it hasn’t caught on, partly because JavaScript doesn’t have very good support for multi-line “heredoc” strings. After all, once you get through escaping quotes and leaving out newlines, it’s not much fun to edit HTML templates inlined into JavaScript code. We’d much rather keep our templates in separate files, where we can avoid all the escaping work, and they can be syntax-highlighted by our editor. + +

The couchapp script has a couple of helpers to make working with templates and library code stored in design documents less painful. In the function shown in Figure 2, “The blog post template”, we use them to load a blog post template, as well as the JavaScript function responsible for rendering it. + +

+ + + +

Figure 2. The blog post template + +

+ +

As you can see, we take the opportunity in the function to strip JavaScript tags from the form post. That regular expression is not secure, and the blogging application is meant to be written to only by its owners, so we should probably drop the regular expression and simplify the function to avoid transforming the document, instead passing it directly to the template. Or we should port a known-good sanitization routine from another language and provide it in the templates library. + +

Writing Templates

+ +

Working with templates, instead of trying to cram all the presentation into one file, makes editing forms a little more relaxing. The templates are stored in their own file, so you don’t have to worry about JavaScript or JSON encoding, and your text editor can highlight the template’s HTML syntax. CouchDB’s JavaScript query server includes the E4X extensions for JavaScript, which can be helpful for XML templates but do not work well for HTML. We’ll explore E4X templates in Chapter 14, Viewing Lists of Blog Posts when we cover forms for views, which makes providing an Atom feed of view results easy and memory efficient. + +

Trust us when we say that looking at this HTML page is much more relaxing than trying to understand what a raw JavaScript one is trying to do. The template library we’re using in the example blog is by John Resig and was chosen for simplicity. It could easily be replaced by one of many other options, such as the Django template language, available in JavaScript. + +

This is a good time to note that CouchDB’s architecture is designed to make it simple to swap out languages for the query servers. With a query server written in Lisp, Python, or Ruby (or any language that supports JSON and stdio), you could have an even wider variety of templating options. However, the CouchDB team recommends sticking with JavaScript as it provides the highest level of support and interoperability, though other options are available. diff --git a/editions/1/zh/show/01.png b/editions/1/zh/show/01.png new file mode 100644 index 0000000..f4ca655 Binary files /dev/null and b/editions/1/zh/show/01.png differ diff --git a/editions/1/zh/show/02.png b/editions/1/zh/show/02.png new file mode 100644 index 0000000..11b369f Binary files /dev/null and b/editions/1/zh/show/02.png differ diff --git a/editions/1/zh/source.html b/editions/1/zh/source.html new file mode 100644 index 0000000..148b8f7 --- /dev/null +++ b/editions/1/zh/source.html @@ -0,0 +1,263 @@ +Installing from Source + + + + + + + + + + + +

Installing from Source

+ +

Generally speaking, you should avoid installing from source. Many operating systems provide package managers that will allow you to download and install CouchDB with a single command. These package managers usually take care of setting things up correctly, handling security, and making sure that the CouchDB database is started and stopped correctly by your system. The first few appendixes showed you how to install CouchDB packages for Unix-like, Mac OS X, and Windows operating systems. If you are unable to follow those instructions, or you need to install by hand for other reasons, this chapter is for you. + +

Dependencies

+ +

To build and install CouchDB, you will need to install a collection of other software that CouchDB depends on. Without this software properly installed on your system, CouchDB will refuse to work. You’ll need to download and install the following: + +

+ +

It is recommended that you install Erlang OTP R12B-5 or above if possible. + +

Each of these software packages should provide custom installation instructions, either on the website or in the archive you download. If you’re lucky, however, you may be able to use a package manager to install these dependencies. + +

Debian-Based (Including Ubuntu) Systems

+ +

You can install the dependencies by running: + +

+apt-get install build-essential erlang libicu-dev libmozjs-dev libcurl4-openssl-dev
+
+ +

If you get an error about any of these packages, be sure to check for the current version offered by your distribution. It may be the case that a newer version has been released and the package name has been changed. For example, you can search for the newest ICU package by running: + +

+apt-cache search libicu
+
+ +

Select and install the highest version from the list available. + +

Mac OS X

+ +

You will need to install the Xcode Tools metapackage by running: + +

+open /Applications/Installers/Xcode\ Tools/XcodeTools.mpkg
+
+ +

If this is unavailable on your system, you will need to install it from your Mac OS X installation CD. Alternatively, you can download a copy. + +

You can then install the other dependencies using MacPorts by running: + +

+port install icu erlang spidermonkey curl
+
+ +

See Appendix B, Installing on Mac OS X for more details. + +

Installing

+ +

Once you have installed all of the dependencies, you should download a copy of the CouchDB source. This should give you an archive that you’ll need to unpack. Open up a terminal and change directory to your newly unpacked archive. + +

Configure the source by running: + +

+./configure
+
+ +

We’re going to be installing CouchDB into /usr/local, which is the default location for user-installed software. A ton of options are available for this command, and you can customize everything from the installation location, such as your home directory, to the location of your Erlang or SpiderMonkey installation. + +

To see what’s available, you can run: + +

+./configure --help
+
+ +

Generally, you can ignore this step if you didn’t get any errors the first time you ran it. You’ll only need to pass extra options if your setup is a bit weird and the script is having trouble finding one of the dependencies you installed in the last section. + +

If everything was successful, you should see the following message: + +

+You have configured Apache CouchDB, time to relax.
+
+ +

Relax. + +

Build and install the source by running: + +

+make && sudo make install
+
+ +

If you changed the installation location to somewhere temporary, you may not want to use the sudo command here. If you are having problems running make, you may want to try running gmake if it is available on your system. More options can be found by reading the INSTALL file. + +

Security Considerations

+ +

It is not advisable to run the CouchDB server as the super user. If the CouchDB server is compromised by an attacker while it is being run by a super user, the attacker will get super user access to your entire system. That’s not what we want! + +

We strongly recommend that you create a specific user for CouchDB. This user should have as few privileges on your system as possible, preferably the bare minimum needed to run the CouchDB server, read the configuration files, and write to the data and log directories. + +

You can use whatever tool your system provides to create a new couchdb user. + +

On many Unix-like systems you can run: + +

+adduser --system --home /usr/local/var/lib/couchdb --no-create-home --shell /bin/bash --group --gecos "CouchDB" couchdb
+
+ +

Mac OS X provides the standard Accounts option from the System Preferences application, or you can use the Workgroup Manager application, which can be downloaded as part of the Server Admin Tools. + +

You should make sure that the couchdb user has a working login shell. You can test this by logging into a terminal as the couchdb user. You should also make sure to set the home directory to /usr/local/var/lib/couchdb, which is the CouchDB database directory. + +

Change the ownership of the CouchDB directories by running: + +

+chown -R couchdb:couchdb /usr/local/etc/couchdb
+chown -R couchdb:couchdb /usr/local/var/lib/couchdb
+chown -R couchdb:couchdb /usr/local/var/log/couchdb
+chown -R couchdb:couchdb /usr/local/var/run/couchdb
+
+ +

Change the permission of the CouchDB directories by running: + +

+chmod -R 0770 /usr/local/etc/couchdb
+chmod -R 0770 /usr/local/var/lib/couchdb
+chmod -R 0770 /usr/local/var/log/couchdb
+chmod -R 0770 /usr/local/var/run/couchdb
+
+ +

This isn’t the final word in securing your CouchDB setup. If you’re deploying CouchDB on the Web, or any place where untrusted parties can access your sever, it behooves you to research the recommended security measures for your operating system and take any additional steps needed. Keep in mind the network security adage that the only way to properly secure a computer system is to unplug it from the network. + +

Running Manually

+ +

You can start the CouchDB server by running: + +

+sudo -i -u couchdb couchdb -b
+
+ +

This uses the sudo command to run the couchdb command as the couchdb user. + +

When CouchDB starts, it should eventually display the following message: + +

+Apache CouchDB has started, time to relax.
+
+ +

Relax. + +

To check that everything has worked, point your web browser to: + +

+http://127.0.0.1:5984/_utils/index.html
+
+ +

This is Futon, the CouchDB web administration console. We covered the basics of Futon in our early chapters. Once you have it loaded, you should select and run the CouchDB Test Suite from the righthand menu. This will make sure that everything is behaving as expected, and it may save you some serious headaches if things turn out to be a bit wonky. + +

Running As a Daemon

+ +

Once you’ve got CouchDB running nicely, you’ll probably want to run it as daemon. A daemon is a software application that runs continually in the background, waiting to handle requests. This is how most production database servers run, and you can configure CouchDB to run like this, too. + +

When you run CouchDB as a daemon, it logs to a number of files that you’ll want to clean up from time to time. Letting your log files fill up a disk is a good way to break your server! Some operating systems come with software that does this for you, and it is important for you to research your options and take the necessary steps to make sure that this doesn’t become a problem. CouchDB ships with a logrotate configuration that may be useful. + +

SysV/BSD-Style Systems

+ +

Depending on your operating system, the couchdb daemon script could be installed into a directory called init.d (for SysV-style systems) or rc.d (for BSD-style systems) under the /usr/local/etc directory. The following examples use [init.d|rc.d] to indicate this choice, and you must replace it with your actual directory before running any of these commands. + +

You can start the CouchDB daemon by running: + +

+sudo /usr/local/etc/[init.d|rc.d]/couchdb start
+
+ +

You can stop the CouchDB daemon by running: + +

+sudo /usr/local/etc/[init.d|rc.d]/couchdb stop
+
+ +

You can get the status of the CouchDB daemon by running: + +

+sudo /usr/local/etc/[init.d|rc.d]/couchdb status
+
+ +

If you want to configure how the daemon script works, you will find a bunch of options you can edit in the /usr/local/etc/default/couchdb file. + +

If you want to run the script without the sudo command, you will need to remove the COUCHDB_USER setting from this file. + +

Your operating system will probably provide a way to control the CouchDB daemon automatically, starting and stopping it as a system service. To do this, you will need to copy the daemon script into your system /etc/[init.d|rc.d] directory, and run a command such as: + +

+sudo update-rc.d couchdb defaults
+
+ +

Consult your system documentation for more information. + +

Mac OS X

+ +

You can use the launchd system to control the CouchDB daemon. + +

You can load the launchd configuration by running: + +

+sudo launchctl load /usr/local/Library/LaunchDaemons/org.apache.couchdb.plist
+
+ +

You can unload the launchd configuration by running: + +

+sudo launchctl unload /usr/local/Library/LaunchDaemons/org.apache.couchdb.plist
+
+ +

You can start the CouchDB daemon by running: + +

+sudo launchctl start org.apache.couchdb
+
+ +

You can stop the CouchDB daemon by running: + +

+sudo launchctl stop org.apache.couchdb
+
+ +

The launchd system can control the CouchDB daemon automatically, starting and stopping it as a system service. To do this, you will need to copy the plist file into your system /Library/LaunchDaemons directory. + +

Consult the launchd documentation for more information. + +

Troubleshooting

+ +

Software being software, you can count on something going wrong every now and then. No need to panic; CouchDB has a great community full of people who will be able to answer your questions and help you get started. Here are a few resources to help you on your way: + +

+ +

Don’t forget to use your favorite search engine when diagnosing problems. If you look around a bit, you’re likely to find something. It’s very possible that a bunch of other people have had exactly the same problem as you and a solution has been posted somewhere on the Web. Good luck, and remember to relax! diff --git a/editions/1/zh/standalone.html b/editions/1/zh/standalone.html new file mode 100644 index 0000000..846ea5c --- /dev/null +++ b/editions/1/zh/standalone.html @@ -0,0 +1,219 @@ +Standalone Applications + + + + + + + + + + + +

Standalone Applications

+ +

CouchDB is useful for many areas of an application. Because of its incremental MapReduce and replication characteristics, it is especially well suited to online interactive document and data management tasks. These are the sort of workloads experienced by the majority of web applications. This coupled with CouchDB’s HTTP interface make it a natural fit for the web. + +

In this part, we’ll tour a document-oriented web application—a basic blog implementation. As a lowest common denominator, we’ll be using plain old HTML and JavaScript. The lessons learned should apply to Django/Rails/Java-style middleware applications and even to intensive MapReduce data mining tasks. CouchDB’s API is the same, regardless of whether you’re running a small installation or an industrial cluster. + +

There is no right answer about which application development framework you should use with CouchDB. We’ve seen successful applications in almost every commonly used language and framework. For this example application, we’ll use a two-layer architecture: CouchDB as the data layer and the browser for the user interface. We think this is a viable model for many document-oriented applications, and it makes a great way to teach CouchDB, because we can easily assume that all of you have a browser at hand without having to ensure that you’re familiar with a particular server-side scripting language. + +

Use the Correct Version

+ +

This part is interactive, so be prepared to follow along with your laptop and a running CouchDB database. We’ve made the full example application and all of the source code examples available online, so you’ll start by downloading the current version of the example application and installing it on your CouchDB instance. + +

A challenge of writing this book and preparing it for production is that CouchDB is evolving at a rapid pace. The basics haven’t changed in a long time, and probably won’t change much in the future, but things around the edges are moving forward rapidly for CouchDB’s 1.0 release. + +

This book is going to press as CouchDB version 0.10.0 is about to be released. Most of the code was written against 0.9.1 and the development trunk that is becoming version 0.10.0. In this part we’ll work with two other software packages: CouchApp, which is a set of tools for editing and sharing CouchDB application code; and Sofa, the example blog itself. + +

+ +

See http://couchapp.org for the latest information about the CouchApp model. + +

+ +

As a reader, it is your responsibility to use the correct versions of these packages. For CouchApp, the correct version is always the latest. The correct version of Sofa depends on which version of CouchDB you are using. To see which version of CouchDB you are using, run the following command: + +

+curl http://127.0.0.1:5984
+
+ +

You should see something like one of these three examples: + +

+{"couchdb":"Welcome","version":"0.9.1"}
+
+{"couchdb":"Welcome","version":"0.10.0"}
+
+{"couchdb":"Welcome","version":"0.11.0a858744"}
+
+ +

These three correspond to versions 0.9.1, 0.10.0, and trunk. If the version of CouchDB you have installed is 0.9.1 or earlier, you should upgrade to at least 0.10.0, as Sofa makes use of features not present until 0.10.0. There is an older version of Sofa that will work, but this book covers features and APIs that are part of the 0.10.0 release of CouchDB. It’s conceivable that there will be a 0.9.2, 0.10.1 and even a 0.10.2 release by the time you read this. Please use the latest release of whichever version you prefer. + +

Trunk refers to the latest development version of CouchDB available in the Apache Subversion repository. We recommend that you use a released version of CouchDB, but as developers, we often use trunk. Sofa’s master branch will tend to work on trunk, so if you want to stay on the cutting edge, that’s the way to do it. + +

Portable JavaScript

+ +

If you’re not familiar with JavaScript, we hope the source examples are given with enough context and explanation so that you can keep up. If you are familiar with JavaScript, you’re probably already excited that CouchDB supports view and template rendering JavaScript functions. + +

One of the advantages of building applications that can be hosted on any standard CouchDB installation is that they are portable via replication. This means your application, if you develop it to be served directly from CouchDB, gets offline mode “for free.” Local data makes a big difference for users in a number of ways we won’t get into here. We call applications that can be hosted from a standard CouchDB CouchApps. + +

CouchApps are a great vehicle for teaching CouchDB because we don’t need to worry about picking a language or framework; we’ll just work directly with CouchDB so that readers get a quick overview of a familiar application pattern. Once you’ve worked through the example app, you’ll have seen enough to know how to apply CouchDB to your problem domain. If you don’t know much about Ajax development, you’ll learn a little about jQuery as well, and we hope you find the experience relaxing. + +

Applications Are Documents

+ +

Applications are stored as design documents (Figure 1, “CouchDB executes application code stored in design documents”). You can replicate design documents just like everything else in CouchDB. Because design documents can be replicated, whole CouchApps are replicated. CouchApps can be updated via replication, but they are also easily “forked” by the users, who can alter the source code at will. + +

+ + + +

Figure 1. CouchDB executes application code stored in design documents + +

+ +

Because applications are just a special kind of document, they are easy to edit and share. + +

+ +

J. Chris says: Thinking of peer-based application replication takes me back to my first year of high school, when my friends and I would share little programs between the TI-85 graphing calculators we were required to own. Two calculators could be connected via a small cable and we’d share physics cheat sheets, Hangman, some multi-player text-based adventures, and, at the height of our powers, I believe there may have been a Doom clone running. + +

The TI-85 programs were in Basic, so everyone was always hacking each other’s hacks. Perhaps the most ridiculous program was a version of Spy Hunter that you controlled with your mind. The idea was that you could influence the pseudorandom number generator by concentrating hard enough, and thereby control the game. Didn’t work. Anyway, the point is that when you give people access to the source code, there’s no telling what might happen. + +

+ +

If people don’t like the aesthetics of your application, they can tweak the CSS. If people don’t like your interface choices, they can improve the HTML. If they want to modify the functionality, they can edit the JavaScript. Taken to the extreme, they may want to completely fork your application for their own purposes. When they show the modified version to their friends and coworkers, and hopefully you, there is a chance that more people may want to make improvements. + +

As the original developer, you have the control over your version and can accept or reject changes as you see fit. If someone messes around with the source code for a local application and breaks things beyond repair, they can replicate the original copy from your server, as illustrated in Figure 2, “Replicating application changes to a group of friends”. + +

+ + + +

Figure 2. Replicating application changes to a group of friends + +

+ +

Of course, this may not be your cup of tea. Don’t worry; you can be as restrictive as you like with CouchDB. You can restrict access to data however you wish, but beware of the opportunities you might be missing. There is a middle ground between open collaboration and restricted access controls. + +

Once you’ve finished the installation procedure, you’ll be able to see the full application code for Sofa, both in your text editor and as a design document in Futon. + +

Standalone

+ +

What happens if you add an HTML file as a document attachment? Exactly the same thing. We can serve web pages directly with CouchDB. Of course, we might also need images, stylesheets, or scripts. No problem; just add these resources as document attachments and link to them using relative URIs. + +

Let’s take a step back. What do we have so far? A way to serve HTML documents and other static files on the Web. That means we can build and serve traditional websites using CouchDB. Fantastic! But isn’t this a little like reinventing the wheel? Well, a very important difference is that we also have a document database sitting in the background. We can talk to this database using the JavaScript served up with our web pages. Now we’re really cooking with gas! + +

CouchDB’s features are a foundation for building standalone web applications backed by a powerful database. As a proof of concept, look no further than CouchDB’s built-in administrative interface. Futon is a fully functional database management application built using HTML, CSS, and JavaScript. Nothing else. CouchDB and web applications go hand in hand. + +

In the Wild

+ +

There are plenty of examples of CouchApps in the wild. This section includes screenshots of just a few sites and applications that use a standalone CouchDB architecture. + +

Damien Katz, inventor of CouchDB and writer of this book’s Foreword, decided to see how long it would take to implement a shared calendar with real-time updates as events are changed on the server. It took about an afternoon, thanks to some amazing open source jQuery plug-ins. The calendar demo is still running on J. Chris’s server. See Figure 3, “Group calendar”. + +

+ + + +

Figure 3. Group calendar + +

+ +

Jason Davies swapped out the backend of the Ely Service website with CouchDB, without changing anything visible to the user. The technical details are covered on his blog. See Figure 4, “Ely Service”. + +

+ + + +

Figure 4. Ely Service + +

+ +

Jason also converted his mom’s ecommerce website, Bet Ha Bracha, to a CouchApp. It uses the _update handler to hook into different transaction gateways. See Figure 5, “Bet Ha Bracha”. + +

Processing JS is a toolkit for building animated art that runs in the browser. Processing JS Studio is a gallery for Processing JS sketches. See Figure 6, “Processing JS Studio”. + +

+ + + +

Figure 5. Bet Ha Bracha + +

+ +
+ + + +

Figure 6. Processing JS Studio + +

+ +

Swinger is a CouchApp for building and sharing presentations. It uses the Sammy JavaScript application framework. See Figure 7, “Swinger”. + +

+ + + +

Figure 7. Swinger + +

+ +

Nymphormation is a link sharing and tagging site by Benoît Chesneau. It uses CouchDB’s cookie authentication and also makes it possible to share links using replication. See Figure 8, “Nymphormation”. + +

Boom Amazing is a CouchApp by Alexander Lang that allows you to zoom, rotate, and pan around an SVG file, record the different positions, and then replay those for a presentation or something else (from the Boom Amazing README). See Figure 9, “Boom Amazing”. + +

+ + + +

Figure 8. Nymphormation + +

+ +
+ + + +

Figure 9. Boom Amazing + +

+ +

The CouchDB Twitter Client was one of the first standalone CouchApps to be released. It’s documented in J. Chris’s blog post, “My Couch or Yours, Shareable Apps are the Future”. The screenshot in Figure 10, “Twitter Client” shows the word cloud generated from a MapReduce view of CouchDB’s archived tweets. The cloud is normalized against the global view, so universally common words don’t dominate the chart. + +

+ + + +

Figure 10. Twitter Client + +

+ +

Toast is a chat application that allows users to create channels and then invite others to real-time chat. It was initially a demo of the _changes event loop, but it started to take off as a way to chat. See Figure 11, “Toast”. + +

Sofa is the example application for this part, and it has been deployed by a few different authors around the web. The screenshot in Figure 12, “Sofa” is from Jan’s Tumblelog. To see Sofa in action, visit J. Chris’s site, which has been running Sofa since late 2008. + +

+ + + +

Figure 11. Toast + +

+ +
+ + + +

Figure 12. Sofa + +

+ +

Wrapping Up

+ +

J. Chris decided to port his blog from Ruby on Rails to CouchDB. He started by exporting Rails ActiveRecord objects as JSON documents, paring away some features, and adding others as he converted to HTML and JavaScript. + +

The resulting blog engine features access-controlled posting, open comments with the possibility of moderation, Atom feeds, Markdown formatting, and a few other little goodies. This book is not about jQuery, so although we use this JavaScript library, we’ll refrain from dwelling on it. Readers familiar with using asynchronous XMLHttpRequest (XHR) should feel right at home with the code. Keep in mind that the figures and code samples in this part omit many of the bookkeeping details. + +

We will be studying this application and learning how it exercises all the core features of CouchDB. The skills learned in this part should be broadly applicable to any CouchDB application domain, whether you intend to build a self-hosted CouchApp or not. diff --git a/editions/1/zh/standalone/01.png b/editions/1/zh/standalone/01.png new file mode 100644 index 0000000..771753e Binary files /dev/null and b/editions/1/zh/standalone/01.png differ diff --git a/editions/1/zh/standalone/02.png b/editions/1/zh/standalone/02.png new file mode 100644 index 0000000..841e9ab Binary files /dev/null and b/editions/1/zh/standalone/02.png differ diff --git a/editions/1/zh/standalone/03.png b/editions/1/zh/standalone/03.png new file mode 100644 index 0000000..6b20a1d Binary files /dev/null and b/editions/1/zh/standalone/03.png differ diff --git a/editions/1/zh/standalone/04.png b/editions/1/zh/standalone/04.png new file mode 100644 index 0000000..0170715 Binary files /dev/null and b/editions/1/zh/standalone/04.png differ diff --git a/editions/1/zh/standalone/05.png b/editions/1/zh/standalone/05.png new file mode 100644 index 0000000..ada652c Binary files /dev/null and b/editions/1/zh/standalone/05.png differ diff --git a/editions/1/zh/standalone/06.png b/editions/1/zh/standalone/06.png new file mode 100644 index 0000000..77e2deb Binary files /dev/null and b/editions/1/zh/standalone/06.png differ diff --git a/editions/1/zh/standalone/07.png b/editions/1/zh/standalone/07.png new file mode 100644 index 0000000..07d72e5 Binary files /dev/null and b/editions/1/zh/standalone/07.png differ diff --git a/editions/1/zh/standalone/08.png b/editions/1/zh/standalone/08.png new file mode 100644 index 0000000..8279a13 Binary files /dev/null and b/editions/1/zh/standalone/08.png differ diff --git a/editions/1/zh/standalone/09.png b/editions/1/zh/standalone/09.png new file mode 100644 index 0000000..db43664 Binary files /dev/null and b/editions/1/zh/standalone/09.png differ diff --git a/editions/1/zh/standalone/10.png b/editions/1/zh/standalone/10.png new file mode 100644 index 0000000..ca59a1c Binary files /dev/null and b/editions/1/zh/standalone/10.png differ diff --git a/editions/1/zh/standalone/11.png b/editions/1/zh/standalone/11.png new file mode 100644 index 0000000..f6077e5 Binary files /dev/null and b/editions/1/zh/standalone/11.png differ diff --git a/editions/1/zh/standalone/12.png b/editions/1/zh/standalone/12.png new file mode 100644 index 0000000..56e6ece Binary files /dev/null and b/editions/1/zh/standalone/12.png differ diff --git a/editions/1/zh/tour.html b/editions/1/zh/tour.html new file mode 100644 index 0000000..eed99a1 --- /dev/null +++ b/editions/1/zh/tour.html @@ -0,0 +1,410 @@ +新手上路 + + + + + + + + + + + +

新手上路

+ +

在本节中, 我们将会快速的浏览下CouchDB的各种特性, 熟悉Futon--CouchDB自带的管理界面. 我们会创建第一个文档并体验CouchDB的视图. 在开始前, 请查看附录D, 从源代码安装中关于您的操作系统的安装步骤. 在继续前进之前, 请按照这些指示, 安装好CouchDB. + +

在任何系统上都能运行!

+ +

我们会用curl来快速的看看CouchDB的一些零散的API. 请注意这只是其中的一种和CouchDB沟通的方式. 我们将在本书的后面展示更多的方式. curl的有趣之处在于, 它给了你对原生HTTP请求的控制能力, 并且让你能准确的看到数据库底层到底在干些什么. + +

确认CouchDB已经在运行后, 输入: + +

+curl http://127.0.0.1:5984/
+
+ +

这会发一个GET请求到你新装的CouchDB实例. + +

得到的回应看起来应该像这样子: + +

+{"couchdb":"Welcome","version":"0.10.1"}
+
+ +

没什么特别的, CouchDB正向你问好呢, 带着它的版本号. + +

接下来, 我们可以得到所有数据库的一个列表: + +

+curl -X GET http://127.0.0.1:5984/_all_dbs
+
+ +

我们在前一个请求里加的只有 _all_dbs 这个字符串. + +

回应看起来像是这个样子: + +

+[]
+
+ +

哦, 对, 我们还没有创建任何数据库! 我们看到的是一个空列表. + +

+ +

curl命令的默认方法是GET. 可以使用curl -X POST来发送一个POST请求. 为了更好的我们的终端历史配合, 即便是在发送GET请求时, 我们也使用了-X的选项. 这样如果下一次我们想要发送POST请求, 就只需要改变方法名就可以了. + +

在底层, HTTP其实做了比上述例子中更多一点的事情. 如果您以于这些细节感兴趣, 可以使用-v选项(比如, curl -vX GET), 这样curl就会显示出它是在向哪里发起连接, 发送的请求头, 以及接收回来的返回头. 这对于调试来说非常有用. + +

+ +

让我们来创建一个数据库: + +

+curl -X PUT http://127.0.0.1:5984/baseball
+
+ +

CouchDB会回应: + +

+{"ok":true}
+
+ +

再取一次数据库列表, 这次会显示一些有用的结果: + +

+curl -X GET http://127.0.0.1:5984/_all_dbs
+
+ +
+["baseball"]
+
+ +
+ +

我们应该在这里提一下JavaScript Object Notation (JSON), CouchDB的数据格式. JSON是一个以JavaScrip语法为基础的轻量级数据交换格式. 因为JSON和JavaScript原生的兼容, 所以浏览器就是CouchDB的一个理想的客户端. + +

方括号([])代表排序的列表, 大括号({})代表key/value字典. key必须是字符串, 以引号(")包含, value可以是字符吕, 数字, 布尔, 列表, 或者key/value字典. 要了解更多的关于JSON的描述, 请查看附录E, JSON初步. + +

+ +

让我们创建另一个数据库: + +

+curl -X PUT http://127.0.0.1:5984/baseball
+
+ +

CouchDB会回应: + +

+{"error":"file_exists","reason":"The database could not be created, the file already exists."}
+
+ +

我们已经有了一个同名的数据库, 所以CouchDB会返回一个错误. 让我们试着换一个数据库名称. + +

+curl -X PUT http://127.0.0.1:5984/plankton
+
+ +

CouchDB会回应: + +

+{"ok":true}
+
+ +

再一次获取数据库列表, 现在可以看到一些有用的东西了: + +

+curl -X GET http://127.0.0.1:5984/_all_dbs
+
+ +

CouchDB会回应: + +

+["baseball", "plankton"]
+
+ +

为了演示, 让我们删除第二个数据库: + +

+curl -X DELETE http://127.0.0.1:5984/plankton
+
+ +

CouchDB会回应: + +

+{"ok":true}
+
+ +

现在数据库列表和先前的一样了: + +

+curl -X GET http://127.0.0.1:5984/_all_dbs
+
+ +

CouchDB回应: + +

+["baseball"]
+
+ +

为了减略, 我们跳过了和文档打交道的部分, 因为下一部分会讲到一个不同的但应该更加简单的方法来和CouchDB打交道, 那里会包含和文档打交道的这一部分. 当我们讲解这些例子时, 请记住, 在底层应用程序所作的事和你现在手工在做的是完全一样的. 任何事情都是用GET, PUT, POSTDELETE来操作一个URI. + +

欢迎来到Futon

+ +

在看过CouchDB的原生API后, 让我们来玩玩Futon, CouchDB自带的管理界面. Futon提供了访问CouchDB特性的全部权限, 而且它使得理解一些更加复杂的概念变得简单. 用Futon, 我们可以创建和销毁数据库, 查看和编辑文档, 建立和运行MapReduce视图, 还能在数据库之间进行复制. + +

想要在你的浏览器里打开Futon, 访问: + +

+http://127.0.0.1:5984/_utils/
+
+ +

如果你运行的是0.9及其以后的版本, 你应该会看到像图1, "Futon欢迎界面"类似的界面. 在之后的章节里, 我们会关注于用服务器端语言, 比如Ruby和Python来使用CouchDB. 而现在, 本章节正是一个好机会, 可以展示一个只使用CouchDB的集成Web Server来构建动态Web应用的例子. + +

在全新安装一个CouchDB后的第一件事就是运行测试套件来确认正常工作. 这可以保证将来无论我们遇到什么问题都不是因为某些安装时的烦人原因而造成的. Futon测试套件里的失败会有一个红色标记, 告诉我们在使用一个可能有问题的数据库之前, 需要再三检查安装是否没有问题, 以免当出现不是我们所期待的事情时陷入混乱.

+ +
+ + + +

图1. Futon欢迎界面 + +

+ +
+ +

当通过 localhost 访问时, 有些网络配置可能会导致复制测试失败. 可以通过访问http://127.0.0.1:5984/_utils/代替 localhost 来解决这个问题. + +

+ +

在Futon的侧边栏上点击 "Test Suite" 来进入测试套件, 然后点击主页面上端的 "run all" 来开始测试. 图2, "Futon测试套件"显示Futon测试套件正在跑一些测试. + +

+ + + +

图2. Futon测试套件 + +

+ +

因为测试套件是从浏览器里跑的, 它不仅测试了CouchDB工作正常, 也验证了你的浏览器到数据库的连接设置正常, 这对于诊断不正常的代理或者其他HTTP中间件是很方便的. + +

+ +

如果测试套件结果有过多的失败, 您需要查看下附录D, 从源代码安装中的错误处理一节, 以便找到方法来修复您的安装. + +

+ +

好了, 测试套件跑完后, 可以确认CouchDB安装已经成功, 来看看Futon还提供了哪些其他功能. + +

你的第一个数据库与文档

+ +

在Futon里创建一个数据库很简单. 在overview页面点击 "Create Database". 当问你叫什么名字时, 键入 hello-world 然后点击Create按钮. + +

当你的数据库创建以后, Futon会显示一个该数据库全部文档的列表, 如图3所示. 这个列表一开始会是空的(图3, "一个空的数据库"), 那么我们来创建第一个文档吧. 点击Create Document来创建. 确保让文档ID为空, CouchDB会为你生成一个UUID. + +

+ +

出于演示的目的, 让CouchDB来给UUID赋值没有问题. 在写您自己的第一个程序时, 我们建议自己来给UUID赋值. 如果你信赖于服务器来产生UUID, 那么可能会有这种情况. 你以为第一个POST请求丢失了, 因为没有响应回来, 而实际上并没有丢失. 结果你又做了一次相同的POST请求. 这样就会产生两个相同的文档而可能永远发现不了第一次产生的文档, 因为只有第二次的POST请求有响应回来. 使用自己的UUID就可以保证永远不会产生重复的文档. + +

+ +

Futon会显示我们刚才新创建的文档, 这个文档只有_idrev两个域. 创建一个新的域, 点击Add Field. 我们把新的域叫作hello. 点击绿色的勾按钮(或者按回车)来结束创建hello域. 双击hello域的值(默认是null)来编辑它. + +

如果你键入world作为新值, 当你点击绿色勾勾时会得到一个错误. CouchDB的值必须是有效的JSON. 键入"world"(带双引号), 因为这是一个有效的JSON字符串, 这次保存它应该没有问题了. 你可以用另外的JSON值来试验一下, 比如 [1, 2, "c"] 或者 {"foo":"bar"}. 一旦你键入了值, 注意下_rev属性然后点击Save Document. 结果应该像图4一个"hello world"文档所示. + +

+ + + +

图3. 一个空的数据库 + +

+ +
+ + + +

图4. 一个"hello world"文档 + +

+ +

你会注意到文档的_rev已经改变了. 在后面的章节中我们会更细节的来看这个问题, 但是现在, 值得注意的事是当保存一个文档时, _rev表现为一个安全特性. 当你和CouchDB对文档的最新_rev达成一致, 你就能成功的保存改变. + +

Futon也提供了一种方法来显示底层的JSON数据, 根据你在处理的是什么类型的数据, 可以展示的更紧凑并且更易读. 要看我们的Hello World文档的JSON版本, 点击Source tab, 结果应该像图5, "'hello world'文档的JSON源代码"所示. + +

+ + + +

图5. "hello world"文档的JSON源代码 + +

+ +

用MapReduce执行查询

+ +

传统关系数据库允许你跑任何你喜欢的查询, 只要你的数据是正常的结构化的. 而CouchDB则使用预告定义的mapreduce函数, 一种被叫做MapReduce的风格. 这些函数提供了巨大的灵活性, 因为它们可以根据文档结构来产生不同变种, 并且每个文档的索引可以被独立和平行的计算. 一个map和一个reduce函数的组合在CouchDB的术语里被叫做视图. + +

+ +

对于有经验的关系数据库程序员来说, MapReduce可以需要慢慢的来适应. 关系数据库中要声明哪些表的哪些行应该出现在结果集中, 要根据不同的数据库决定如何最有效的查询数据, 而reduce查询则是以map函数产生的索引为基础的. + +

+ +

Map函数每次调用都会使用一个文档作为参数. 函数可以选择跳过整个文档或者使用一行或多行key/value对. Map函数可以不依赖任何文档之外的信息. 这种独立性使得CouchDB视图可以增量和平行的生成. + +

CouchDB视图是以key排序以行的形式被存储的. 这使得从一个keys范围中获取数据变得高效, 甚至当有成千上万行的时候. 在写CouchDB map函数的时候, 你的首要目标是建立一个索引, 这个索引存储相近key之间的相关数据. + +

在我们执行一个MapReduce视图示例之前, 我们需要一些用来执行这个示例的数据. 我们会创建包含不同超市不同商品的价格的文档. 我们以苹果, 村子和香蕉为例子来创建文档. (允许CouchDB来生成_id_rev域.) 使用Futon来创建文档, 最后产生的JSON结构看起来是这样: + +

+{
+    "_id" : "bc2a41170621c326ec68382f846d5764",
+    "_rev" : "2612672603",
+    "item" : "apple",
+    "prices" : {
+        "Fresh Mart" : 1.59,
+        "Price Max" : 5.99,
+        "Apples Express" : 0.79
+    }
+}
+
+ +

这个文档应该如图6, 一个包含有苹果价格的示例文档所示. + +

+ + + +

图6. 一个包含有苹果价格的示例文档 + +

+ +

好, 让我们再来创建桔子的文档: + +

+{
+    "_id" : "bc2a41170621c326ec68382f846d5764",
+    "_rev" : "2612672603",
+    "item" : "orange",
+    "prices" : {
+        "Fresh Mart" : 1.99,
+        "Price Max" : 3.19,
+        "Citrus Circus" : 1.09
+    }
+}
+
+ +

最后, 香蕉的: + +

+{
+    "_id" : "bc2a41170621c326ec68382f846d5764",
+    "_rev" : "2612672603",
+    "item" : "banana",
+    "prices" : {
+        "Fresh Mart" : 1.99,
+        "Price Max" : 0.79,
+        "Banana Montana" : 4.22
+    }
+}
+
+ +

想像一下, 我们正在做一个大型的午餐, 但是客户对价格很敏感. 为了找出最低的价格, 我们来创建我们的第一个视图, 它会根据价格排序显示每种水果. 点击hello-world来回到hello-world的overview页面, 然后从视图选择目录里选择Temporary view...来创建一个新视图. 结果应该如图7, "一个临时视图"所示. + +

+ + + +

图7. 一个临时视图

+ +
+ +

编辑左边的map函数, 让它看起像这样子: + +

+function(doc) {
+    var store, price, value;
+    if (doc.item && doc.prices) {
+        for (store in doc.prices) {
+            price = doc.prices[store];
+            value = [doc.item, store];
+            emit(price, value);
+        }
+    }
+}
+
+ +

这是一个JavaScript函数, CouchDB会在每个文档上运行这个函数. 现在我们把Reduce函数暂时留空. + +

点击Run, 然后你应该会看到如图8, "视图运行的结果"的结果, 以价格排序的不同的物件. 如果把这些物件用类型排序, 这个map函数会更加有用, 这样所以香蕉的价格在结果集里都集中了. CouchDB的key排序系统允许任何有效的JSON对象作为一个key. 在这个例子中, 我们使用[item, price]这样一个数组, 这样CouchDB就能以物品类型和价格来进行分组 + +

+ + + +

图8. 视图运行的结果 + +

+ +

让我们修改视图函数, 使它变成这样: + +

+function(doc) {
+    var store, price, key;
+    if (doc.item && doc.prices) {
+        for (store in doc.prices) {
+            price = doc.prices[store];
+            key = [doc.item, price];
+            emit(key, store);
+        }
+    }
+}
+
+ +

在这个函数中, 我们首先检查文档是否有我们想要使用的域. CouchDB可以从少量独立的map函数错误中优雅的恢复, 但是当一个map函数经常的发生错误(因为缺少需要的域或者其他JavaScript异常), CouchDB 会关闭它的索引来防止进一步的资源使用. 因为这个原因, 在使用它们之前检查域的存在是很重要的. 在这个例子里, 我们的map函数会跳过我们第一个"hello world"文档, 没有数据, 同时不会错误. 这个查询的结果应该如图9, "根据物品类型和价格进行分组后的视图运行结果"所示. + +

+ + + +

图9. 根据物品类型和价格进行分组后的视图运行结果 + +

+ +

在得到了一个有物品类型和价格的文档后, 我们遍历物品的价格, 然后显示key/value对. key是一个物品和价格的数组. 在这个例子中, value则是这个价格的物品所在商场的名字. + +

视图列表根据它们的key来排序, 在这个例子中: 首先是物件, 然后是价格. 这种复杂排序的方法是使用CouchDB创建有价值的索引的核心. + +

+ +

MapReduce可以变得很有挑战性, 特别是在你已经使用了多年的关系数据库后. 最重要的要记住的就是, map函数给了你以任何你选择的key来排序数据的机会, 而CouchDB的设计则关注于提供快速, 高效的依据key范围的数据访问. + +

+ +

进行复制

+ +

Futon可以在两个本地数据库, 一个本地数据库一个远端数据库, 或者甚至是两个远端数据库之间进行复制. 我们会向你展示, 如何从一个本地数据库复制数据到另一个, 这是一个简单的做数据库备份的方法. + +

首先我们需要创建一个空数据库作为复制目标数据库. 回到overview页面然后创建一个叫hello-replication的数据库. 现在点击侧边栏里的Replicator, 并选择hello-world作为源, hello-replication作为目标. 点击"Relicate"来复制你的数据库. 结果应该如图10, "在Futon中运行数据库复制"所示. + +

+ + + +

图10: 在Futon中运行数据库复制 + +

+ +
+ +

对于更大型的数据库来说, 复制会需要更长的时间. 在复制正在进行时, 保持浏览器窗口一直打开是很重要的. 也可以用另一种方式来进行复制, 可以通过curl或者其他的可以处理长时间连接的HTTP客户端. 如果在复制结果前, 客户端关闭了连接, 就必须要再次触发复制了. 幸运的是, CouchDB的复制可以在它中断的地方重新开始, 而不必重头再来. + +

+ +

收尾

+ +

现在我们已经看过了Futon的大部分功能, 您已经可以进一步深入, 并且可以在后面章节构建示例应用时仔细观察下您的数据. Futon使用了纯粹的avaScript来实现CouchDB的管理, 这也向我们展示了如何只用CouchDB的HTTP API和内建Web Server来构建一个全功能的Web应用. + +

但在我们讲到那之前, 我们用另一个角度来看看CouchDB的HTTP API; 这次我们会用放大镜--curl来看看. diff --git a/editions/1/zh/tour/01.png b/editions/1/zh/tour/01.png new file mode 100644 index 0000000..41b6420 Binary files /dev/null and b/editions/1/zh/tour/01.png differ diff --git a/editions/1/zh/tour/02.png b/editions/1/zh/tour/02.png new file mode 100644 index 0000000..f263664 Binary files /dev/null and b/editions/1/zh/tour/02.png differ diff --git a/editions/1/zh/tour/03.png b/editions/1/zh/tour/03.png new file mode 100644 index 0000000..47f013b Binary files /dev/null and b/editions/1/zh/tour/03.png differ diff --git a/editions/1/zh/tour/04.png b/editions/1/zh/tour/04.png new file mode 100644 index 0000000..33ceccc Binary files /dev/null and b/editions/1/zh/tour/04.png differ diff --git a/editions/1/zh/tour/05.png b/editions/1/zh/tour/05.png new file mode 100644 index 0000000..08042d7 Binary files /dev/null and b/editions/1/zh/tour/05.png differ diff --git a/editions/1/zh/tour/06.png b/editions/1/zh/tour/06.png new file mode 100644 index 0000000..bb9c7c8 Binary files /dev/null and b/editions/1/zh/tour/06.png differ diff --git a/editions/1/zh/tour/07.png b/editions/1/zh/tour/07.png new file mode 100644 index 0000000..dc009a8 Binary files /dev/null and b/editions/1/zh/tour/07.png differ diff --git a/editions/1/zh/tour/08.png b/editions/1/zh/tour/08.png new file mode 100644 index 0000000..6976f57 Binary files /dev/null and b/editions/1/zh/tour/08.png differ diff --git a/editions/1/zh/tour/09.png b/editions/1/zh/tour/09.png new file mode 100644 index 0000000..f41b306 Binary files /dev/null and b/editions/1/zh/tour/09.png differ diff --git a/editions/1/zh/tour/10.png b/editions/1/zh/tour/10.png new file mode 100644 index 0000000..e95bc66 Binary files /dev/null and b/editions/1/zh/tour/10.png differ diff --git a/editions/1/zh/transforming.html b/editions/1/zh/transforming.html new file mode 100644 index 0000000..541e3f3 --- /dev/null +++ b/editions/1/zh/transforming.html @@ -0,0 +1,260 @@ +Transforming Views with List Functions + + + + + + + + + + + +

Transforming Views with List Functions

+ +

Just as show functions convert documents to arbitrary output formats, CouchDB list functions allow you to render the output of view queries in any format. The powerful iterator API allows for flexibility to filter and aggregate rows on the fly, as well as output raw transformations for an easy way to make Atom feeds, HTML lists, CSV files, config files, or even just modified JSON. + +

List functions are stored under the lists field of a design document. Here’s an example design document that contains two list functions: + +

+{
+  "_id" : "_design/foo",
+  "_rev" : "1-67at7bg",
+  "lists" : {
+    "bar" : "function(head, req) { var row; while (row = getRow()) { ... } }",
+    "zoom" : "function() { return 'zoom!' }",
+  }
+}
+
+ +

Arguments to the List Function

+ +

The function is called with two arguments, which can sometimes be ignored, as the row data itself is loaded during function execution. The first argument, head, contains information about the view. Here’s what you might see looking at a JSON representation of head: + +

+{total_rows:10, offset:0}
+
+ +

The request itself is a much richer data structure. This is the same request object that is available to show, update, and filter functions. We’ll go through it in detail here as a reference. Here’s the example req object: + +

+{
+  "info": {
+    "db_name": "test_suite_db","doc_count": 11,"doc_del_count": 0,
+    "update_seq": 11,"purge_seq": 0,"compact_running": false,"disk_size": 4930,
+    "instance_start_time": "1250046852578425","disk_format_version": 4},
+
+ +

The database information, as available in an information request against a database’s URL, is included in the request parameters. This allows you to stamp rendered rows with an update sequence and know the database you are working with. + +

+  "method": "GET",
+  "path": ["test_suite_db","_design","lists","_list","basicJSON","basicView"],
+
+ +

The HTTP method and the path in the client from the client request are useful, especially for rendering links to other resources within the application. + +

+  "query": {"foo":"bar"},
+
+ +

If there are parameters in the query string (in this case corresponding to ?foo=bar), they will be parsed and available as a JSON object at req.query. + +

+  "headers":
+    {"Accept": "text/html,application/xhtml+xml ,application/xml;q=0.9,*/*;q=0.8",
+    "Accept-Charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.7","Accept-Encoding":
+    "gzip,deflate","Accept-Language": "en-us,en;q=0.5","Connection": "keep-alive",
+    "Cookie": "_x=95252s.sd25; AuthSession=","Host": "127.0.0.1:5984",
+    "Keep-Alive": "300",
+    "Referer": "http://127.0.0.1:5984/_utils/couch_tests.html?script/couch_tests.js",
+    "User-Agent": "Mozilla/5.0 Gecko/20090729 Firefox/3.5.2"},
+  "cookie": {"_x": "95252s.sd25","AuthSession": ""},
+
+ +

Headers give list and show functions the ability to provide the Content-Type response that the client prefers, as well as other nifty things like cookies. Note that cookies are also parsed into a JSON representation. Thanks, MochiWeb! + +

+  "body": "undefined",
+  "form": {},
+
+ +

In the case where the method is POST, the request body (and a form-decoded JSON representation of it, if applicable) are available as well. + +

+  "userCtx": {"db": "test_suite_db","name": null,"roles": ["_admin"]}
+}
+
+ +

Finally, the userCtx is the same as that sent to the validation function. It provides access to the database the user is authenticated against, the user’s name, and the roles they’ve been granted. In the previous example, you see an anonymous user working with a CouchDB node that is in “admin party” mode. Unless an admin is specified, everyone is an admin. + +

That’s enough about the arguments to list functions. Now it’s time to look at the mechanics of the function itself. + +

An Example List Function

+ +

Let’s put this knowledge to use. In the chapter introduction, we mentioned using lists to generate config files. One fun thing about this is that if you keep your configuration information in CouchDB and generate it with lists, you don’t have to worry about being able to regenerate it again, because you know the config will be generated by a pure function from your database and not other sources of information. This level of isolation will ensure that your config files can be generated correctly as long as CouchDB is running. Because you can’t fetch data from other system services, files, or network sources, you can’t accidentally write a config file generator that fails due to external factors. + +

+ +

J. Chris got excited about the idea of using list functions to generate config files for the sort of services people usually configure using CouchDB, specifically via Chef, an Apache-licensed infrastructure automation tool. The key feature of infrastructure automation is that deployment scripts are idempotent—that is, running your scripts multiple times will have the same intended effect as running them once, something that becomes critical when a script fails halfway through. This encourages crash-only design, where your scripts can bomb out multiple times but your data remains consistent, because it takes the guesswork out of provisioning and updating servers in the case of previous failures. + +

Like map, reduce, and show functions, lists are pure functions, from a view query and an HTTP request to an output format. They can’t make queries against remote services or otherwise access outside data, so you know they are repeatable. Using a list function to generate an HTTP server configuration file ensures that the configuration is generated repeatably, based on only the state of the database. + +

+ +

Imagine you are running a shared hosting platform, with one name-based virtual host per user. You’ll need a config file that starts out with some node configuration (which modules to use, etc.) and is followed by one config section per user, setting things like the user’s HTTP directory, subdomain, forwarded ports, etc. + +

+function(head, req) {
+  // helper function definitions would be here...
+  var row, userConf, configHeader, configFoot;
+  configHeader = renderTopOfApacheConf(head, req.query.hostname);
+  send(configHeader);
+
+ +

In the first block of the function, we’re rendering the top of the config file using the function renderTopOfApacheConf(head, req.query.hostname). This may include information that’s posted into the function, like the internal name of the server that is being configured or the root directory in which user HTML files are organized. We won’t show the function body, but you can imagine that it would return a long multi-line string that handles all the global configuration for your server and sets the stage for the per-user configuration that will be based on view data. + +

The call to send(configHeader) is the heart of your ability to render text using list functions. Put simply, it just sends an HTTP chunk to the client, with the content of the strings pasted to it. There is some batching behind the scenes, as CouchDB speaks with the JavaScript runner with a synchronous protocol, but from the perspective of a programmer, send() is how HTTP chunks are born. + +

Now that we’ve rendered and sent the file’s head, it’s time to start rendering the list itself. Each list item will be the result of converting a view row to a virtual host’s configuration element. The first thing we do is call getRow() to get a row of the view. + +

+  while (row = getRow()) {
+    var userConf = renderUserConf(row);
+    send(userConf)
+  }
+
+ +

The while loop used here will continue to run until getRow() returns null, which is how CouchDB signals to the list function that all valid rows (based on the view query parameters) have been exhausted. Before we get ahead of ourselves, let’s check out what happens when we do get a row. + +

In this case, we simply render a string based on the row and send it to the client. Once all rows have been rendered, the loop is complete. Now is a good time to note that the function has the option to return early. Perhaps it is programmed to stop iterating when it sees a particular user’s document or is based on a tally it’s been keeping of some resource allocated in the configuration. In those cases, the loop can end early with a break statement or other method. There’s no requirement for the list function to render every row that is sent to it. + +

+  configFoot = renderConfTail();
+  return configFoot;
+}
+
+ +

Finally, we close out the configuration file and return the final string value to be sent as the last HTTP chunk. The last action of a list function is always to return a string, which will be sent as the final HTTP chunk to the client. + +

To use our config file generation function in practice, we might run a command-line script that looks like: + +

+curl http://localhost:5984/config_db/_design/files/_list/apache/users?hostname=foobar > apache.conf
+
+ +

This will render our Apache config based on data in the user’s view and save it to a file. What a simple way to build a reliable configuration generator! + +

List Theory

+ +

Now that we’ve seen a complete list function, it’s worth mentioning some of the helpful properties they have. + +

The most obvious thing is the iterator-style API. Because each row is loaded independently by calling getRow(), it’s easy not to leak memory. The list function API is capable of rendering lists of arbitrary length without error, when used correctly. + +

On the other hand, this API gives you the flexibility to bundle a few rows in a single chunk of output, so if you had a view of, say, user accounts, followed by subdomains owned by that account, you could use a slightly more complex loop to build up some state in the list function for rendering more complex chunks. Let’s look at an alternate loop section: + +

+var subdomainOwnerRow, subdomainRows = [];
+while (row = getRow()) {
+
+ +

We’ve entered a loop that will continue until we have reached the endkey of the view. The view is structured so that a user profile row is emitted, followed by all of that user’s subdomains. We’ll use the profile data and the subdomain information to template the configuration for each individual user. This means we can’t render any subdomain configuration until we know we’ve received all the rows for the current user. + +

+  if (!subdomainOwnerRow) {
+    subdomainOwnerRow = row;
+
+ +

This case is true only for the first user. We’re merely setting up the initial conditions. + +

+  } else if (row.value.user != subdomainOwnerRow.value.user) {
+
+ +

This is the end case. It will be called only after all the subdomain rows for the current user have been exhausted. It is triggered by a row with a mismatched user, indicating that we have all the subdomain rows. + +

+    send(renderUserConf(subdomainOwnerRow, subdomainRows));
+
+ +

We know we are ready to render everything for the current user, so we pass the profile row and the subdomain rows to a render function (which nicely hides all the gnarly nginx config details from our fair reader). The result is sent to the HTTP client, which writes it to the config file. + +

+    subdomainRows = [];
+    subdomainOwnerRow = row;
+
+ +

We’ve finished with that user, so let’s clear the rows and start working on the next user. + +

+  } else {
+    subdomainRows.push(row);
+
+ +

Ahh, back to work, collecting rows. + +

+  }
+}
+send(renderUserConf(subdomainOwnerRow, subdomainRows));
+
+ +

This last bit is tricky—after the loop is finished (we’ve reached the end of the view query), we’ve still got to render the last user’s config. Wouldn’t want to forget that! + +

The gist of this loop section is that we collect rows that belong to a particular user until we see a row that belongs to another user, at which point we render output for the first user, clear our state, and start working with the new user. Techniques like this show how much flexibility is allowed by the list iterator API. + +

More uses along these lines include filtering rows that should be hidden from a particular result set, finding the top N grouped reduce values (e.g., to sort a tag cloud by popularity), and even writing custom reduce functions (as long as you don’t mind that reductions are not stored incrementally). + +

Querying Lists

+ +

We haven’t looked in detail at the ways list functions are queried. Just like show functions, they are resources available on the design document. The basic path to a list function is as follows: + +

+/db/_design/foo/_list/list-name/view-name
+
+ +

Because the list name and the view name are both specified, this means it is possible to render a list against more than one view. For instance, you could have a list function that renders blog comments in the Atom XML format, and then run it against both a global view of recent comments as well as a view of recent comments by blog post. This would allow you to use the same list function to provide an Atom feed for comments across an entire site, as well as individual comment feeds for each post. + +

After the path to the list comes the view query parameter. Just like a regular view, calling a list function without any query parameters results in a list that reflects every row in the view. Most of the time you’ll want to call it with query parameters to limit the returned data. + +

You’re already familiar with the view query options from Chapter 6, Finding Your Data with Views. The same query options apply to the _list query. Let’s look at URLs side by side; see Example 1, “A JSON view query”. + +

+ +
+GET /db/_design/sofa/_view/recent-posts?descending=true&limit=10
+
+ +

Example 1. A JSON view query + +

+ +

This view query is just asking for the 10 most recent blog posts. Of course, this query could include parameters like startkey or skip—we’re leaving them out for simplicity. To run the same query through a list function, we access it via the list resource, as shown in Example 2, “The HTML list query”. + +

+ +
+GET /db/_design/sofa/_list/index/recent-posts?descending=true&limit=10
+
+ +

Example 2. The HTML list query + +

+ +

The index list here is a function from JSON to HTML. Just like the preceding view query, additional query parameters can be applied to paginate through the list. As we’ll see in Part III, “Example Application”, once you have a working list, adding pagination is trivial. See Example 3, “The Atom list query”. + +

+ +
+GET /db/_design/sofa/_list/index/recent-posts?descending=true&limit=10&format=atom
+
+ +

Example 3. The Atom list query + +

+ +

The list function can also look at the query parameters and do things like switch that output to render based on parameters. You can even do things like pass the username into the list using a query parameter (but it’s not recommended, as you’ll ruin cache efficiency). + +

Lists, Etags, and Caching

+ +

Just like show functions and view queries, lists are sent with proper HTTP Etags, which makes them cacheable by intermediate proxies. This means that if your server is starting to bog down in list-rendering code, it should be possible to relieve load by using a caching reverse proxy like Squid. We won’t go into the details of Etags and caching here, as they were covered in Chapter 8, Show Functions. diff --git a/editions/1/zh/unix.html b/editions/1/zh/unix.html new file mode 100644 index 0000000..120cfdd --- /dev/null +++ b/editions/1/zh/unix.html @@ -0,0 +1,61 @@ +Installing on Unix-like Systems + + + + + + + + + + + +

Installing on Unix-like Systems

+ +

Debian GNU/Linux

+ +

You can install the CouchDB package by running: + +

+sudo apt-get install couchdb
+
+ +

When this completes, you should have a copy of CouchDB running on your machine. Be sure to read through the Debian-specific system documentation that can be found under /usr/share/couchdb. + +

Starting with Ubuntu 9.10 (“Karmic”), CouchDB comes preinstalled with every desktop system. + +

Ubuntu

+ +

You can install the CouchDB package by running: + +

+sudo aptitude install couchdb
+
+ +

When this completes, you should have a copy of CouchDB running on your machine. Be sure to read through the Ubuntu-specific system documentation that can be found under /usr/share/couchdb. + +

Gentoo Linux

+ +

Enable the development ebuild of CouchDB by running: + +

+sudo echo dev-db/couchdb >> /etc/portage/package.keywords
+
+ +

Check the CouchDB ebuild by running: + +

+emerge -pv couchdb
+
+ +

Build and install the CouchDB ebuild by running: + +

+sudo emerge couchdb
+
+ +

When this completes, you should have a copy of CouchDB running on your machine. + +

Problems

+ +

See Appendix D, Installing from Source if your distribution doesn’t have a CouchDB package. diff --git a/editions/1/zh/validation.html b/editions/1/zh/validation.html new file mode 100644 index 0000000..4ed8f81 --- /dev/null +++ b/editions/1/zh/validation.html @@ -0,0 +1,210 @@ +Validation Functions + + + + + + + + + + + +

Validation Functions

+ +

In this chapter, we look closely at the individual components of Sofa’s validation function. Sofa has the basic set of validation features you’ll want in your apps, so understanding its validation function will give you a good foundation for others you may write in the future. + +

CouchDB uses the validate_doc_update function to prevent invalid or unauthorized document updates from proceeding. We use it in the example application to ensure that blog posts can be authored only by logged-in users. CouchDB’s validation functions—like map and reduce functions—can’t have any side effects; they run in isolation of a request. They have the opportunity to block not only end-user document saves, but also replicated documents from other CouchDBs. + +

Document Validation Functions

+ +

To ensure that users may save only documents that provide these fields, we can validate their input by adding another member to the _design/ document: the validate_doc_update function. This is the first time you’ve seen CouchDB’s external process in action. CouchDB sends functions and documents to a JavaScript interpreter. This mechanism is what allows us to write our document validation functions in JavaScript. The validate_doc_update function gets executed for each document you want to create or update. If the validation function raises an exception, the update is denied; when it doesn’t, the updates are accepted. + +

Document validation is optional. If you don’t create a validation function, no checking is done and documents with any content or structure can be written into your CouchDB database. If you have multiple design documents, each with a validate_doc_update function, all of those functions are called upon each incoming write request. Only if all of them pass does the write succeed. The order of the validation execution is not defined. Each validation function must act on its own. See Figure 1, “The JavaScript document validation function”. + +

+ + + +

Figure 1. The JavaScript document validation function + +

+ +

Validation functions can cancel document updates by throwing errors. To throw an error in such a way that the user will be asked to authenticate, before retrying the request, use JavaScript code like: + +

+throw({unauthorized : message});
+
+ +

When you’re trying to prevent an authorized user from saving invalid data, use this: + +

+throw({forbidden : message});
+
+ +

This function throws forbidden errors when a post does not contain the necessary fields. In places it uses a validate() helper to clean up the JavaScript. We also use simple JavaScript conditionals to ensure that the doc._id is set to be the same as doc.slug for the sake of pretty URLs. + +

If no exceptions are thrown, CouchDB expects the incoming document to be valid and will write it to the database. By using JavaScript to validate JSON documents, we can deal with any structure a document might have. Given that you can just make up document structure as you go, being able to validate what you come up with is pretty flexible and powerful. Validation can also be a valuable form of documentation. + +

Validation’s Context

+ +

Before we delve into the details of our validation function, let’s talk about the context in which they run and the effects they can have. + +

Validation functions are stored in design documents under the validate_doc_update field. There is only one per design document, but there can be many design documents in a database. In order for a document to be saved, it must pass validations on all design documents in the database (the order in which multiple validations are executed is left undefined). In this chapter, we’ll assume you are working in a database with only one validation function. + +

Writing One

+ +

The function declaration is simple. It takes three arguments: the proposed document update, the current version of the document on disk, and an object corresponding to the user initiating the request. + +

+function(newDoc, oldDoc, userCtx) {}
+
+ +

Above is the simplest possible validation function, which, when deployed, would allow all updates regardless of content or user roles. The converse, which never lets anyone do anything, looks like this: + +

+function(newDoc, oldDoc, userCtx) {
+  throw({forbidden : 'no way'});
+}
+
+ +

Note that if you install this function in your database, you won’t be able to perform any other document operations until you remove it from the design document or delete the design document. Admins can create and delete design documents despite the existence of this extreme validation function. + +

We can see from these examples that the return value of the function is ignored. Validation functions prevent document updates by raising errors. When the validation function passes without raising errors, the update is allowed to proceed. + +

Type

+ +

The most basic use of validation functions is to ensure that documents are properly formed to fit your application’s expectations. Without validation, you need to check for the existence of all fields on a document that your MapReduce or user-interface code needs to function. With validation, you know that any saved documents meet whatever criteria you require. + +

A common pattern in most languages, frameworks, and databases is using types to distinguish between subsets of your data. For instance, in Sofa we have a few document types, most prominently post and comment. + +

CouchDB itself has no notion of types, but they are a convenient shorthand for use in your application code, including MapReduce views, display logic, and user interface code. The convention is to use a field called type to store document types, but many frameworks use other fields, as CouchDB itself doesn’t care which field you use. (For instance, the CouchRest Ruby client uses couchrest-type). + +

Here’s an example validation function that runs only on posts: + +

+function(newDoc, oldDoc, userCtx) {
+  if (newDoc.type == "post") {
+    // validation logic goes here
+  }
+}
+
+ +

Since CouchDB stores only one validation function per design document, you’ll end up validating multiple types in one function, so the overall structure becomes something like: + +

+function(newDoc, oldDoc, userCtx) {
+  if (newDoc.type == "post") {
+    // validation logic for posts
+  }
+  if (newDoc.type == "comment") {
+    // validation logic for comments
+  }
+  if (newDoc.type == "unicorn") {
+    // validation logic for unicorns
+  }
+}
+
+ +

It bears repeating that type is a completely optional field. We present it here as a helpful technique for managing validations in CouchDB, but there are other ways to write validation functions. Here’s an example that uses duck typing instead of an explicit type attribute: + +

+function(newDoc, oldDoc, userCtx) {
+  if (newDoc.title && newDoc.body) {
+    // validate that the document has an author
+  }
+}
+
+ +

This validation function ignores the type attribute altogether and instead makes the somewhat simpler requirement that any document with both a title and a body must have an author. For some applications, typeless validations are simpler. For others, it can be a pain to keep track of which sets of fields are dependent on one another. + +

In practice, many applications end up using a mix of typed and untyped validations. For instance, Sofa uses document types to track which fields are required on a given document, but it also uses duck typing to validate the structure of particular named fields. We don’t care what sort of document we’re validating. If the document has a created_at field, we ensure that the field is a properly formed timestamp. Similarly, when we validate the author of a document, we don’t care what type of document it is; we just ensure that the author matches the user who saved the document. + +

Required Fields

+ +

The most fundamental validation is ensuring that particular fields are available on a document. The proper use of required fields can make writing MapReduce views much simpler, as you don’t have to test for all the properties before using them—you know all documents will be well-formed. + +

Required fields also make display logic much simpler. Nothing says amateur like the word undefined showing up throughout your application. If you know for certain that all documents will have a field, you can avoid lengthy conditional statements to render the display differently depending on document structure. + +

Sofa requires a different set of fields on posts and comments. Here’s a subset of the Sofa validation function: + +

+function(newDoc, oldDoc, userCtx) {
+  function require(field, message) {
+    message = message || "Document must have a " + field;
+    if (!newDoc[field]) throw({forbidden : message});
+  };
+
+  if (newDoc.type == "post") {
+    require("title");
+    require("created_at");
+    require("body");
+    require("author");
+  }
+  if (newDoc.type == "comment") {
+    require("name");
+    require("created_at");
+    require("comment", "You may not leave an empty comment");
+  }
+}
+
+ +

This is our first look at actual validation logic. You can see that the actual error throwing code has been wrapped in a helper function. Helpers like the require function just shown go a long way toward making your code clean and readable. The require function is simple. It takes a field name and an optional message, and it ensures that the field is not empty or blank. + +

Once we’ve declared our helper function, we can simply use it in a type-specific way. Posts require a title, a timestamp, a body, and an author. Comments require a name, a timestamp, and the comment itself. If we wanted to require that every single document contained a created_at field, we could move that declaration outside of any type conditional logic. + +

Timestamps

+ +

Timestamps are an interesting problem in validation functions. Because validation functions are run at replication time as well as during normal client access, we can’t require that timestamps be set close to the server’s system time. We can require two things: that timestamps do not change after they are initially set, and that they are well formed. What it means to be well formed depends on your application. We’ll look at Sofa’s particular requirements here, as well as digress a bit about other options for timestamp formats. + +

First, let’s look at a validation helper that does not allow fields, once set, to be changed on subsequent updates: + +

+function(newDoc, oldDoc, userCtx) {
+  function unchanged(field) {
+    if (oldDoc && toJSON(oldDoc[field]) != toJSON(newDoc[field]))
+      throw({forbidden : "Field can't be changed: " + field});
+  }
+  unchanged("created_at");
+}
+
+ +

The unchanged helper is a little more complex than the require helper, but not much. The first line of the function prevents it from running on initial updates. The unchanged helper doesn’t care at all what goes into a field the first time it is saved. However, if there exists an already-saved version of the document, the unchanged helper requires that whatever fields it is used on are the same between the new and the old version of the document. + +

JavaScript’s equality test is not well suited to working with deeply nested objects. We use CouchDB’s JavaScript runtime’s built-in toJSON function in our equality test, which is better than testing for raw equality. Here’s why: + +

+js> [] == []
+false
+
+ +

JavaScript considers these arrays to be different because it doesn’t look at the contents of the array when making the decision. Since they are distinct objects, JavaScript must consider them not equal. We use the toJSON function to convert objects to a string representation, which makes comparisons more likely to succeed in the case where two objects have the same contents. This is not guaranteed to work for deeply nested objects, as toJSON may serialize objects. + +

+ +

The js command gets installed when you install CouchDB’s SpiderMonkey dependency. It is a command-line application that lets you parse, evaluate, and run JavaScript code. js lets you quickly test JavaScript code snippets like the one previously shown. You can also run a syntax check of your JavaScript code using js file.js. In case CouchDB’s error messages are not helpful, you can resort to testing your code standalone and get a useful error report. + +

+ +

Authorship

+ +

Authorship is an interesting question in distributed systems. In some environments, you can trust the server to ascribe authorship to a document. Currently, CouchDB has a simple built-in validation system that manages node admins. There are plans to add a database admin role, as well as other roles. The authentication system is pluggable, so you can integrate with existing services to authenticate users to CouchDB using an HTTP layer, using LDAP integration, or through other means. + +

Sofa uses the built-in node admin account system and so is best suited for single or small groups of authors. Extending Sofa to store author credentials in CouchDB itself is an exercise left to the reader. + +

Sofa’s validation logic says that documents saved with an author field must be saved by the author listed on that field: + +

+function(newDoc, oldDoc, userCtx) {
+  if (newDoc.author) {
+    enforce(newDoc.author == userCtx.name,
+      "You may only update documents with author " + userCtx.name);
+  }
+}
+
+ +

Wrapping Up

+ +

Validation functions are a powerful tool to ensure that only documents you expect end up in your databases. You can test writes to your database by content, by structure, and by user who is making the document request. Together, these three angles let you build sophisticated validation routines that will stop anyone from tampering with your database. + +

Of course, validation functions are no substitute for a full security system, although they go a long way and work well with CouchDB’s other security mechanisms. Read more about CouchDB’s security in Chapter 22, Security. diff --git a/editions/1/zh/validation/01.png b/editions/1/zh/validation/01.png new file mode 100644 index 0000000..a5f1330 Binary files /dev/null and b/editions/1/zh/validation/01.png differ diff --git a/editions/1/zh/views.html b/editions/1/zh/views.html new file mode 100644 index 0000000..9182c44 --- /dev/null +++ b/editions/1/zh/views.html @@ -0,0 +1,544 @@ +Finding Your Data with Views + + + + + + + + + + + +

Finding Your Data with Views

+ +

Views are useful for many purposes: + +

+ +

What Is a View?

+ +

Let’s go through the different use cases. First is extracting data that you might need for a special purpose in a specific order. For a front page, we want a list of blog post titles sorted by date. We’ll work with a set of example documents as we walk through how views work: + +

+{
+  "_id":"biking",
+  "_rev":"AE19EBC7654",
+
+  "title":"Biking",
+  "body":"My biggest hobby is mountainbiking. The other day...",
+  "date":"2009/01/30 18:04:11"
+}
+
+{
+  "_id":"bought-a-cat",
+  "_rev":"4A3BBEE711",
+
+  "title":"Bought a Cat",
+  "body":"I went to the the pet store earlier and brought home a little kitty...",
+  "date":"2009/02/17 21:13:39"
+}
+
+{
+  "_id":"hello-world",
+  "_rev":"43FBA4E7AB",
+
+  "title":"Hello World",
+  "body":"Well hello and welcome to my new blog...",
+  "date":"2009/01/15 15:52:20"
+}
+
+ +

Three will do for the example. Note that the documents are sorted by "_id", which is how they are stored in the database. Now we define a view. Chapter 3, Getting Started showed you how to create a view in Futon, the CouchDB administration client. Bear with us without an explanation while we show you some code: + +

+function(doc) {
+  if(doc.date && doc.title) {
+    emit(doc.date, doc.title);
+  }
+}
+
+ +

This is a map function, and it is written in JavaScript. If you are not familiar with JavaScript but have used C or any other C-like language such as Java, PHP, or C#, this should look familiar. It is a simple function definition. + +

You provide CouchDB with view functions as strings stored inside the views field of a design document. You don’t run it yourself. Instead, when you query your view, CouchDB takes the source code and runs it for you on every document in the database your view was defined in. You query your view to retrieve the view result. + +

All map functions have a single parameter doc. This is a single document in the database. Our map function checks whether our document has a date and a title attribute—luckily, all of our documents have them—and then calls the built-in emit() function with these two attributes as arguments. + +

The emit() function always takes two arguments: the first is key, and the second is value. The emit(key, value) function creates an entry in our view result. One more thing: the emit() function can be called multiple times in the map function to create multiple entries in the view results from a single document, but we are not doing that yet. + +

CouchDB takes whatever you pass into the emit() function and puts it into a list (see Table 1, “View results”). Each row in that list includes the key and value. More importantly, the list is sorted by key (by doc.date in our case). The most important feature of a view result is that it is sorted by key. We will come back to that over and over again to do neat things. Stay tuned. + +

+ + + + + + + + + + + + + + + + + + + +
KeyValue
"2009/01/15 15:52:20""Hello World"
"2009/01/30 18:04:11""Biking"
"2009/02/17 21:13:39""Bought a Cat"
+ +

Table 1. View results + +

+ +

If you read carefully over the last few paragraphs, one part stands out: “When you query your view, CouchDB takes the source code and runs it for you on every document in the database.” If you have a lot of documents, that takes quite a bit of time and you might wonder if it is not horribly inefficient to do this. Yes, it would be, but CouchDB is designed to avoid any extra costs: it only runs through all documents once, when you first query your view. If a document is changed, the map function is only run once, to recompute the keys and values for that single document. + +

The view result is stored in a B-tree, just like the structure that is responsible for holding your documents. View B-trees are stored in their own file, so that for high-performance CouchDB usage, you can keep views on their own disk. The B-tree provides very fast lookups of rows by key, as well as efficient streaming of rows in a key range. In our example, a single view can answer all questions that involve time: “Give me all the blog posts from last week” or “last month” or “this year.” Pretty neat. Read more about how CouchDB’s B-trees work in Appendix F, The Power of B-trees. + +

When we query our view, we get back a list of all documents sorted by date. Each row also includes the post title so we can construct links to posts. Figure 1 is just a graphical representation of the view result. The actual result is JSON-encoded and contains a little more metadata: + +

+{
+  "total_rows": 3,
+  "offset": 0,
+  "rows": [
+    {
+      "key": "2009/01/15 15:52:20",
+      "id": "hello-world",
+      "value": "Hello World"
+    },
+
+    {
+      "key": "2009/02/17 21:13:39",
+      "id": "bought-a-cat",
+      "value": "Bought a Cat"
+    },
+
+    {
+      "key": "2009/01/30 18:04:11",
+      "id": "biking",
+      "value": "Biking"
+    }
+  ]
+}
+
+ +

Now, the actual result is not as nicely formatted and doesn’t include any superfluous whitespace or newlines, but this is better for you (and us!) to read and understand. Where does that "id" member in the result rows come from? That wasn’t there before. That’s because we omitted it earlier to avoid confusion. CouchDB automatically includes the document ID of the document that created the entry in the view result. We’ll use this as well when constructing links to the blog post pages. + +

Efficient Lookups

+ +

Let’s move on to the second use case for views: “building efficient indexes to find documents by any value or structure that resides in them.” We already explained the efficient indexing, but we skipped a few details. This is a good time to finish this discussion as we are looking at map functions that are a little more complex. + +

First, back to the B-trees! We explained that the B-tree that backs the key-sorted view result is built only once, when you first query a view, and all subsequent queries will just read the B-tree instead of executing the map function for all documents again. What happens, though, when you change a document, add a new one, or delete one? Easy: CouchDB is smart enough to find the rows in the view result that were created by a specific document. It marks them invalid so that they no longer show up in view results. If the document was deleted, we’re good—the resulting B-tree reflects the state of the database. If a document got updated, the new document is run through the map function and the resulting new lines are inserted into the B-tree at the correct spots. New documents are handled in the same way. Appendix F, The Power of B-trees demonstrates that a B-tree is a very efficient data structure for our needs, and the crash-only design of CouchDB databases is carried over to the view indexes as well. + +

To add one more point to the efficiency discussion: usually multiple documents are updated between view queries. The mechanism explained in the previous paragraph gets applied to all changes in the database since the last time the view was queried in a batch operation, which makes things even faster and is generally a better use of your resources. + +

Find One

+ +

On to more complex map functions. We said “find documents by any value or structure that resides in them.” We already explained how to extract a value by which to sort a list of views (our date field). The same mechanism is used for fast lookups. The URI to query to get a view’s result is /database/_design/designdocname/_view/viewname. This gives you a list of all rows in the view. We have only three documents, so things are small, but with thousands of documents, this can get long. You can add view parameters to the URI to constrain the result set. Say we know the date of a blog post. To find a single document, we would use /blog/_design/docs/_view/by_date?key="2009/01/30 18:04:11" to get the “Biking” blog post. Remember that you can place whatever you like in the key parameter to the emit() function. Whatever you put in there, we can now use to look up exactly—and fast. + +

Note that in the case where multiple rows have the same key (perhaps we design a view where the key is the name of the post’s author), key queries can return more than one row. + +

Find Many

+ +

We talked about “getting all posts for last month.” If it’s February now, this is as easy as /blog/_design/docs/_view/by_date?startkey="2010/01/01 00:00:00"&endkey="2010/02/00 00:00:00". The startkey and endkey parameters specify an inclusive range on which we can search. + +

To make things a little nicer and to prepare for a future example, we are going to change the format of our date field. Instead of a string, we are going to use an array, where individual members are part of a timestamp in decreasing significance. This sounds fancy, but it is rather easy. Instead of: + +

+{
+  "date": "2009/01/31 00:00:00"
+}
+
+ +

we use: + +

+"date": [2009, 1, 31, 0, 0, 0]
+
+ +

Our map function does not have to change for this, but our view result looks a little different. See Table 2, “New view results”. + +

+ + + + + + + + + + + + + + + + + + + +
KeyValue
[2009, 1, 15, 15, 52, 20]"Hello World"
[2009, 2, 17, 21, 13, 39]"Biking"
[2009, 1, 30, 18, 4, 11]"Bought a Cat"
+ +

Table 2. New view results + +

+ +

And our queries change to /blog/_design/docs/_view/by_date?key=[2009, 1, 1, 0, 0, 0] and /blog/_design/docs/_view/by_date?key=[2009, 01, 31, 0, 0, 0]. For all you care, this is just a change in syntax, not meaning. But it shows you the power of views. Not only can you construct an index with scalar values like strings and integers, you can also use JSON structures as keys for your views. Say we tag our documents with a list of tags and want to see all tags, but we don’t care for documents that have not been tagged. + +

+{
+  ...
+  tags: ["cool", "freak", "plankton"],
+  ...
+}
+
+ +
+{
+  ...
+  tags: [],
+  ...
+}
+
+ +
+function(doc) {
+  if(doc.tags.length > 0) {
+    for(var idx in doc.tags) {
+      emit(doc.tags[idx], null);
+    }
+  }
+}
+
+ +

This shows a few new things. You can have conditions on structure (if(doc.tags.length > 0)) instead of just values. This is also an example of how a map function calls emit() multiple times per document. And finally, you can pass null instead of a value to the value parameter. The same is true for the key parameter. We’ll see in a bit how that is useful. + +

Reversed Results

+ +

To retrieve view results in reverse order, use the descending=true query parameter. If you are using a startkey parameter, you will find that CouchDB returns different rows or no rows at all. What’s up with that? + +

It’s pretty easy to understand when you see how view query options work under the hood. A view is stored in a tree structure for fast lookups. Whenever you query a view, this is how CouchDB operates: + +

    + +
  1. Starts reading at the top, or at the position that startkey specifies, if present.
  2. + +
  3. Returns one row at a time until the end or until it hits endkey, if present.
  4. + +
+ +

If you specify descending=true, the reading direction is reversed, not the sort order of the rows in the view. In addition, the same two-step procedure is followed. + +

Say you have a view result that looks like this: + +

+ + + + + + + + + + + + + + + + + + + +
KeyValue
0"foo"
1"bar"
2"baz"
+ +
+ +

Here are potential query options: ?startkey=1&descending=true. What will CouchDB do? See #1 above: it jumps to startkey, which is the row with the key 1, and starts reading backward until it hits the end of the view. So the particular result would be: + +

+ + + + + + + + + + + + + + + + + +
KeyValue
1"bar"
0"foo"
+ +
+ +

This is very likely not what you want. To get the rows with the indexes 1 and 2 in reverse order, you need to switch the startkey to endkey: endkey=1&descending=true: + +

+ + + + + + + + + + + + + + + + + +
KeyValue
2"baz"
1"bar"
+ +
+ +

Now that looks a lot better. CouchDB started reading at the bottom of the view and went backward until it hit endkey. + +

The View to Get Comments for Posts

+ +

We use an array key here to support the group_level reduce query parameter. CouchDB’s views are stored in the B-tree file structure (which will be described in more detail later on). Because of the way B-trees are structured, we can cache the intermediate reduce results in the non-leaf nodes of the tree, so reduce queries can be computed along arbitrary key ranges in logarithmic time. See Figure 1, “Comments map function”. + +

In the blog app, we use group_level reduce queries to compute the count of comments both on a per-post and total basis, achieved by querying the same view index with different methods. With some array keys, and assuming each key has the value 1: + +

+["a","b","c"]
+["a","b","e"]
+["a","c","m"]
+["b","a","c"]
+["b","a","g"]
+
+ +

the reduce view: + +

+function(keys, values, rereduce) {
+  return sum(values)
+}
+
+ +

returns the total number of rows between the start and end key. So with startkey=["a","b"]&endkey=["b"] (which includes the first three of the above keys) the result would equal 3. The effect is to count rows. If you’d like to count rows without depending on the row value, you can switch on the rereduce parameter: + +

+function(keys, values, rereduce) {
+  if (rereduce) {
+    return sum(values);
+  } else {
+    return values.length;
+  }
+}
+
+ +
+ + + +

Figure 1. Comments map function + +

+ +

This is the reduce view used by the example app to count comments, while utilizing the map to output the comments, which are more useful than just 1 over and over. It pays to spend some time playing around with map and reduce functions. Futon is OK for this, but it doesn’t give full access to all the query parameters. Writing your own test code for views in your language of choice is a great way to explore the nuances and capabilities of CouchDB’s incremental MapReduce system. + +

Anyway, with a group_level query, you’re basically running a series of reduce range queries: one for each group that shows up at the level you query. Let’s reprint the key list from earlier, grouped at level 1: + +

+["a"]   3
+["b"]   2
+
+ +

And at group_level=2: + +

+["a","b"]   2
+["a","c"]   1
+["b","a"]   2
+
+ +

Using the parameter group=true makes it behave as though it were group_level=Exact, so in the case of our current example, it would give the number 1 for each key, as there are no exactly duplicated keys. + +

Reduce/Rereduce

+ +

We briefly talked about the rereduce parameter to your reduce function. We’ll explain what’s up with it in this section. By now, you should have learned that your view result is stored in B-tree index structure for efficiency. The existence and use of the rereduce parameter is tightly coupled to how the B-tree index works. + +

Consider the map result shown in Example 1, “Example view result (mmm, food)”. + +

+ +
+"afrikan", 1
+"afrikan", 1
+"chinese", 1
+"chinese", 1
+"chinese", 1
+"chinese", 1
+"french", 1
+"italian", 1
+"italian", 1
+"spanish", 1
+"vietnamese", 1
+"vietnamese", 1
+
+ +

Example 1. Example view result (mmm, food) + +

+ +

When we want to find out how many dishes there are per origin, we can reuse the simple reduce function shown earlier: + +

+function(keys, values, rereduce) {
+  return sum(values);
+}
+
+ +

Figure 2, “The B-tree index” shows a simplified version of what the B-tree index looks like. We abbreviated the key strings. + +

+ + + +

Figure 2. The B-tree index + +

+ +

The view result is what computer science grads call a “pre-order” walk through the tree. We look at each element in each node starting from the left. Whenever we see that there is a subnode to descend into, we descend and start reading the elements in that subnode. When we have walked through the entire tree, we’re done. + +

You can see that CouchDB stores both keys and values inside each leaf node. In our case, it is simply always 1, but you might have a value where you count other results and then all rows have a different value. What’s important is that CouchDB runs all elements that are within a node into the reduce function (setting the rereduce parameter to false) and stores the result inside the parent node along with the edge to the subnode. In our case, each edge has a 3 representing the reduce value for the node it points to. + +

In reality, nodes have more than 1,600 elements in them. CouchDB computes the result for all the elements in multiple iterations over the elements in a single node, not all at once (which would be disastrous for memory consumption). + +

Now let’s see what happens when we run a query. We want to know how many "chinese" entries we have. The query option is simple: ?key="chinese". See Figure 3, “The B-tree index reduce result”. + +

+ + + +

Figure 3. The B-tree index reduce result + +

+ +

CouchDB detects that all values in the subnode include the "chinese" key. It concludes that it can take just the 3 value associated with that node to compute the final result. It then finds the node left to it and sees that it’s a node with keys outside the requested range (key= requests a range where the beginning and the end are the same value). It concludes that it has to use the "chinese" element’s value and the other node’s value and run them through the reduce function with the rereduce parameter set to true. + +

The reduce function effectively calculates 3 + 1 on query time and returns the desired result. Example 2, “The result is 4” shows some pseudocode that shows the last invocation of the reduce function with actual values. + +

+ +
+function(null, [3, 1], true) {
+  return sum([3, 1]);
+}
+
+ +

Example 2. The result is 4 + +

+ +

Now, we said your reduce function must actually reduce your values. If you see the B-tree, it should become obvious what happens when you don’t reduce your values. Consider the following map result and reduce function. This time we want to get a list of all the unique labels in our view: + +

+"abc", "afrikan"
+"cef", "afrikan"
+"fhi", "chinese"
+"hkl", "chinese"
+"ino", "chinese"
+"lqr", "chinese"
+"mtu", "french"
+"owx", "italian"
+"qza", "italian"
+"tdx", "spanish"
+"xfg", "vietnamese"
+"zul", "vietnamese"
+
+ +

We don’t care for the key here and only list all the labels we have. Our reduce function removes duplicates; see Example 3, “Don’t use this, it’s an example broken on purpose”. + +

+ +
+function(keys, values, rereduce) {
+  var unique_labels = {};
+  values.forEach(function(label) {
+    if(!unique_labels[label]) {
+      unique_labels[label] = true;
+    }
+  });
+
+  return unique_labels;
+}
+
+ +

Example 3. Don’t use this, it’s an example broken on purpose + +

+ +

This translates to Figure 4, “An overflowing reduce index”. + +

We hope you get the picture. The way the B-tree storage works means that if you don’t actually reduce your data in the reduce function, you end up having CouchDB copy huge amounts of data around that grow linearly, if not faster with the number of rows in your view. + +

CouchDB will be able to compute the final result, but only for views with a few rows. Anything larger will experience a ridiculously slow view build time. To help with that, CouchDB since version 0.10.0 will throw an error if your reduce function does not reduce its input values. + +

See Chapter 21, View Cookbook for SQL Jockeys for an example of how to compute unique lists with views. + +

+ + + +

Figure 4. An overflowing reduce index + +

+ +

Lessons Learned

+ + + +

Wrapping Up

+ +

Map functions are side effect–free functions that take a document as argument and emit key/value pairs. CouchDB stores the emitted rows by constructing a sorted B-tree index, so row lookups by key, as well as streaming operations across a range of rows, can be accomplished in a small memory and processing footprint, while writes avoid seeks. Generating a view takes O(N), where N is the total number of rows in the view. However, querying a view is very quick, as the B-tree remains shallow even when it contains many, many keys. + +

Reduce functions operate on the sorted rows emitted by map view functions. CouchDB’s reduce functionality takes advantage of one of the fundamental properties of B-tree indexes: for every leaf node (a sorted row), there is a chain of internal nodes reaching back to the root. Each leaf node in the B-tree carries a few rows (on the order of tens, depending on row size), and each internal node may link to a few leaf nodes or other internal nodes. + +

The reduce function is run on every node in the tree in order to calculate the final reduce value. The end result is a reduce function that can be incrementally updated upon changes to the map function, while recalculating the reduction values for a minimum number of nodes. The initial reduction is calculated once per each node (inner and leaf) in the tree. + +

When run on leaf nodes (which contain actual map rows), the reduce function’s third parameter, rereduce, is false. The arguments in this case are the keys and values as output by the map function. The function has a single returned reduction value, which is stored on the inner node that a working set of leaf nodes have in common, and is used as a cache in future reduce calculations. + +

When the reduce function is run on inner nodes, the rereduce flag is true. This allows the function to account for the fact that it will be receiving its own prior output. When rereduce is true, the values passed to the function are intermediate reduction values as cached from previous calculations. When the tree is more than two levels deep, the rereduce phase is repeated, consuming chunks of the previous level’s output until the final reduce value is calculated at the root node. + +

A common mistake new CouchDB users make is attempting to construct complex aggregate values with a reduce function. Full reductions should result in a scalar value, like 5, and not, for instance, a JSON hash with a set of unique keys and the count of each. The problem with this approach is that you’ll end up with a very large final value. The number of unique keys can be nearly as large as the number of total keys, even for a large set. It is fine to combine a few scalar calculations into one reduce function; for instance, to find the total, average, and standard deviation of a set of numbers in a single function. + +

If you’re interested in pushing the edge of CouchDB’s incremental reduce functionality, have a look at Google’s paper on Sawzall, which gives examples of some of the more exotic reductions that can be accomplished in a system with similar constraints. diff --git a/editions/1/zh/views/01.png b/editions/1/zh/views/01.png new file mode 100644 index 0000000..b102d5e Binary files /dev/null and b/editions/1/zh/views/01.png differ diff --git a/editions/1/zh/views/02.png b/editions/1/zh/views/02.png new file mode 100644 index 0000000..4e9f3dc Binary files /dev/null and b/editions/1/zh/views/02.png differ diff --git a/editions/1/zh/views/03.png b/editions/1/zh/views/03.png new file mode 100644 index 0000000..83929ee Binary files /dev/null and b/editions/1/zh/views/03.png differ diff --git a/editions/1/zh/views/04.png b/editions/1/zh/views/04.png new file mode 100644 index 0000000..51e3de8 Binary files /dev/null and b/editions/1/zh/views/04.png differ diff --git a/editions/1/zh/why.html b/editions/1/zh/why.html new file mode 100644 index 0000000..83d4408 --- /dev/null +++ b/editions/1/zh/why.html @@ -0,0 +1,159 @@ +为什么选择CouchDB + + + + + + + + + + + +

为什么选择CouchDB

+ +

Apache CouchDB是新型数据库管理系统的一种. 本章解释了为什么需要一种新系统以及开发构建CouchDB背后的动机. + +

作为CouchDB开发者, 我们自然会对使用CouchDB感到很兴奋. 在本章中, 我们会和您分享我们这种热情的原因. 我们会向您展示, CouchDB与模式无关的(schema-free)文档模型如何更加适合于普通应用, 内置的查询引擎如何高效的使用和处理数据使用和处理数据的, 以及CouchDB的设计如何使它自己变得模块化和可扩展化. + +

放松

+ +

如果有一个词可以来形容CouchDB的话, 那就是放松. 它出现在这本书的题目中, 它是CouchDB官方LOGO的副标题, 并且当你启动CouchDB时, 你会看到: + +

+Apache CouchDB has started. Time to relax.
+
+ +

为什么放松如此重要? 开发者的生产效率在过去五年中大概翻了一番. 这种加速的首要原因便是有了更多的更加易用的工具. 就拿Ruby on Rails来作例子吧. 它本身是一个极其复杂的框架, 但是使用起来却相当的容易上手. Rails因为把易用性作为核心设计重点而成为了一个成功的故事. 这是CouchDB为什么要放松的一个原因: 学习CouchDB, 理解它的核心概念, 对于做过Web开发的大多数人而说应该会觉得自然而然. 并且对于非技术人员解释说明也应该相当的容易. + +

当那些富有创意的人们想要构建一个特殊的解决方法时, 让出道路, 是CouchDB自身的一个核心功能, 也是CouchDB目标要做好的一件事. 我们发现现有的工具在开发或生产环境中显然过于笨重, 所以决定将注意力放在如何让CouchDB在使用中显然简单甚至是一种乐趣. 第3和第4章会演示直观的基于HTTP的REST API. + +

CouchDB用户另外一个可以感到放松的地方是生产环境设置. 如果您有一个在线的应用, CouchDB又一次让出路来避免来麻烦你. 它的内部架构是容错的, 失败会发生在一个可控的环境并且被优雅的处理. 单一的问题不会波及到整个服务系统, 而是被孤立在一个单一的请求. + +

CouchDB的核心概念是简单的(也是强大的)并且容易理解. 运维团队(容易您有一个的话, 否则, 就是您自己)不用去害怕随机的行为和不可追踪的错误. 如果真有什么错误发生了, 您可以相当容易的发现问题所在--但是这种情况很少发生. + +

CouchDB同样被设计为使用优雅的处理流量的变化. 假设您有一个网站遭遇了突然的流量高峰. CouchDB一般会吸收大量的并且请求而不会挂掉; 它会使用一点点更多的时间来让每个请求结束, 但是这些请求都会得到响应. 当这个高峰过去后, CouchDB会再次和之前一样的快速运行. + +

第三个放松的地方是增加或收缩应用的底层硬件的时候. 这通常也被叫做可扩展性. CouchDB对程序员强制执行了一些限制. 第一眼看去, CouchDB时常被认为不够灵活, 但有些特性是因为为了保持设计的简单性而放弃的, 那是因为如果CouchDB支持了这些特性, 那么创建的应用就没办法很好的向上或向下进行扩展. 我们会在第四部分, "部署CouchDB"里再来讲解关于如何扩展CouchDB的问题. + +

简言之: CouchDB不会让你做以后会让你陷入麻烦的事情. 这通常意味着要忘却一些您在现在或者曾经工作中学到的一些"最佳实践". "Recipes"这一章包含了一个普通任务的列表并且如何在CouchDB中解决他们的方法. + +

一种不同的方法来建模你的数据

+ +

我们相信CouchDB会很大程度上的改变您开发以文档为基础应用的方法. CouchDB用一个简单的方式结合了直观的文档存储模型和一个强大的查询引擎, 它是如此简单以至于你可能会很想问"为什么以前没人开发出类似的东西呢?" + +

+ +

Django可以说是Web而构建, 可CouchDB是Web而构建的. 我以前从来没有看到到一款软件是如此完全得拥抱了HTTP背后的哲学. CouchDB使得Django看起来像是很老旧的东西, 就像Django让ASP显得过时了一样. + +

—Jacob Kaplan-Moss, Django开发者 + +

+ +

CouchDB的设计大量的借用了Web架构以及资源(resources), 方法(methods)和展示(representations)的概念. 它增添了强大的方式来查询, 匹配, 组合和过滤你的数据. 增加了容错性, 极高的可扩展性和增量的复制, 并且CouchDB为文档型数据库作了一个很好的定义. + +

一般应用的一个更好的选择

+ +

我们写软件来改进自己的生活以及他人的生活. 通常, 这牵扯到一些世俗的信息, 比如联系人, 账单, 收货单, 然后用计算机应用程序来处理它们. 对于类似此类的应用, CouchDB是一个非常棒的选择, 因为它拥抱了演变的,自包含的文档这种自然而然的概念作为它的最核心的数据模型. + +

自包含的数据

+ +

一个账单包含了一个交易的所有相关信息: 卖家, 买家, 日期, 和一个售出物品或服务的清单. 如图1, "自包含的文档"所示, 在这张小纸条上不会包含抽象的指向其它写有卖家的名字和地址的小纸条的引用. 会计师喜欢这种所有东西放在一个地方的简单性. 如果有选择的话, 程序员也会喜欢. + +

+ + + +

图1. 自包含数据的文档 + +

+ +

而这正是我们在关系数据库里建模数据时做的事情! 每个账单被存储在一张表里, 其中一行指向其他表里的其他行: 一行指向卖家的信息, 一行指向买家的信息, 各一行指向每种消费的物品的信息, 还有更多的行来描述物品的细节, 制造商细节等等等等. + +

这不是对关系模型的一种诽谤, 关系模型被广泛的接受并且因为很多原因非常的有用. 但是上节表达了这么一种观点, 有时候, 你的模型并不适合你想如何处理的数据的方式. + +

让我们来看看联系人数据库, 以此来说明一种不同的建模数据的方式, 一种更加接近于真实世界的方法: 一堆名片. 和我们前面所说的账单那个例子类似, 名片包含了所有的重要信息, 全写在了卡片上. 我们把它叫做"自包含"的数据, 这是理解像CouchDB这样的文档型数据库的一个重要概念. + +

语法与语义

+ +

大部分名片包含了大致相同的信息: 某人的名字, 职业, 和一些联系信息. 此类信息的准确形式却以不同的名片而不同, 但一般包含的信息保持相同, 我们可以轻松的认出这是一张名片. 从这种意义上来说, 我们可以把一张名片描述成一个真实世界的文档. + +

Jan的名片可能包含一个电话号码但是却没有传真号码, 而Chris的名片却包含电话和传真号码. Jan不用因为他少了传真号码而刻意的在名片上莫名其妙的写上"传真: 无". 他会简单的不写传真号码, 以此表示他没有传真号码. + +

我们可以看到真实世界同样类型的文档, 比如名片, 在语义上, 在它们包含的信息类型上, 总是相似的; 但是在语法上, 在这种信息是如何被结构化上, 却大相径庭. 对于人类来说, 我们能够很自然的处理这种变化. + +

然而, 一个传统的关系数据库要求你事先建模好你的数据, CouchDB与模式无关(schema-free)的设计使你可以卸下这个包袱, 它给了你一个在事后再来聚合数据的强大方法, 就像我们在处理真实世界文档时所做的那样. 我们会深入了解如何在这种模式下设计应用.

+ +

为大型系统建立分块

+ +

CouchDB自身是一个有用的存储系统. 你可以用CouchDB给你的工具建立许多应用. 但是, CouchDB在设计时, 我们的脑子里有一个更大的蓝图. 它的组件可以用一种稍微不同的方式来为更加大型, 更加复杂的系统建立分块, 解决存储问题. + +

不管你是否需要一个极快的速度运行, 但却不大关心可靠性(比如日志)的系统或者一个把数据放在两个或多个物理地址的需要高可靠性的系统, 你都会愿意有更高的性能. 因为可靠性更加重要, CouchDB能让你构建这样的系统. + +

让一个系统在某一方面表现的更好, 你可以有做很多事情, 但是不管你怎么做, 总是会影响到另外的方面. 我们的第一个例子是下一章要讲CAP定理. 为了给你一个其他因素会如何影响一个存储系统的概念, 见图2和图3. + +

减少一个系统的延时(并且这并不只是对于存储系统来说的), 就会影响并发和处理能力 + +

+ + + +

图2. 处理能力, 延时, 与并发能力 + +

+ +
+ + + +

图3. 扩展: 写请求, 读请求, 与数据 + +

+ +

当你想要扩展时, 有三个明显的问题需会遇到: 扩展读的请求, 写的请求和数据. 这三者是正相关的, 并且图2和图3之外其他的属性比如可靠性和简单性也是正相关的. 你可以画出很多像这样的图表. 不同的特性或属性指向不同的方向并得出它们所描述的系统的图形. + +

CouchDB非常的灵活并且给你足够的分块来使系统适合于你特定的问题. 这也并不是说CouchDB可以任意变化来解决所有的问题: CouchDB没有银弹, 但是数据存储这方面, 它可以给你很多东西. + +

CouchDB的复制

+ +

CouchDB复制是这些分块中的一个. 它的基本功能是同步两个或更多个CouchDB数据库. 这听起来可能很简单, 但这种简单性是让复制可以解决很多问题的关键: 可靠的在多台机器上的冗余数据的同步; 在一个CouchDB实例集群上的分布式的数据同步, 集群分享所有请求的一个子集(负载均衡); 不同物理地点间的分布式的数据同步, 比如一个在纽约的办公室和另一个在东京的. + +

CouchDB复制的所有客户端都使用的REST API. HTTP很普遍并且容易理解. 复制是增量的, 这意味着, 如果在复制过程中, 有什么错误发生了, 比如网络连接断掉了, 下次复制开始的时候它会从上次断掉的地方继续. 复制也只会传输那些需要同步的数据. + +

CouchDB的一个核心的假设便是, 错误可以发生, 比如网络连接中断了, 它被设计为可以优雅的进行错误恢复, 而不是假设所有事情都能顺利进行. 复制系统的增量设计最好的展示了这一点. "错误可以发生"背后的观点在分布式计算的谬误这里有很好的解释: + +

    + +
  1. 网络是可靠的.
  2. + +
  3. 延时为0.
  4. + +
  5. 带宽是无限大的.
  6. + +
  7. 网络是安全的.
  8. + +
  9. 拓扑不变.
  10. + +
  11. 管理员存在.
  12. + +
  13. 数据传输成本为0.
  14. + +
  15. 网络是同类型的.
  16. + +
+ +

现在的工具经常试图来隐藏这些问题, 它们假设对于特定系统, 会有这么一个网络, 没有上述的任一问题或所有问题. 当错误最后终于发生的时候, 这些系统就会崩溃. 相对的, CouchDB不打算隐藏网络问题, 它只是优雅的进行处理错误, 然后在当需要你进行处理的时候通知你. + +

本地数据为王

+ +

CouchDB从Web上学到了不少的东西, 但是有一个事情Web做的并不好: 延时. 每当你要等待一个应用给你响应或者一个网站展现给你的时候, 你总是在等待一个在这一时刻不够快的网络连接. 相比于等待几毫秒, 等待几秒极大的影响了用户体验以及用户满意度. + +

更糟的是: 当你离线的时候你怎么办. 这种情况总是会发生, 你的DSL或者网络提供商出现了问题, 你的iPhone, G1或者Blackberry没有信号了. 没有网络连接, 就没有方法得到你的数据. + +

CouchDB可以解决这种问题, 同时这也又一次说明可扩展性的重要. 只是这次是向下扩展. 想像一下安装在手机或其他移动设备的CouchDB, 当它们在线的时候可以和一个中心的CouchDB进行同步. 同步没有用户界面的限制, 比如需要小于1秒的响应时间之类的约束. 对高带宽和高延时作优化要比对低带宽和低延时作优化简单. 移动应用然后可以使用本地CouchDB来取得数据因而不再需要远程网络, 这样默认的, 延时就小了. + +

但是, 手机上的CouchDB, 能实现吗? Erlang, CouchDB的实现语言被设计成可以在比现在的手机更小, 处理能力更弱的嵌入式设备上运行. + +

收尾

+ +

下一节会更加深入的讲解CouchDB的分布式特性. 我们应该已经激起了你的兴趣的吧. 让我们继续. diff --git a/editions/1/zh/why/01.png b/editions/1/zh/why/01.png new file mode 100644 index 0000000..c927450 Binary files /dev/null and b/editions/1/zh/why/01.png differ diff --git a/editions/1/zh/why/02.png b/editions/1/zh/why/02.png new file mode 100644 index 0000000..a5bb4ce Binary files /dev/null and b/editions/1/zh/why/02.png differ diff --git a/editions/1/zh/why/03.png b/editions/1/zh/why/03.png new file mode 100644 index 0000000..1f5e536 Binary files /dev/null and b/editions/1/zh/why/03.png differ diff --git a/editions/1/zh/windows.html b/editions/1/zh/windows.html new file mode 100644 index 0000000..2703b84 --- /dev/null +++ b/editions/1/zh/windows.html @@ -0,0 +1,21 @@ +Installing on Windows + + + + + + + + + + + +

Installing on Windows

+ +

CouchDB does not officially support Windows. CouchDB intends to provide an official Windows installer at some point in the future, so this may change. At the time this book is going to print, there is, however, an unofficial binary installer. + +

This is unofficial software, so please remember to exercise additional caution when downloading or installing it, as it may damage your system. Imagine a fearsomely comprehensive disclaimer of author liability. Now fear, comprehensively. + +

We recommend that you ask on the CouchDB mailing lists for further help. + +

CouchDB will have official Windows support as part of the 1.0 release.