test
int32_t GetDaffodil() {
for( int i = 100; i < 1000; ++i ){
if( i == pow(i/100,3) + pow((i%100)/10,3) + pow(i%10,3) ){
printf( "%d\n", i );
}
}
}
int32_t GetDaffodil() {
for( int i = 100; i < 1000; ++i ){
if( i == pow(i/100,3) + pow((i%100)/10,3) + pow(i%10,3) ){
printf( "%d\n", i );
}
}
}
在前MySQL核心开发人员Yoshinori Matsunobu宣传handlerSocket之后,最近这个插件颇受关注 。它的核心思想很简单,在profiling后发现对于简单的主键查询,SQL层的消耗很大。在数据集较小,能够在内存中存放的情况下(此时随机的Read IO可以忽略),SQL层就成了最大的瓶颈。为方便不能爬墙的同学,RT一下原文给出的剖分结果:
samples % app name symbol name 259130 4.5199 mysqld MYSQLparse(void*) 196841 3.4334 mysqld my_pthread_fastmutex_lock 106439 1.8566 libc-2.5.so _int_malloc 94583 1.6498 bnx2 /bnx2 84550 1.4748 ha_innodb_plugin.so.0.0.0 ut_delay 67945 1.1851 mysqld _ZL20make_join_statistics P4JOINP10TABLE_LISTP4ItemP16st_dynamic_array 63435 1.1065 mysqld JOIN::optimize() 55825 0.9737 vmlinux wakeup_stack_begin 55054 0.9603 mysqld MYSQLlex(void*, void*) 50833 0.8867 libpthread-2.5.so pthread_mutex_trylock 49602 0.8652 ha_innodb_plugin.so.0.0.0 row_search_for_mysql 47518 0.8288 libc-2.5.so memcpy 46957 0.8190 vmlinux .text.elf_core_dump 46499 0.8111 libc-2.5.so malloc
可以看出,简单的SQL查询方式下,有很大比例的时间消耗在SQL解析、Query Plan、表锁等SQL层上。因此,如果跳过SQL层,直接与存储引擎进行交互,就可以获取很大程度的性能提升。
基于这一思想,他们的团队开发了HandlerSocket。测试结果显示,单机查询性能能够到达75W+(100w条数据,做纯内存主键查询),这个数字意味着已经超过了现在绝大多数KV存储系统、甚至缓存系统的性能。
MySQL的Vadim觉得这玩艺儿挺靠谱(enjoyed),也对它进行了测试,并在MySQL Performance Blog上给出了测试的结果,这个测试关注了数据量大到需要换入换出时handlersocket的性能表现。结论与预期的相符:当数据在内存能装下时,性能稳定在60W+ rps的水平,但当数据大到一定级别时,性能开始下降。此时主要的瓶颈就在于IO,像FusionIO这样强悍的硬件,还可以支撑到40W+,而普通的RAID10就已经惨不忍睹。 也就是说数据量大拼的就是硬件IO性能,此时HandlerSocket在SQL层节省的CPU消耗,在巨大的IO成本前不值一提。
再RT一下Yoshinori给出的结构图:

图一:HandlerSocket结构 (来源于slidershare的PPT)

图二: HandlerSocket结构 (来源于原文)
这两张图大同小异,但主旨都是在正常的SQL解析层外,HandlerSocket为我们开了一条后门,直接通过MySQL的HandlerInterface与存储引擎打交道。第二张结构图更详细,可明显看出HandlerSocket要做的事情比正常的SQL少很多。
在原文作者列出了HandlerSocket的一些特性,整理了一下相对重要的,再加上自己的一些粗浅的理解:
作为一个较新的开源项目,HandlerSocket的文档比较薄弱。幸好它的代码还是很简单的,有什么疑问翻一下代码基本都能解决。这里就不展开很细致的代码分析,主要分析一下代码层面重要的几个点。
图一告诉我们,HandlerSocket和SQL Layer在同一层,但实际上这个地方有点小trick。它以daemon plugin的形式的,在这个意义上说,它和InnoDB/MyISAM等引擎插件在同一层;但在daemon_handlersocket_init里,就自己listen端口、起worker线程、接收请求、直接与存储引擎交互。
没有插件开发经验的同学,理解这个trick可能会稍有些疑惑:它是如何被调度的?它又是如何直接访问其他存储引擎的?
worker thread的流程清晰明了,总体流程如下:

工作流程的图示中可以看出,在一次epoll_wait返回的请求,将一并commit,这也是HandlerSocket的基本事务模型:
协议
HandlerSocket使用了自定义协议进行交互。具体协议有文档说明,参考源码目录docs-en/protocol.en.txt。协议这里就不详细展开,只提一下基本语法:
两台DELL PowerEdge 2950,4核Intel Xeon 5510 @2.66G, 16G内存
Red Hat Enterprise Linux AS release 4 (Nahant Update 3)
mysql 5.1.53 Linux-generic-source
HandlerSocket a485973
MySQL:
innodb_buffer_pool_size = 8G innodb_flush_log_at_trx_commit = 2 innodb_thread_concurrency = 16 innodb_log_buffer_size = 8M innodb_log_file_size = 256M innodb_max_dirty_pages_pct = 90
HandlerSocket:
loose_handlersocket_port = 9998 loose_handlersocket_port_wr = 9999 loose_handlersocket_threads = 4 loose_handlersocket_threads_wr = 1 open_files_limit = 65535
CREATE TABLE user ( user_id INT UNSIGNED PRIMARY KEY, user_name VARCHAR(50), user_email VARCHAR(255), created DATETIME ) ENGINE=InnoDB;
seq 1000000 | sed 's/\(.*\)/INSERT INTO user set user_id=\1, user_name=\1, user_email=\1;/' > handlersocket.sql time mysql -D test < handlersocket.sql

扩展hstest程序,增加测试用例,插入与SQL相同数据。


关闭qcache mysqlslap --query="select user_name from user where user_id=1" --number-of-queries=10000000 --concurrency=30 --host=HOST --port=3306


./hstest test=11 tablesize=1000000 host=10.26.53.34 hsport=9998 num=10000000 num_threads=100 timelimit=10


在这里,我们需要修改loose_handlersocket_threads_wr,将WritePort的工作线程数为4,保持与ReadPort一致,之后再运行hstest。


简单整理分析一下:

总体来看,HandlerSocket有着很不错的性能表现。在以下case应该有不错的应用前景:
在GoDaddy上架了个dokuwiki做简单的pkm。在管理员登录后,dokuwiki总有一个安全提示,说dokuwiki的安全设定有问题。之前没去折腾它,一直忍着大红叉,直到今天无意打开了指导文章,按照上面的提示,访问了一下url: http://xiaoy.info/pkm/dokuwiki/data/pages/wiki/dokuwiki.txt,居然很顺利的就把wiki内容取出来了(别试了,现在已经不行了
),按图索骥就可得到所有wiki,就算设定了private也照样可以拉到。赶紧看解决方案,官方给出两个方法:
杯具的是,这两个方法对godaddy免费windows主机都无效。它上面跑的是IIS,而IIS想禁用关键目录访问,必须有IIS控制权限,免费主机当然是没有的。
而第2个方法,godaddy为每个用户只提供了htdocs目录,无法把文件放置于其他地方。
难道就没有别的解决方案了么?分析一下现在的目标是:让用户无法直接访问到关键目录。官方两种方法为了达到这个目的,采用了webserver配置或者htdocs的方法,但达到这个目的只有这两个方法么?显然答案是no.在godaddy上,可以用很简单的基于IIS的方案:
修改web.config,用rule匹配的方式,禁用掉data目录和conf目录的访问:
<rule name="secure"> <match url="^pkm/dokuwiki/(data|conf|bin|inc)/?(.*)?$" /> <action type="CustomResponse" statusCode="403" statusReason="Forbidded" statusDescription="Forbidden" /> </rule>
当识别到关键目录的直接访问时,直接返回403错误,这样就很简单明了的解决了问题
。
参考: http://learn.iis.net/page.aspx/557/translate-htaccess-content-to-iis-webconfig/
OPROFILE是一个开源的profiling工具,类似于vtune和gprof。但不需要像gprof一样,必须优雅退出才可以剖分。今天在看HandlerSocket文章时,想重复一下作者的oprofile结果,因此,安装试用了一下,遇到一些问题,过程记录如下。
1. 连到服务器开发机上,发现已经安装了oprofile。尝试初始化: sudo opcontrol –init 出现错误:kernel doesn’t support oprofile. 看着很吓人,以为需要重新编译内核。但仔细看了看介绍,说2.6的内核已经以module的方式支持oprofile。因此,只需要sudo modprobe oprofile一下,如果正常就可以继续,否则就说明没有安装这个module,需要重编内核,在menuconfig时选择是否添加。。
如果modprobe正确,仍出现这个错误,那说明opcontrol没有对应的kernel driver. 解决方案是重新编译安装oprofile。
2. 安装
wget http://prdownloads.sourceforge.net/oprofile/oprofile-0.9.6.tar.gz
tar xvfz oprofile.*.gz ./configure --prefix=/home/work/software/output/ --with-kernel-support
出现
/usr/bin/ld: /usr/lib/gcc/x86_64-redhat-linux/3.4.4/../../../../lib64/libbfd.a(archures.o): relocation R_X86_64_32 against `a local symbol’ can not be used when making a shared object; recompile with –fPIC
./configure --prefix=/home/work/software/output/ --with-kernel-support --enable-shared=no
再make

服务器用的是64位的系统,而它默认依赖到32位的库.因此,导出LDFLAGS=-L/usr/lib64后,make clean后再make,就成功通过了.

最后还出个waring,提示要添加用户,如果不调试JIT,直接忽略,不用添加用户。
3. 这就可以开始使用oprofile了,不过需要注意的是,需要有root权限才可以运行,请向系统管理员索要sudo权限。
4. 对mysqld进行profile为例:
sudo opcontrol --reset sudo opcontrol --separate=lib --no-vmlinux --start --image=/home/software/output/libexec/mysqld 在其他机器起压力,压力停止后再进行后续操作 sudo opcontrol --dump sudo opcontrol --shutdown opreport -l /home/software/output/libexec/mysqld opannotate -s /home/software/output/libexec/mysqld
在B树的wiki页面上,有一小段内容介绍为什么数据库使用B树,很通俗易懂,推荐。
Time to search a sorted file
Usually, sorting and searching algorithms have been characterized by the number of comparison operations that must be performed using order notation. A binary search of a sorted table with N records, for example, can be done in O(log2N) comparisons. If the table had 1,000,000 records, then a specific record could be located with about 20 comparisons: log21,000,000 = 19.931….Large databases have historically been kept on disk drives. The time to read a record on a disk drive can dominate the time needed to compare keys once the record is available. The time to read a record from a disk drive involves a seek time and a rotational delay. The seek time may be 0 to 20 or more milliseconds, and the rotational delay averages about half the rotation period. For a 7200 RPM drive, the rotation period is 8.33 milliseconds. For a drive such as the Seagate ST3500320NS, the track-to-track seek time is 0.8 milliseconds and the average reading seek time is 8.5 milliseconds.[2] For simplicity, assume reading from disk takes about 10 milliseconds.
Naively, then, the time to locate one record out of a million would take 20 disk reads times 10 milliseconds per disk read, which is 0.2 second.
The time won’t be that bad because individual records are grouped together in a disk block. A disk block might be 16 kilobytes. If each record is 160 bytes, then 100 records could be stored in each block. The disk read time above was actually for an entire block. Once the disk head is in position, one or more disk blocks can be read with little delay. With 100 records per block, the last 6 or so comparisons don’t need to do any disk reads—the comparisons are all within the last disk block read.
To speed the search further, the first 13 to 14 comparisons (which each required a disk access) must be sped up.
排序和查找算法通常使用“比较次数”衡量。举例来说,对含N条记录的有序表进行二分查询,需要做O(log2N)次比较。如果表中有100万条记录,那么查找特定记录,需要进行20次比较: log21,000,000 = 19.931。
大型数据库通常将数据保存在磁盘中。比较key时间将主要消耗在磁盘读取。磁盘读取时间包含两部分:磁头寻道时间和盘片旋转延时。寻道时间大约是0~20ms,甚至更多;平均旋转延时是一半的盘片旋转周期。7200RPM的磁盘,旋转周期大约为8.3ms。以Seagate ST3500320NS为例,trace-to-track的寻道时间大约0.8ms,平均读寻道时间为8.5ms。简单起见,假设从磁盘读需要10ms。
再简单假设在100w条记录中查找需要20次磁盘读,那么一次查询需要0.2s。
实际时间不会这么糟糕,因为记录实际上是分组分布在“磁盘块”上。假定磁盘块是16k,每条记录是160字节,那么:在一个磁盘块上可以存储100条记录。而上述分析的磁盘读时间实际上会读入一整个块。一旦磁头的位置适合,再读入数据的延时会变少。当一个块中有100条记录时,那么最后6次的比较不需要进行磁盘读(数据已在最后一次读的块中)。
为了进一步加速查找,我们需要优化前13~14次查询,这些查询都需要一次磁盘访问。
An index speeds the search
A significant improvement can be made with an index. In the example above, initial disk reads narrowed the search range by a factor of two. That can be improved substantially by creating an auxiliary index that contains the first record in each disk block (sometimes called a sparse index). This auxiliary index would be 1% of the size of the original database, but it can be searched more quickly. Finding an entry in the auxiliary index would tell us which block to search in the main database; after searching the auxiliary index, we would have to search only that one block of the main database—at a cost of one more disk read. The index would hold 10,000 entries, so it would take at most 14 comparisons. Like the main database, the last 6 or so comparisons in the aux index would be on the same disk block. The index could be searched in about 8 disk reads, and the desired record could be accessed in 9 disk reads.
The trick of creating an auxiliary index can be repeated to make an auxiliary index to the auxiliary index. That would make an aux-aux index that would need only 100 entries and would fit in one disk block.
Instead of reading 14 disk blocks to find the desired record, we only need to read 3 blocks. Reading and searching the first (and only) block of the aux-aux index identifies the relevant block in aux-index. Reading and searching that aux-index block identifies the relevant block in the main database. Instead of 150 milliseconds, we need only 30 milliseconds to get the record.
The auxiliary indices have turned the search problem from a binary search requiring roughly log2N disk reads to one requiring only logbN disk reads where b is the blocking factor (the number of entries per block: b = 100 entries per block; logb1,000,000 = 3 reads).
In practice, if the main database is being frequently searched, the aux-aux index and much of the aux index may reside in a disk cache, so they would not incur a disk read.
索引能带来显著地改善。在上面的例子中,第一次磁盘读将搜索范围缩小一半。可以创建索引优化,索引中包含每一个磁盘块的第一个记录(称之为“稀疏索引”)。这个索引只有原数据库1%的大小,但却可以进行很快的查找。在索引找到的记录会告诉我们数据在哪个块,在查找索引完成后,只需要在主数据库找到相应的块,代价是一次磁盘读。索引中有1万个记录,所以它只需要14次比较,与主数据库一样,最后6次比较只需要一次磁盘读,因为它们在同一个磁盘块上。在索引中的查找需要8次磁盘读,查找到需要记录总共需要9次磁盘读。
这种创建索引的技巧可以嵌套使用:为索引创建索引。在我们的例子中,索引的索引只有100个记录,可以放进一个磁盘块。
因此,我们实际上只需要3次磁盘读就可以找到记录,而不是14次。一次磁盘读取到索引的索引,就可以定位到相关的索引块;再通过索引块就可以找到数据块。这样我们查找一条记录的时间只需要30ms。
索引改变了查找问题:原本需要log2N次磁盘读,现在只需要logbN,b是块因子(每个磁盘块含有的记录数,上例中b=100, log100 1000000=3)。
在实际读频繁的数据库中,索引的索引以及大部分索引都会在磁盘cache中,还可以节省下来磁盘读。
Insertions and deletions cause trouble
If the database does not change, then compiling the index is simple to do, and the index need never be changed. If there are changes, then managing the database and its index becomes more complicated.
Deleting records from a database doesn’t cause much trouble. The index can stay the same, and the record can just be marked as deleted. The database stays in sorted order. If there are a lot of deletions, then the searching and storage become less efficient.
Insertions are a disaster in a sorted sequential file because room for the inserted record must be made. Inserting a record before the first record in the file requires shifting all of the records down one. Such an operation is just too expensive to be practical.
A trick is to leave some space lying around to be used for insertions. Instead of densely storing all the records in a block, the block can have some free space to allow for subsequent insertions. Those records would be marked as if they were "deleted" records.
Now, both insertions and deletions are fast as long as space is available on a block. If an insertion won’t fit on the block, then some free space on some nearby block must be found and the auxiliary indices adjusted. The hope is enough space is nearby that a lot of blocks do not need to be reorganized. Alternatively, some out-of-sequence disk blocks may be used.
如果数据库不改变,那么创建索引很简单,也不会被改变。但如果存在更改操作,数据库及索引的维护就会变得复杂的多。
删除记录不会带来太大的问题。索引可以仍然存在,只需要将记录标记为己删除。数据库仍然保持有序。但如果有大量的删除,查找效率和存储空间浪费较多。
在有序文件中插入记录会导致灾难,因为需要空间存放插入的纪录。如果在第一条记录前插入一条,那么整个文件中的记录都需要向后移动。这样的操作在实际中代价过于昂贵。
一个技巧是预留一些空间,不写满磁盘块,而是留一些空白为后续的插入准备空间。这些记录可以标记为“已删除”。
现在如果块中空间足够,插入和删除都很快了。但如果插入的记录空间不够,就需要在附近的块中找到一些空间出来,并且索引需要做相应的调整。我们只能寄希望于附近的块有足够的空间,否则就需要大量的磁盘块。有的地方使用了替换方案,一部分块可以乱序。
The B-tree uses all those ideas
The B-tree uses all the above ideas. It keeps the records in sorted order so they may be sequentially traversed. It uses a hierarchical index to minimize the number of disk reads. The index is elegantly adjusted with a recursive algorithm. The B-tree uses partially full blocks to speed insertions and deletions. In addition, a B-tree minimizes waste by making sure the interior nodes are at least 1/2 full. A B-tree can handle an arbitrary number of insertions and deletions.
B树包含了上述所有的思想: