Table of Contents
http://codecapsule.com/2014/02/12/coding-for-ssds-part-1-introduction-and-table-of-contents/
两种flash:
存储单元类型:
MLC/TLC成本较低 目前多数是MLC或TLC, 如果update很多的话, 还是SLC好.
厂商给的数字通常是在某种条件下测得的, 不见得能表示真实性能:
In his articles about common flaws in SSD benchmarking [66], Marc Bevand mentioned that for instance it is common for the IOPS of random write workloads to be reported without any mention of the span of the LBA, and that many IOPS are also reported for queue depth of 1 instead of the maximum value for the drive being tested. There are also many cases of bugs and misuses of the benchmarking tools. Correctly assessing the performance of SSDs is not an easy task.
span
queue-depth很重要.
测试时间, 和SSD的size有很大关系:
the performance of SSDs only drops under a sustained workload of random writes, which depending on the total size of the SSD can take just 30 minutes or up to three hours
如下图, 30min后性能有明显下降:
The parameters used are generally the following: - The type of workload: can be a specific benchmark based on data collected from users, or just only sequential or random accesses of the same type (ex: only random writes) - The percentages of reads and writes performed concurrently (ex: 30% reads and 70% writes) - The queue length: this is the number of concurrent execution threads running commands on a drive - The size of the data chunks being accessed (4 KB, 8 KB, etc.)
结果: - Throughput (对顺序读写) - ipos (对随机读写) - Latancy
不少概念:
每次读一个page比较好理解, 每次怎么写一个page??
write amplification(写放大)
写均衡. 避免总是写同一个page.
Flash Translation Layer 是一个地址映射机制, 提供像磁盘一样的逻辑地址.
一个逻辑地址到page的有映射表.
This mapping table is stored in the RAM of the SSD for speed of access, and is persisted in flash memory in case of power failure.
When the SSD powers up, the table is read from the persisted version and reconstructed into the RAM of the SSD
page-level 太费内存. Let’s assume that an SSD drive has 256 pages per block. This means that block-level mapping requires 256 times less memory than page-level mapping
block-level 让写放大更加明显
garbage collection.
做ssd控制器的厂商比较少:
seven companies are providing controllers for 90% of the solid-state drive market(做ssd的厂商很多, 但是做控制芯片/算法的很少)
Erase: 1500-3500 μs Write: 250-1500 μs
A less important reason for blocks to be moved is the read disturb. Reading can change the state of nearby cells, thus blocks need to be moved around after a certain number of reads have been reached [14].
which can be sent by the operating system to notify the SSD controller that pages are no longer in use in the logical space
帮助控制器更好的GC.
The TRIM command will only work if the SSD controller, the operating system, and the filesystem are supporting it.
Under Linux, support for the ATA TRIM was added in version 2.6.33 . ext2 and ext3 filesystems do not support TRIM, ext4 and XFS , among others, do support it.
Over-provisioning is simply having more physical blocks than logical blocks, by keeping a ratio of the physical blocks reserved for the controller and not visible to the user. Most manufacturers of professional SSDs already include some over-provisioning, generally in the order of 7 to 25%
Some SSD controllers offer the ATA Secure Erase functionality
Due to physical limitations, an asynchronous NAND-flash I/O bus cannot provide more than 32-40 MB/s of bandwidth [5].
所以实际上, ssd控制器内部有多个芯片, 类似raid那样.
在ssd里面, 因为映射关系是动态的, contiguous addresses in the logical space may refer to addresses that are not contiguous in the physical space.
很多ssd会通过hybrid log-block mapping来做写merge. 从而减轻写放大.
ssd 适合写少, 读多的情况
我们到底要不要避免用small write 呢?
http://codecapsule.com/2012/11/07/ikvs-implementing-a-key-value-store-table-of-contents/
July 8, 2014: I am currently implementing the key-value store and will write a post about it when it’s done
1-4: 作者选了半天名字, api形式, 命名风格....
练习目的:
I am starting this project as a way to refresh my knowledge of some fundamentals of hardcore back-end engineering
Reading books and Wikipedia articles is boring and exempt of practice
现有的: Redis, MongoDB, memcached, BerkeleyDB, Kyoto Cabinet and LevelDB.
打算提供的特性: - Adapt to a specific data representation (ex: graphs, geographic data, etc.) - Adapt to a specific operation (ex: performing very well for reads only, or writes only, etc.) - Offer more data access options. For instance in LevelDB, data can be accessed forward or backward, with iterators
貌似没有什么特点..
主要考察:
DBM
Berkeley DB , 当时作为改进的DBM
Memcached and MemcacheDB
MongoDB
Redis
OpenLDAP
SQLite
分成这些模块:
太多了... 后面4个不该考虑.
作者认为Kyoto Cabinet 代码耦合较重, 比如: such as the definition of the Parametrization module inside the Core.
String库: - LevelDB is using a specialized class called “Slice - Kyoto Cabinet is using std::string
Memory Management - Kyoto Cabinet 用mmap() - LevelDB: 用LSM-tree
Data Storage
关于代码:
Kyoto Cabinet 会把实现写在.h里面, Kyoto Cabinet 代码比Tokyo Cabinet好多了(The overall architecture and naming conventions have been greatly improved.) 但是还是很糟糕:
embcomp, trhard, fmtver(), fpow()
LevelDB 的代码命名就很好.
Kyoto Cabinet 的代码重复也很严重.
作者考虑的: Kyoto, LevelDB, BDB, SQLite, 都是库的形式.
LevelDB 的api有open(). 但是没有close(), 关闭的时候是通过delete 指针做的 -> 不对称.
Iteration接口, sqlite 是通过一个回调, 不好:
/* SQLite3 */ static int callback(void *NotUsed, int argc, char **argv, char **szColName) { for(int i = 0; i < argc; i++) { printf("%s = %s\n", szColName[i], argv[i] ? argv[i] : "NULL"); } printf("\n"); return 0; } char *query = “SELECT * FROM table”; sqlite3_exec(db, query, callback, 0, &szErrMsg);
it's always interesting to see how different engineers solved the same problems.
小结: 作者很喜欢LevelDB的api. 除了没有close.
这都研究cache line的优化去了, 每必要把.
2. sparsehash库, 提供 sparse_hash_map 和 dense_hash_map两个类. 其中dense_hash_map可以scales up or down
也许我们能设计这样一个引擎:
这里有很多ssd测试数据: http://www.storagereview.com/samsung_ssd_840_pro_review