Linux缓存机制之块缓存-服务器评测

在Linux内核中，并非总使用基于页的方法来承担缓存的任务。内核的早期版本只包含了块缓存，来加速文件操作和提高系统性能。这是来自于其他具有相同结构的类UNIX操作系统的遗产。来自于底层块设备的块缓存在内存的缓冲区中，可以加速读写操作。

与内存页相比，块不仅比较小（大多数情况下），而且长度可变的，依赖于使用的块设备（或文件系统）。随着日渐倾向于使用基于页操作实现的通用文件存取方法，块缓存作为中枢系统缓存的重要性已经逐渐失去。主要的缓存任务现在由页缓存承担。另外，基于块的I/O的标准数据结构，现在已经不再是缓冲区，而是struct bio结构。

缓冲区用作小型的数据传输，一般设计的数据量是与块长度可比拟的。文件系统在处理元数据时，通常会使用此类方法。而裸数据的传输则按页进行，而缓冲区的实现也基于也缓存。

块缓存在结构上由两个部分组成：

1）缓冲头（buffer head）包含了与缓冲区状态相关的所有管理数据，包括快号、块长度、访问计数器等。这些数据不是直接存储在缓冲头之后，而是存储在物理内存的一个独立区域中，由缓冲头结构中的一个对应的指针表示。

2）有用数据保存在专门分配的页中，这些页也可能同时存在于页缓存中。这进一步细分了页缓存，如下图所示，在我们的例子中，页划分为4个长度相同的部分，每一部分由其自身的缓冲头描述。缓冲头存储的内存区域与有用数据存储的区域是有关的。

这使得页面可以细分为更小的部分，各顾各部分之间完全连续的（因为缓冲区数据和缓冲头数据是分离的）。因为一个缓冲区由至少512字节组成，每页最多可包括MAX_BUF_PER_PAGE个缓冲区。该常数定义为页面长度的函数。

如果修改了某个缓冲区，则会立即印象到页面的内容（反之也是），因而两个缓存不需要显示同步，毕竟二者的数据是共享的。

当然，有些应用程序在访问块设备时，使用的是块而不是页面，读取文件系统的操作几块，就是一个例子。一个独立的块缓存用于加速此类访问。该块缓存的运作独立于页面缓存，而不是在其上建立的。为此，缓冲头数据结构（对于块缓存和页面缓存是相同的）群聚在一个长度恒定的数组中，各个数组项按LUR方式管理。在一个三个数组项用过之后，将其置于索引位置0，其他数组项相应下移。这意味这最常使用的数组项位于数组的开头，而不常用的数组项将被后退，如果很长时间不使用，则会“掉出”数组。

因为数组的长度，或者说LUR列表中的项数，是一个固定值，在内核运行期间不改变，内核无需运行独立的线程来将缓存长度修正为合理值。相反，内核只需要在一项“掉出”数组时，将相关的缓冲区从缓存删除，以释放内存，用于其他目地。

块缓存实现

块患处不仅仅用作页面缓存的附加功能，对以块而不是页面进行处理的对象来说，块缓存是一个独立的缓存。

数据结构

块缓冲区头

struct buffer_head {

unsigned long b_state; /* buffer state bitmap (see above) */

struct buffer_head *b_this_page;/* circular list of page’s buffers */

struct page *b_page; /* the page this bh is mapped to */

sector_t b_blocknr; /* start block number */

size_t b_size; /* size of mapping */

char *b_data; /* pointer to data within the page */

struct block_device *b_bdev;

bh_end_io_t *b_end_io; /* I/O completion */

void *b_private; /* reserved for b_end_io */

struct list_head b_assoc_buffers; /* associated with another mapping */

struct address_space *b_assoc_map; /* mapping this buffer is

associated with */

atomic_t b_count; /* users using this buffer_head */

};

操作

内核必须提供一组操作，使得其余代码能够轻松有效地利用缓冲区的功能。切记：这些机制对内存中实际缓存的数据没有贡献。

在使用缓冲区之前，内核首先必须创建一个buffer_head结构实例，而其余的函数则对该结构进行操作。因为创建新缓冲头是一个频繁重现的任务，他应该尽快执行。这是一种很经典的情形，可使用slab缓存解决。

切记：内核源代码确实提供了一些函数，可用作前端，来创建和销毁缓冲头。alloc_buffer_head生成一个新的缓冲头，而free_buffer_head销毁一个显存的缓冲头。

/*分配buffer_head*/

struct buffer_head *alloc_buffer_head(gfp_t gfp_flags)

{

/*从slab中分配空间*/

struct buffer_head *ret = kmem_cache_alloc(bh_cachep, gfp_flags);

if (ret) {

/*初始化*/

INIT_LIST_HEAD(&ret->b_assoc_buffers);

get_cpu_var(bh_accounting).nr++;

recalc_bh_state();

put_cpu_var(bh_accounting);

}

return ret;

}

页缓存和块缓存的交互

一页划分为几个数据单元，但缓冲头保存在独立的内存区中，与实际数据无关。与缓冲区的交互没有改变的页的内容，缓冲区只不过为页的数据提供了一个新的视图。

为支持页与缓冲区的交互，需要使用struct page的private成员。其类型为unsigned long,可用作指向虚拟地址空间中任何位置的指针。

Private成员还可以用作其他用途，根据页的具体用途，可能与缓冲头完全无关。但其主要的用途是关联缓冲区和页。这样的话，private指向将页划分为更小单位的第一个缓冲头。各个缓冲头通过b_this_page链接为一个环形链表。在该链表中每个缓冲头的b_this_page成员指向下一个缓冲头，而最后一个缓冲头的b_this_page成员指向第一个缓冲头。这使得内核从page结构开始，可以轻易地扫描与页关联的所有buffer_head实例。

内核提供cteate_empty_buffers函数关联page和buffer_head结构之间的关联：

* We attach and possibly dirty the buffers atomically wrt

* __set_page_dirty_buffers() via private_lock. try_to_free_buffers

* is already excluded via the page lock.

void create_empty_buffers(struct page *page,

unsigned long blocksize, unsigned long b_state)

{

struct buffer_head *bh, *head, *tail;

head = alloc_page_buffers(page, blocksize, 1);

bh = head;

/*遍历所有缓冲头，设置其状态，并建立一个环形链表*/

do {

bh->b_state |= b_state;

tail = bh;

bh = bh->b_this_page;

} while (bh);

tail->b_this_page = head;

spin_lock(&page->mapping->private_lock);

/*缓冲区的状态依赖于内存页面中数据的状态*/

if (PageUptodate(page) || PageDirty(page)) {

bh = head;

do {/*设置相关标志*/

if (PageDirty(page))

set_buffer_dirty(bh);

if (PageUptodate(page))

set_buffer_uptodate(bh);

bh = bh->b_this_page;

} while (bh != head);

}

/*将缓冲区关联到页面*/

attach_page_buffers(page, head);

spin_unlock(&page->mapping->private_lock);

}

static inline void attach_page_buffers(struct page *page,

struct buffer_head *head)

{

page_cache_get(page);/*递增引用计数*/

/*设置PG_private标志，通知内核其他部分，page实例的private成员正在使用中*/

SetPagePrivate(page);

/*将页的private成员设置为一个指向环形链表中第一个缓冲头的指针*/

set_page_private(page, (unsigned long)head);

}

交互

如果对内核的其他部分无益，那么在页和缓冲区之间建立关联就没起作用。一些与块设备之间的传输操作，传输单位的长度依赖于底层设备的块长度，而内核的许多部分更喜欢按页的粒度来执行I/O操作，因为这使得其他事情更容易处理，特别是内存管理方面。在这种场景下，缓冲头区充当了双方的中介。

从缓冲区中读取整页

首先考察内核在从块设备读取整页时采用的方法，以block_read_full_page为例。我们讨论缓冲区实现所关注的部分。

* Generic “read page” function for block devices that have the normal

* get_block functionality. This is most of the block device filesystems.

* Reads the page asynchronously — the unlock_buffer() and

* set/clear_buffer_uptodate() functions propagate buffer state into the

* page struct once IO has completed.

int block_read_full_page(struct page *page, get_block_t *get_block)

{

struct inode *inode = page->mapping->host;

sector_t iblock, lblock;

struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];

unsigned int blocksize;

int nr, i;

int fully_mapped = 1;

BUG_ON(!PageLocked(page));

blocksize = 1 << inode->i_blkbits;

/*检查页是否有相关联的缓冲区，如果没有，则创建他*/

if (!page_has_buffers(page))

create_empty_buffers(page, blocksize, 0);

/*获得这些缓冲区，无论是新建的还是已经存在的

只是将page的private成员转换为buffer_head指针，因为按照

惯例，private指向与page关联的第一个缓冲头*/

head = page_buffers(page);

iblock = (sector_t)page->index << (PAGE_CACHE_SHIFT – inode->i_blkbits);

lblock = (i_size_read(inode)+blocksize-1) >> inode->i_blkbits;

bh = head;

nr = 0;

i = 0;

/*内核遍历与页面关联的所有缓冲区*/

do {

/*如果缓冲区内容是最新的，内核继续处理下一个

缓冲区。在这种情况下，页面缓冲区中的数据与块

设备匹配，无需额外的读操作*/

if (buffer_uptodate(bh))

continue;

/*如果没有映射*/

if (!buffer_mapped(bh)) {

int err = 0;

fully_mapped = 0;

if (iblock < lblock) {

WARN_ON(bh->b_size != blocksize);

/*确定块在块设备上的位置*/

err = get_block(inode, iblock, bh, 0);

if (err)

SetPageError(page);

}

if (!buffer_mapped(bh)) {

zero_user(page, i * blocksize, blocksize);

if (!err)

set_buffer_uptodate(bh);

continue;

}

* get_block() might have updated the buffer

* synchronously

if (buffer_uptodate(bh))

continue;

}

/*如果缓冲区已经建立了与块的映射，但是其内容不是最新

的则将缓冲区放置到一个临时的数组中*/

arr[nr++] = bh;

} while (i++, iblock++, (bh = bh->b_this_page) != head);

if (fully_mapped)

SetPageMappedToDisk(page);

if (!nr) {

* All buffers are uptodate – we can set the page uptodate

* as well. But not if get_block() returned an error.

if (!PageError(page))

SetPageUptodate(page);

unlock_page(page);

return 0;

}

/* Stage two: lock the buffers */

for (i = 0; i < nr; i++) {

bh = arr[i];

lock_buffer(bh);

/*将b_end_io设置为end_buffer_async_read，该函数将在数据传输结构时

调用*/

mark_buffer_async_read(bh);

}

* Stage 3: start the IO. Check for uptodateness

* inside the buffer lock in case another process reading

* the underlying blockdev brought it uptodate (the sct fix).

for (i = 0; i < nr; i++) {

bh = arr[i];

if (buffer_uptodate(bh))

end_buffer_async_read(bh, 1);

else

/*将所有需要读取的缓冲区转交给块层

也就是BIO层，在其中开始读操作*/

submit_bh(READ, bh);

}

return 0;

}

将整页写入到缓冲区

除了读操作之外，页面的写操作也可以划分为更小的单位。只有页中实际修改的内容需要回写，而不用回写整页的内容。遗憾的是，从缓冲区的角度来看，写操作的实现比上述的读操作复杂的多。

__block_wirte_full_page函数中回写脏页面设计的缓冲区相关操作。

* NOTE! All mapped/uptodate combinations are valid:

* Mapped Uptodate Meaning

* No No “unknown” – must do get_block()

* No Yes “hole” – zero-filled

* Yes No “allocated” – allocated on disk, not read in

* Yes Yes “valid” – allocated and up-to-date in memory.

* “Dirty” is valid only with the last case (mapped+uptodate).

* While block_write_full_page is writing back the dirty buffers under

* the page lock, whoever dirtied the buffers may decide to clean them

* again at any time. We handle that by only looking at the buffer

* state inside lock_buffer().

* If block_write_full_page() is called for regular writeback

* (wbc->sync_mode == WB_SYNC_NONE) then it will redirty a page which has a

* locked buffer. This only can happen if someone has written the buffer

* directly, with submit_bh(). At the address_space level PageWriteback

* prevents this contention from occurring.

* If block_write_full_page() is called with wbc->sync_mode ==

* WB_SYNC_ALL, the writes are posted using WRITE_SYNC_PLUG; this

* causes the writes to be flagged as synchronous writes, but the

* block device queue will NOT be unplugged, since usually many pages

* will be pushed to the out before the higher-level caller actually

* waits for the writes to be completed. The various wait functions,

* such as wait_on_writeback_range() will ultimately call sync_page()

* which will ultimately call blk_run_backing_dev(), which will end up

* unplugging the device queue.

static int __block_write_full_page(struct inode *inode, struct page *page,

get_block_t *get_block, struct writeback_control *wbc,

bh_end_io_t *handler)

{

int err;

sector_t block;

sector_t last_block;

struct buffer_head *bh, *head;

const unsigned blocksize = 1 << inode->i_blkbits;

int nr_underway = 0;

int write_op = (wbc->sync_mode == WB_SYNC_ALL ?

WRITE_SYNC_PLUG : WRITE);

BUG_ON(!PageLocked(page));

last_block = (i_size_read(inode) – 1) >> inode->i_blkbits;

/*页面是否有关联缓冲区，如果没有创建他*/

if (!page_has_buffers(page)) {

create_empty_buffers(page, blocksize,

(1 << BH_Dirty)|(1 << BH_Uptodate));

}

* Be very careful. We have no exclusion from __set_page_dirty_buffers

* here, and the (potentially unmapped) buffers may become dirty at

* any time. If a buffer becomes dirty here after we’ve inspected it

* then we just miss that fact, and the page stays dirty.

* Buffers outside i_size may be dirtied by __set_page_dirty_buffers;

* handle that here by just cleaning them.

block = (sector_t)page->index << (PAGE_CACHE_SHIFT – inode->i_blkbits);

head = page_buffers(page);

bh = head;

* Get all the dirty buffers mapped to disk addresses and

* handle any aliases from the underlying blockdev’s mapping.

/*对所有未映射的脏缓冲区，在缓冲区和块设备

之间建立映射*/

do {

if (block > last_block) {

* mapped buffers outside i_size will occur, because

* this page can be outside i_size when there is a

* truncate in progress.

* The buffer was zeroed by block_write_full_page()

clear_buffer_dirty(bh);

set_buffer_uptodate(bh);

} else if ((!buffer_mapped(bh) || buffer_delay(bh)) &&

buffer_dirty(bh)) {

WARN_ON(bh->b_size != blocksize);

/*查找块设备上与缓冲区项匹配的块*/

err = get_block(inode, block, bh, 1);

if (err)

goto recover;

clear_buffer_delay(bh);

if (buffer_new(bh)) {

/* blockdev mappings never come here */

clear_buffer_new(bh);

unmap_underlying_metadata(bh->b_bdev,

bh->b_blocknr);

}

}

bh = bh->b_this_page;

block++;

} while (bh != head);

/*第二遍遍历，将滤出所有的脏缓冲区*/

do {

if (!buffer_mapped(bh))

continue;

* If it’s a fully non-blocking write attempt and we cannot

* lock the buffer then redirty the page. Note that this can

* potentially cause a busy-wait loop from writeback threads

* and kswapd activity, but those code paths have their own

* higher-level throttling.

if (wbc->sync_mode != WB_SYNC_NONE || !wbc->nonblocking) {

lock_buffer(bh);

} else if (!trylock_buffer(bh)) {

redirty_page_for_writepage(wbc, page);

continue;

}

/*如果设置了脏页标志，则会在调用该函数时清除

因为缓冲区的内容将立即回写*/

if (test_clear_buffer_dirty(bh)) {

/*设置BH_Async_Write状态位，并将end_buffer_async_write

指定为BIO完成处理程序即b_end_io*/

mark_buffer_async_write_endio(bh, handler);

} else {

unlock_buffer(bh);

}

} while ((bh = bh->b_this_page) != head);

* The page and its buffers are protected by PageWriteback(), so we can

* drop the bh refcounts early.

BUG_ON(PageWriteback(page));

set_page_writeback(page);

/*最后一次遍历*/

do {

struct buffer_head *next = bh->b_this_page;

if (buffer_async_write(bh)) {

/*将前一次遍历中标记为BH_Async_Write的所有缓冲区

转交给块层执行实际的写操作，该函数向块层提交

了对应的请求*/

submit_bh(write_op, bh);

nr_underway++;

}

bh = next;

} while (bh != head);

unlock_page(page);

err = 0;

done:

if (nr_underway == 0) {

* The page was marked dirty, but the buffers were

* clean. Someone wrote them back by hand with

* ll_rw_block/submit_bh. A rare case.

end_page_writeback(page);

* The page and buffer_heads can be released at any time from

* here on.

}

return err;

recover:

* ENOSPC, or some other error. We may already have added some

* blocks to the file, so we need to write these out to avoid

* exposing stale data.

* The page is currently locked and not marked for writeback

bh = head;

/* Recovery: lock and submit the mapped buffers */

do {

if (buffer_mapped(bh) && buffer_dirty(bh) &&

!buffer_delay(bh)) {

lock_buffer(bh);

mark_buffer_async_write_endio(bh, handler);

} else {

* The buffer may have been set dirty during

* attachment to a dirty page.

clear_buffer_dirty(bh);

}

} while ((bh = bh->b_this_page) != head);

SetPageError(page);

BUG_ON(PageWriteback(page));

mapping_set_error(page->mapping, err);

set_page_writeback(page);

do {

struct buffer_head *next = bh->b_this_page;

if (buffer_async_write(bh)) {

clear_buffer_dirty(bh);

submit_bh(write_op, bh);

nr_underway++;

}

bh = next;

} while (bh != head);

unlock_page(page);

goto done;

}

Linux缓存机制之块缓存

相关推荐

分类

听说打赏我的人，都进福布斯排行榜啦！

支付宝扫一扫打赏

微信扫一扫打赏

相关推荐

分类

听说打赏我的人，都进福布斯排行榜啦！

支付宝扫一扫打赏

微信扫一扫打赏

切换注册登录

用户名或邮箱

密码

切换登录注册

昵称

邮箱