1cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet#ifndef _BCACHE_JOURNAL_H 2cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet#define _BCACHE_JOURNAL_H 3cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet 4cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet/* 5cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * THE JOURNAL: 6cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * 7cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * The journal is treated as a circular buffer of buckets - a journal entry 8cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * never spans two buckets. This means (not implemented yet) we can resize the 9cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * journal at runtime, and will be needed for bcache on raw flash support. 10cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * 11cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * Journal entries contain a list of keys, ordered by the time they were 12cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * inserted; thus journal replay just has to reinsert the keys. 13cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * 14cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * We also keep some things in the journal header that are logically part of the 15cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * superblock - all the things that are frequently updated. This is for future 16cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * bcache on raw flash support; the superblock (which will become another 17cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * journal) can't be moved or wear leveled, so it contains just enough 18cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * information to find the main journal, and the superblock only has to be 19cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * rewritten when we want to move/wear level the main journal. 20cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * 21cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * Currently, we don't journal BTREE_REPLACE operations - this will hopefully be 22cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * fixed eventually. This isn't a bug - BTREE_REPLACE is used for insertions 23cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * from cache misses, which don't have to be journaled, and for writeback and 24cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * moving gc we work around it by flushing the btree to disk before updating the 25cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * gc information. But it is a potential issue with incremental garbage 26cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * collection, and it's fragile. 27cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * 28cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * OPEN JOURNAL ENTRIES: 29cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * 30cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * Each journal entry contains, in the header, the sequence number of the last 31cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * journal entry still open - i.e. that has keys that haven't been flushed to 32cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * disk in the btree. 33cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * 34cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * We track this by maintaining a refcount for every open journal entry, in a 35cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * fifo; each entry in the fifo corresponds to a particular journal 36cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * entry/sequence number. When the refcount at the tail of the fifo goes to 37cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * zero, we pop it off - thus, the size of the fifo tells us the number of open 38cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * journal entries 39cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * 40cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * We take a refcount on a journal entry when we add some keys to a journal 41cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * entry that we're going to insert (held by struct btree_op), and then when we 42cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * insert those keys into the btree the btree write we're setting up takes a 43cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * copy of that refcount (held by struct btree_write). That refcount is dropped 44cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * when the btree write completes. 45cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * 46cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * A struct btree_write can only hold a refcount on a single journal entry, but 47cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * might contain keys for many journal entries - we handle this by making sure 48cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * it always has a refcount on the _oldest_ journal entry of all the journal 49cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * entries it has keys for. 50cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * 51cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * JOURNAL RECLAIM: 52cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * 53cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * As mentioned previously, our fifo of refcounts tells us the number of open 54cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * journal entries; from that and the current journal sequence number we compute 55cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * last_seq - the oldest journal entry we still need. We write last_seq in each 56cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * journal entry, and we also have to keep track of where it exists on disk so 57cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * we don't overwrite it when we loop around the journal. 58cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * 59cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * To do that we track, for each journal bucket, the sequence number of the 60cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * newest journal entry it contains - if we don't need that journal entry we 61cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * don't need anything in that bucket anymore. From that we track the last 62cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * journal bucket we still need; all this is tracked in struct journal_device 63cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * and updated by journal_reclaim(). 64cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * 65cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * JOURNAL FILLING UP: 66cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * 67cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * There are two ways the journal could fill up; either we could run out of 68cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * space to write to, or we could have too many open journal entries and run out 69cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * of room in the fifo of refcounts. Since those refcounts are decremented 70cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * without any locking we can't safely resize that fifo, so we handle it the 71cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * same way. 72cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * 73cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * If the journal fills up, we start flushing dirty btree nodes until we can 74cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * allocate space for a journal write again - preferentially flushing btree 75cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * nodes that are pinning the oldest journal entries first. 76cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet */ 77cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet 78cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet/* 79cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * Only used for holding the journal entries we read in btree_journal_read() 80cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * during cache_registration 81cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet */ 82cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreetstruct journal_replay { 83cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet struct list_head list; 84cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet atomic_t *pin; 85cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet struct jset j; 86cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet}; 87cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet 88cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet/* 89cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * We put two of these in struct journal; we used them for writes to the 90cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * journal that are being staged or in flight. 91cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet */ 92cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreetstruct journal_write { 93cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet struct jset *data; 94cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet#define JSET_BITS 3 95cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet 96cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet struct cache_set *c; 97cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet struct closure_waitlist wait; 98dabb44334060b4b84051b34c58573e57cc7432b2Kent Overstreet bool dirty; 99cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet bool need_write; 100cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet}; 101cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet 102cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet/* Embedded in struct cache_set */ 103cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreetstruct journal { 104cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet spinlock_t lock; 105cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet /* used when waiting because the journal was full */ 106cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet struct closure_waitlist wait; 1077857d5d470ec53bae187d144c69065ad3c0ebc21Kent Overstreet struct closure io; 108cb7a583e6a6ace661a5890803e115d2292a293dfKent Overstreet int io_in_flight; 1097857d5d470ec53bae187d144c69065ad3c0ebc21Kent Overstreet struct delayed_work work; 110cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet 111cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet /* Number of blocks free in the bucket(s) we're currently writing to */ 112cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet unsigned blocks_free; 113cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet uint64_t seq; 114cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet DECLARE_FIFO(atomic_t, pin); 115cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet 116cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet BKEY_PADDED(key); 117cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet 118cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet struct journal_write w[2], *cur; 119cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet}; 120cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet 121cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet/* 122cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * Embedded in struct cache. First three fields refer to the array of journal 123cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * buckets, in cache_sb. 124cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet */ 125cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreetstruct journal_device { 126cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet /* 127cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * For each journal bucket, contains the max sequence number of the 128cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet * journal writes it contains - so we know when a bucket can be reused. 129cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet */ 130cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet uint64_t seq[SB_JOURNAL_BUCKETS]; 131cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet 132cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet /* Journal bucket we're currently writing to */ 133cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet unsigned cur_idx; 134cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet 135cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet /* Last journal bucket that still contains an open journal entry */ 136cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet unsigned last_idx; 137cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet 138cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet /* Next journal bucket to be discarded */ 139cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet unsigned discard_idx; 140cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet 141cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet#define DISCARD_READY 0 142cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet#define DISCARD_IN_FLIGHT 1 143cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet#define DISCARD_DONE 2 144cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet /* 1 - discard in flight, -1 - discard completed */ 145cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet atomic_t discard_in_flight; 146cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet 147cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet struct work_struct discard_work; 148cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet struct bio discard_bio; 149cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet struct bio_vec discard_bv; 150cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet 151cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet /* Bio for journal reads/writes to this device */ 152cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet struct bio bio; 153cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet struct bio_vec bv[8]; 154cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet}; 155cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet 156cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet#define journal_pin_cmp(c, l, r) \ 157c18536a72ddd7fe30d63e6c1500b5c930ac14594Kent Overstreet (fifo_idx(&(c)->journal.pin, (l)) > fifo_idx(&(c)->journal.pin, (r))) 158cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet 159cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet#define JOURNAL_PIN 20000 160cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet 161cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet#define journal_full(j) \ 162cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet (!(j)->blocks_free || fifo_free(&(j)->pin) <= 1) 163cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet 164cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreetstruct closure; 165cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreetstruct cache_set; 166cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreetstruct btree_op; 167a34a8bfd4e6358c646928320d37b0425c0762f8aKent Overstreetstruct keylist; 168cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet 169a34a8bfd4e6358c646928320d37b0425c0762f8aKent Overstreetatomic_t *bch_journal(struct cache_set *, struct keylist *, struct closure *); 170cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreetvoid bch_journal_next(struct journal *); 171cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreetvoid bch_journal_mark(struct cache_set *, struct list_head *); 172cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreetvoid bch_journal_meta(struct cache_set *, struct closure *); 173c18536a72ddd7fe30d63e6c1500b5c930ac14594Kent Overstreetint bch_journal_read(struct cache_set *, struct list_head *); 174c18536a72ddd7fe30d63e6c1500b5c930ac14594Kent Overstreetint bch_journal_replay(struct cache_set *, struct list_head *); 175cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet 176cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreetvoid bch_journal_free(struct cache_set *); 177cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreetint bch_journal_alloc(struct cache_set *); 178cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet 179cafe563591446cf80bfbc2fe3bc72a2e36cf1060Kent Overstreet#endif /* _BCACHE_JOURNAL_H */ 180