| ## The design of littlefs |
| |
| A little fail-safe filesystem designed for microcontrollers. |
| |
| ``` |
| | | | .---._____ |
| .-----. | | |
| --|o |---| littlefs | |
| --| |---| | |
| '-----' '----------' |
| | | | |
| ``` |
| |
| littlefs was originally built as an experiment to learn about filesystem design |
| in the context of microcontrollers. The question was: How would you build a |
| filesystem that is resilient to power-loss and flash wear without using |
| unbounded memory? |
| |
| This document covers the high-level design of littlefs, how it is different |
| than other filesystems, and the design decisions that got us here. For the |
| low-level details covering every bit on disk, check out [SPEC.md](SPEC.md). |
| |
| ## The problem |
| |
| The embedded systems littlefs targets are usually 32-bit microcontrollers with |
| around 32 KiB of RAM and 512 KiB of ROM. These are often paired with SPI NOR |
| flash chips with about 4 MiB of flash storage. These devices are too small for |
| Linux and most existing filesystems, requiring code written specifically with |
| size in mind. |
| |
| Flash itself is an interesting piece of technology with its own quirks and |
| nuance. Unlike other forms of storage, writing to flash requires two |
| operations: erasing and programming. Programming (setting bits to 0) is |
| relatively cheap and can be very granular. Erasing however (setting bits to 1), |
| requires an expensive and destructive operation which gives flash its name. |
| [Wikipedia][wikipedia-flash] has more information on how exactly flash works. |
| |
| To make the situation more annoying, it's very common for these embedded |
| systems to lose power at any time. Usually, microcontroller code is simple and |
| reactive, with no concept of a shutdown routine. This presents a big challenge |
| for persistent storage, where an unlucky power loss can corrupt the storage and |
| leave a device unrecoverable. |
| |
| This leaves us with three major requirements for an embedded filesystem. |
| |
| 1. **Power-loss resilience** - On these systems, power can be lost at any time. |
| If a power loss corrupts any persistent data structures, this can cause the |
| device to become unrecoverable. An embedded filesystem must be designed to |
| recover from a power loss during any write operation. |
| |
| 1. **Wear leveling** - Writing to flash is destructive. If a filesystem |
| repeatedly writes to the same block, eventually that block will wear out. |
| Filesystems that don't take wear into account can easily burn through blocks |
| used to store frequently updated metadata and cause a device's early death. |
| |
| 1. **Bounded RAM/ROM** - If the above requirements weren't enough, these |
| systems also have very limited amounts of memory. This prevents many |
| existing filesystem designs, which can lean on relatively large amounts of |
| RAM to temporarily store filesystem metadata. |
| |
| For ROM, this means we need to keep our design simple and reuse code paths |
| were possible. For RAM we have a stronger requirement, all RAM usage is |
| bounded. This means RAM usage does not grow as the filesystem changes in |
| size or number of files. This creates a unique challenge as even presumably |
| simple operations, such as traversing the filesystem, become surprisingly |
| difficult. |
| |
| ## Existing designs? |
| |
| So, what's already out there? There are, of course, many different filesystems, |
| however they often share and borrow feature from each other. If we look at |
| power-loss resilience and wear leveling, we can narrow these down to a handful |
| of designs. |
| |
| 1. First we have the non-resilient, block based filesystems, such as [FAT] and |
| [ext2]. These are the earliest filesystem designs and often the most simple. |
| Here storage is divided into blocks, with each file being stored in a |
| collection of blocks. Without modifications, these filesystems are not |
| power-loss resilient, so updating a file is a simple as rewriting the blocks |
| in place. |
| |
| ``` |
| .--------. |
| | root | |
| | | |
| | | |
| '--------' |
| .-' '-. |
| v v |
| .--------. .--------. |
| | A | | B | |
| | | | | |
| | | | | |
| '--------' '--------' |
| .-' .-' '-. |
| v v v |
| .--------. .--------. .--------. |
| | C | | D | | E | |
| | | | | | | |
| | | | | | | |
| '--------' '--------' '--------' |
| ``` |
| |
| Because of their simplicity, these filesystems are usually both the fastest |
| and smallest. However the lack of power resilience is not great, and the |
| binding relationship of storage location and data removes the filesystem's |
| ability to manage wear. |
| |
| 2. In a completely different direction, we have logging filesystems, such as |
| [JFFS], [YAFFS], and [SPIFFS], storage location is not bound to a piece of |
| data, instead the entire storage is used for a circular log which is |
| appended with every change made to the filesystem. Writing appends new |
| changes, while reading requires traversing the log to reconstruct a file. |
| Some logging filesystems cache files to avoid the read cost, but this comes |
| at a tradeoff of RAM. |
| |
| ``` |
| v |
| .--------.--------.--------.--------.--------.--------.--------.--------. |
| | C | new B | new A | | A | B | |
| | | | |-> | | | |
| | | | | | | | |
| '--------'--------'--------'--------'--------'--------'--------'--------' |
| ``` |
| |
| Logging filesystem are beautifully elegant. With a checksum, we can easily |
| detect power-loss and fall back to the previous state by ignoring failed |
| appends. And if that wasn't good enough, their cyclic nature means that |
| logging filesystems distribute wear across storage perfectly. |
| |
| The main downside is performance. If we look at garbage collection, the |
| process of cleaning up outdated data from the end of the log, I've yet to |
| see a pure logging filesystem that does not have one of these two costs: |
| |
| 1. _O(n²)_ runtime |
| 2. _O(n)_ RAM |
| |
| SPIFFS is a very interesting case here, as it uses the fact that repeated |
| programs to NOR flash is both atomic and masking. This is a very neat |
| solution, however it limits the type of storage you can support. |
| |
| 3. Perhaps the most common type of filesystem, a journaling filesystem is the |
| offspring that happens when you mate a block based filesystem with a logging |
| filesystem. [ext4] and [NTFS] are good examples. Here, we take a normal |
| block based filesystem and add a bounded log where we note every change |
| before it occurs. |
| |
| ``` |
| journal |
| .--------.--------. |
| .--------. | C'| D'| | E'| |
| | root |-->| | |-> | | |
| | | | | | | | |
| | | '--------'--------' |
| '--------' |
| .-' '-. |
| v v |
| .--------. .--------. |
| | A | | B | |
| | | | | |
| | | | | |
| '--------' '--------' |
| .-' .-' '-. |
| v v v |
| .--------. .--------. .--------. |
| | C | | D | | E | |
| | | | | | | |
| | | | | | | |
| '--------' '--------' '--------' |
| ``` |
| |
| |
| This sort of filesystem takes the best from both worlds. Performance can be |
| as fast as a block based filesystem (though updating the journal does have |
| a small cost), and atomic updates to the journal allow the filesystem to |
| recover in the event of a power loss. |
| |
| Unfortunately, journaling filesystems have a couple of problems. They are |
| fairly complex, since there are effectively two filesystems running in |
| parallel, which comes with a code size cost. They also offer no protection |
| against wear because of the strong relationship between storage location |
| and data. |
| |
| 4. Last but not least we have copy-on-write (COW) filesystems, such as |
| [btrfs] and [ZFS]. These are very similar to other block based filesystems, |
| but instead of updating block inplace, all updates are performed by creating |
| a copy with the changes and replacing any references to the old block with |
| our new block. This recursively pushes all of our problems upwards until we |
| reach the root of our filesystem, which is often stored in a very small log. |
| |
| ``` |
| .--------. .--------. |
| | root | write |new root| |
| | | ==> | | |
| | | | | |
| '--------' '--------' |
| .-' '-. | '-. |
| | .-------|------------------' v |
| v v v .--------. |
| .--------. .--------. | new B | |
| | A | | B | | | |
| | | | | | | |
| | | | | '--------' |
| '--------' '--------' .-' | |
| .-' .-' '-. .------------|------' |
| | | | | v |
| v v v v .--------. |
| .--------. .--------. .--------. | new D | |
| | C | | D | | E | | | |
| | | | | | | | | |
| | | | | | | '--------' |
| '--------' '--------' '--------' |
| ``` |
| |
| COW filesystems are interesting. They offer very similar performance to |
| block based filesystems while managing to pull off atomic updates without |
| storing data changes directly in a log. They even disassociate the storage |
| location of data, which creates an opportunity for wear leveling. |
| |
| Well, almost. The unbounded upwards movement of updates causes some |
| problems. Because updates to a COW filesystem don't stop until they've |
| reached the root, an update can cascade into a larger set of writes than |
| would be needed for the original data. On top of this, the upward motion |
| focuses these writes into the block, which can wear out much earlier than |
| the rest of the filesystem. |
| |
| ## littlefs |
| |
| So what does littlefs do? |
| |
| If we look at existing filesystems, there are two interesting design patterns |
| that stand out, but each have their own set of problems. Logging, which |
| provides independent atomicity, has poor runtime performance. And COW data |
| structures, which perform well, push the atomicity problem upwards. |
| |
| Can we work around these limitations? |
| |
| Consider logging. It has either a _O(n²)_ runtime or _O(n)_ RAM cost. We |
| can't avoid these costs, _but_ if we put an upper bound on the size we can at |
| least prevent the theoretical cost from becoming problem. This relies on the |
| super secret computer science hack where you can pretend any algorithmic |
| complexity is _O(1)_ by bounding the input. |
| |
| In the case of COW data structures, we can try twisting the definition a bit. |
| Let's say that our COW structure doesn't copy after a single write, but instead |
| copies after _n_ writes. This doesn't change most COW properties (assuming you |
| can write atomically!), but what it does do is prevent the upward motion of |
| wear. This sort of copy-on-bounded-writes (CObW) still focuses wear, but at |
| each level we divide the propagation of wear by _n_. With a sufficiently |
| large _n_ (> branching factor) wear propagation is no longer a problem. |
| |
| See where this is going? Separate, logging and COW are imperfect solutions and |
| have weaknesses that limit their usefulness. But if we merge the two they can |
| mutually solve each other's limitations. |
| |
| This is the idea behind littlefs. At the sub-block level, littlefs is built |
| out of small, two block logs that provide atomic updates to metadata anywhere |
| on the filesystem. At the super-block level, littlefs is a CObW tree of blocks |
| that can be evicted on demand. |
| |
| ``` |
| root |
| .--------.--------. |
| | A'| B'| | |
| | | |-> | |
| | | | | |
| '--------'--------' |
| .----' '--------------. |
| A v B v |
| .--------.--------. .--------.--------. |
| | C'| D'| | | E'|new| | |
| | | |-> | | | E'|-> | |
| | | | | | | | | |
| '--------'--------' '--------'--------' |
| .-' '--. | '------------------. |
| v v .-' v |
| .--------. .--------. v .--------. |
| | C | | D | .--------. write | new E | |
| | | | | | E | ==> | | |
| | | | | | | | | |
| '--------' '--------' | | '--------' |
| '--------' .-' | |
| .-' '-. .-------------|------' |
| v v v v |
| .--------. .--------. .--------. |
| | F | | G | | new F | |
| | | | | | | |
| | | | | | | |
| '--------' '--------' '--------' |
| ``` |
| |
| There are still some minor issues. Small logs can be expensive in terms of |
| storage, in the worst case a small log costs 4x the size of the original data. |
| CObW structures require an efficient block allocator since allocation occurs |
| every _n_ writes. And there is still the challenge of keeping the RAM usage |
| constant. |
| |
| ## Metadata pairs |
| |
| Metadata pairs are the backbone of littlefs. These are small, two block logs |
| that allow atomic updates anywhere in the filesystem. |
| |
| Why two blocks? Well, logs work by appending entries to a circular buffer |
| stored on disk. But remember that flash has limited write granularity. We can |
| incrementally program new data onto erased blocks, but we need to erase a full |
| block at a time. This means that in order for our circular buffer to work, we |
| need more than one block. |
| |
| We could make our logs larger than two blocks, but the next challenge is how |
| do we store references to these logs? Because the blocks themselves are erased |
| during writes, using a data structure to track these blocks is complicated. |
| The simple solution here is to store a two block addresses for every metadata |
| pair. This has the added advantage that we can change out blocks in the |
| metadata pair independently, and we don't reduce our block granularity for |
| other operations. |
| |
| In order to determine which metadata block is the most recent, we store a |
| revision count that we compare using [sequence arithmetic][wikipedia-sna] |
| (very handy for avoiding problems with integer overflow). Conveniently, this |
| revision count also gives us a rough idea of how many erases have occurred on |
| the block. |
| |
| ``` |
| metadata pair pointer: {block 0, block 1} |
| | '--------------------. |
| '-. | |
| disk v v |
| .--------.--------.--------.--------.--------.--------.--------.--------. |
| | | |metadata| |metadata| | |
| | | |block 0 | |block 1 | | |
| | | | | | | | |
| '--------'--------'--------'--------'--------'--------'--------'--------' |
| '--. .----' |
| v v |
| metadata pair .----------------.----------------. |
| | revision 11 | revision 12 | |
| block 1 is |----------------|----------------| |
| most recent | A | A'' | |
| |----------------|----------------| |
| | checksum | checksum | |
| |----------------|----------------| |
| | B | A''' | <- most recent A |
| |----------------|----------------| |
| | A'' | checksum | |
| |----------------|----------------| |
| | checksum | | | |
| |----------------| v | |
| '----------------'----------------' |
| ``` |
| |
| So how do we atomically update our metadata pairs? Atomicity (a type of |
| power-loss resilience) requires two parts: redundancy and error detection. |
| Error detection can be provided with a checksum, and in littlefs's case we |
| use a 32-bit [CRC][wikipedia-crc]. Maintaining redundancy, on the other hand, |
| requires multiple stages. |
| |
| 1. If our block is not full and the program size is small enough to let us |
| append more entries, we can simply append the entries to the log. Because |
| we don't overwrite the original entries (remember rewriting flash requires |
| an erase), we still have the original entries if we lose power during the |
| append. |
| |
| ``` |
| commit A |
| .----------------.----------------. .----------------.----------------. |
| | revision 1 | revision 0 | => | revision 1 | revision 0 | |
| |----------------|----------------| |----------------|----------------| |
| | | | | | A | | |
| | v | | |----------------| | |
| | | | | checksum | | |
| | | | |----------------| | |
| | | | | | | | |
| | | | | v | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| '----------------'----------------' '----------------'----------------' |
| ``` |
| |
| Note that littlefs doesn't maintain a checksum for each entry. Many logging |
| filesystems do this, but it limits what you can update in a single atomic |
| operation. What we can do instead is group multiple entries into a commit |
| that shares a single checksum. This lets us update multiple unrelated pieces |
| of metadata as long as they reside on the same metadata pair. |
| |
| ``` |
| commit B and A' |
| .----------------.----------------. .----------------.----------------. |
| | revision 1 | revision 0 | => | revision 1 | revision 0 | |
| |----------------|----------------| |----------------|----------------| |
| | A | | | A | | |
| |----------------| | |----------------| | |
| | checksum | | | checksum | | |
| |----------------| | |----------------| | |
| | | | | | B | | |
| | v | | |----------------| | |
| | | | | A' | | |
| | | | |----------------| | |
| | | | | checksum | | |
| | | | |----------------| | |
| '----------------'----------------' '----------------'----------------' |
| ``` |
| |
| 2. If our block _is_ full of entries, we need to somehow remove outdated |
| entries to make space for new ones. This process is called garbage |
| collection, but because littlefs has multiple garbage collectors, we |
| also call this specific case compaction. |
| |
| Compared to other filesystems, littlefs's garbage collector is relatively |
| simple. We want to avoid RAM consumption, so we use a sort of brute force |
| solution where for each entry we check to see if a newer entry has been |
| written. If the entry is the most recent we append it to our new block. This |
| is where having two blocks becomes important, if we lose power we still have |
| everything in our original block. |
| |
| During this compaction step we also erase the metadata block and increment |
| the revision count. Because we can commit multiple entries at once, we can |
| write all of these changes to the second block without worrying about power |
| loss. It's only when the commit's checksum is written that the compacted |
| entries and revision count become committed and readable. |
| |
| ``` |
| commit B', need to compact |
| .----------------.----------------. .----------------.----------------. |
| | revision 1 | revision 0 | => | revision 1 | revision 2 | |
| |----------------|----------------| |----------------|----------------| |
| | A | | | A | A' | |
| |----------------| | |----------------|----------------| |
| | checksum | | | checksum | B' | |
| |----------------| | |----------------|----------------| |
| | B | | | B | checksum | |
| |----------------| | |----------------|----------------| |
| | A' | | | A' | | | |
| |----------------| | |----------------| v | |
| | checksum | | | checksum | | |
| |----------------| | |----------------| | |
| '----------------'----------------' '----------------'----------------' |
| ``` |
| |
| 3. If our block is full of entries _and_ we can't find any garbage, then what? |
| At this point, most logging filesystems would return an error indicating no |
| more space is available, but because we have small logs, overflowing a log |
| isn't really an error condition. |
| |
| Instead, we split our original metadata pair into two metadata pairs, each |
| containing half of the entries, connected by a tail pointer. Instead of |
| increasing the size of the log and dealing with the scalability issues |
| associated with larger logs, we form a linked list of small bounded logs. |
| This is a tradeoff as this approach does use more storage space, but at the |
| benefit of improved scalability. |
| |
| Despite writing to two metadata pairs, we can still maintain power |
| resilience during this split step by first preparing the new metadata pair, |
| and then inserting the tail pointer during the commit to the original |
| metadata pair. |
| |
| ``` |
| commit C and D, need to split |
| .----------------.----------------. .----------------.----------------. |
| | revision 1 | revision 2 | => | revision 3 | revision 2 | |
| |----------------|----------------| |----------------|----------------| |
| | A | A' | | A' | A' | |
| |----------------|----------------| |----------------|----------------| |
| | checksum | B' | | B' | B' | |
| |----------------|----------------| |----------------|----------------| |
| | B | checksum | | tail ---------------------. |
| |----------------|----------------| |----------------|----------------| | |
| | A' | | | | checksum | | | |
| |----------------| v | |----------------| | | |
| | checksum | | | | | | | |
| |----------------| | | v | | | |
| '----------------'----------------' '----------------'----------------' | |
| .----------------.---------' |
| v v |
| .----------------.----------------. |
| | revision 1 | revision 0 | |
| |----------------|----------------| |
| | C | | |
| |----------------| | |
| | D | | |
| |----------------| | |
| | checksum | | |
| |----------------| | |
| | | | | |
| | v | | |
| | | | |
| | | | |
| '----------------'----------------' |
| ``` |
| |
| There is another complexity the crops up when dealing with small logs. The |
| amortized runtime cost of garbage collection is not only dependent on its |
| one time cost (_O(n²)_ for littlefs), but also depends on how often |
| garbage collection occurs. |
| |
| Consider two extremes: |
| |
| 1. Log is empty, garbage collection occurs once every _n_ updates |
| 2. Log is full, garbage collection occurs **every** update |
| |
| Clearly we need to be more aggressive than waiting for our metadata pair to |
| be full. As the metadata pair approaches fullness the frequency of compactions |
| grows very rapidly. |
| |
| Looking at the problem generically, consider a log with ![n] bytes for each |
| entry, ![d] dynamic entries (entries that are outdated during garbage |
| collection), and ![s] static entries (entries that need to be copied during |
| garbage collection). If we look at the amortized runtime complexity of updating |
| this log we get this formula: |
| |
| ![cost = n + n (s / d+1)][metadata-formula1] |
| |
| If we let ![r] be the ratio of static space to the size of our log in bytes, we |
| find an alternative representation of the number of static and dynamic entries: |
| |
| ![s = r (size/n)][metadata-formula2] |
| |
| ![d = (1 - r) (size/n)][metadata-formula3] |
| |
| Substituting these in for ![d] and ![s] gives us a nice formula for the cost of |
| updating an entry given how full the log is: |
| |
| ![cost = n + n (r (size/n) / ((1-r) (size/n) + 1))][metadata-formula4] |
| |
| Assuming 100 byte entries in a 4 KiB log, we can graph this using the entry |
| size to find a multiplicative cost: |
| |
| ![Metadata pair update cost graph][metadata-cost-graph] |
| |
| So at 50% usage, we're seeing an average of 2x cost per update, and at 75% |
| usage, we're already at an average of 4x cost per update. |
| |
| To avoid this exponential growth, instead of waiting for our metadata pair |
| to be full, we split the metadata pair once we exceed 50% capacity. We do this |
| lazily, waiting until we need to compact before checking if we fit in our 50% |
| limit. This limits the overhead of garbage collection to 2x the runtime cost, |
| giving us an amortized runtime complexity of _O(1)_. |
| |
| --- |
| |
| If we look at metadata pairs and linked-lists of metadata pairs at a high |
| level, they have fairly nice runtime costs. Assuming _n_ metadata pairs, |
| each containing _m_ metadata entries, the _lookup_ cost for a specific |
| entry has a worst case runtime complexity of _O(nm)_. For _updating_ a specific |
| entry, the worst case complexity is _O(nm²)_, with an amortized complexity |
| of only _O(nm)_. |
| |
| However, splitting at 50% capacity does mean that in the best case our |
| metadata pairs will only be 1/2 full. If we include the overhead of the second |
| block in our metadata pair, each metadata entry has an effective storage cost |
| of 4x the original size. I imagine users would not be happy if they found |
| that they can only use a quarter of their original storage. Metadata pairs |
| provide a mechanism for performing atomic updates, but we need a separate |
| mechanism for storing the bulk of our data. |
| |
| ## CTZ skip-lists |
| |
| Metadata pairs provide efficient atomic updates but unfortunately have a large |
| storage cost. But we can work around this storage cost by only using the |
| metadata pairs to store references to more dense, copy-on-write (COW) data |
| structures. |
| |
| [Copy-on-write data structures][wikipedia-cow], also called purely functional |
| data structures, are a category of data structures where the underlying |
| elements are immutable. Making changes to the data requires creating new |
| elements containing a copy of the updated data and replacing any references |
| with references to the new elements. Generally, the performance of a COW data |
| structure depends on how many old elements can be reused after replacing parts |
| of the data. |
| |
| littlefs has several requirements of its COW structures. They need to be |
| efficient to read and write, but most frustrating, they need to be traversable |
| with a constant amount of RAM. Notably this rules out |
| [B-trees][wikipedia-B-tree], which can not be traversed with constant RAM, and |
| [B+-trees][wikipedia-B+-tree], which are not possible to update with COW |
| operations. |
| |
| --- |
| |
| So, what can we do? First let's consider storing files in a simple COW |
| linked-list. Appending a block, which is the basis for writing files, means we |
| have to update the last block to point to our new block. This requires a COW |
| operation, which means we need to update the second-to-last block, and then the |
| third-to-last, and so on until we've copied out the entire file. |
| |
| ``` |
| A linked-list |
| .--------. .--------. .--------. .--------. .--------. .--------. |
| | data 0 |->| data 1 |->| data 2 |->| data 4 |->| data 5 |->| data 6 | |
| | | | | | | | | | | | | |
| | | | | | | | | | | | | |
| '--------' '--------' '--------' '--------' '--------' '--------' |
| ``` |
| |
| To avoid a full copy during appends, we can store the data backwards. Appending |
| blocks just requires adding the new block and no other blocks need to be |
| updated. If we update a block in the middle, we still need to copy the |
| following blocks, but can reuse any blocks before it. Since most file writes |
| are linear, this design gambles that appends are the most common type of data |
| update. |
| |
| ``` |
| A backwards linked-list |
| .--------. .--------. .--------. .--------. .--------. .--------. |
| | data 0 |<-| data 1 |<-| data 2 |<-| data 4 |<-| data 5 |<-| data 6 | |
| | | | | | | | | | | | | |
| | | | | | | | | | | | | |
| '--------' '--------' '--------' '--------' '--------' '--------' |
| ``` |
| |
| However, a backwards linked-list does have a rather glaring problem. Iterating |
| over a file _in order_ has a runtime cost of _O(n²)_. A quadratic runtime |
| just to read a file! That's awful. |
| |
| Fortunately we can do better. Instead of a singly linked list, littlefs |
| uses a multilayered linked-list often called a |
| [skip-list][wikipedia-skip-list]. However, unlike the most common type of |
| skip-list, littlefs's skip-lists are strictly deterministic built around some |
| interesting properties of the count-trailing-zeros (CTZ) instruction. |
| |
| The rules CTZ skip-lists follow are that for every _n_‍th block where _n_ |
| is divisible by 2‍_ˣ_, that block contains a pointer to block |
| _n_-2‍_ˣ_. This means that each block contains anywhere from 1 to |
| log₂_n_ pointers that skip to different preceding elements of the |
| skip-list. |
| |
| The name comes from heavy use of the [CTZ instruction][wikipedia-ctz], which |
| lets us calculate the power-of-two factors efficiently. For a give block _n_, |
| that block contains ctz(_n_)+1 pointers. |
| |
| ``` |
| A backwards CTZ skip-list |
| .--------. .--------. .--------. .--------. .--------. .--------. |
| | data 0 |<-| data 1 |<-| data 2 |<-| data 3 |<-| data 4 |<-| data 5 | |
| | |<-| |--| |<-| |--| | | | |
| | |<-| |--| |--| |--| | | | |
| '--------' '--------' '--------' '--------' '--------' '--------' |
| ``` |
| |
| The additional pointers let us navigate the data-structure on disk much more |
| efficiently than in a singly linked list. |
| |
| Consider a path from data block 5 to data block 1. You can see how data block 3 |
| was completely skipped: |
| ``` |
| .--------. .--------. .--------. .--------. .--------. .--------. |
| | data 0 | | data 1 |<-| data 2 | | data 3 | | data 4 |<-| data 5 | |
| | | | | | |<-| |--| | | | |
| | | | | | | | | | | | | |
| '--------' '--------' '--------' '--------' '--------' '--------' |
| ``` |
| |
| The path to data block 0 is even faster, requiring only two jumps: |
| ``` |
| .--------. .--------. .--------. .--------. .--------. .--------. |
| | data 0 | | data 1 | | data 2 | | data 3 | | data 4 |<-| data 5 | |
| | | | | | | | | | | | | |
| | |<-| |--| |--| |--| | | | |
| '--------' '--------' '--------' '--------' '--------' '--------' |
| ``` |
| |
| We can find the runtime complexity by looking at the path to any block from |
| the block containing the most pointers. Every step along the path divides |
| the search space for the block in half, giving us a runtime of _O(log n)_. |
| To get _to_ the block with the most pointers, we can perform the same steps |
| backwards, which puts the runtime at _O(2 log n)_ = _O(log n)_. An interesting |
| note is that this optimal path occurs naturally if we greedily choose the |
| pointer that covers the most distance without passing our target. |
| |
| So now we have a [COW] data structure that is cheap to append with a runtime |
| of _O(1)_, and can be read with a worst case runtime of _O(n log n)_. Given |
| that this runtime is also divided by the amount of data we can store in a |
| block, this cost is fairly reasonable. |
| |
| --- |
| |
| This is a new data structure, so we still have several questions. What is the |
| storage overhead? Can the number of pointers exceed the size of a block? How do |
| we store a CTZ skip-list in our metadata pairs? |
| |
| To find the storage overhead, we can look at the data structure as multiple |
| linked-lists. Each linked-list skips twice as many blocks as the previous, |
| or from another perspective, each linked-list uses half as much storage as |
| the previous. As we approach infinity, the storage overhead forms a geometric |
| series. Solving this tells us that on average our storage overhead is only |
| 2 pointers per block. |
| |
| ![lim,n->inf((1/n)sum,i,0->n(ctz(i)+1)) = sum,i,0->inf(1/2^i) = 2][ctz-formula1] |
| |
| Because our file size is limited the word width we use to store sizes, we can |
| also solve for the maximum number of pointers we would ever need to store in a |
| block. If we set the overhead of pointers equal to the block size, we get the |
| following equation. Note that both a smaller block size (![B][bigB]) and larger |
| word width (![w]) result in more storage overhead. |
| |
| ![B = (w/8)ceil(log2(2^w / (B-2w/8)))][ctz-formula2] |
| |
| Solving the equation for ![B][bigB] gives us the minimum block size for some |
| common word widths: |
| |
| 1. 32-bit CTZ skip-list => minimum block size of 104 bytes |
| 2. 64-bit CTZ skip-list => minimum block size of 448 bytes |
| |
| littlefs uses a 32-bit word width, so our blocks can only overflow with |
| pointers if they are smaller than 104 bytes. This is an easy requirement, as |
| in practice, most block sizes start at 512 bytes. As long as our block size |
| is larger than 104 bytes, we can avoid the extra logic needed to handle |
| pointer overflow. |
| |
| This last question is how do we store CTZ skip-lists? We need a pointer to the |
| head block, the size of the skip-list, the index of the head block, and our |
| offset in the head block. But it's worth noting that each size maps to a unique |
| index + offset pair. So in theory we can store only a single pointer and size. |
| |
| However, calculating the index + offset pair from the size is a bit |
| complicated. We can start with a summation that loops through all of the blocks |
| up until our given size. Let ![B][bigB] be the block size in bytes, ![w] be the |
| word width in bits, ![n] be the index of the block in the skip-list, and |
| ![N][bigN] be the file size in bytes: |
| |
| ![N = sum,i,0->n(B-(w/8)(ctz(i)+1))][ctz-formula3] |
| |
| This works quite well, but requires _O(n)_ to compute, which brings the full |
| runtime of reading a file up to _O(n² log n)_. Fortunately, that summation |
| doesn't need to touch the disk, so the practical impact is minimal. |
| |
| However, despite the integration of a bitwise operation, we can actually reduce |
| this equation to a _O(1)_ form. While browsing the amazing resource that is |
| the [On-Line Encyclopedia of Integer Sequences (OEIS)][oeis], I managed to find |
| [A001511], which matches the iteration of the CTZ instruction, |
| and [A005187], which matches its partial summation. Much to my |
| surprise, these both result from simple equations, leading us to a rather |
| unintuitive property that ties together two seemingly unrelated bitwise |
| instructions: |
| |
| ![sum,i,0->n(ctz(i)+1) = 2n-popcount(n)][ctz-formula4] |
| |
| where: |
| |
| 1. ctz(![x]) = the number of trailing bits that are 0 in ![x] |
| 2. popcount(![x]) = the number of bits that are 1 in ![x] |
| |
| Initial tests of this surprising property seem to hold. As ![n] approaches |
| infinity, we end up with an average overhead of 2 pointers, which matches our |
| assumption from earlier. During iteration, the popcount function seems to |
| handle deviations from this average. Of course, just to make sure I wrote a |
| quick script that verified this property for all 32-bit integers. |
| |
| Now we can substitute into our original equation to find a more efficient |
| equation for file size: |
| |
| ![N = Bn - (w/8)(2n-popcount(n))][ctz-formula5] |
| |
| Unfortunately, the popcount function is non-injective, so we can't solve this |
| equation for our index. But what we can do is solve for an ![n'] index that |
| is greater than ![n] with error bounded by the range of the popcount function. |
| We can repeatedly substitute ![n'] into the original equation until the error |
| is smaller than our integer resolution. As it turns out, we only need to |
| perform this substitution once, which gives us this formula for our index: |
| |
| ![n = floor((N-(w/8)popcount(N/(B-2w/8))) / (B-2w/8))][ctz-formula6] |
| |
| Now that we have our index ![n], we can just plug it back into the above |
| equation to find the offset. We run into a bit of a problem with integer |
| overflow, but we can avoid this by rearranging the equation a bit: |
| |
| ![off = N - (B-2w/8)n - (w/8)popcount(n)][ctz-formula7] |
| |
| Our solution requires quite a bit of math, but computers are very good at math. |
| Now we can find both our block index and offset from a size in _O(1)_, letting |
| us store CTZ skip-lists with only a pointer and size. |
| |
| CTZ skip-lists give us a COW data structure that is easily traversable in |
| _O(n)_, can be appended in _O(1)_, and can be read in _O(n log n)_. All of |
| these operations work in a bounded amount of RAM and require only two words of |
| storage overhead per block. In combination with metadata pairs, CTZ skip-lists |
| provide power resilience and compact storage of data. |
| |
| ``` |
| .--------. |
| .|metadata| |
| || | |
| || | |
| |'--------' |
| '----|---' |
| v |
| .--------. .--------. .--------. .--------. |
| | data 0 |<-| data 1 |<-| data 2 |<-| data 3 | |
| | |<-| |--| | | | |
| | | | | | | | | |
| '--------' '--------' '--------' '--------' |
| |
| write data to disk, create copies |
| => |
| .--------. |
| .|metadata| |
| || | |
| || | |
| |'--------' |
| '----|---' |
| v |
| .--------. .--------. .--------. .--------. |
| | data 0 |<-| data 1 |<-| data 2 |<-| data 3 | |
| | |<-| |--| | | | |
| | | | | | | | | |
| '--------' '--------' '--------' '--------' |
| ^ ^ ^ |
| | | | .--------. .--------. .--------. .--------. |
| | | '----| new |<-| new |<-| new |<-| new | |
| | '----------------| data 2 |<-| data 3 |--| data 4 | | data 5 | |
| '------------------| |--| |--| | | | |
| '--------' '--------' '--------' '--------' |
| |
| commit to metadata pair |
| => |
| .--------. |
| .|new | |
| ||metadata| |
| || | |
| |'--------' |
| '----|---' |
| | |
| .--------. .--------. .--------. .--------. | |
| | data 0 |<-| data 1 |<-| data 2 |<-| data 3 | | |
| | |<-| |--| | | | | |
| | | | | | | | | | |
| '--------' '--------' '--------' '--------' | |
| ^ ^ ^ v |
| | | | .--------. .--------. .--------. .--------. |
| | | '----| new |<-| new |<-| new |<-| new | |
| | '----------------| data 2 |<-| data 3 |--| data 4 | | data 5 | |
| '------------------| |--| |--| | | | |
| '--------' '--------' '--------' '--------' |
| ``` |
| |
| ## The block allocator |
| |
| So we now have the framework for an atomic, wear leveling filesystem. Small two |
| block metadata pairs provide atomic updates, while CTZ skip-lists provide |
| compact storage of data in COW blocks. |
| |
| But now we need to look at the [elephant] in the room. Where do all these |
| blocks come from? |
| |
| Deciding which block to use next is the responsibility of the block allocator. |
| In filesystem design, block allocation is often a second-class citizen, but in |
| a COW filesystem its role becomes much more important as it is needed for |
| nearly every write to the filesystem. |
| |
| Normally, block allocation involves some sort of free list or bitmap stored on |
| the filesystem that is updated with free blocks. However, with power |
| resilience, keeping these structures consistent becomes difficult. It doesn't |
| help that any mistake in updating these structures can result in lost blocks |
| that are impossible to recover. |
| |
| littlefs takes a cautious approach. Instead of trusting a free list on disk, |
| littlefs relies on the fact that the filesystem on disk is a mirror image of |
| the free blocks on the disk. The block allocator operates much like a garbage |
| collector in a scripting language, scanning for unused blocks on demand. |
| |
| ``` |
| .----. |
| |root| |
| | | |
| '----' |
| v-------' '-------v |
| .----. . . .----. |
| | A | . . | B | |
| | | . . | | |
| '----' . . '----' |
| . . . . v--' '------------v---------v |
| . . . .----. . .----. .----. |
| . . . | C | . | D | | E | |
| . . . | | . | | | | |
| . . . '----' . '----' '----' |
| . . . . . . . . . . |
| .----.----.----.----.----.----.----.----.----.----.----.----. |
| | A | |root| C | B | | D | | E | | |
| | | | | | | | | | | | |
| '----'----'----'----'----'----'----'----'----'----'----'----' |
| ^ ^ ^ ^ ^ |
| '-------------------'----'-------------------'----'-- free blocks |
| ``` |
| |
| While this approach may sound complicated, the decision to not maintain a free |
| list greatly simplifies the overall design of littlefs. Unlike programming |
| languages, there are only a handful of data structures we need to traverse. |
| And block deallocation, which occurs nearly as often as block allocation, |
| is simply a noop. This "drop it on the floor" strategy greatly reduces the |
| complexity of managing on disk data structures, especially when handling |
| high-risk error conditions. |
| |
| --- |
| |
| Our block allocator needs to find free blocks efficiently. You could traverse |
| through every block on storage and check each one against our filesystem tree; |
| however, the runtime would be abhorrent. We need to somehow collect multiple |
| blocks per traversal. |
| |
| Looking at existing designs, some larger filesystems that use a similar "drop |
| it on the floor" strategy store a bitmap of the entire storage in [RAM]. This |
| works well because bitmaps are surprisingly compact. We can't use the same |
| strategy here, as it violates our constant RAM requirement, but we may be able |
| to modify the idea into a workable solution. |
| |
| ``` |
| .----.----.----.----.----.----.----.----.----.----.----.----. |
| | A | |root| C | B | | D | | E | | |
| | | | | | | | | | | | |
| '----'----'----'----'----'----'----'----'----'----'----'----' |
| 1 0 1 1 1 0 0 1 0 1 0 0 |
| \---------------------------+----------------------------/ |
| v |
| bitmap: 0xb94 (0b101110010100) |
| ``` |
| |
| The block allocator in littlefs is a compromise between a disk-sized bitmap and |
| a brute force traversal. Instead of a bitmap the size of storage, we keep track |
| of a small, fixed-size bitmap called the lookahead buffer. During block |
| allocation, we take blocks from the lookahead buffer. If the lookahead buffer |
| is empty, we scan the filesystem for more free blocks, populating our lookahead |
| buffer. In each scan we use an increasing offset, circling the storage as |
| blocks are allocated. |
| |
| Here's what it might look like to allocate 4 blocks on a decently busy |
| filesystem with a 32 bit lookahead and a total of 128 blocks (512 KiB |
| of storage if blocks are 4 KiB): |
| ``` |
| boot... lookahead: |
| fs blocks: fffff9fffffffffeffffffffffff0000 |
| scanning... lookahead: fffff9ff |
| fs blocks: fffff9fffffffffeffffffffffff0000 |
| alloc = 21 lookahead: fffffdff |
| fs blocks: fffffdfffffffffeffffffffffff0000 |
| alloc = 22 lookahead: ffffffff |
| fs blocks: fffffffffffffffeffffffffffff0000 |
| scanning... lookahead: fffffffe |
| fs blocks: fffffffffffffffeffffffffffff0000 |
| alloc = 63 lookahead: ffffffff |
| fs blocks: ffffffffffffffffffffffffffff0000 |
| scanning... lookahead: ffffffff |
| fs blocks: ffffffffffffffffffffffffffff0000 |
| scanning... lookahead: ffffffff |
| fs blocks: ffffffffffffffffffffffffffff0000 |
| scanning... lookahead: ffff0000 |
| fs blocks: ffffffffffffffffffffffffffff0000 |
| alloc = 112 lookahead: ffff8000 |
| fs blocks: ffffffffffffffffffffffffffff8000 |
| ``` |
| |
| This lookahead approach has a runtime complexity of _O(n²)_ to completely |
| scan storage; however, bitmaps are surprisingly compact, and in practice only |
| one or two passes are usually needed to find free blocks. Additionally, the |
| performance of the allocator can be optimized by adjusting the block size or |
| size of the lookahead buffer, trading either write granularity or RAM for |
| allocator performance. |
| |
| ## Wear leveling |
| |
| The block allocator has a secondary role: wear leveling. |
| |
| Wear leveling is the process of distributing wear across all blocks in the |
| storage to prevent the filesystem from experiencing an early death due to |
| wear on a single block in the storage. |
| |
| littlefs has two methods of protecting against wear: |
| 1. Detection and recovery from bad blocks |
| 2. Evenly distributing wear across dynamic blocks |
| |
| --- |
| |
| Recovery from bad blocks doesn't actually have anything to do with the block |
| allocator itself. Instead, it relies on the ability of the filesystem to detect |
| and evict bad blocks when they occur. |
| |
| In littlefs, it is fairly straightforward to detect bad blocks at write time. |
| All writes must be sourced by some form of data in RAM, so immediately after we |
| write to a block, we can read the data back and verify that it was written |
| correctly. If we find that the data on disk does not match the copy we have in |
| RAM, a write error has occurred and we most likely have a bad block. |
| |
| Once we detect a bad block, we need to recover from it. In the case of write |
| errors, we have a copy of the corrupted data in RAM, so all we need to do is |
| evict the bad block, allocate a new, hopefully good block, and repeat the write |
| that previously failed. |
| |
| The actual act of evicting the bad block and replacing it with a new block is |
| left up to the filesystem's copy-on-bounded-writes (CObW) data structures. One |
| property of CObW data structures is that any block can be replaced during a |
| COW operation. The bounded-writes part is normally triggered by a counter, but |
| nothing prevents us from triggering a COW operation as soon as we find a bad |
| block. |
| |
| ``` |
| .----. |
| |root| |
| | | |
| '----' |
| v--' '----------------------v |
| .----. .----. |
| | A | | B | |
| | | | | |
| '----' '----' |
| . . v---' . |
| . . .----. . |
| . . | C | . |
| . . | | . |
| . . '----' . |
| . . . . . |
| .----.----.----.----.----.----.----.----.----.----. |
| | A |root| | C | B | | |
| | | | | | | | |
| '----'----'----'----'----'----'----'----'----'----' |
| |
| update C |
| => |
| .----. |
| |root| |
| | | |
| '----' |
| v--' '----------------------v |
| .----. .----. |
| | A | | B | |
| | | | | |
| '----' '----' |
| . . v---' . |
| . . .----. . |
| . . |bad | . |
| . . |blck| . |
| . . '----' . |
| . . . . . |
| .----.----.----.----.----.----.----.----.----.----. |
| | A |root| |bad | B | | |
| | | | |blck| | | |
| '----'----'----'----'----'----'----'----'----'----' |
| |
| oh no! bad block! relocate C |
| => |
| .----. |
| |root| |
| | | |
| '----' |
| v--' '----------------------v |
| .----. .----. |
| | A | | B | |
| | | | | |
| '----' '----' |
| . . v---' . |
| . . .----. . |
| . . |bad | . |
| . . |blck| . |
| . . '----' . |
| . . . . . |
| .----.----.----.----.----.----.----.----.----.----. |
| | A |root| |bad | B |bad | | |
| | | | |blck| |blck| | |
| '----'----'----'----'----'----'----'----'----'----' |
| ---------> |
| oh no! bad block! relocate C |
| => |
| .----. |
| |root| |
| | | |
| '----' |
| v--' '----------------------v |
| .----. .----. |
| | A | | B | |
| | | | | |
| '----' '----' |
| . . v---' . |
| . . .----. . .----. |
| . . |bad | . | C' | |
| . . |blck| . | | |
| . . '----' . '----' |
| . . . . . . . |
| .----.----.----.----.----.----.----.----.----.----. |
| | A |root| |bad | B |bad | C' | | |
| | | | |blck| |blck| | | |
| '----'----'----'----'----'----'----'----'----'----' |
| --------------> |
| successfully relocated C, update B |
| => |
| .----. |
| |root| |
| | | |
| '----' |
| v--' '----------------------v |
| .----. .----. |
| | A | |bad | |
| | | |blck| |
| '----' '----' |
| . . v---' . |
| . . .----. . .----. |
| . . |bad | . | C' | |
| . . |blck| . | | |
| . . '----' . '----' |
| . . . . . . . |
| .----.----.----.----.----.----.----.----.----.----. |
| | A |root| |bad |bad |bad | C' | | |
| | | | |blck|blck|blck| | | |
| '----'----'----'----'----'----'----'----'----'----' |
| |
| oh no! bad block! relocate B |
| => |
| .----. |
| |root| |
| | | |
| '----' |
| v--' '----------------------v |
| .----. .----. .----. |
| | A | |bad | |bad | |
| | | |blck| |blck| |
| '----' '----' '----' |
| . . v---' . . . |
| . . .----. . .----. . |
| . . |bad | . | C' | . |
| . . |blck| . | | . |
| . . '----' . '----' . |
| . . . . . . . . |
| .----.----.----.----.----.----.----.----.----.----. |
| | A |root| |bad |bad |bad | C' |bad | |
| | | | |blck|blck|blck| |blck| |
| '----'----'----'----'----'----'----'----'----'----' |
| --------------> |
| oh no! bad block! relocate B |
| => |
| .----. |
| |root| |
| | | |
| '----' |
| v--' '----------------------v |
| .----. .----. .----. |
| | A | | B' | |bad | |
| | | | | |blck| |
| '----' '----' '----' |
| . . . | . .---' . |
| . . . '--------------v-------------v |
| . . . . .----. . .----. |
| . . . . |bad | . | C' | |
| . . . . |blck| . | | |
| . . . . '----' . '----' |
| . . . . . . . . . |
| .----.----.----.----.----.----.----.----.----.----. |
| | A |root| B' | |bad |bad |bad | C' |bad | |
| | | | | |blck|blck|blck| |blck| |
| '----'----'----'----'----'----'----'----'----'----' |
| ------------> ------------------ |
| successfully relocated B, update root |
| => |
| .----. |
| |root| |
| | | |
| '----' |
| v--' '--v |
| .----. .----. |
| | A | | B' | |
| | | | | |
| '----' '----' |
| . . . '---------------------------v |
| . . . . .----. |
| . . . . | C' | |
| . . . . | | |
| . . . . '----' |
| . . . . . . |
| .----.----.----.----.----.----.----.----.----.----. |
| | A |root| B' | |bad |bad |bad | C' |bad | |
| | | | | |blck|blck|blck| |blck| |
| '----'----'----'----'----'----'----'----'----'----' |
| ``` |
| |
| We may find that the new block is also bad, but hopefully after repeating this |
| cycle we'll eventually find a new block where a write succeeds. If we don't, |
| that means that all blocks in our storage are bad, and we've reached the end of |
| our device's usable life. At this point, littlefs will return an "out of space" |
| error. This is technically true, as there are no more good blocks, but as an |
| added benefit it also matches the error condition expected by users of |
| dynamically sized data. |
| |
| --- |
| |
| Read errors, on the other hand, are quite a bit more complicated. We don't have |
| a copy of the data lingering around in RAM, so we need a way to reconstruct the |
| original data even after it has been corrupted. One such mechanism for this is |
| [error-correction-codes (ECC)][wikipedia-ecc]. |
| |
| ECC is an extension to the idea of a checksum. Where a checksum such as CRC can |
| detect that an error has occurred in the data, ECC can detect and actually |
| correct some amount of errors. However, there is a limit to how many errors ECC |
| can detect: the [Hamming bound][wikipedia-hamming-bound]. As the number of |
| errors approaches the Hamming bound, we may still be able to detect errors, but |
| can no longer fix the data. If we've reached this point the block is |
| unrecoverable. |
| |
| littlefs by itself does **not** provide ECC. The block nature and relatively |
| large footprint of ECC does not work well with the dynamically sized data of |
| filesystems, correcting errors without RAM is complicated, and ECC fits better |
| with the geometry of block devices. In fact, several NOR flash chips have extra |
| storage intended for ECC, and many NAND chips can even calculate ECC on the |
| chip itself. |
| |
| In littlefs, ECC is entirely optional. Read errors can instead be prevented |
| proactively by wear leveling. But it's important to note that ECC can be used |
| at the block device level to modestly extend the life of a device. littlefs |
| respects any errors reported by the block device, allowing a block device to |
| provide additional aggressive error detection. |
| |
| --- |
| |
| To avoid read errors, we need to be proactive, as opposed to reactive as we |
| were with write errors. |
| |
| One way to do this is to detect when the number of errors in a block exceeds |
| some threshold, but is still recoverable. With ECC we can do this at write |
| time, and treat the error as a write error, evicting the block before fatal |
| read errors have a chance to develop. |
| |
| A different, more generic strategy, is to proactively distribute wear across |
| all blocks in the storage, with the hope that no single block fails before the |
| rest of storage is approaching the end of its usable life. This is called |
| wear leveling. |
| |
| Generally, wear leveling algorithms fall into one of two categories: |
| |
| 1. [Dynamic wear leveling][wikipedia-dynamic-wear-leveling], where we |
| distribute wear over "dynamic" blocks. The can be accomplished by |
| only considering unused blocks. |
| |
| 2. [Static wear leveling][wikipedia-static-wear-leveling], where we |
| distribute wear over both "dynamic" and "static" blocks. To make this work, |
| we need to consider all blocks, including blocks that already contain data. |
| |
| As a tradeoff for code size and complexity, littlefs (currently) only provides |
| dynamic wear leveling. This is a best effort solution. Wear is not distributed |
| perfectly, but it is distributed among the free blocks and greatly extends the |
| life of a device. |
| |
| On top of this, littlefs uses a statistical wear leveling algorithm. What this |
| means is that we don’t actively track wear, instead we rely on a uniform |
| distribution of wear across storage to approximate a dynamic wear leveling |
| algorithm. Despite the long name, this is actually a simplification of dynamic |
| wear leveling. |
| |
| The uniform distribution of wear is left up to the block allocator, which |
| creates a uniform distribution in two parts. The easy part is when the device |
| is powered, in which case we allocate the blocks linearly, circling the device. |
| The harder part is what to do when the device loses power. We can't just |
| restart the allocator at the beginning of storage, as this would bias the wear. |
| Instead, we start the allocator as a random offset every time we mount the |
| filesystem. As long as this random offset is uniform, the combined allocation |
| pattern is also a uniform distribution. |
| |
| ![Cumulative wear distribution graph][wear-distribution-graph] |
| |
| Initially, this approach to wear leveling looks like it creates a difficult |
| dependency on a power-independent random number generator, which must return |
| different random numbers on each boot. However, the filesystem is in a |
| relatively unique situation in that it is sitting on top of a large of amount |
| of entropy that persists across power loss. |
| |
| We can actually use the data on disk to directly drive our random number |
| generator. In practice, this is implemented by xoring the checksums of each |
| metadata pair, which is already calculated to fetch and mount the filesystem. |
| |
| ``` |
| .--------. \ probably random |
| .|metadata| | ^ |
| || | +-> crc ----------------------> xor |
| || | | ^ |
| |'--------' / | |
| '---|--|-' | |
| .-' '-------------------------. | |
| | | | |
| | .--------------> xor ------------> xor |
| | | ^ | ^ |
| v crc crc v crc |
| .--------. \ ^ .--------. \ ^ .--------. \ ^ |
| .|metadata|-|--|-->|metadata| | | .|metadata| | | |
| || | +--' || | +--' || | +--' |
| || | | || | | || | | |
| |'--------' / |'--------' / |'--------' / |
| '---|--|-' '----|---' '---|--|-' |
| .-' '-. | .-' '-. |
| v v v v v |
| .--------. .--------. .--------. .--------. .--------. |
| | data | | data | | data | | data | | data | |
| | | | | | | | | | | |
| | | | | | | | | | | |
| '--------' '--------' '--------' '--------' '--------' |
| ``` |
| |
| Note that this random number generator is not perfect. It only returns unique |
| random numbers when the filesystem is modified. This is exactly what we want |
| for distributing wear in the allocator, but means this random number generator |
| is not useful for general use. |
| |
| --- |
| |
| Together, bad block detection and dynamic wear leveling provide a best effort |
| solution for avoiding the early death of a filesystem due to wear. Importantly, |
| littlefs's wear leveling algorithm provides a key feature: You can increase the |
| life of a device simply by increasing the size of storage. And if more |
| aggressive wear leveling is desired, you can always combine littlefs with a |
| [flash translation layer (FTL)][wikipedia-ftl] to get a small power resilient |
| filesystem with static wear leveling. |
| |
| ## Files |
| |
| Now that we have our building blocks out of the way, we can start looking at |
| our filesystem as a whole. |
| |
| The first step: How do we actually store our files? |
| |
| We've determined that CTZ skip-lists are pretty good at storing data compactly, |
| so following the precedent found in other filesystems we could give each file |
| a skip-list stored in a metadata pair that acts as an inode for the file. |
| |
| |
| ``` |
| .--------. |
| .|metadata| |
| || | |
| || | |
| |'--------' |
| '----|---' |
| v |
| .--------. .--------. .--------. .--------. |
| | data 0 |<-| data 1 |<-| data 2 |<-| data 3 | |
| | |<-| |--| | | | |
| | | | | | | | | |
| '--------' '--------' '--------' '--------' |
| ``` |
| |
| However, this doesn't work well when files are small, which is common for |
| embedded systems. Compared to PCs, _all_ data in an embedded system is small. |
| |
| Consider a small 4-byte file. With a two block metadata-pair and one block for |
| the CTZ skip-list, we find ourselves using a full 3 blocks. On most NOR flash |
| with 4 KiB blocks, this is 12 KiB of overhead. A ridiculous 3072x increase. |
| |
| ``` |
| file stored as inode, 4 bytes costs ~12 KiB |
| |
| .----------------. \ |
| .| revision | | |
| ||----------------| \ | |
| || skiplist ---. +- metadata | |
| ||----------------| | / 4x8 bytes | |
| || checksum | | 32 bytes | |
| ||----------------| | | |
| || | | | +- metadata pair |
| || v | | | 2x4 KiB |
| || | | | 8 KiB |
| || | | | |
| || | | | |
| || | | | |
| |'----------------' | | |
| '----------------' | / |
| .--------' |
| v |
| .----------------. \ \ |
| | data | +- data | |
| |----------------| / 4 bytes | |
| | | | |
| | | | |
| | | | |
| | | +- data block |
| | | | 4 KiB |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| '----------------' / |
| ``` |
| |
| We can make several improvements. First, instead of giving each file its own |
| metadata pair, we can store multiple files in a single metadata pair. One way |
| to do this is to directly associate a directory with a metadata pair (or a |
| linked list of metadata pairs). This makes it easy for multiple files to share |
| the directory's metadata pair for logging and reduces the collective storage |
| overhead. |
| |
| The strict binding of metadata pairs and directories also gives users |
| direct control over storage utilization depending on how they organize their |
| directories. |
| |
| ``` |
| multiple files stored in metadata pair, 4 bytes costs ~4 KiB |
| |
| .----------------. |
| .| revision | |
| ||----------------| |
| || A name | |
| || A skiplist -----. |
| ||----------------| | \ |
| || B name | | +- metadata |
| || B skiplist ---. | | 4x8 bytes |
| ||----------------| | | / 32 bytes |
| || checksum | | | |
| ||----------------| | | |
| || | | | | |
| || v | | | |
| |'----------------' | | |
| '----------------' | | |
| .----------------' | |
| v v |
| .----------------. .----------------. \ \ |
| | A data | | B data | +- data | |
| | | |----------------| / 4 bytes | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | + data block |
| | | | | | 4 KiB |
| | | | | | |
| |----------------| | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| '----------------' '----------------' / |
| ``` |
| |
| The second improvement we can make is noticing that for very small files, our |
| attempts to use CTZ skip-lists for compact storage backfires. Metadata pairs |
| have a ~4x storage cost, so if our file is smaller than 1/4 the block size, |
| there's actually no benefit in storing our file outside of our metadata pair. |
| |
| In this case, we can store the file directly in our directory's metadata pair. |
| We call this an inline file, and it allows a directory to store many small |
| files quite efficiently. Our previous 4 byte file now only takes up a |
| theoretical 16 bytes on disk. |
| |
| ``` |
| inline files stored in metadata pair, 4 bytes costs ~16 bytes |
| |
| .----------------. |
| .| revision | |
| ||----------------| |
| || A name | |
| || A skiplist ---. |
| ||----------------| | \ |
| || B name | | +- data |
| || B data | | | 4x4 bytes |
| ||----------------| | / 16 bytes |
| || checksum | | |
| ||----------------| | |
| || | | | |
| || v | | |
| |'----------------' | |
| '----------------' | |
| .---------' |
| v |
| .----------------. |
| | A data | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| |----------------| |
| | | |
| | | |
| | | |
| '----------------' |
| ``` |
| |
| Once the file exceeds 1/4 the block size, we switch to a CTZ skip-list. This |
| means that our files never use more than 4x storage overhead, decreasing as |
| the file grows in size. |
| |
| ![File storage cost graph][file-cost-graph] |
| |
| ## Directories |
| |
| Now we just need directories to store our files. As mentioned above we want |
| a strict binding of directories and metadata pairs, but there are a few |
| complications we need to sort out. |
| |
| On their own, each directory is a linked-list of metadata pairs. This lets us |
| store an unlimited number of files in each directory, and we don't need to |
| worry about the runtime complexity of unbounded logs. We can store other |
| directory pointers in our metadata pairs, which gives us a directory tree, much |
| like what you find on other filesystems. |
| |
| ``` |
| .--------. |
| .| root | |
| || | |
| || | |
| |'--------' |
| '---|--|-' |
| .-' '-------------------------. |
| v v |
| .--------. .--------. .--------. |
| .| dir A |------->| dir A | .| dir B | |
| || | || | || | |
| || | || | || | |
| |'--------' |'--------' |'--------' |
| '---|--|-' '----|---' '---|--|-' |
| .-' '-. | .-' '-. |
| v v v v v |
| .--------. .--------. .--------. .--------. .--------. |
| | file C | | file D | | file E | | file F | | file G | |
| | | | | | | | | | | |
| | | | | | | | | | | |
| '--------' '--------' '--------' '--------' '--------' |
| ``` |
| |
| The main complication is, once again, traversal with a constant amount of |
| [RAM]. The directory tree is a tree, and the unfortunate fact is you can't |
| traverse a tree with constant RAM. |
| |
| Fortunately, the elements of our tree are metadata pairs, so unlike CTZ |
| skip-lists, we're not limited to strict COW operations. One thing we can do is |
| thread a linked-list through our tree, explicitly enabling cheap traversal |
| over the entire filesystem. |
| |
| ``` |
| .--------. |
| .| root |-. |
| || | | |
| .-------|| |-' |
| | |'--------' |
| | '---|--|-' |
| | .-' '-------------------------. |
| | v v |
| | .--------. .--------. .--------. |
| '->| dir A |------->| dir A |------->| dir B | |
| || | || | || | |
| || | || | || | |
| |'--------' |'--------' |'--------' |
| '---|--|-' '----|---' '---|--|-' |
| .-' '-. | .-' '-. |
| v v v v v |
| .--------. .--------. .--------. .--------. .--------. |
| | file C | | file D | | file E | | file F | | file G | |
| | | | | | | | | | | |
| | | | | | | | | | | |
| '--------' '--------' '--------' '--------' '--------' |
| ``` |
| |
| Unfortunately, not sticking to pure COW operations creates some problems. Now, |
| whenever we want to manipulate the directory tree, multiple pointers need to be |
| updated. If you're familiar with designing atomic data structures this should |
| set off a bunch of red flags. |
| |
| To work around this, our threaded linked-list has a bit of leeway. Instead of |
| only containing metadata pairs found in our filesystem, it is allowed to |
| contain metadata pairs that have no parent because of a power loss. These are |
| called orphaned metadata pairs. |
| |
| With the possibility of orphans, we can build power loss resilient operations |
| that maintain a filesystem tree threaded with a linked-list for traversal. |
| |
| Adding a directory to our tree: |
| |
| ``` |
| .--------. |
| .| root |-. |
| || | | |
| .-------|| |-' |
| | |'--------' |
| | '---|--|-' |
| | .-' '-. |
| | v v |
| | .--------. .--------. |
| '->| dir A |->| dir C | |
| || | || | |
| || | || | |
| |'--------' |'--------' |
| '--------' '--------' |
| |
| allocate dir B |
| => |
| .--------. |
| .| root |-. |
| || | | |
| .-------|| |-' |
| | |'--------' |
| | '---|--|-' |
| | .-' '-. |
| | v v |
| | .--------. .--------. |
| '->| dir A |--->| dir C | |
| || | .->| | |
| || | | || | |
| |'--------' | |'--------' |
| '--------' | '--------' |
| | |
| .--------. | |
| .| dir B |-' |
| || | |
| || | |
| |'--------' |
| '--------' |
| |
| insert dir B into threaded linked-list, creating an orphan |
| => |
| .--------. |
| .| root |-. |
| || | | |
| .-------|| |-' |
| | |'--------' |
| | '---|--|-' |
| | .-' '-------------. |
| | v v |
| | .--------. .--------. .--------. |
| '->| dir A |->| dir B |->| dir C | |
| || | || orphan!| || | |
| || | || | || | |
| |'--------' |'--------' |'--------' |
| '--------' '--------' '--------' |
| |
| add dir B to parent directory |
| => |
| .--------. |
| .| root |-. |
| || | | |
| .-------------|| |-' |
| | |'--------' |
| | '--|-|-|-' |
| | .------' | '-------. |
| | v v v |
| | .--------. .--------. .--------. |
| '->| dir A |->| dir B |->| dir C | |
| || | || | || | |
| || | || | || | |
| |'--------' |'--------' |'--------' |
| '--------' '--------' '--------' |
| ``` |
| |
| Removing a directory: |
| |
| ``` |
| .--------. |
| .| root |-. |
| || | | |
| .-------------|| |-' |
| | |'--------' |
| | '--|-|-|-' |
| | .------' | '-------. |
| | v v v |
| | .--------. .--------. .--------. |
| '->| dir A |->| dir B |->| dir C | |
| || | || | || | |
| || | || | || | |
| |'--------' |'--------' |'--------' |
| '--------' '--------' '--------' |
| |
| remove dir B from parent directory, creating an orphan |
| => |
| .--------. |
| .| root |-. |
| || | | |
| .-------|| |-' |
| | |'--------' |
| | '---|--|-' |
| | .-' '-------------. |
| | v v |
| | .--------. .--------. .--------. |
| '->| dir A |->| dir B |->| dir C | |
| || | || orphan!| || | |
| || | || | || | |
| |'--------' |'--------' |'--------' |
| '--------' '--------' '--------' |
| |
| remove dir B from threaded linked-list, returning dir B to free blocks |
| => |
| .--------. |
| .| root |-. |
| || | | |
| .-------|| |-' |
| | |'--------' |
| | '---|--|-' |
| | .-' '-. |
| | v v |
| | .--------. .--------. |
| '->| dir A |->| dir C | |
| || | || | |
| || | || | |
| |'--------' |'--------' |
| '--------' '--------' |
| ``` |
| |
| In addition to normal directory tree operations, we can use orphans to evict |
| blocks in a metadata pair when the block goes bad or exceeds its allocated |
| erases. If we lose power while evicting a metadata block we may end up with |
| a situation where the filesystem references the replacement block while the |
| threaded linked-list still contains the evicted block. We call this a |
| half-orphan. |
| |
| ``` |
| .--------. |
| .| root |-. |
| || | | |
| .-------------|| |-' |
| | |'--------' |
| | '--|-|-|-' |
| | .------' | '-------. |
| | v v v |
| | .--------. .--------. .--------. |
| '->| dir A |->| dir B |->| dir C | |
| || | || | || | |
| || | || | || | |
| |'--------' |'--------' |'--------' |
| '--------' '--------' '--------' |
| |
| try to write to dir B |
| => |
| .--------. |
| .| root |-. |
| || | | |
| .----------------|| |-' |
| | |'--------' |
| | '-|-||-|-' |
| | .--------' || '-----. |
| | v |v v |
| | .--------. .--------. .--------. |
| '->| dir A |---->| dir B |->| dir C | |
| || |-. | | || | |
| || | | | | || | |
| |'--------' | '--------' |'--------' |
| '--------' | v '--------' |
| | .--------. |
| '->| dir B | |
| | bad | |
| | block! | |
| '--------' |
| |
| oh no! bad block detected, allocate replacement |
| => |
| .--------. |
| .| root |-. |
| || | | |
| .----------------|| |-' |
| | |'--------' |
| | '-|-||-|-' |
| | .--------' || '-------. |
| | v |v v |
| | .--------. .--------. .--------. |
| '->| dir A |---->| dir B |--->| dir C | |
| || |-. | | .->| | |
| || | | | | | || | |
| |'--------' | '--------' | |'--------' |
| '--------' | v | '--------' |
| | .--------. | |
| '->| dir B | | |
| | bad | | |
| | block! | | |
| '--------' | |
| | |
| .--------. | |
| | dir B |--' |
| | | |
| | | |
| '--------' |
| |
| insert replacement in threaded linked-list, creating a half-orphan |
| => |
| .--------. |
| .| root |-. |
| || | | |
| .----------------|| |-' |
| | |'--------' |
| | '-|-||-|-' |
| | .--------' || '-------. |
| | v |v v |
| | .--------. .--------. .--------. |
| '->| dir A |---->| dir B |--->| dir C | |
| || |-. | | .->| | |
| || | | | | | || | |
| |'--------' | '--------' | |'--------' |
| '--------' | v | '--------' |
| | .--------. | |
| | | dir B | | |
| | | bad | | |
| | | block! | | |
| | '--------' | |
| | | |
| | .--------. | |
| '->| dir B |--' |
| | half | |
| | orphan!| |
| '--------' |
| |
| fix reference in parent directory |
| => |
| .--------. |
| .| root |-. |
| || | | |
| .-------------|| |-' |
| | |'--------' |
| | '--|-|-|-' |
| | .------' | '-------. |
| | v v v |
| | .--------. .--------. .--------. |
| '->| dir A |->| dir B |->| dir C | |
| || | || | || | |
| || | || | || | |
| |'--------' |'--------' |'--------' |
| '--------' '--------' '--------' |
| ``` |
| |
| Finding orphans and half-orphans is expensive, requiring a _O(n²)_ |
| comparison of every metadata pair with every directory entry. But the tradeoff |
| is a power resilient filesystem that works with only a bounded amount of RAM. |
| Fortunately, we only need to check for orphans on the first allocation after |
| boot, and a read-only littlefs can ignore the threaded linked-list entirely. |
| |
| If we only had some sort of global state, then we could also store a flag and |
| avoid searching for orphans unless we knew we were specifically interrupted |
| while manipulating the directory tree (foreshadowing!). |
| |
| ## The move problem |
| |
| We have one last challenge: the move problem. Phrasing the problem is simple: |
| |
| How do you atomically move a file between two directories? |
| |
| In littlefs we can atomically commit to directories, but we can't create |
| an atomic commit that spans multiple directories. The filesystem must go |
| through a minimum of two distinct states to complete a move. |
| |
| To make matters worse, file moves are a common form of synchronization for |
| filesystems. As a filesystem designed for power-loss, it's important we get |
| atomic moves right. |
| |
| So what can we do? |
| |
| - We definitely can't just let power-loss result in duplicated or lost files. |
| This could easily break users' code and would only reveal itself in extreme |
| cases. We were only able to be lazy about the threaded linked-list because |
| it isn't user facing and we can handle the corner cases internally. |
| |
| - Some filesystems propagate COW operations up the tree until a common parent |
| is found. Unfortunately this interacts poorly with our threaded tree and |
| brings back the issue of upward propagation of wear. |
| |
| - In a previous version of littlefs we tried to solve this problem by going |
| back and forth between the source and destination, marking and unmarking the |
| file as moving in order to make the move atomic from the user perspective. |
| This worked, but not well. Finding failed moves was expensive and required |
| a unique identifier for each file. |
| |
| In the end, solving the move problem required creating a new mechanism for |
| sharing knowledge between multiple metadata pairs. In littlefs this led to the |
| introduction of a mechanism called "global state". |
| |
| --- |
| |
| Global state is a small set of state that can be updated from _any_ metadata |
| pair. Combining global state with metadata pairs' ability to update multiple |
| entries in one commit gives us a powerful tool for crafting complex atomic |
| operations. |
| |
| How does global state work? |
| |
| Global state exists as a set of deltas that are distributed across the metadata |
| pairs in the filesystem. The actual global state can be built out of these |
| deltas by xoring together all of the deltas in the filesystem. |
| |
| ``` |
| .--------. .--------. .--------. .--------. .--------. |
| .| |->| gdelta |->| |->| gdelta |->| gdelta | |
| || | || 0x23 | || | || 0xff | || 0xce | |
| || | || | || | || | || | |
| |'--------' |'--------' |'--------' |'--------' |'--------' |
| '--------' '----|---' '--------' '----|---' '----|---' |
| v v v |
| 0x00 --> xor ------------------> xor ------> xor --> gstate 0x12 |
| ``` |
| |
| To update the global state from a metadata pair, we take the global state we |
| know and xor it with both our changes and any existing delta in the metadata |
| pair. Committing this new delta to the metadata pair commits the changes to |
| the filesystem's global state. |
| |
| ``` |
| .--------. .--------. .--------. .--------. .--------. |
| .| |->| gdelta |->| |->| gdelta |->| gdelta | |
| || | || 0x23 | || | || 0xff | || 0xce | |
| || | || | || | || | || | |
| |'--------' |'--------' |'--------' |'--------' |'--------' |
| '--------' '----|---' '--------' '--|---|-' '----|---' |
| v v | v |
| 0x00 --> xor ----------------> xor -|------> xor --> gstate = 0x12 |
| | | |
| | | |
| change gstate to 0xab --> xor <------------|--------------------------' |
| => | v |
| '------------> xor |
| | |
| v |
| .--------. .--------. .--------. .--------. .--------. |
| .| |->| gdelta |->| |->| gdelta |->| gdelta | |
| || | || 0x23 | || | || 0x46 | || 0xce | |
| || | || | || | || | || | |
| |'--------' |'--------' |'--------' |'--------' |'--------' |
| '--------' '----|---' '--------' '----|---' '----|---' |
| v v v |
| 0x00 --> xor ------------------> xor ------> xor --> gstate = 0xab |
| ``` |
| |
| To make this efficient, we always keep a copy of the global state in RAM. We |
| only need to iterate over our metadata pairs and build the global state when |
| the filesystem is mounted. |
| |
| You may have noticed that global state is very expensive. We keep a copy in |
| RAM and a delta in an unbounded number of metadata pairs. Even if we reset |
| the global state to its initial value, we can't easily clean up the deltas on |
| disk. For this reason, it's very important that we keep the size of global |
| state bounded and extremely small. But, even with a strict budget, global |
| state is incredibly valuable. |
| |
| --- |
| |
| Now we can solve the move problem. We can create global state describing our |
| move atomically with the creation of the new file, and we can clear this move |
| state atomically with the removal of the old file. |
| |
| ``` |
| .--------. gstate = no move |
| .| root |-. |
| || | | |
| .-------------|| |-' |
| | |'--------' |
| | '--|-|-|-' |
| | .------' | '-------. |
| | v v v |
| | .--------. .--------. .--------. |
| '->| dir A |->| dir B |->| dir C | |
| || | || | || | |
| || | || | || | |
| |'--------' |'--------' |'--------' |
| '----|---' '--------' '--------' |
| v |
| .--------. |
| | file D | |
| | | |
| | | |
| '--------' |
| |
| begin move, add reference in dir C, change gstate to have move |
| => |
| .--------. gstate = moving file D in dir A (m1) |
| .| root |-. |
| || | | |
| .-------------|| |-' |
| | |'--------' |
| | '--|-|-|-' |
| | .------' | '-------. |
| | v v v |
| | .--------. .--------. .--------. |
| '->| dir A |->| dir B |->| dir C | |
| || | || | || gdelta | |
| || | || | || =m1 | |
| |'--------' |'--------' |'--------' |
| '----|---' '--------' '----|---' |
| | .----------------' |
| v v |
| .--------. |
| | file D | |
| | | |
| | | |
| '--------' |
| |
| complete move, remove reference in dir A, change gstate to no move |
| => |
| .--------. gstate = no move (m1^~m1) |
| .| root |-. |
| || | | |
| .-------------|| |-' |
| | |'--------' |
| | '--|-|-|-' |
| | .------' | '-------. |
| | v v v |
| | .--------. .--------. .--------. |
| '->| dir A |->| dir B |->| dir C | |
| || gdelta | || | || gdelta | |
| || =~m1 | || | || =m1 | |
| |'--------' |'--------' |'--------' |
| '--------' '--------' '----|---' |
| v |
| .--------. |
| | file D | |
| | | |
| | | |
| '--------' |
| ``` |
| |
| |
| If, after building our global state during mount, we find information |
| describing an ongoing move, we know we lost power during a move and the file |
| is duplicated in both the source and destination directories. If this happens, |
| we can resolve the move using the information in the global state to remove |
| one of the files. |
| |
| ``` |
| .--------. gstate = moving file D in dir A (m1) |
| .| root |-. ^ |
| || |------------> xor |
| .---------------|| |-' ^ |
| | |'--------' | |
| | '--|-|-|-' | |
| | .--------' | '---------. | |
| | | | | | |
| | | .----------> xor --------> xor |
| | v | v ^ v ^ |
| | .--------. | .--------. | .--------. | |
| '->| dir A |-|->| dir B |-|->| dir C | | |
| || |-' || |-' || gdelta |-' |
| || | || | || =m1 | |
| |'--------' |'--------' |'--------' |
| '----|---' '--------' '----|---' |
| | .---------------------' |
| v v |
| .--------. |
| | file D | |
| | | |
| | | |
| '--------' |
| ``` |
| |
| We can also move directories the same way we move files. There is the threaded |
| linked-list to consider, but leaving the threaded linked-list unchanged works |
| fine as the order doesn't really matter. |
| |
| ``` |
| .--------. gstate = no move (m1^~m1) |
| .| root |-. |
| || | | |
| .-------------|| |-' |
| | |'--------' |
| | '--|-|-|-' |
| | .------' | '-------. |
| | v v v |
| | .--------. .--------. .--------. |
| '->| dir A |->| dir B |->| dir C | |
| || gdelta | || | || gdelta | |
| || =~m1 | || | || =m1 | |
| |'--------' |'--------' |'--------' |
| '--------' '--------' '----|---' |
| v |
| .--------. |
| | file D | |
| | | |
| | | |
| '--------' |
| |
| begin move, add reference in dir C, change gstate to have move |
| => |
| .--------. gstate = moving dir B in root (m1^~m1^m2) |
| .| root |-. |
| || | | |
| .--------------|| |-' |
| | |'--------' |
| | '--|-|-|-' |
| | .-------' | '----------. |
| | v | v |
| | .--------. | .--------. |
| '->| dir A |-. | .->| dir C | |
| || gdelta | | | | || gdelta | |
| || =~m1 | | | | || =m1^m2 | |
| |'--------' | | | |'--------' |
| '--------' | | | '---|--|-' |
| | | .-------' | |
| | v v | v |
| | .--------. | .--------. |
| '->| dir B |-' | file D | |
| || | | | |
| || | | | |
| |'--------' '--------' |
| '--------' |
| |
| complete move, remove reference in root, change gstate to no move |
| => |
| .--------. gstate = no move (m1^~m1^m2^~m2) |
| .| root |-. |
| || gdelta | | |
| .-----------|| =~m2 |-' |
| | |'--------' |
| | '---|--|-' |
| | .-----' '-----. |
| | v v |
| | .--------. .--------. |
| '->| dir A |-. .->| dir C | |
| || gdelta | | | || gdelta | |
| || =~m1 | | '-|| =m1^m2 |-------. |
| |'--------' | |'--------' | |
| '--------' | '---|--|-' | |
| | .-' '-. | |
| | v v | |
| | .--------. .--------. | |
| '->| dir B |--| file D |-' |
| || | | | |
| || | | | |
| |'--------' '--------' |
| '--------' |
| ``` |
| |
| Global state gives us a powerful tool we can use to solve the move problem. |
| And the result is surprisingly performant, only needing the minimum number |
| of states and using the same number of commits as a naive move. Additionally, |
| global state gives us a bit of persistent state we can use for some other |
| small improvements. |
| |
| ## Conclusion |
| |
| And that's littlefs, thanks for reading! |
| |
| |
| [wikipedia-flash]: https://en.wikipedia.org/wiki/Flash_memory |
| [wikipedia-sna]: https://en.wikipedia.org/wiki/Serial_number_arithmetic |
| [wikipedia-crc]: https://en.wikipedia.org/wiki/Cyclic_redundancy_check |
| [wikipedia-cow]: https://en.wikipedia.org/wiki/Copy-on-write |
| [wikipedia-B-tree]: https://en.wikipedia.org/wiki/B-tree |
| [wikipedia-B+-tree]: https://en.wikipedia.org/wiki/B%2B_tree |
| [wikipedia-skip-list]: https://en.wikipedia.org/wiki/Skip_list |
| [wikipedia-ctz]: https://en.wikipedia.org/wiki/Count_trailing_zeros |
| [wikipedia-ecc]: https://en.wikipedia.org/wiki/Error_correction_code |
| [wikipedia-hamming-bound]: https://en.wikipedia.org/wiki/Hamming_bound |
| [wikipedia-dynamic-wear-leveling]: https://en.wikipedia.org/wiki/Wear_leveling#Dynamic_wear_leveling |
| [wikipedia-static-wear-leveling]: https://en.wikipedia.org/wiki/Wear_leveling#Static_wear_leveling |
| [wikipedia-ftl]: https://en.wikipedia.org/wiki/Flash_translation_layer |
| |
| [oeis]: https://oeis.org |
| [A001511]: https://oeis.org/A001511 |
| [A005187]: https://oeis.org/A005187 |
| |
| [fat]: https://en.wikipedia.org/wiki/Design_of_the_FAT_file_system |
| [ext2]: http://e2fsprogs.sourceforge.net/ext2intro.html |
| [jffs]: https://www.sourceware.org/jffs2/jffs2-html |
| [yaffs]: https://yaffs.net/documents/how-yaffs-works |
| [spiffs]: https://github.com/pellepl/spiffs/blob/master/docs/TECH_SPEC |
| [ext4]: https://ext4.wiki.kernel.org/index.php/Ext4_Design |
| [ntfs]: https://en.wikipedia.org/wiki/NTFS |
| [btrfs]: https://btrfs.wiki.kernel.org/index.php/Btrfs_design |
| [zfs]: https://en.wikipedia.org/wiki/ZFS |
| |
| [cow]: https://upload.wikimedia.org/wikipedia/commons/0/0c/Cow_female_black_white.jpg |
| [elephant]: https://upload.wikimedia.org/wikipedia/commons/3/37/African_Bush_Elephant.jpg |
| [ram]: https://upload.wikimedia.org/wikipedia/commons/9/97/New_Mexico_Bighorn_Sheep.JPG |
| |
| [metadata-formula1]: https://latex.codecogs.com/svg.latex?cost%20%3D%20n%20+%20n%20%5Cfrac%7Bs%7D%7Bd+1%7D |
| [metadata-formula2]: https://latex.codecogs.com/svg.latex?s%20%3D%20r%20%5Cfrac%7Bsize%7D%7Bn%7D |
| [metadata-formula3]: https://latex.codecogs.com/svg.latex?d%20%3D%20%281-r%29%20%5Cfrac%7Bsize%7D%7Bn%7D |
| [metadata-formula4]: https://latex.codecogs.com/svg.latex?cost%20%3D%20n%20+%20n%20%5Cfrac%7Br%5Cfrac%7Bsize%7D%7Bn%7D%7D%7B%281-r%29%5Cfrac%7Bsize%7D%7Bn%7D+1%7D |
| |
| [ctz-formula1]: https://latex.codecogs.com/svg.latex?%5Clim_%7Bn%5Cto%5Cinfty%7D%5Cfrac%7B1%7D%7Bn%7D%5Csum_%7Bi%3D0%7D%5E%7Bn%7D%5Cleft%28%5Ctext%7Bctz%7D%28i%29+1%5Cright%29%20%3D%20%5Csum_%7Bi%3D0%7D%5Cfrac%7B1%7D%7B2%5Ei%7D%20%3D%202 |
| [ctz-formula2]: https://latex.codecogs.com/svg.latex?B%20%3D%20%5Cfrac%7Bw%7D%7B8%7D%5Cleft%5Clceil%5Clog_2%5Cleft%28%5Cfrac%7B2%5Ew%7D%7BB-2%5Cfrac%7Bw%7D%7B8%7D%7D%5Cright%29%5Cright%5Crceil |
| [ctz-formula3]: https://latex.codecogs.com/svg.latex?N%20%3D%20%5Csum_i%5En%5Cleft%5BB-%5Cfrac%7Bw%7D%7B8%7D%5Cleft%28%5Ctext%7Bctz%7D%28i%29+1%5Cright%29%5Cright%5D |
| [ctz-formula4]: https://latex.codecogs.com/svg.latex?%5Csum_i%5En%5Cleft%28%5Ctext%7Bctz%7D%28i%29+1%5Cright%29%20%3D%202n-%5Ctext%7Bpopcount%7D%28n%29 |
| [ctz-formula5]: https://latex.codecogs.com/svg.latex?N%20%3D%20Bn%20-%20%5Cfrac%7Bw%7D%7B8%7D%5Cleft%282n-%5Ctext%7Bpopcount%7D%28n%29%5Cright%29 |
| [ctz-formula6]: https://latex.codecogs.com/svg.latex?n%20%3D%20%5Cleft%5Clfloor%5Cfrac%7BN-%5Cfrac%7Bw%7D%7B8%7D%5Cleft%28%5Ctext%7Bpopcount%7D%5Cleft%28%5Cfrac%7BN%7D%7BB-2%5Cfrac%7Bw%7D%7B8%7D%7D-1%5Cright%29+2%5Cright%29%7D%7BB-2%5Cfrac%7Bw%7D%7B8%7D%7D%5Cright%5Crfloor |
| [ctz-formula7]: https://latex.codecogs.com/svg.latex?%5Cmathit%7Boff%7D%20%3D%20N%20-%20%5Cleft%28B-2%5Cfrac%7Bw%7D%7B8%7D%5Cright%29n%20-%20%5Cfrac%7Bw%7D%7B8%7D%5Ctext%7Bpopcount%7D%28n%29 |
| |
| [bigB]: https://latex.codecogs.com/svg.latex?B |
| [d]: https://latex.codecogs.com/svg.latex?d |
| [m]: https://latex.codecogs.com/svg.latex?m |
| [bigN]: https://latex.codecogs.com/svg.latex?N |
| [n]: https://latex.codecogs.com/svg.latex?n |
| [n']: https://latex.codecogs.com/svg.latex?n%27 |
| [r]: https://latex.codecogs.com/svg.latex?r |
| [s]: https://latex.codecogs.com/svg.latex?s |
| [w]: https://latex.codecogs.com/svg.latex?w |
| [x]: https://latex.codecogs.com/svg.latex?x |
| |
| [metadata-cost-graph]: https://raw.githubusercontent.com/geky/littlefs/gh-images/metadata-cost.svg?sanitize=true |
| [wear-distribution-graph]: https://raw.githubusercontent.com/geky/littlefs/gh-images/wear-distribution.svg?sanitize=true |
| [file-cost-graph]: https://raw.githubusercontent.com/geky/littlefs/gh-images/file-cost.svg?sanitize=true |