US20120179874A1 - Scalable cloud storage architecture - Google Patents
Scalable cloud storage architecture Download PDFInfo
- Publication number
- US20120179874A1 US20120179874A1 US12/986,466 US98646611A US2012179874A1 US 20120179874 A1 US20120179874 A1 US 20120179874A1 US 98646611 A US98646611 A US 98646611A US 2012179874 A1 US2012179874 A1 US 2012179874A1
- Authority
- US
- United States
- Prior art keywords
- data
- storage
- block
- virtual
- local persistent
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
- H04L67/1097—Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1446—Point-in-time backing up or restoration of persistent data
- G06F11/1448—Management of the data involved in backup or backup restore
- G06F11/1453—Management of the data involved in backup or backup restore using de-duplication of the data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/0808—Multiuser, multiprocessor or multiprocessing cache systems with cache invalidating means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/0813—Multiuser, multiprocessor or multiprocessing cache systems with a network or matrix configuration
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
- G06F2009/45579—I/O management, e.g. providing access to device drivers or storage
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
- G06F2009/45583—Memory management, e.g. access or allocation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/15—Use in a specific computing environment
- G06F2212/152—Virtualized environment, e.g. logically partitioned system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/15—Use in a specific computing environment
- G06F2212/154—Networked environment
Definitions
- the present application generally relates to computer systems and computer storage, and more particularly to virtual storage and storage architecture.
- VM Virtual Machine
- a VM host may be required to provide virtual disks for a large number of VMs. It is difficult to ascertain the largest possible storage demands and physically provision them all in the host machine.
- the storage spaces for virtual disks are provided through remote storage servers, aggregate network traffic due to storage accesses from VMs can easily deplete the network bandwidth and cause congestion.
- a storage system and method for handling data for virtual machines, for instance, for scalable cloud storage architecture may be provided.
- the system may include a virtual storage module operable to run in a virtual machine monitor.
- the virtual storage module may include a wait-queue operable to store incoming block-level data requests from one or more virtual machines, and in-memory metadata for storing information associated with data stored in local persistent storage that is local to a host computer hosting the virtual machines.
- the data stored in local persistent storage may be replication of a subset of data in one or more virtual disks provided to the virtual machines, the virtual disks being mapped to remote storage accessible via a network connecting the virtual machines and the remote storage.
- a cache handling logic may be operable to handle the block-level data requests by obtaining the information in the in-memory metadata and making I/O requests to the local persistent storage or the remote storage or combination of the local persistent storage and the remote storage to service the block-level data requests.
- a method for handling data storage for virtual machines may include intercepting one or more incoming block-level data requests received by a virtual machine monitor from one or more virtual machines.
- the method may also include obtaining from in-memory metadata, information associated with data of the block-level data request.
- the in-memory metadata may store information associated with data stored in local persistent storage that is local to a host computer hosting the virtual machines.
- the data stored in local persistent storage may be replication of a subset of data in one or more virtual disks provided to the virtual machines.
- the virtual disks may be mapped to remote storage accessible via a network connecting the virtual machines and the remote storage.
- the method may further include making I/O requests to the local persistent storage or the remote storage or combination of the local persistent storage and the remote storage to service the block-level data requests.
- a computer readable storage medium storing a program of instructions executable by a machine to perform one or more methods described herein also may be provided.
- FIG. 1 shows the architecture of a scalable Cloud storage system in one embodiment of the present disclosure.
- FIG. 2 shows the architecture of vStore in one embodiment of the present disclosure.
- FIG. 3 illustrates structure of one cache entry in one embodiment of the present disclosure.
- FIG. 4A is a flow diagram illustarting a read request handling in one embodiment of the present disclosure.
- FIG. 4B is a flow diagram illustarting a write request handling in one embodiment of the present disclosure.
- FIG. 5 shows as an example, the Xen implementation of vStore in one embodiment of the present disclsoure.
- the present disclosure in one embodiment presents a system (referred to in this disclosure as vStore), which utilizes the host's (e.g., computer server hosting virtual machines) local disk space as a block-level cache for the remote storage (e.g., network attached storages), for example, in order to absorb network traffics from storage accesses.
- vStore virtual Machine Monitor
- VMM Virtual Machine Monitor, a.k.a. hypervisor
- I/O disk input/output
- Caching virtual disks at block-level poses special challenges in achieving high performance while maintaining virtual disk semantics.
- cache handling operations in one embodiment of the present disclosure may ensure consistency between on-disk metadata and data to avoid committing incorrect data to the network attached storage (NAS) during recovery from a crash, while minimizing overheads in updating on-disk metadata.
- NAS network attached storage
- the present disclosure in one embodiment may utilize a cache placement policy that maintains a high degree of data sequentiality in the cache as in the original (i.e., remote) virtual disk.
- the destaging operation that sends dirty pages back to the remote storage server may be self-adaptive and minimize the impact on the foreground traffic.
- a scalable architecture is presented that provides reliable virtual disks (i.e., block devices as opposed to object stores) for virtual machines (VM) in a cloud environment.
- VM virtual machines
- FIG. 1 shows the architecture of a scalable Cloud storage system in one embodiment of the present disclosure.
- the architecture may include one or more VM-hosting machines (e.g., 102 , 104 , 106 ).
- a VM-hosting machine is a physical machine that hosts a large number of VMs and has limited local storage space.
- vStore 108 uses local storage 110 as a block-level cache and provides to VMs 112 the illusion of unlimited storage space.
- vStore 108 may be implemented in hypervisor 114 and provides persistent cache.
- vStore 108 performs caching at the block device level rather than the file system level.
- the hypervisor 114 executes on one or more computer processors and provides a virtual block device to VMs 112 , which implies that VMs 112 see raw block devices and they are free to install any file systems on top of it. Thus, hypervisor 114 receives block-level requests and redirects it to the remote storage (e.g., 116 , 118 ).
- single cache space is provided per machine (e.g., 102 ).
- the cache tries to replicate the block layout of remote storage (e.g., 116 , 118 ) in the local cache space (local disk) 110 .
- Storage server clusters (e.g., 116 , 118 ) provide network attached storage to physical machines (e.g., 102 , 104 , 106 ). They (e.g., 116 , 118 ) can be either dedicated high-performance storage servers or a cluster of servers using commodity storage devices.
- the interface to the hypervisors 114 can be either block-level or file-level. If it is the block-level, iSCSI type of protocol can be used between storage servers and clients (i.e., hypervisors). If it is file-level, the hypervisor mounts a remote directory structure and keeps the virtual disks as individual files. Regardless of the protocol between hypervisors and storage servers, the interface between VMs and hypervisor remains at block-level.
- the directory server 120 holds the location information about the storage server clusters.
- a hypervisor 114 wants to attach a virtual disk to a VM, it consults the directory server 120 to determine the address of a specific storage server (e.g., 116 , 118 ) that currently stores the virtual disk.
- the architecture also includes networking infrastructure. Usually network bandwidth within a rack is well-provisioned, but cross-rack network is usually 5-10 times under-provisioned than that of within-rack network. As a result, uncontrolled storage accesses from VMs can easily deplete the network bandwidth and cause congestion.
- An example configuration may have rack-mounted servers for hosting virtual machines and remote storage servers to provide storage services to the VMs.
- a rack may contain more than 20 servers and virtual machine monitors such as Xen-3.1.4 hypervisor installed on each of them.
- Servers may have processors such as two Intel® XeonTM CPU of 3.40 GHz and have memory, e.g., 2 giga (G) bytes of memory. They can communicate through 1 Gbps link within the rack.
- Local storage for each server may be about 1 terabytes and they have a network file system (NFS)-mounted shared storage space that is used to hold VM images for all Virtual Machines.
- Remote storage servers may have physical hard disks attached, e.g., through Serial Advanced Technology Attachment (SATA) interface.
- SATA Serial Advanced Technology Attachment
- VMs may use different amounts of storage space, depending on how much the user pays. If every host's local storage space is over-provisioned for the largest possible demand, the cost would be prohibitive.
- Another solution is to only use network attached storage. That is, a VM's root file system, swap area, and additional data disks are all stored on network attached storage. This solution, however, would incur a large amount of network traffic and disk I/O load on the storage servers.
- Sequential disk access can achieve a data rate of 100 MB/s. Even with pure random access, it can reach 10 MB/s. Since 1 Gbps network can sustain roughly about 13 MB/s, four uplinks to the rack-level switch are not enough to handle even one single sequential access. Note that uplinks to the rack-level network switches are limited in numbers and cannot be easily increased in commodity systems. Even for random disk access, it can only support about five VMs' disk I/O traffic. Even with 10 Gbps networks, it still can hardly support thousands of VMs running in one rack (e.g., typical numbers are 42 hosts per rack, and 32 VMs per host, i.e., 1,344 VMs per rack).
- vStore 108 takes a hybrid approach that leverages both local storage 110 and network attached storage 116 , 118 . It still relies on network attached storage 116 , 118 to provide sufficient storage space for VMs 112 , but utilizes the local storage 110 of a host 102 to cache data and avoid accessing network attached storage 116 , 118 as much as possible.
- vStore Data integrity and performance are two main challenges in the design of vStore. After a disk write operation finishes from the VM's perspective, the data should survive even if the host immediately encounters a power failure. In vStore, system failures can compromise data integrity in several ways.
- vStore of the present disclosure in one embodiment may be designed to support data integrity.
- the second challenge is to achieve high performance, which conflicts with ensuring data integrity and hence may be designed to minimize performance penalties.
- the performance of vStore may be affected by several factors: (i) data placement within the cache, (ii) vStore metadata placement on disk, (iii) complication introduced by the vStore logic. For (i), if sequential blocks in a virtual disk are placed far apart in the cache, a sequential read of these blocks incurs a high overhead due to a long disk seek time. Therefore, in one embodiment, vStore keeps a virtual disk as sequential as possible in the limited cache space. For (ii), ideally, on-disk metadata should be small and should not require an additional disk seek to access data and metadata separately. For (iii), one potential overhead is the dependency among outstanding requests. For example, if one request is about to evict one cache entry, then all the requests on that entry must wait. All of these factors may be considered in the design of vStore.
- FIG. 2 shows the architecture of vStore in one embodiment of the present disclosure.
- the description herein is based on para-virtualized Xen as an example.
- VMs 202 generate block requests in the form of (sector address, sector count). Requests arrive at the front-end device driver within the VM 202 after passing through the guest kernel. Then they are forwarded to the back-end driver in Domain-0. The back-end driver issues actual I/O requests to the device, and send responses to the guest VM 202 along the reverse path.
- the vStore module 204 runs in Domain-0, and extends the function of the back-end device driver. vStore 204 intercepts requests and filters them through its cache handling logic.
- vStore 204 internally may include a wait queue 206 for incoming requests, a cache handling logic 208 , and in-memory metadata 210 . Incoming requests are first put into vStore's wait queue 206 .
- the wait queue 206 is used in one embodiment because the cache entry that this request needs to use might be under eviction or update triggered by previous requests. After clearing such conflicts, the request is handled by the cache handling logic 208 .
- the in-memory metadata 210 are consulted to obtain information such as block address, dirty bit, and modification time. Depending on the current cache state, actual I/O requests are made to either the cache on local storage 212 or the network attached storage 214 .
- I/O Unit Guest VMs usually operate on 4 KB blocks, but vStore can perform I/Os to and from the network attached storage at a configurable larger unit.
- a large I/O unit reduces the size of in-memory metadata, as it reduces the number of cache entries to manage.
- a large I/O unit works well with high-end storage servers, which are optimized for large I/O sizes (e.g., 256 KB or even 1 MB).
- I/O sizes e.g., 256 KB or even 1 MB.
- reading a large unit is as efficient as reading 4 KB. This may increase the incoming network traffic, but our evaluation shows that the subsequent savings outweigh the initial cost.
- block group to refer to the I/O unit used by the vStore as opposed to the (typically 4 KB) block used by the guest VMs. That is, one block group contains one or more 4 KB blocks.
- Metadata holds information about cache entries on disk. Metadata are stored on disk for data integrity and cached in memory for performance. Metadata updates are done in a write-through manner. After a host crashes and recovers, vStore visits each metadata entry on disk and recovers any dirty data that have not been flushed to network attached storage. Table 1 summarizes examples of the metadata fields in one embodiment of the present disclosure.
- Virtual Disk identifier identifies a virtual disk stored on network attached storage. When a virtual disk is detached and reconnected later, cached contents that belong to this disk is identified and reused.
- Bit Vector has one bit for each 4 KB block in a block group so that the states of 4 KB blocks in the same block group can be changed and tracked individually. Without Bit Vector, the states of 4 KB blocks in the same block group must always be changed together. As a result, when the VM writes to a 4 KB block, vStore must read the entire block group (including all 4 KB blocks in that block group) from network attached storage, merge with the 4 KB new data, and writes the entire block group to cache. With Bit Vector, vStore can write to the 4 KB data directly without fetching the entire block group, and then only change the affected 4 KB block's state in Bit Vector. Our experiments show that Bit Vector helps reduce network traffic when using a large cache unit size.
- Maintaining metadata on disk may compromise performance.
- a naive implementation may require two disk accesses to handle one write request issued by a VM—one for metadata update and one for writing actual data.
- vStore solves this problem by putting metadata and data together, and updates them in a single write. The details are described below.
- In-memory Metadata To avoid disk I/Os for reading the on-disk metadata, vStore in one embodiment maintains a complete copy of the metadata in memory and updates them in a write-through manner.
- One embodiment of the present disclosure use a large block group size (e.g., 256 KB) to reduce the size of the in-memory metadata.
- vStore in one embodiment of the present disclosure organizes local storage as a set-associative cache with write-back policy by default.
- the cache is a table-like structure, where a cache set is a column in the table, and a cache row is a row in the table.
- a cache row includes multiple block groups.
- a block group has contents coming from one virtual disk, but different block groups in the same cache row may have contents coming from different virtual disks.
- Block groups in the same cache row are laid out in logically contiguous disk blocks in one embodiment of the present disclosure.
- FIG. 3 illustrates structure of one cache entry in one embodiment of the present disclosure.
- a block group includes n number of 4 kilobyte (KB) blocks and each 4 KB blocks have trailers.
- each 4 KB block 302 in a block group 304 has a 512-byte trailer 306 shown in FIG. 3 .
- This trailer 306 in one embodiment includes metadata 308 and the hash value 310 of the 4 KB data block 302 .
- vStore computes the hash of the 4 KB block 302 , and writes the 4 KB block 302 and its 512-byte trailer 306 in a single write operation. If the host crashes during the write operation, after recovery, the hash value helps detect that the 4 KB block and the trailer are inconsistent.
- vStore When handling a read request, vStore also reads the 512-byte trailer 306 together with the 4 KB block 302 . As a result, a sequential read of two adjacent blocks issued by the VM is also sequential in the cache. If only the 4 KB data block is read without the trailer, the sequential request would be broken into two sub-requests, spaced apart by 512 bytes.
- simple policies like least recently used (LRU) and least frequently used (LFU) may not be suitable for vStore, because they are designed primarily for memory-based cache without consideration of block sequentiality on disk. If two consecutive blocks in a virtual disk are placed at two random locations in vStore's cache, sequential I/O requests issued by the VM become random accesses on the physical disk. In one embodiment, vStore's cache replacement algorithm strives to preserve the sequentiality of a virtual disk's blocks.
- the base cache row is the default cache row on which the first row of blocks of a virtual disk is placed. Subsequent blocks of the virtual disk are mapped to the subsequent cache rows. For example, if there are two virtual disks Disk 1 and Disk 2 currently attached to the vStore and the cache associativity is 5 (i.e., there are 5 cache rows), then Disk might be assigned 1 as a base cache row and Disk 2 might be assigned 3 to keep them reasonably away from each other. If we assume one cache row is made of ten 128 KB cache groups, Disk 2 's block at address 1280K will be mapped to row 4 which is the next row from Disk 2 's base cache row.
- vStore Upon arrival of new data block, vStore in one embodiment determines the cache location in two steps. First, it looks at the cache entry's state whose location is calculated using the base cache row and the block's address. If it is invalid or not dirty, then it is immediately assigned to the cache entry. If dirty, a victim entry is selected based on the scores. Six criteria may be used to calculate the score one embodiment.
- a score may be computed using equation (1) as follows.
- the coefficient a i represents the weight of each criterion. If all a i is 0 except for a 5 , the eviction policy becomes equivalent to LRU. Weight coefficients are adjustable according to the preference. In one embodiment, this value (score) is computed for all the cache entry within the cache set and the entry with the lowest score is chosen for eviction.
- vStore design considers both performance and data integrity in its cache handling operations. Since vStore uses disk as a cache space, cache handling has more disk access than when cache were not used. Excessive disk accesses may degrade the overall performance and reduce the merit of using vStore. In one embodiment of the present disclosure, disk accesses are minimized to make the performance loss tolerable.
- vStore may address data integrity, in one embodiment as follows. 512 byte trailer to each 4K blocks is added to record hash of it. In order to minimize disk I/O in one embodiment of the present disclosure, we read and write the trailer together.
- FIG. 4A is a flow diagram illustarting a read request handling in one embodiment of the present disclosure.
- FIG. 4B is a flow diagram illustarting a write request handling in one embodiment of the present disclosure.
- FIG. 4A illustrates a flow diagram for read cache handling in one embodiment of the present disclosure.
- a read request is received.
- the read request may originate from an application in a VM, for example to read data X.
- Using a virtual disk involves multiple steps: open the virtual disk, perform reads/writes, and finally close the virtual disk.
- vStore assigns a “Virtual Disk ID” to the virtual disk and maps it to a remote disk on storage server (virtual disk ID was described previously). This mapping relationship is kept in a mapping table, and stored both in memory and on disk in one embodiment.
- the Virtual Disk ID implicitly (because the request comes from a previously opened handle) and the sector address is specified explicitly.
- Combining the virtual disk ID and the sector address as one search key to look up the in-memory metadata can determine whether the data is cached and if so which block group currently caches the data. The following shows an example data struc-ture of the combined search key.
- the 4 KB block corresponding to the requested read data e.g., data X is cached. If so, at 408 , local disk is read to retrieve the data. At 410 , the data is returned to the requestor. If at 406 , it is determined that parts of the requested read data are cached while other parts are not cached (e.g., 1 KB in the cache and 3 KB on remote storage server), the cached block group from the local disk is read at 412 . At 414 , data corresponding the reqeusted read data is read from the remote disk and returned at 416 . At 418 , the locally read data and the remotely read data are merged. The merged data is written to cache for later reuse on a cache hit.
- the 4 KB block corresponding to the requested read data e.g., data X is cached. If so, at 408 , local disk is read to retrieve the data. At 410 , the data is returned to the requestor. If at 406 , it is determined that parts of
- the cache replacement algorithm chooses a location in the cache to hold the requested read data.
- the requested read data is read from the remote storage device at 422 . The data is returned at 424 and written to cache at 426 .
- Bit Vector is examined to determine whether the old data in the cache entry is partially valid, i.e., part of the data are stored in the cache while the other part are stored on the remote storage server. Partial validity may be determined, for example, by reading the bit vector values for each of the 4 KB blocks in the block group. For instance, if a bit in the bit vector is 0, that part of the data is in local cache. If it is 1 that part of the data is on remote storage. If it is determined that the existing data in the cache entry is partially valid, the corresponding data from the remote storage device is read at 430 .
- the cache entry data is written to remote storate. If the cache entry data has partially valid data, the remotely read data (at 430 ) is merged with the locally read data (at 432 ) before the data is written to the remote storage at 434 .
- the requested read data is read from the remote storage. The read data is returned at 438 to the requestor (e.g., the application that requested it).
- the reqesuted read data retrieved from the remote storage is written to cache.
- the merge at 442 implies a wait for operations on both incoming links ( 434 , 438 ) to complete, before performing the operation on the outgoing link ( 440 ). This is used, for example, to gurantee data integrity or to wait for data from both lock disk and remote storage.
- a difference of read handling in FIG. 4A from write handling shown in FIG. 4B is that vStore can return the data as soon as it is available and continue the rest of the cache operations in background. This is reflected in the miss handling operations (e.g., 420 to 440 ).
- remote read e.g., 422 , 436
- On-disk metadata update and cache data write may be performed afterwards (e.g., 426 , 440 ).
- FIG. 4B is a flow diagram illustarting a write request handling in one embodiment of the present disclosure.
- write request (or command) is received to write data (e.g., data X).
- the data is cached, the data is written to the local storage, i.e., cached.
- the process returns, for instance, acknowledging successful write to the requestor.
- the block group is not cached, it is determined as to whether the block group is dirty, i.e., whether the data content of the block group is modified. Whether the content of the block group is modified may be determined from reading the metadata associated with the block group and the values for the dirty bits of the 4 KB blocks contained therein.
- the requested write data is written to cache.
- the process returns, for instance, acknowledging successful write to the requestor.
- the content of the block group is modified, that data should be written out to the remote storage before the write data can overwrite the existing content of the block group.
- the content of the block group is dirty (modified)
- the remotely stored data corresponding to that content is read. This data may be merged with the current content of the block group in the local storage in order to make the local block group content wholely valid.
- the block group's content is read at 468 .
- the content of the block group is written to the remote storage.
- the requested write data is written to cache at the location of the block group.
- the process returns, for instance, acknowledging successful write to the requestor.
- vStore in one embodiment directly writes the data to the cache without accessing the network attached storage. This simplifies operations of cache hit and cache miss without flush. But, write handling for cache miss with flush may make several I/O requests. In FIG. 4B , the write handling returns at the end of entire operation sequences. In the worst case, write handling incurs at most four disk I/Os, which may occur in the case of cache miss with flush.
- Destaging refers to the process of flushing dirty (modified) data in the cache to the network attached storage.
- the destaging functionality in one embodiment of the present disclosure may be used to keep the proportion of dirty blocks under a specified level. Large number of dirty blocks is potentially harmful to the performance because evicting a dirty cache entry delays the cache handling operations significantly due to flushing operations.
- detachment of a virtual disk can be faster when there are less number of dirty blocks. If a VM wants to terminate or migrate, it has to detach the virtual disk. As part of the detachment process, all the dirty blocks belonging to the detaching storage has to be flushed. Without destaging, the amount of data that has to be transferred can be as large as orders of several gigabytes. Transferring that amount of data takes time and also generates bursty traffic.
- destaging may be triggered when the number of dirty blocks in the cache exceeds the user-specified level, which we call the pollution level. For example, if the pollution level is set to be 65%, it means that user wants to keep the ratio of dirty blocks to total blocks below 65%.
- vStore in one embodiment may determine how many blocks to destage at a given time t.
- Basic idea in one embodiment is to maintain a window size w t which indicates the total allowed data transmission size in unit of bytes per millisecond (Bpms).
- This window size is the combined data transmission size for both normal remote storage accesses and the destaging. It is specified as a rate (Bpms) since destaging action can be fired at irregularly. If w t increases, then may be more likely that normal network attached storage access would leave more bandwidth available for destaging.
- Control technique for w t in vStore may adopt the technique used for flow control in FAST TCP and for queue lengths adjustment.
- w t may be adjusted using the network attached storage latency.
- R be the desired network attached storage latency.
- ⁇ is a smoothing factor.
- ⁇ is another smoothing factor for w t . If observed remote latency is smaller than R, then w t will increase and vice versa. In vStore, we also may consider the local latency denoted as v t .
- v t ( 1 - ⁇ ) ⁇ v t - 1 + ⁇ ⁇ ⁇ L L t ⁇ v t - 1 .
- ⁇ t is time length between t and t ⁇ 1 in millisec
- B the block group size and C t pending I/O requests at time t in bytes.
- C t represents the remote access from normal file system operations. Destaging may happen only if d t >0.
- vStore may be implemented using Xen's blktap interface.
- Xen is a virtual machine montior.
- Virtual machine monitor also referred to as hypervisor, allows guest operating systems to excute on the same computer hardware concurrently. Other virtual machine monitors may be used for implementing the vStore.
- FIG. 5 shows as an example, the Xen implementation of vStore in one embodiment of the present disclsoure.
- Blktap mechanism redirects a VM's disk I/O requests to a tapdisk process 508 running in the userspace of Domain-0.
- user application 502 reads or writes to the blkfront device 504 .
- blktap 506 Normally blkfront connects to the blkback and all the block traffics are delivered to it. If blktap 506 is enabled, blktap replaces blkback and all the block traffics are now redirected to the tapdisk process 508 . Overall the blktap mechanism provides convenient method to intercept block traffics and implement new functionalities in the user space.
- Xen ships with several types of tapdisks so that tapdisk process can open the block device using the specified disk type.
- Disk types are simply a set of callback functions such as open, close, read, write, do callback and submit.
- synchronous I/O type uses normal read, write system calls to handle each incoming block I/Os.
- AIO-based disk type uses Linux AIO library to issue multiple block requests in a batch.
- vStore also may implement those predefined set of callback functions and registers to tapdisk as another type of tapdisk.
- vStore 510 may be based on the asynchronous I/O mechanism. For example, vStore submits requests to the Linux AIO library 512 and periodically polls for completed I/Os.
- internal structure of vStore 510 may be an event-driven architecture.
- a vStore also may be implemented using synchronous I/O in another embodiment.
- the architecture of the present disclosure may also include cloud storage infrstructure which has features such as cache block transfer between VM hosts to support fast migration, replication of cache blocks to nearby storage (possibly at higher level of hierarchy or same rack) within other hosts to support fast restart of VMs on a failed host, and an intelligent workload balancing mechanism between using the local stroage and the remote storage for performance and/or cost optimization, e.g., a mechanism to dyanmically determine using remote storage or local cache.
- cloud storage infrstructure which has features such as cache block transfer between VM hosts to support fast migration, replication of cache blocks to nearby storage (possibly at higher level of hierarchy or same rack) within other hosts to support fast restart of VMs on a failed host, and an intelligent workload balancing mechanism between using the local stroage and the remote storage for performance and/or cost optimization, e.g., a mechanism to dyanmically determine using remote storage or local cache.
- aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
- the computer readable medium may be a computer readable signal medium or a computer readable storage medium.
- a computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing.
- a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
- a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof.
- a computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
- Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
- Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages, a scripting language such as Perl, VBS or similar languages, and/or functional languages such as Lisp and ML and logic-oriented languages such as Prolog.
- the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
- the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
- LAN local area network
- WAN wide area network
- Internet Service Provider an Internet Service Provider
- These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
- the computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
- each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s).
- the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
- the systems and methodologies of the present disclosure may be carried out or executed in a computer system that includes a processing unit, which houses one or more processors and/or cores, memory and other systems components (not shown expressly in the drawing) that implement a computer processing system, or computer that may execute a computer program product.
- the computer program product may comprise media, for example a hard disk, a compact storage medium such as a compact disc, or other storage devices, which may be read by the processing unit by any techniques known or will be known to the skilled artisan for providing the computer program product to the processing system for execution.
- the computer program product may comprise all the respective features enabling the implementation of the methodology described herein, and which—when loaded in a computer system—is able to carry out the methods.
- Computer program, software program, program, or software in the present context means any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after either or both of the following: (a) conversion to another language, code or notation; and/or (b) reproduction in a different material form.
- the computer processing system that carries out the system and method of the present disclosure may also include a display device such as a monitor or display screen for presenting output displays and providing a display through which the user may input data and interact with the processing system, for instance, in cooperation with input devices such as the keyboard and mouse device or pointing device.
- the computer processing system may be also connected or coupled to one or more peripheral devices such as the printer, scanner, speaker, and any other devices, directly or via remote connections.
- the computer processing system may be connected or coupled to one or more other processing systems such as a server, other remote computer processing system, network storage devices, via any one or more of a local Ethernet, WAN connection, Internet, etc. or via any other networking methodologies that connect different computing systems and allow them to communicate with one another.
- the various functionalities and modules of the systems and methods of the present disclosure may be implemented or carried out distributedly on different processing systems or on any single platform, for instance, accessing data stored locally or distributedly on the network.
- aspects of the present disclosure may be embodied as a program, software, or computer instructions embodied in a computer or machine usable or readable medium, which causes the computer or machine to perform the steps of the method when executed on the computer, processor, and/or machine.
- a program storage device readable by a machine, tangibly embodying a program of instructions executable by the machine to perform various functionalities and methods described in the present disclosure is also provided.
- the system and method of the present disclosure may be implemented and run on a general-purpose computer or special-purpose computer system.
- the computer system may be any type of known or will be known systems and may typically include a processor, memory device, a storage device, input/output devices, internal buses, and/or a communications interface for communicating with other computer systems in conjunction with communication hardware and software, etc.
- the terms “computer system” and “computer network” as may be used in the present application may include a variety of combinations of fixed and/or portable computer hardware, software, peripherals, and storage devices.
- the computer system may include a plurality of individual components that are networked or otherwise linked to perform collaboratively, or may include one or more stand-alone components.
- the hardware and software components of the computer system of the present application may include and may be included within fixed and portable devices such as desktop, laptop, and/or server.
- a module may be a component of a device, software, program, or system that implements some “functionality”, which can be embodied as software, hardware, firmware, electronic circuitry, or etc.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Memory System Of A Hierarchy Structure (AREA)
Abstract
a virtual storage module operable to run in a virtual machine monitor may include a wait-queue operable to store incoming block-level data requests from one or more virtual machines. In-memory metadata may store information associated with data stored in local persistent storage that is local to a host computer hosting the virtual machines. The data stored in local persistent storage replicates a subset of data in one or more virtual disks provided to the virtual machines. The virtual disks are mapped to remote storage accessible via a network connecting the virtual machines and the remote storage. A cache handling logic may be operable to handle the block-level data requests by obtaining the information in the in-memory metadata and making I/O re-quests to the local persistent storage or the remote storage or combination of the local persistent storage and the remote storage to service the block-level data requests.
Description
- The present application generally relates to computer systems and computer storage, and more particularly to virtual storage and storage architecture.
- Designing a storage system is a challenging task. For instance, in Cloud Computing, high degree of virtualization increases the demand for storage spaces and this requires the use of remote storage spaces. However, uncontrolled access to the remote storage from large number of virtual machines can easily saturate the networking infrastructure and affect the entire systems using the network.
- More particularly, for example, in an IaaS (Infrastructure-as-a-Service) cloud services, storage needs of VM (Virtual Machine) instances are met through virtual disks (i.e. virtual block devices). However, it is nontrivial to provide virtual disks to VMs in an efficient and scalable way for a couple of reasons. First, a VM host may be required to provide virtual disks for a large number of VMs. It is difficult to ascertain the largest possible storage demands and physically provision them all in the host machine. On the other hand, if the storage spaces for virtual disks are provided through remote storage servers, aggregate network traffic due to storage accesses from VMs can easily deplete the network bandwidth and cause congestion.
- A storage system and method for handling data for virtual machines, for instance, for scalable cloud storage architecture, may be provided. The system, in one aspect, may include a virtual storage module operable to run in a virtual machine monitor. The virtual storage module may include a wait-queue operable to store incoming block-level data requests from one or more virtual machines, and in-memory metadata for storing information associated with data stored in local persistent storage that is local to a host computer hosting the virtual machines. The data stored in local persistent storage may be replication of a subset of data in one or more virtual disks provided to the virtual machines, the virtual disks being mapped to remote storage accessible via a network connecting the virtual machines and the remote storage. A cache handling logic may be operable to handle the block-level data requests by obtaining the information in the in-memory metadata and making I/O requests to the local persistent storage or the remote storage or combination of the local persistent storage and the remote storage to service the block-level data requests.
- A method for handling data storage for virtual machines, in one aspect, may include intercepting one or more incoming block-level data requests received by a virtual machine monitor from one or more virtual machines. The method may also include obtaining from in-memory metadata, information associated with data of the block-level data request. The in-memory metadata may store information associated with data stored in local persistent storage that is local to a host computer hosting the virtual machines. The data stored in local persistent storage may be replication of a subset of data in one or more virtual disks provided to the virtual machines. The virtual disks may be mapped to remote storage accessible via a network connecting the virtual machines and the remote storage. The method may further include making I/O requests to the local persistent storage or the remote storage or combination of the local persistent storage and the remote storage to service the block-level data requests.
- A computer readable storage medium storing a program of instructions executable by a machine to perform one or more methods described herein also may be provided.
- Further features as well as the structure and operation of various embodiments are described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers indicate identical or functionally similar elements.
-
FIG. 1 shows the architecture of a scalable Cloud storage system in one embodiment of the present disclosure. -
FIG. 2 shows the architecture of vStore in one embodiment of the present disclosure. -
FIG. 3 illustrates structure of one cache entry in one embodiment of the present disclosure. -
FIG. 4A is a flow diagram illustarting a read request handling in one embodiment of the present disclosure. -
FIG. 4B is a flow diagram illustarting a write request handling in one embodiment of the present disclosure. -
FIG. 5 shows as an example, the Xen implementation of vStore in one embodiment of the present disclsoure. - The present disclosure in one embodiment presents a system (referred to in this disclosure as vStore), which utilizes the host's (e.g., computer server hosting virtual machines) local disk space as a block-level cache for the remote storage (e.g., network attached storages), for example, in order to absorb network traffics from storage accesses. This allows the VMM (Virtual Machine Monitor, a.k.a. hypervisor) to serve VMs' disk input/output (I/O) requests from the host's local disks most of the time, while providing the illusion of much larger storage space for creating new virtual disks. Caching virtual disks at block-level poses special challenges in achieving high performance while maintaining virtual disk semantics. First, after a disk write operation finishes from the VM's perspective, the data should survive even if the host immediately encounters a power failure. That is, the block-level cache should preserve the data integrity in the event of host crashes. To that end, cache handling operations in one embodiment of the present disclosure may ensure consistency between on-disk metadata and data to avoid committing incorrect data to the network attached storage (NAS) during recovery from a crash, while minimizing overheads in updating on-disk metadata. Second, as disk I/O performance is dominated by disk seek times, a virtual disk should be kept as sequential as possible in the limited cache space. Unlike memory-based caching schemes, the performance of an on-disk cache is highly sensitive to data layout. The present disclosure in one embodiment may utilize a cache placement policy that maintains a high degree of data sequentiality in the cache as in the original (i.e., remote) virtual disk. Third, the destaging operation that sends dirty pages back to the remote storage server may be self-adaptive and minimize the impact on the foreground traffic.
- In another aspect, a scalable architecture is presented that provides reliable virtual disks (i.e., block devices as opposed to object stores) for virtual machines (VM) in a cloud environment.
-
FIG. 1 shows the architecture of a scalable Cloud storage system in one embodiment of the present disclosure. The architecture may include one or more VM-hosting machines (e.g., 102, 104, 106). A VM-hosting machine is a physical machine that hosts a large number of VMs and has limited local storage space. vStore 108 useslocal storage 110 as a block-level cache and provides toVMs 112 the illusion of unlimited storage space. vStore 108 may be implemented in hypervisor 114 and provides persistent cache. vStore 108 performs caching at the block device level rather than the file system level. The hypervisor 114 executes on one or more computer processors and provides a virtual block device to VMs 112, which implies that VMs 112 see raw block devices and they are free to install any file systems on top of it. Thus, hypervisor 114 receives block-level requests and redirects it to the remote storage (e.g., 116, 118). - In one embodiment, single cache space is provided per machine (e.g., 102). The cache tries to replicate the block layout of remote storage (e.g., 116, 118) in the local cache space (local disk) 110.
- Storage server clusters (e.g., 116, 118) provide network attached storage to physical machines (e.g., 102, 104, 106). They (e.g., 116, 118) can be either dedicated high-performance storage servers or a cluster of servers using commodity storage devices. The interface to the hypervisors 114 can be either block-level or file-level. If it is the block-level, iSCSI type of protocol can be used between storage servers and clients (i.e., hypervisors). If it is file-level, the hypervisor mounts a remote directory structure and keeps the virtual disks as individual files. Regardless of the protocol between hypervisors and storage servers, the interface between VMs and hypervisor remains at block-level.
- The
directory server 120 holds the location information about the storage server clusters. When a hypervisor 114 wants to attach a virtual disk to a VM, it consults thedirectory server 120 to determine the address of a specific storage server (e.g., 116, 118) that currently stores the virtual disk. - The architecture also includes networking infrastructure. Usually network bandwidth within a rack is well-provisioned, but cross-rack network is usually 5-10 times under-provisioned than that of within-rack network. As a result, uncontrolled storage accesses from VMs can easily deplete the network bandwidth and cause congestion.
- An example configuration may have rack-mounted servers for hosting virtual machines and remote storage servers to provide storage services to the VMs. A rack may contain more than 20 servers and virtual machine monitors such as Xen-3.1.4 hypervisor installed on each of them. Servers may have processors such as two Intel® Xeon™ CPU of 3.40 GHz and have memory, e.g., 2 giga (G) bytes of memory. They can communicate through 1 Gbps link within the rack. Local storage for each server may be about 1 terabytes and they have a network file system (NFS)-mounted shared storage space that is used to hold VM images for all Virtual Machines. Remote storage servers may have physical hard disks attached, e.g., through Serial Advanced Technology Attachment (SATA) interface.
- There may be multiple options when designing a storage system for a Cloud. One solution is to use only local storage. In a Cloud, VMs may use different amounts of storage space, depending on how much the user pays. If every host's local storage space is over-provisioned for the largest possible demand, the cost would be prohibitive. Another solution is to only use network attached storage. That is, a VM's root file system, swap area, and additional data disks are all stored on network attached storage. This solution, however, would incur a large amount of network traffic and disk I/O load on the storage servers.
- Sequential disk access can achieve a data rate of 100 MB/s. Even with pure random access, it can reach 10 MB/s. Since 1 Gbps network can sustain roughly about 13 MB/s, four uplinks to the rack-level switch are not enough to handle even one single sequential access. Note that uplinks to the rack-level network switches are limited in numbers and cannot be easily increased in commodity systems. Even for random disk access, it can only support about five VMs' disk I/O traffic. Even with 10 Gbps networks, it still can hardly support thousands of VMs running in one rack (e.g., typical numbers are 42 hosts per rack, and 32 VMs per host, i.e., 1,344 VMs per rack).
-
vStore 108 takes a hybrid approach that leverages bothlocal storage 110 and network attachedstorage storage VMs 112, but utilizes thelocal storage 110 of ahost 102 to cache data and avoid accessing network attachedstorage - Consider the case of Amazon EC2, where a VM is given one 10 GB virtual disk to store its root file system and another 160 GB virtual disk to store data. The root disk can be stored on local storage due to its small size. The large data disk can be stored on network attached storage and accessed through the vStore cache. Data integrity and performance are two main challenges in the design of vStore. After a disk write operation finishes from the VM's perspective, the data should survive even if the host immediately encounters a power failure. In vStore, system failures can compromise data integrity in several ways. If the host crashes while vStore is in the middle of updating either the metadata or the data and there is no mechanism for detecting the inconsistency between the metadata and the data, after the host restarts, incorrect data may remain in the cache and be written back to the network attached storage. Another case that may compromise data integrity is through violating the semantics of writes. If data is buffered in memory and not flushed to disk after reporting write completion to the VM, a system crash will cause data loss. Taking such semantics in consideration vStore of the present disclosure in one embodiment may be designed to support data integrity.
- The second challenge is to achieve high performance, which conflicts with ensuring data integrity and hence may be designed to minimize performance penalties. The performance of vStore may be affected by several factors: (i) data placement within the cache, (ii) vStore metadata placement on disk, (iii) complication introduced by the vStore logic. For (i), if sequential blocks in a virtual disk are placed far apart in the cache, a sequential read of these blocks incurs a high overhead due to a long disk seek time. Therefore, in one embodiment, vStore keeps a virtual disk as sequential as possible in the limited cache space. For (ii), ideally, on-disk metadata should be small and should not require an additional disk seek to access data and metadata separately. For (iii), one potential overhead is the dependency among outstanding requests. For example, if one request is about to evict one cache entry, then all the requests on that entry must wait. All of these factors may be considered in the design of vStore.
-
FIG. 2 shows the architecture of vStore in one embodiment of the present disclosure. The description herein is based on para-virtualized Xen as an example.VMs 202 generate block requests in the form of (sector address, sector count). Requests arrive at the front-end device driver within theVM 202 after passing through the guest kernel. Then they are forwarded to the back-end driver in Domain-0. The back-end driver issues actual I/O requests to the device, and send responses to theguest VM 202 along the reverse path. - In one embodiment, the
vStore module 204 runs in Domain-0, and extends the function of the back-end device driver.vStore 204 intercepts requests and filters them through its cache handling logic. InFIG. 2 ,vStore 204 internally may include await queue 206 for incoming requests, acache handling logic 208, and in-memory metadata 210. Incoming requests are first put into vStore'swait queue 206. Thewait queue 206 is used in one embodiment because the cache entry that this request needs to use might be under eviction or update triggered by previous requests. After clearing such conflicts, the request is handled by thecache handling logic 208. The in-memory metadata 210 are consulted to obtain information such as block address, dirty bit, and modification time. Depending on the current cache state, actual I/O requests are made to either the cache onlocal storage 212 or the network attachedstorage 214. - I/O Unit: Guest VMs usually operate on 4 KB blocks, but vStore can perform I/Os to and from the network attached storage at a configurable larger unit. A large I/O unit reduces the size of in-memory metadata, as it reduces the number of cache entries to manage. Moreover, a large I/O unit works well with high-end storage servers, which are optimized for large I/O sizes (e.g., 256 KB or even 1 MB). Thus, reading a large unit is as efficient as reading 4 KB. This may increase the incoming network traffic, but our evaluation shows that the subsequent savings outweigh the initial cost. We use the term, block group, to refer to the I/O unit used by the vStore as opposed to the (typically 4 KB) block used by the guest VMs. That is, one block group contains one or more 4 KB blocks.
- Metadata: Metadata holds information about cache entries on disk. Metadata are stored on disk for data integrity and cached in memory for performance. Metadata updates are done in a write-through manner. After a host crashes and recovers, vStore visits each metadata entry on disk and recovers any dirty data that have not been flushed to network attached storage. Table 1 summarizes examples of the metadata fields in one embodiment of the present disclosure.
-
TABLE 1 vStore Metadata. Fields Size Descriptions Virtual 2 Bytes ID assigned by vStore to uniquely identify a Disk ID virtual disk. An ID is unique only within individual hypervisors. Sector 4 Bytes Cache entry's remote address in unit of Address sector. Dirty Bit 1 Bit Set if cache content is modified. Valid Bit 1 Bit Set if cache entry is being used and the corresponding data is in the cache. Lock Bit 1 Bit Set if under modification by a request. Read Count 2 Bytes How many read accesses within a time unit. Write Count 2 Bytes How many write accesses within a time unit. Bit Vector Variable Each bit represents 4 KB within the block group. Set if corresponding 4 KB is valid. The size is (block group)/4 KB bits. Access Time 8 Bytes Most recently accessed time. Total Size <23 Bytes - Virtual Disk identifier (ID) identifies a virtual disk stored on network attached storage. When a virtual disk is detached and reconnected later, cached contents that belong to this disk is identified and reused. Bit Vector has one bit for each 4 KB block in a block group so that the states of 4 KB blocks in the same block group can be changed and tracked individually. Without Bit Vector, the states of 4 KB blocks in the same block group must always be changed together. As a result, when the VM writes to a 4 KB block, vStore must read the entire block group (including all 4 KB blocks in that block group) from network attached storage, merge with the 4 KB new data, and writes the entire block group to cache. With Bit Vector, vStore can write to the 4 KB data directly without fetching the entire block group, and then only change the affected 4 KB block's state in Bit Vector. Our experiments show that Bit Vector helps reduce network traffic when using a large cache unit size.
- Maintaining metadata on disk may compromise performance. A naive implementation may require two disk accesses to handle one write request issued by a VM—one for metadata update and one for writing actual data. In the present disclosure in one embodiment, vStore solves this problem by putting metadata and data together, and updates them in a single write. The details are described below.
- In-memory Metadata: To avoid disk I/Os for reading the on-disk metadata, vStore in one embodiment maintains a complete copy of the metadata in memory and updates them in a write-through manner. One embodiment of the present disclosure use a large block group size (e.g., 256 KB) to reduce the size of the in-memory metadata.
- Cache Structure: vStore in one embodiment of the present disclosure organizes local storage as a set-associative cache with write-back policy by default. We describe the cache as a table-like structure, where a cache set is a column in the table, and a cache row is a row in the table. A cache row includes multiple block groups. A block group has contents coming from one virtual disk, but different block groups in the same cache row may have contents coming from different virtual disks. Block groups in the same cache row are laid out in logically contiguous disk blocks in one embodiment of the present disclosure.
-
FIG. 3 illustrates structure of one cache entry in one embodiment of the present disclosure. A block group includes n number of 4 kilobyte (KB) blocks and each 4 KB blocks have trailers. For instance, each 4 KB block 302 in ablock group 304 has a 512-byte trailer 306 shown inFIG. 3 . Thistrailer 306 in one embodiment includesmetadata 308 and thehash value 310 of the 4 KB data block 302. On a write operation, vStore computes the hash of the 4 KB block 302, and writes the 4 KB block 302 and its 512-byte trailer 306 in a single write operation. If the host crashes during the write operation, after recovery, the hash value helps detect that the 4 KB block and the trailer are inconsistent. The 4 KB block can be safely discarded, because the completion of the write operation has not been acknowledged to the VM yet. When handling a read request, vStore also reads the 512-byte trailer 306 together with the 4KB block 302. As a result, a sequential read of two adjacent blocks issued by the VM is also sequential in the cache. If only the 4 KB data block is read without the trailer, the sequential request would be broken into two sub-requests, spaced apart by 512 bytes. - In one aspect, simple policies like least recently used (LRU) and least frequently used (LFU) may not be suitable for vStore, because they are designed primarily for memory-based cache without consideration of block sequentiality on disk. If two consecutive blocks in a virtual disk are placed at two random locations in vStore's cache, sequential I/O requests issued by the VM become random accesses on the physical disk. In one embodiment, vStore's cache replacement algorithm strives to preserve the sequentiality of a virtual disk's blocks.
- Below, we describe an embodiment of vStore's cache replacement algorithm in detail. We introduce the concept of base cache row of a virtual disk. The base cache row is the default cache row on which the first row of blocks of a virtual disk is placed. Subsequent blocks of the virtual disk are mapped to the subsequent cache rows. For example, if there are two virtual disks Disk1 and Disk2 currently attached to the vStore and the cache associativity is 5 (i.e., there are 5 cache rows), then Disk might be assigned 1 as a base cache row and Disk2 might be assigned 3 to keep them reasonably away from each other. If we assume one cache row is made of ten 128 KB cache groups, Disk2's block at address 1280K will be mapped to row 4 which is the next row from Disk2's base cache row.
- Upon arrival of new data block, vStore in one embodiment determines the cache location in two steps. First, it looks at the cache entry's state whose location is calculated using the base cache row and the block's address. If it is invalid or not dirty, then it is immediately assigned to the cache entry. If dirty, a victim entry is selected based on the scores. Six criteria may be used to calculate the score one embodiment.
-
- Recentness—E.g., the more recently accessed, higher the score.
- Prior Sequentiality—This measures how sequential the cache entry is with respect to the adjacent cache entries. If the cache entry is already sequential, then we prefer to keep it in one embodiment.
- Prior Distance—This measures how far away the cache entry is from the default base cache row. If the entry is located in cache row 2 and the default base cache row of the virtual disk is 1, then the value is 2−1=1.
- Posterior Sequentiality—This measures how sequential it will be if we cache new block. If it becomes sequential, then we prefer this cache entry as a victim.
- Posterior Distance—This measures how far away from the default base cache row it would be if we cache new block. If this distance is far, it is less preferable.
- Dirtiness—If the cache entry is modified, we would like to avoid evicting this entry as much as possible.
- Let xi be each of the six criteria described above, e.g., for i=1 to 6. A score may be computed using equation (1) as follows.
-
S=a 0 ·x 0 +a 1 ·x 1 + . . . +a 5 ·x 5 (1) - Here the coefficient ai represents the weight of each criterion. If all ai is 0 except for a5, the eviction policy becomes equivalent to LRU. Weight coefficients are adjustable according to the preference. In one embodiment, this value (score) is computed for all the cache entry within the cache set and the entry with the lowest score is chosen for eviction.
- Cache Handling Operations
- In one embodiment of the present disclosure, there may be three cases in cache handling—cache hit, miss without flush and miss with flush. In one embodiment, vStore design considers both performance and data integrity in its cache handling operations. Since vStore uses disk as a cache space, cache handling has more disk access than when cache were not used. Excessive disk accesses may degrade the overall performance and reduce the merit of using vStore. In one embodiment of the present disclosure, disk accesses are minimized to make the performance loss tolerable. vStore may address data integrity, in one embodiment as follows. 512 byte trailer to each 4K blocks is added to record hash of it. In order to minimize disk I/O in one embodiment of the present disclosure, we read and write the trailer together. This only increases data size, but does not increase the number of I/O. However, for cache miss handling, additional disk I/O for data integrity may be introduced. In general, such consistency issue complicates overall cache handling and there may be a trade-off between maintaining consistency and performance penalty due to additional disk I/O.
-
FIG. 4A is a flow diagram illustarting a read request handling in one embodiment of the present disclosure.FIG. 4B is a flow diagram illustarting a write request handling in one embodiment of the present disclosure. - READ Handling
-
FIG. 4A illustrates a flow diagram for read cache handling in one embodiment of the present disclosure. At 402, a read request is received. The read request may originate from an application in a VM, for example to read data X. At 404, it is determined whether the block group which stores the data of the read request is already cached. For example, the sector address of the read data is compared with the in-memory metatdata to determine whether the block group is cached already. If it is determined that the block group is cached, the flow logic proceeds to 406, otherwise the flow logic proceeds to 420. - Using a virtual disk involves multiple steps: open the virtual disk, perform reads/writes, and finally close the virtual disk. When the virtual disk was opened, vStore assigns a “Virtual Disk ID” to the virtual disk and maps it to a remote disk on storage server (virtual disk ID was described previously). This mapping relationship is kept in a mapping table, and stored both in memory and on disk in one embodiment. When the VM issues a read request, vStore knows the Virtual Disk ID implicitly (because the request comes from a previously opened handle) and the sector address is specified explicitly. Combining the virtual disk ID and the sector address as one search key to look up the in-memory metadata can determine whether the data is cached and if so which block group currently caches the data. The following shows an example data struc-ture of the combined search key.
-
Virtual 2 Bytes Disk ID Sector Address 4 Bytes - At 406, it is determined whether the 4 KB block corresponding to the requested read data, e.g., data X is cached. If so, at 408, local disk is read to retrieve the data. At 410, the data is returned to the requestor. If at 406, it is determined that parts of the requested read data are cached while other parts are not cached (e.g., 1 KB in the cache and 3 KB on remote storage server), the cached block group from the local disk is read at 412. At 414, data corresponding the reqeusted read data is read from the remote disk and returned at 416. At 418, the locally read data and the remotely read data are merged. The merged data is written to cache for later reuse on a cache hit.
- At 404, if it is determined that the block group corresponding to the requested read data is not cached, the cache replacement algorithm chooses a location in the cache to hold the requested read data. At 420, it is determined whether the old data currently cached at that location is dirty, i.e., the old data of that cache entry needs to be stored or updated in the remote storage since that old data will be evicted from the cache. At 420, if the cache entry is not dirty, the requested read data is read from the remote storage device at 422. The data is returned at 424 and written to cache at 426.
- At 420, if it is determined that the old data in the cache entry is dirty, at 428, Bit Vector is examined to determine whether the old data in the cache entry is partially valid, i.e., part of the data are stored in the cache while the other part are stored on the remote storage server. Partial validity may be determined, for example, by reading the bit vector values for each of the 4 KB blocks in the block group. For instance, if a bit in the bit vector is 0, that part of the data is in local cache. If it is 1 that part of the data is on remote storage. If it is determined that the existing data in the cache entry is partially valid, the corresponding data from the remote storage device is read at 430. At 432, if the entire data of the cache entry is valid, the data is read from the local storage. At 434, the cache entry data is written to remote storate. If the cache entry data has partially valid data, the remotely read data (at 430) is merged with the locally read data (at 432) before the data is written to the remote storage at 434. At 436, the requested read data is read from the remote storage. The read data is returned at 438 to the requestor (e.g., the application that requested it). At 440, the reqesuted read data retrieved from the remote storage is written to cache. Here, the merge at 442 implies a wait for operations on both incoming links (434, 438) to complete, before performing the operation on the outgoing link (440). This is used, for example, to gurantee data integrity or to wait for data from both lock disk and remote storage.
- A difference of read handling in
FIG. 4A from write handling shown inFIG. 4B is that vStore can return the data as soon as it is available and continue the rest of the cache operations in background. This is reflected in the miss handling operations (e.g., 420 to 440). For example, remote read (e.g., 422, 436) may be initiated first. As soon as vStore finishes reading the requested block, it returns with the data (e.g., 424, 438). On-disk metadata update and cache data write may be performed afterwards (e.g., 426, 440). - WRITE Handling
-
FIG. 4B is a flow diagram illustarting a write request handling in one embodiment of the present disclosure. At 450, write request (or command) is received to write data (e.g., data X). At 452, it is determined whether the block group to which the requested write data belongs, is cached, e.g., using virtual disk ID and sector number as the search key to look up the in-memory metadata. At 454, if the data is cached, the data is written to the local storage, i.e., cached. At 456, the process returns, for instance, acknowledging successful write to the requestor. - At 458, if the block group is not cached, it is determined as to whether the block group is dirty, i.e., whether the data content of the block group is modified. Whether the content of the block group is modified may be determined from reading the metadata associated with the block group and the values for the dirty bits of the 4 KB blocks contained therein. At 460, if the content of the block group is determined to be not modified (i.e., not dirty), the requested write data is written to cache. At 462, the process returns, for instance, acknowledging successful write to the requestor.
- If the content of the block group is modified, that data should be written out to the remote storage before the write data can overwrite the existing content of the block group. At 464, if the content of the block group is dirty (modified), it is determined whether the current content of the block group is partially valid. At 466, if the content is only partially valid, the remotely stored data corresponding to that content is read. This data may be merged with the current content of the block group in the local storage in order to make the local block group content wholely valid. At 468, the block group's content is read at 468. At 470, the content of the block group is written to the remote storage. At 472, the requested write data is written to cache at the location of the block group. At 474, the process returns, for instance, acknowledging successful write to the requestor.
- For write requests, vStore in one embodiment directly writes the data to the cache without accessing the network attached storage. This simplifies operations of cache hit and cache miss without flush. But, write handling for cache miss with flush may make several I/O requests. In
FIG. 4B , the write handling returns at the end of entire operation sequences. In the worst case, write handling incurs at most four disk I/Os, which may occur in the case of cache miss with flush. - Destaging
- Destaging refers to the process of flushing dirty (modified) data in the cache to the network attached storage. The destaging functionality in one embodiment of the present disclosure may be used to keep the proportion of dirty blocks under a specified level. Large number of dirty blocks is potentially harmful to the performance because evicting a dirty cache entry delays the cache handling operations significantly due to flushing operations. In addition, detachment of a virtual disk can be faster when there are less number of dirty blocks. If a VM wants to terminate or migrate, it has to detach the virtual disk. As part of the detachment process, all the dirty blocks belonging to the detaching storage has to be flushed. Without destaging, the amount of data that has to be transferred can be as large as orders of several gigabytes. Transferring that amount of data takes time and also generates bursty traffic.
- Mechanism Design
- In one embodimnet of the present disclosure, destaging may be triggered when the number of dirty blocks in the cache exceeds the user-specified level, which we call the pollution level. For example, if the pollution level is set to be 65%, it means that user wants to keep the ratio of dirty blocks to total blocks below 65%.
- Upon destaging, vStore in one embodiment may determine how many blocks to destage at a given time t. Basic idea in one embodiment is to maintain a window size wt which indicates the total allowed data transmission size in unit of bytes per millisecond (Bpms). This window size is the combined data transmission size for both normal remote storage accesses and the destaging. It is specified as a rate (Bpms) since destaging action can be fired at irregularly. If wt increases, then may be more likely that normal network attached storage access would leave more bandwidth available for destaging.
- Control technique for wt in vStore may adopt the technique used for flow control in FAST TCP and for queue lengths adjustment. wt may be adjusted using the network attached storage latency. Let R be the desired network attached storage latency. Let Rt be the exponentially weighted moving average of observed network attached storage latency expressed as Rt=(1−α)R+αRt-1, where α is a smoothing factor. We calculate wt using
-
- where γ is another smoothing factor for wt. If observed remote latency is smaller than R, then wt will increase and vice versa. In vStore, we also may consider the local latency denoted as vt.
- If we let Lt be the latency of local disk, we calculate vt as
-
- We take the minimum of wt and vt as the window size. Next we calculate how many block groups to destage using determined window size. Let dt denote the number of destage I/O to perform at time t, then
-
d t=(min(v t ,w t)×τt −C t)/B (3) - where τt is time length between t and t−1 in millisec, B the block group size and Ct pending I/O requests at time t in bytes. Ct represents the remote access from normal file system operations. Destaging may happen only if dt>0.
- vStore may be implemented using Xen's blktap interface. Xen is a virtual machine montior. Virtual machine monitor, also referred to as hypervisor, allows guest operating systems to excute on the same computer hardware concurrently. Other virtual machine monitors may be used for implementing the vStore.
FIG. 5 shows as an example, the Xen implementation of vStore in one embodiment of the present disclsoure. Blktap mechanism redirects a VM's disk I/O requests to atapdisk process 508 running in the userspace of Domain-0. In a para-virtualized VM,user application 502 reads or writes to theblkfront device 504. Normally blkfront connects to the blkback and all the block traffics are delivered to it. Ifblktap 506 is enabled, blktap replaces blkback and all the block traffics are now redirected to thetapdisk process 508. Overall the blktap mechanism provides convenient method to intercept block traffics and implement new functionalities in the user space. - Xen ships with several types of tapdisks so that tapdisk process can open the block device using the specified disk type. Disk types are simply a set of callback functions such as open, close, read, write, do callback and submit. Among several disk types, synchronous I/O type uses normal read, write system calls to handle each incoming block I/Os. AIO-based disk type uses Linux AIO library to issue multiple block requests in a batch. vStore also may implement those predefined set of callback functions and registers to tapdisk as another type of tapdisk.
vStore 510 may be based on the asynchronous I/O mechanism. For example, vStore submits requests to theLinux AIO library 512 and periodically polls for completed I/Os. Thus, internal structure ofvStore 510 may be an event-driven architecture. A vStore also may be implemented using synchronous I/O in another embodiment. - In another aspect, the architecture of the present disclosure may also include cloud storage infrstructure which has features such as cache block transfer between VM hosts to support fast migration, replication of cache blocks to nearby storage (possibly at higher level of hierarchy or same rack) within other hosts to support fast restart of VMs on a failed host, and an intelligent workload balancing mechanism between using the local stroage and the remote storage for performance and/or cost optimization, e.g., a mechanism to dyanmically determine using remote storage or local cache.
- As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
- Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
- A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
- Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
- Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages, a scripting language such as Perl, VBS or similar languages, and/or functional languages such as Lisp and ML and logic-oriented languages such as Prolog. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
- Aspects of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
- These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
- The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
- The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
- The systems and methodologies of the present disclosure may be carried out or executed in a computer system that includes a processing unit, which houses one or more processors and/or cores, memory and other systems components (not shown expressly in the drawing) that implement a computer processing system, or computer that may execute a computer program product. The computer program product may comprise media, for example a hard disk, a compact storage medium such as a compact disc, or other storage devices, which may be read by the processing unit by any techniques known or will be known to the skilled artisan for providing the computer program product to the processing system for execution.
- The computer program product may comprise all the respective features enabling the implementation of the methodology described herein, and which—when loaded in a computer system—is able to carry out the methods. Computer program, software program, program, or software, in the present context means any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after either or both of the following: (a) conversion to another language, code or notation; and/or (b) reproduction in a different material form.
- The computer processing system that carries out the system and method of the present disclosure may also include a display device such as a monitor or display screen for presenting output displays and providing a display through which the user may input data and interact with the processing system, for instance, in cooperation with input devices such as the keyboard and mouse device or pointing device. The computer processing system may be also connected or coupled to one or more peripheral devices such as the printer, scanner, speaker, and any other devices, directly or via remote connections. The computer processing system may be connected or coupled to one or more other processing systems such as a server, other remote computer processing system, network storage devices, via any one or more of a local Ethernet, WAN connection, Internet, etc. or via any other networking methodologies that connect different computing systems and allow them to communicate with one another. The various functionalities and modules of the systems and methods of the present disclosure may be implemented or carried out distributedly on different processing systems or on any single platform, for instance, accessing data stored locally or distributedly on the network.
- The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
- The corresponding structures, materials, acts, and equivalents of all means or step plus function elements, if any, in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.
- Various aspects of the present disclosure may be embodied as a program, software, or computer instructions embodied in a computer or machine usable or readable medium, which causes the computer or machine to perform the steps of the method when executed on the computer, processor, and/or machine. A program storage device readable by a machine, tangibly embodying a program of instructions executable by the machine to perform various functionalities and methods described in the present disclosure is also provided.
- The system and method of the present disclosure may be implemented and run on a general-purpose computer or special-purpose computer system. The computer system may be any type of known or will be known systems and may typically include a processor, memory device, a storage device, input/output devices, internal buses, and/or a communications interface for communicating with other computer systems in conjunction with communication hardware and software, etc.
- The terms “computer system” and “computer network” as may be used in the present application may include a variety of combinations of fixed and/or portable computer hardware, software, peripherals, and storage devices. The computer system may include a plurality of individual components that are networked or otherwise linked to perform collaboratively, or may include one or more stand-alone components. The hardware and software components of the computer system of the present application may include and may be included within fixed and portable devices such as desktop, laptop, and/or server. A module may be a component of a device, software, program, or system that implements some “functionality”, which can be embodied as software, hardware, firmware, electronic circuitry, or etc.
- The embodiments described above are illustrative examples and it should not be construed that the present invention is limited to these particular embodiments. Thus, various changes and modifications may be effected by one skilled in the art without departing from the spirit or scope of the invention as defined in the appended claims.
Claims (25)
1. A storage system for handling data for virtual machines, comprising:
a virtual storage module operable to run in a virtual machine monitor, the virtual storage module including at least,
a wait-queue operable to store incoming block-level data requests from one or more virtual machines;
in-memory metadata for storing information associated with data stored in local persistent storage that is local to a host computer hosting the virtual machines, the data stored in local persistent storage being replication of a subset of data in one or more virtual disks provided to the virtual machines, the virtual disks being mapped to remote storage accessible via a network connecting the virtual machines and the remote storage; and
a cache handling logic operable to handle the block-level data requests by obtaining the information in the in-memory metadata and making I/O requests to the local persistent storage or the remote storage or combination of the local persistent storage and the remote storage to service the block-level data requests.
2. The system of claim 1 , wherein the in-memory metadata includes at least virtual disk identifier that identifies a virtual disk stored on the remote storage, remote address of the data in the remote storage, a bit vector that indicates whether the data is valid, and a dirty bit that indicates whether the data is modified.
3. The system of claim 2 , wherein the virtual storage module manages block groups and performs I/O requests to the local persistent storage in units of one or more predetermined sized blocks.
4. The system of claim 3 , wherein each block stored in the local persistent storage includes a trailer that stores metadata of the block and hash value of the block used for checking data integrity of data content of the block, wherein after a host crash and recovery, the virtual storage module can examine the trailer to determine a virtual disk that owns said each block stored in the local persistent storage, and determine whether the data content of the block and the hash value are consistent.
5. The system of claim 4 , wherein the data content of the block and the trailer are read and written together in a single disk I/O operation.
6. The system of claim 3 , wherein the virtual storage module organizes the local persistent storage as set-associative cache structured into a table-like structure with rows and columns, each of the rows having multiple block groups wherein the block groups in a same row are laid out in logically contiguous disk blocks, and wherein each block group in the same row can store contents coming from a different virtual disk.
7. The system of claim 6 , wherein the one or more predetermined sized blocks can store data and metadata associated with the data, and wherein the in-memory metadata includes each of the metadata stored in the one or more predetermined sized blocks.
8. The system of claim 7 , wherein the predetermined sized blocks can further store hash value of the data.
9. The system of claim 1 , wherein the cache handling logic replaces data in the local persistent storage based on a score determined from summing weighted values associated with how recently the data was accessed, how sequential the data is with respect to an adjacent data, how far away the data is from a base row, how sequential the data would be if new block is cached, how far away from the base row the data would be if a new block is cached, and whether the data is modified.
10. The system of claim 1 , wherein the virtual storage module automatically destages modified data in the local persistent storage to the remote storage in response to determining that the modified data has reached a threshold.
11. The system of claim 10 , wherein the virtual storage module further determines how many blocks of data to destage at a given time based on total allowed data transmission size including combined data transmission size for both remote storage accesses and destaging.
12. The system of claim 1 , wherein the in-memory metadata are persisted on disk in a write-through manner to guarantee data integrity in an event of a host crash.
13. A method for handling data storage for virtual machines, comprising:
intercepting one or more incoming block-level data requests received by a virtual machine monitor from one or more virtual machines;
obtaining from in-memory metadata, information associated with data of the block-level data request, the in-memory metadata for storing information associated with data stored in local persistent storage that is local to a host computer hosting the virtual machines, the data stored in local persistent storage being replication of a subset of data in one or more virtual disks provided to the virtual machines, the virtual disks being mapped to remote storage accessible via a network connecting the virtual machines and the remote storage; and
making I/O requests to the local persistent storage or the remote storage or combination of the local persistent storage and the remote storage to service the block-level data requests.
14. The method of claim 13 , wherein the in-memory metadata includes at least virtual disk identifier that identifies a virtual disk stored on the remote storage, remote address of the data in the remote storage, a bit vector that indicates whether the data is valid, and a dirty bit that indicates whether the data is modified.
15. The method of claim 14 , further including managing block groups and performing I/O requests to the local persistent storage in units of predetermined sized blocks.
16. The method of claim 15 , further including organizing the local persistent storage as set-associative cache structured into a table-like structure with rows and columns, each of the rows having multiple block groups wherein the block groups in a same row are laid out in logically contiguous disk blocks, and wherein each block group in the same row can store contents coming from a different virtual disk
17. The method of claim 16 , wherein the one or more predetermined sized blocks can store data and metadata associated with the data, and wherein the in-memory metadata includes each of the metadata stored in the one or more predetermined sized blocks.
18. The method of claim 17 , wherein the predetermined sized blocks can further store hash value of the data.
19. The method of claim 13 , further including replacing data in the local persistent storage based on a score determined from summing weighted values associated with how recently the data was accessed, how sequential the data is with respect to an adjacent data, how far away the data is from a base row, how sequential the data would be if new block is cached, how far away from the base row the data would be if a new block is cached, and whether the data is modified.
20. The method of claim 13 , further including automatically destaging modified data in the local persistent storage to the remote storage in response to determining that the modified data has reached a threshold.
21. The method of claim 20 , further including determining how many blocks of data to destage at a given time based on total allowed data transmission size including combined data transmission size for both remote storage accesses and destaging.
22. A computer readable storage medium storing a program of instructions executable by a machine to perform a method for handling data storage for virtual machines, comprising:
intercepting one or more incoming block-level data requests received by a virtual machine monitor from one or more virtual machines;
obtaining from in-memory metadata, information associated with data of the block-level data request, the in-memory metadata for storing information associated with data stored in local persistent storage that is local to a host computer hosting the virtual machines, the data stored in local persistent storage being replication of a subset of data in one or more virtual disks provided to the virtual machines, the virtual disks being mapped to remote storage accessible via a network connecting the virtual machines and the remote storage; and
making I/O requests to the local persistent storage or the remote storage or combination of the local persistent storage and the remote storage to service the block-level data requests.
23. The computer readable storage medium of claim 22 , wherein the in-memory metadata includes at least virtual disk identifier that identifies a virtual disk stored on the remote storage, remote address of the data in the remote storage, a bit vector that indicates whether the data is valid, and a dirty bit that indicates whether the data is modified.
24. The computer readable storage medium of claim 20 , further including managing block groups and performing I/O requests to the local persistent storage in units of predetermined sized blocks.
25. The computer readable storage medium of claim 24 , further including organizing the local persistent storage as set-associative cache structured into a table-like structure with rows and columns, each of the rows having multiple block groups wherein the block groups in a same row are laid out in logically contiguous disk blocks, wherein each block group in the same row can store contents coming from a different virtual disk, wherein the one or more predetermined sized blocks can store data and metadata associated with the data, and wherein the in-memory metadata includes each of the metadata stored in the one or more predetermined sized blocks.
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/986,466 US20120179874A1 (en) | 2011-01-07 | 2011-01-07 | Scalable cloud storage architecture |
US14/014,888 US9401960B2 (en) | 2011-01-07 | 2013-08-30 | Scalable cloud storage architecture |
US15/172,205 US10042760B2 (en) | 2011-01-07 | 2016-06-03 | Scalable cloud storage architecture |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/986,466 US20120179874A1 (en) | 2011-01-07 | 2011-01-07 | Scalable cloud storage architecture |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/014,888 Continuation US9401960B2 (en) | 2011-01-07 | 2013-08-30 | Scalable cloud storage architecture |
Publications (1)
Publication Number | Publication Date |
---|---|
US20120179874A1 true US20120179874A1 (en) | 2012-07-12 |
Family
ID=46456129
Family Applications (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/986,466 Abandoned US20120179874A1 (en) | 2011-01-07 | 2011-01-07 | Scalable cloud storage architecture |
US14/014,888 Expired - Fee Related US9401960B2 (en) | 2011-01-07 | 2013-08-30 | Scalable cloud storage architecture |
US15/172,205 Expired - Fee Related US10042760B2 (en) | 2011-01-07 | 2016-06-03 | Scalable cloud storage architecture |
Family Applications After (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/014,888 Expired - Fee Related US9401960B2 (en) | 2011-01-07 | 2013-08-30 | Scalable cloud storage architecture |
US15/172,205 Expired - Fee Related US10042760B2 (en) | 2011-01-07 | 2016-06-03 | Scalable cloud storage architecture |
Country Status (1)
Country | Link |
---|---|
US (3) | US20120179874A1 (en) |
Cited By (68)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120158967A1 (en) * | 2010-12-21 | 2012-06-21 | Sedayao Jeffrey C | Virtual core abstraction for cloud computing |
US20120233626A1 (en) * | 2011-03-11 | 2012-09-13 | Hoffman Jason A | Systems and methods for transparently optimizing workloads |
US20120246643A1 (en) * | 2011-03-23 | 2012-09-27 | Lei Chang | File system for storage area network |
US20130007737A1 (en) * | 2011-07-01 | 2013-01-03 | Electronics And Telecommunications Research Institute | Method and architecture for virtual desktop service |
US20130047032A1 (en) * | 2011-08-19 | 2013-02-21 | International Business Machines Corporation | Data set autorecovery |
CN103023982A (en) * | 2012-11-22 | 2013-04-03 | 中国人民解放军国防科学技术大学 | Low-latency metadata access method of cloud storage client |
US20130097356A1 (en) * | 2011-10-13 | 2013-04-18 | Mcafee, Inc. | System and method for kernel rootkit protection in a hypervisor environment |
US20130125115A1 (en) * | 2011-11-15 | 2013-05-16 | Michael S. Tsirkin | Policy enforcement by hypervisor paravirtualized ring copying |
CN103281407A (en) * | 2013-05-08 | 2013-09-04 | 重庆绿色智能技术研究院 | IP (internet protocol) address remote management system based on Loongson cloud terminal |
US8547379B2 (en) | 2011-12-29 | 2013-10-01 | Joyent, Inc. | Systems, methods, and media for generating multidimensional heat maps |
US8677359B1 (en) | 2013-03-14 | 2014-03-18 | Joyent, Inc. | Compute-centric object stores and methods of use |
US8694738B2 (en) | 2011-10-11 | 2014-04-08 | Mcafee, Inc. | System and method for critical address space protection in a hypervisor environment |
US8701189B2 (en) | 2008-01-31 | 2014-04-15 | Mcafee, Inc. | Method of and system for computer system denial-of-service protection |
US8775485B1 (en) | 2013-03-15 | 2014-07-08 | Joyent, Inc. | Object store management operations within compute-centric object stores |
US8782224B2 (en) | 2011-12-29 | 2014-07-15 | Joyent, Inc. | Systems and methods for time-based dynamic allocation of resource management |
US8793688B1 (en) | 2013-03-15 | 2014-07-29 | Joyent, Inc. | Systems and methods for double hulled virtualization operations |
US8826279B1 (en) | 2013-03-14 | 2014-09-02 | Joyent, Inc. | Instruction set architecture for compute-based object stores |
US20140304468A1 (en) * | 2012-04-06 | 2014-10-09 | Datacore Software Corporation | Data consolidation using a common portion accessible by multiple devices |
US8881279B2 (en) | 2013-03-14 | 2014-11-04 | Joyent, Inc. | Systems and methods for zone-based intrusion detection |
EP2799973A1 (en) * | 2013-04-30 | 2014-11-05 | CloudFounders NV | A method for layered storage of enterprise data |
US8943284B2 (en) | 2013-03-14 | 2015-01-27 | Joyent, Inc. | Systems and methods for integrating compute resources in a storage area network |
US8959217B2 (en) | 2010-01-15 | 2015-02-17 | Joyent, Inc. | Managing workloads and hardware resources in a cloud resource |
US20150058444A1 (en) * | 2013-08-26 | 2015-02-26 | Vmware, Inc. | Cloud-scale heterogeneous datacenter management infrastructure |
US8973144B2 (en) | 2011-10-13 | 2015-03-03 | Mcafee, Inc. | System and method for kernel rootkit protection in a hypervisor environment |
US9009106B1 (en) | 2011-08-10 | 2015-04-14 | Nutanix, Inc. | Method and system for implementing writable snapshots in a virtualized storage environment |
WO2015060831A1 (en) | 2013-10-22 | 2015-04-30 | Citrix Systems Inc. | Method and system for displaying graphics for a local virtual machine |
US20150134780A1 (en) * | 2013-11-13 | 2015-05-14 | Datadirect Networks, Inc. | Centralized parallel burst engine for high performance computing |
US9052936B1 (en) | 2011-08-10 | 2015-06-09 | Nutanix, Inc. | Method and system for communicating to a storage controller in a virtualization environment |
US9092238B2 (en) | 2013-03-15 | 2015-07-28 | Joyent, Inc. | Versioning schemes for compute-centric object stores |
US9104456B2 (en) | 2013-03-14 | 2015-08-11 | Joyent, Inc. | Zone management of compute-centric object stores |
US20150234617A1 (en) * | 2014-02-18 | 2015-08-20 | University Of Florida Research Foundation, Inc. | Method and apparatus for virtual machine live storage migration in heterogeneous storage environment |
WO2015125135A1 (en) * | 2014-02-19 | 2015-08-27 | Technion Research & Development Foundation Limited | Memory swapper for virtualized environments |
US20150293830A1 (en) * | 2014-04-15 | 2015-10-15 | Splunk Inc. | Displaying storage performance information |
US9256456B1 (en) | 2011-08-10 | 2016-02-09 | Nutanix, Inc. | Architecture for managing I/O and storage for a virtualization environment |
US9256374B1 (en) | 2011-08-10 | 2016-02-09 | Nutanix, Inc. | Metadata for managing I/O and storage for a virtualization environment |
US20160103851A1 (en) * | 2014-10-10 | 2016-04-14 | Vencislav Dimitrov | Providing extended file storage for applications |
US9354912B1 (en) | 2011-08-10 | 2016-05-31 | Nutanix, Inc. | Method and system for implementing a maintenance service for managing I/O and storage for a virtualization environment |
WO2016081942A3 (en) * | 2014-11-21 | 2016-08-11 | Security First Corp. | Gateway for cloud-based secure storage |
CN105988721A (en) * | 2015-02-10 | 2016-10-05 | 中兴通讯股份有限公司 | Data caching method and apparatus for network disk client |
US20160350340A1 (en) * | 2013-12-23 | 2016-12-01 | Roger March | Method of operation for a hierarchical file block variant tracker apparatus |
US20170031627A1 (en) * | 2015-07-31 | 2017-02-02 | International Business Machines Corporation | Proxying slice access requests during a data evacuation |
US9582306B2 (en) | 2015-03-31 | 2017-02-28 | At&T Intellectual Property I, L.P. | Method and system to dynamically instantiate virtual repository for any services |
US20170103087A1 (en) * | 2015-10-13 | 2017-04-13 | Ca, Inc. | Subsystem dataset utilizing cloud storage |
US9652265B1 (en) * | 2011-08-10 | 2017-05-16 | Nutanix, Inc. | Architecture for managing I/O and storage for a virtualization environment with multiple hypervisor types |
US9684469B1 (en) * | 2012-03-12 | 2017-06-20 | EMC IP Holding Company LLC | System and method for cache replacement using access-ordering lookahead approach |
CN106961475A (en) * | 2017-03-14 | 2017-07-18 | 云宏信息科技股份有限公司 | A kind of remote disk sharing method and shared system based on NBD |
US20170208149A1 (en) * | 2016-01-20 | 2017-07-20 | International Business Machines Corporation | Operating local caches for a shared storage device |
US9740880B1 (en) * | 2013-12-10 | 2017-08-22 | Emc Corporation | Encrypted virtual machines in a cloud |
US9747287B1 (en) | 2011-08-10 | 2017-08-29 | Nutanix, Inc. | Method and system for managing metadata for a virtualization environment |
US9772866B1 (en) | 2012-07-17 | 2017-09-26 | Nutanix, Inc. | Architecture for implementing a virtualization environment and appliance |
US10089009B2 (en) | 2013-04-30 | 2018-10-02 | Inuron | Method for layered storage of enterprise data |
US20180324149A1 (en) * | 2017-05-02 | 2018-11-08 | MobileNerd, Inc. | Cloud based virtual computing system with virtual network tunnel |
US10127062B2 (en) | 2013-10-22 | 2018-11-13 | Citrix Systems, Inc. | Displaying graphics for local virtual machine by allocating textual buffer |
US10127068B2 (en) * | 2016-06-30 | 2018-11-13 | Amazon Technologies, Inc. | Performance variability reduction using an opportunistic hypervisor |
CN109213691A (en) * | 2017-06-30 | 2019-01-15 | 伊姆西Ip控股有限责任公司 | Method and apparatus for cache management |
US20190042386A1 (en) * | 2017-12-27 | 2019-02-07 | Intel Corporation | Logical storage driver |
CN109426548A (en) * | 2017-08-28 | 2019-03-05 | 三星电子株式会社 | Prevent the method and system that dirty virtual machine is run on undesirable host server |
US20190129859A1 (en) * | 2017-10-27 | 2019-05-02 | EMC IP Holding Company LLC | Method, device and computer program product for cache management |
US10465492B2 (en) | 2014-05-20 | 2019-11-05 | KATA Systems LLC | System and method for oil and condensate processing |
US10467103B1 (en) | 2016-03-25 | 2019-11-05 | Nutanix, Inc. | Efficient change block training |
US20200204626A1 (en) * | 2018-08-25 | 2020-06-25 | Panzura, Inc. | Accessing a scale-out block interface in a cloud-based distributed computing environment |
CN112540982A (en) * | 2019-09-20 | 2021-03-23 | Sap欧洲公司 | Virtual database table with updatable logical table pointers |
US11093402B2 (en) * | 2012-08-27 | 2021-08-17 | Vmware, Inc. | Transparent host-side caching of virtual disks located on shared storage |
US11226771B2 (en) * | 2012-04-20 | 2022-01-18 | Memory Technologies Llc | Managing operational state data in memory module |
CN113946286A (en) * | 2021-08-17 | 2022-01-18 | 丝路信息港云计算科技有限公司 | Cloud node block-level caching method, storage device and server |
US11494080B2 (en) | 2008-02-28 | 2022-11-08 | Memory Technologies Llc | Extended utilization area for a memory device |
US11733869B2 (en) | 2009-06-04 | 2023-08-22 | Memory Technologies Llc | Apparatus and method to share host system RAM with mass storage memory RAM |
US11797180B2 (en) | 2012-01-26 | 2023-10-24 | Memory Technologies Llc | Apparatus and method to provide cache move with non-volatile mass memory system |
Families Citing this family (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9565250B2 (en) * | 2014-05-30 | 2017-02-07 | Microsoft Technology Licensing, Llc | Data transfer service |
US9767021B1 (en) * | 2014-09-19 | 2017-09-19 | EMC IP Holding Company LLC | Optimizing destaging of data to physical storage devices |
US10241867B2 (en) | 2014-11-04 | 2019-03-26 | International Business Machines Corporation | Journal-less recovery for nested crash-consistent storage systems |
US10831715B2 (en) | 2015-01-30 | 2020-11-10 | Dropbox, Inc. | Selective downloading of shared content items in a constrained synchronization system |
US9361349B1 (en) | 2015-01-30 | 2016-06-07 | Dropbox, Inc. | Storage constrained synchronization of shared content items |
US20160269501A1 (en) * | 2015-03-11 | 2016-09-15 | Netapp, Inc. | Using a cache cluster of a cloud computing service as a victim cache |
US9852147B2 (en) * | 2015-04-01 | 2017-12-26 | Dropbox, Inc. | Selective synchronization and distributed content item block caching for multi-premises hosting of digital content items |
US9922201B2 (en) | 2015-04-01 | 2018-03-20 | Dropbox, Inc. | Nested namespaces for selective content sharing |
US10963430B2 (en) | 2015-04-01 | 2021-03-30 | Dropbox, Inc. | Shared workspaces with selective content item synchronization |
US10691718B2 (en) | 2015-10-29 | 2020-06-23 | Dropbox, Inc. | Synchronization protocol for multi-premises hosting of digital content items |
US9571573B1 (en) | 2015-10-29 | 2017-02-14 | Dropbox, Inc. | Peer-to-peer synchronization protocol for multi-premises hosting of digital content items |
US9537952B1 (en) * | 2016-01-29 | 2017-01-03 | Dropbox, Inc. | Apparent cloud access for hosted content items |
US10719532B2 (en) | 2016-04-25 | 2020-07-21 | Dropbox, Inc. | Storage constrained synchronization engine |
US10049145B2 (en) | 2016-04-25 | 2018-08-14 | Dropbox, Inc. | Storage constrained synchronization engine |
US9934303B2 (en) * | 2016-04-25 | 2018-04-03 | Dropbox, Inc. | Storage constrained synchronization engine |
US10678578B2 (en) * | 2016-06-30 | 2020-06-09 | Microsoft Technology Licensing, Llc | Systems and methods for live migration of a virtual machine based on heat map and access pattern |
CN106850825B (en) * | 2017-02-23 | 2020-08-07 | 中南大学 | Client block-level cache optimization method in mobile transparent computing environment |
CN107589907B (en) * | 2017-08-10 | 2019-12-13 | 深圳壹账通智能科技有限公司 | Data processing method, electronic device and computer readable storage medium |
US10642526B2 (en) * | 2017-08-28 | 2020-05-05 | Vmware, Inc. | Seamless fault tolerance via block remapping and efficient reconciliation |
US10866963B2 (en) | 2017-12-28 | 2020-12-15 | Dropbox, Inc. | File system authentication |
CN108429813B (en) * | 2018-03-22 | 2021-04-06 | 深圳市网心科技有限公司 | Disaster recovery method, system and terminal for cloud storage service |
US11290531B2 (en) | 2019-12-04 | 2022-03-29 | Dropbox, Inc. | Immediate cloud content item creation from local file system interface |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030046493A1 (en) * | 2001-08-31 | 2003-03-06 | Coulson Richard L. | Hardware updated metadata for non-volatile mass storage cache |
US20050021764A1 (en) * | 1999-10-14 | 2005-01-27 | Barrall Geoffrey S. | Apparatus and method for hardware implementation or acceleration of operating system functions |
US20050125513A1 (en) * | 2003-12-08 | 2005-06-09 | Monica Sin-Ling Lam | Cache-based system management architecture with virtual appliances, network repositories, and virtual appliance transceivers |
US20110055827A1 (en) * | 2009-08-25 | 2011-03-03 | International Business Machines Corporation | Cache Partitioning in Virtualized Environments |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030014523A1 (en) * | 2001-07-13 | 2003-01-16 | John Teloh | Storage network data replicator |
US7486618B2 (en) * | 2003-05-27 | 2009-02-03 | Oracle International Corporation | Weighted attributes on connections and closest connection match from a connection cache |
US20060075281A1 (en) * | 2004-09-27 | 2006-04-06 | Kimmel Jeffrey S | Use of application-level context information to detect corrupted data in a storage system |
US7958310B2 (en) * | 2008-02-27 | 2011-06-07 | International Business Machines Corporation | Apparatus, system, and method for selecting a space efficient repository |
US9176883B2 (en) * | 2009-04-30 | 2015-11-03 | HGST Netherlands B.V. | Storage of data reference blocks and deltas in different storage devices |
-
2011
- 2011-01-07 US US12/986,466 patent/US20120179874A1/en not_active Abandoned
-
2013
- 2013-08-30 US US14/014,888 patent/US9401960B2/en not_active Expired - Fee Related
-
2016
- 2016-06-03 US US15/172,205 patent/US10042760B2/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050021764A1 (en) * | 1999-10-14 | 2005-01-27 | Barrall Geoffrey S. | Apparatus and method for hardware implementation or acceleration of operating system functions |
US20030046493A1 (en) * | 2001-08-31 | 2003-03-06 | Coulson Richard L. | Hardware updated metadata for non-volatile mass storage cache |
US20050125513A1 (en) * | 2003-12-08 | 2005-06-09 | Monica Sin-Ling Lam | Cache-based system management architecture with virtual appliances, network repositories, and virtual appliance transceivers |
US20110055827A1 (en) * | 2009-08-25 | 2011-03-03 | International Business Machines Corporation | Cache Partitioning in Virtualized Environments |
Cited By (133)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8701189B2 (en) | 2008-01-31 | 2014-04-15 | Mcafee, Inc. | Method of and system for computer system denial-of-service protection |
US11550476B2 (en) | 2008-02-28 | 2023-01-10 | Memory Technologies Llc | Extended utilization area for a memory device |
US11829601B2 (en) | 2008-02-28 | 2023-11-28 | Memory Technologies Llc | Extended utilization area for a memory device |
US11907538B2 (en) | 2008-02-28 | 2024-02-20 | Memory Technologies Llc | Extended utilization area for a memory device |
US11494080B2 (en) | 2008-02-28 | 2022-11-08 | Memory Technologies Llc | Extended utilization area for a memory device |
US11775173B2 (en) | 2009-06-04 | 2023-10-03 | Memory Technologies Llc | Apparatus and method to share host system RAM with mass storage memory RAM |
US11733869B2 (en) | 2009-06-04 | 2023-08-22 | Memory Technologies Llc | Apparatus and method to share host system RAM with mass storage memory RAM |
US9021046B2 (en) | 2010-01-15 | 2015-04-28 | Joyent, Inc | Provisioning server resources in a cloud resource |
US8959217B2 (en) | 2010-01-15 | 2015-02-17 | Joyent, Inc. | Managing workloads and hardware resources in a cloud resource |
US10176018B2 (en) * | 2010-12-21 | 2019-01-08 | Intel Corporation | Virtual core abstraction for cloud computing |
US20120158967A1 (en) * | 2010-12-21 | 2012-06-21 | Sedayao Jeffrey C | Virtual core abstraction for cloud computing |
US8555276B2 (en) * | 2011-03-11 | 2013-10-08 | Joyent, Inc. | Systems and methods for transparently optimizing workloads |
US8789050B2 (en) | 2011-03-11 | 2014-07-22 | Joyent, Inc. | Systems and methods for transparently optimizing workloads |
US20120233626A1 (en) * | 2011-03-11 | 2012-09-13 | Hoffman Jason A | Systems and methods for transparently optimizing workloads |
US8732702B2 (en) * | 2011-03-23 | 2014-05-20 | Emc Corporation | File system for storage area network |
US20120246643A1 (en) * | 2011-03-23 | 2012-09-27 | Lei Chang | File system for storage area network |
US20130007737A1 (en) * | 2011-07-01 | 2013-01-03 | Electronics And Telecommunications Research Institute | Method and architecture for virtual desktop service |
US9086897B2 (en) * | 2011-07-01 | 2015-07-21 | Electronics And Telecommunications Research Institute | Method and architecture for virtual desktop service |
US9256374B1 (en) | 2011-08-10 | 2016-02-09 | Nutanix, Inc. | Metadata for managing I/O and storage for a virtualization environment |
US9256475B1 (en) | 2011-08-10 | 2016-02-09 | Nutanix, Inc. | Method and system for handling ownership transfer in a virtualization environment |
US9389887B1 (en) | 2011-08-10 | 2016-07-12 | Nutanix, Inc. | Method and system for managing de-duplication of data in a virtualization environment |
US9354912B1 (en) | 2011-08-10 | 2016-05-31 | Nutanix, Inc. | Method and system for implementing a maintenance service for managing I/O and storage for a virtualization environment |
US11301274B2 (en) * | 2011-08-10 | 2022-04-12 | Nutanix, Inc. | Architecture for managing I/O and storage for a virtualization environment |
US9052936B1 (en) | 2011-08-10 | 2015-06-09 | Nutanix, Inc. | Method and system for communicating to a storage controller in a virtualization environment |
US9256456B1 (en) | 2011-08-10 | 2016-02-09 | Nutanix, Inc. | Architecture for managing I/O and storage for a virtualization environment |
US11314421B2 (en) | 2011-08-10 | 2022-04-26 | Nutanix, Inc. | Method and system for implementing writable snapshots in a virtualized storage environment |
US9575784B1 (en) | 2011-08-10 | 2017-02-21 | Nutanix, Inc. | Method and system for handling storage in response to migration of a virtual machine in a virtualization environment |
US9747287B1 (en) | 2011-08-10 | 2017-08-29 | Nutanix, Inc. | Method and system for managing metadata for a virtualization environment |
US9619257B1 (en) | 2011-08-10 | 2017-04-11 | Nutanix, Inc. | System and method for implementing storage for a virtualization environment |
US9009106B1 (en) | 2011-08-10 | 2015-04-14 | Nutanix, Inc. | Method and system for implementing writable snapshots in a virtualized storage environment |
US9652265B1 (en) * | 2011-08-10 | 2017-05-16 | Nutanix, Inc. | Architecture for managing I/O and storage for a virtualization environment with multiple hypervisor types |
US11853780B2 (en) | 2011-08-10 | 2023-12-26 | Nutanix, Inc. | Architecture for managing I/O and storage for a virtualization environment |
US10359952B1 (en) | 2011-08-10 | 2019-07-23 | Nutanix, Inc. | Method and system for implementing writable snapshots in a virtualized storage environment |
US9037901B2 (en) * | 2011-08-19 | 2015-05-19 | International Business Machines Corporation | Data set autorecovery |
US20130047032A1 (en) * | 2011-08-19 | 2013-02-21 | International Business Machines Corporation | Data set autorecovery |
US8694738B2 (en) | 2011-10-11 | 2014-04-08 | Mcafee, Inc. | System and method for critical address space protection in a hypervisor environment |
US9946562B2 (en) | 2011-10-13 | 2018-04-17 | Mcafee, Llc | System and method for kernel rootkit protection in a hypervisor environment |
US9069586B2 (en) * | 2011-10-13 | 2015-06-30 | Mcafee, Inc. | System and method for kernel rootkit protection in a hypervisor environment |
US20130097356A1 (en) * | 2011-10-13 | 2013-04-18 | Mcafee, Inc. | System and method for kernel rootkit protection in a hypervisor environment |
US8973144B2 (en) | 2011-10-13 | 2015-03-03 | Mcafee, Inc. | System and method for kernel rootkit protection in a hypervisor environment |
US9465700B2 (en) | 2011-10-13 | 2016-10-11 | Mcafee, Inc. | System and method for kernel rootkit protection in a hypervisor environment |
US20130125115A1 (en) * | 2011-11-15 | 2013-05-16 | Michael S. Tsirkin | Policy enforcement by hypervisor paravirtualized ring copying |
US9904564B2 (en) * | 2011-11-15 | 2018-02-27 | Red Hat Israel, Ltd. | Policy enforcement by hypervisor paravirtualized ring copying |
US8782224B2 (en) | 2011-12-29 | 2014-07-15 | Joyent, Inc. | Systems and methods for time-based dynamic allocation of resource management |
US8547379B2 (en) | 2011-12-29 | 2013-10-01 | Joyent, Inc. | Systems, methods, and media for generating multidimensional heat maps |
US11797180B2 (en) | 2012-01-26 | 2023-10-24 | Memory Technologies Llc | Apparatus and method to provide cache move with non-volatile mass memory system |
US10503423B1 (en) | 2012-03-12 | 2019-12-10 | EMC IP Holding Company LLC | System and method for cache replacement using access-ordering lookahead approach |
US9684469B1 (en) * | 2012-03-12 | 2017-06-20 | EMC IP Holding Company LLC | System and method for cache replacement using access-ordering lookahead approach |
US8886857B2 (en) * | 2012-04-06 | 2014-11-11 | Datacore Software Corporation | Data consolidation using a common portion accessible by multiple devices |
US20140304468A1 (en) * | 2012-04-06 | 2014-10-09 | Datacore Software Corporation | Data consolidation using a common portion accessible by multiple devices |
US11226771B2 (en) * | 2012-04-20 | 2022-01-18 | Memory Technologies Llc | Managing operational state data in memory module |
US11782647B2 (en) | 2012-04-20 | 2023-10-10 | Memory Technologies Llc | Managing operational state data in memory module |
US9772866B1 (en) | 2012-07-17 | 2017-09-26 | Nutanix, Inc. | Architecture for implementing a virtualization environment and appliance |
US10684879B2 (en) | 2012-07-17 | 2020-06-16 | Nutanix, Inc. | Architecture for implementing a virtualization environment and appliance |
US10747570B2 (en) | 2012-07-17 | 2020-08-18 | Nutanix, Inc. | Architecture for implementing a virtualization environment and appliance |
US11314543B2 (en) | 2012-07-17 | 2022-04-26 | Nutanix, Inc. | Architecture for implementing a virtualization environment and appliance |
US11093402B2 (en) * | 2012-08-27 | 2021-08-17 | Vmware, Inc. | Transparent host-side caching of virtual disks located on shared storage |
CN103023982A (en) * | 2012-11-22 | 2013-04-03 | 中国人民解放军国防科学技术大学 | Low-latency metadata access method of cloud storage client |
US8943284B2 (en) | 2013-03-14 | 2015-01-27 | Joyent, Inc. | Systems and methods for integrating compute resources in a storage area network |
US8677359B1 (en) | 2013-03-14 | 2014-03-18 | Joyent, Inc. | Compute-centric object stores and methods of use |
US9582327B2 (en) | 2013-03-14 | 2017-02-28 | Joyent, Inc. | Compute-centric object stores and methods of use |
US8826279B1 (en) | 2013-03-14 | 2014-09-02 | Joyent, Inc. | Instruction set architecture for compute-based object stores |
US9104456B2 (en) | 2013-03-14 | 2015-08-11 | Joyent, Inc. | Zone management of compute-centric object stores |
US8881279B2 (en) | 2013-03-14 | 2014-11-04 | Joyent, Inc. | Systems and methods for zone-based intrusion detection |
US8793688B1 (en) | 2013-03-15 | 2014-07-29 | Joyent, Inc. | Systems and methods for double hulled virtualization operations |
US8898205B2 (en) | 2013-03-15 | 2014-11-25 | Joyent, Inc. | Object store management operations within compute-centric object stores |
US8775485B1 (en) | 2013-03-15 | 2014-07-08 | Joyent, Inc. | Object store management operations within compute-centric object stores |
US9092238B2 (en) | 2013-03-15 | 2015-07-28 | Joyent, Inc. | Versioning schemes for compute-centric object stores |
US9075818B2 (en) | 2013-03-15 | 2015-07-07 | Joyent, Inc. | Object store management operations within compute-centric object stores |
US9792290B2 (en) | 2013-03-15 | 2017-10-17 | Joyent, Inc. | Object store management operations within compute-centric object stores |
US9547453B2 (en) | 2013-04-30 | 2017-01-17 | Inuron | Method for layered storage of enterprise data |
US10089009B2 (en) | 2013-04-30 | 2018-10-02 | Inuron | Method for layered storage of enterprise data |
EP2799973A1 (en) * | 2013-04-30 | 2014-11-05 | CloudFounders NV | A method for layered storage of enterprise data |
CN103281407A (en) * | 2013-05-08 | 2013-09-04 | 重庆绿色智能技术研究院 | IP (internet protocol) address remote management system based on Loongson cloud terminal |
US10187479B2 (en) * | 2013-08-26 | 2019-01-22 | Vmware, Inc. | Cloud-scale heterogeneous datacenter management infrastructure |
US20150058444A1 (en) * | 2013-08-26 | 2015-02-26 | Vmware, Inc. | Cloud-scale heterogeneous datacenter management infrastructure |
US10862982B2 (en) | 2013-08-26 | 2020-12-08 | Vmware, Inc. | Cloud-scale heterogeneous datacenter management infrastructure |
US10127062B2 (en) | 2013-10-22 | 2018-11-13 | Citrix Systems, Inc. | Displaying graphics for local virtual machine by allocating textual buffer |
EP3061072A4 (en) * | 2013-10-22 | 2017-07-19 | Citrix Systems Inc. | Method and system for displaying graphics for a local virtual machine |
US10635468B2 (en) | 2013-10-22 | 2020-04-28 | Citrix Systems, Inc. | Displaying graphics for local virtual machine by allocating and mapping textual buffer |
WO2015060831A1 (en) | 2013-10-22 | 2015-04-30 | Citrix Systems Inc. | Method and system for displaying graphics for a local virtual machine |
US9558192B2 (en) * | 2013-11-13 | 2017-01-31 | Datadirect Networks, Inc. | Centralized parallel burst engine for high performance computing |
US10055417B2 (en) * | 2013-11-13 | 2018-08-21 | Datadirect Networks, Inc. | Centralized parallel burst engine for high performance computing |
US20150134780A1 (en) * | 2013-11-13 | 2015-05-14 | Datadirect Networks, Inc. | Centralized parallel burst engine for high performance computing |
US20170177598A1 (en) * | 2013-11-13 | 2017-06-22 | Datadirect Networks, Inc. | Centralized parallel burst engine for high performance computing |
US9740880B1 (en) * | 2013-12-10 | 2017-08-22 | Emc Corporation | Encrypted virtual machines in a cloud |
US9740717B2 (en) * | 2013-12-23 | 2017-08-22 | IC Manage Inc. | Method of operation for a hierarchical file block variant tracker apparatus |
US20160350340A1 (en) * | 2013-12-23 | 2016-12-01 | Roger March | Method of operation for a hierarchical file block variant tracker apparatus |
US20150234617A1 (en) * | 2014-02-18 | 2015-08-20 | University Of Florida Research Foundation, Inc. | Method and apparatus for virtual machine live storage migration in heterogeneous storage environment |
US9195401B2 (en) * | 2014-02-18 | 2015-11-24 | University Of Florida Research Foundation, Inc. | Method and apparatus for virtual machine live storage migration in heterogeneous storage environment |
US9811268B2 (en) | 2014-02-19 | 2017-11-07 | Technion Research & Development Foundation Limited | Memory swapper for virtualized environments |
US10379751B2 (en) | 2014-02-19 | 2019-08-13 | Technion Research & Development Foundation Limited | Memory swapper for virtualized environments |
WO2015125135A1 (en) * | 2014-02-19 | 2015-08-27 | Technion Research & Development Foundation Limited | Memory swapper for virtualized environments |
US20150293830A1 (en) * | 2014-04-15 | 2015-10-15 | Splunk Inc. | Displaying storage performance information |
US9990265B2 (en) * | 2014-04-15 | 2018-06-05 | Splunk Inc. | Diagnosing causes of performance issues of virtual machines |
US10552287B2 (en) * | 2014-04-15 | 2020-02-04 | Splunk Inc. | Performance metrics for diagnosing causes of poor performing virtual machines |
US11645183B1 (en) | 2014-04-15 | 2023-05-09 | Splunk Inc. | User interface for correlation of virtual machine information and storage information |
US20180260296A1 (en) * | 2014-04-15 | 2018-09-13 | Splunk, Inc. | Performance metrics for diagnosing causes of poor performing virtual machines |
US11314613B2 (en) * | 2014-04-15 | 2022-04-26 | Splunk Inc. | Graphical user interface for visual correlation of virtual machine information and storage volume information |
US10465492B2 (en) | 2014-05-20 | 2019-11-05 | KATA Systems LLC | System and method for oil and condensate processing |
US20160103851A1 (en) * | 2014-10-10 | 2016-04-14 | Vencislav Dimitrov | Providing extended file storage for applications |
US10747730B2 (en) * | 2014-10-10 | 2020-08-18 | Sap Se | Providing extended file storage for applications |
US9733849B2 (en) | 2014-11-21 | 2017-08-15 | Security First Corp. | Gateway for cloud-based secure storage |
WO2016081942A3 (en) * | 2014-11-21 | 2016-08-11 | Security First Corp. | Gateway for cloud-based secure storage |
US10031679B2 (en) | 2014-11-21 | 2018-07-24 | Security First Corp. | Gateway for cloud-based secure storage |
CN105988721A (en) * | 2015-02-10 | 2016-10-05 | 中兴通讯股份有限公司 | Data caching method and apparatus for network disk client |
US9582306B2 (en) | 2015-03-31 | 2017-02-28 | At&T Intellectual Property I, L.P. | Method and system to dynamically instantiate virtual repository for any services |
US9952888B2 (en) | 2015-03-31 | 2018-04-24 | At&T Intellectual Property I, L.P. | Method and system to dynamically instantiate virtual repository for any services |
US20170031627A1 (en) * | 2015-07-31 | 2017-02-02 | International Business Machines Corporation | Proxying slice access requests during a data evacuation |
US10339006B2 (en) | 2015-07-31 | 2019-07-02 | International Business Machines Corporation | Proxying slice access requests during a data evacuation |
US10073736B2 (en) * | 2015-07-31 | 2018-09-11 | International Business Machines Corporation | Proxying slice access requests during a data evacuation |
US10853173B2 (en) | 2015-07-31 | 2020-12-01 | Pure Storage, Inc. | Proxying slice access requests during a data evacuation |
US20170103087A1 (en) * | 2015-10-13 | 2017-04-13 | Ca, Inc. | Subsystem dataset utilizing cloud storage |
US20170208149A1 (en) * | 2016-01-20 | 2017-07-20 | International Business Machines Corporation | Operating local caches for a shared storage device |
US10241913B2 (en) * | 2016-01-20 | 2019-03-26 | International Business Machines Corporation | Operating local caches for a shared storage device |
US10467103B1 (en) | 2016-03-25 | 2019-11-05 | Nutanix, Inc. | Efficient change block training |
US11429414B2 (en) | 2016-06-30 | 2022-08-30 | Amazon Technologies, Inc. | Virtual machine management using partially offloaded virtualization managers |
US10127068B2 (en) * | 2016-06-30 | 2018-11-13 | Amazon Technologies, Inc. | Performance variability reduction using an opportunistic hypervisor |
CN106961475A (en) * | 2017-03-14 | 2017-07-18 | 云宏信息科技股份有限公司 | A kind of remote disk sharing method and shared system based on NBD |
US20180324149A1 (en) * | 2017-05-02 | 2018-11-08 | MobileNerd, Inc. | Cloud based virtual computing system with virtual network tunnel |
US10860480B2 (en) * | 2017-06-30 | 2020-12-08 | EMC IP Holding Company LLC | Method and device for cache management |
CN109213691A (en) * | 2017-06-30 | 2019-01-15 | 伊姆西Ip控股有限责任公司 | Method and apparatus for cache management |
US20190057030A1 (en) * | 2017-06-30 | 2019-02-21 | EMC IP Holding Company LLC | Method and device for cache management |
CN109426548A (en) * | 2017-08-28 | 2019-03-05 | 三星电子株式会社 | Prevent the method and system that dirty virtual machine is run on undesirable host server |
US10740245B2 (en) * | 2017-10-27 | 2020-08-11 | EMC IP Holding Company LLC | Method, device and computer program product for cache management |
US20190129859A1 (en) * | 2017-10-27 | 2019-05-02 | EMC IP Holding Company LLC | Method, device and computer program product for cache management |
CN109725825A (en) * | 2017-10-27 | 2019-05-07 | 伊姆西Ip控股有限责任公司 | For managing method, equipment and the computer program product of caching |
US20190042386A1 (en) * | 2017-12-27 | 2019-02-07 | Intel Corporation | Logical storage driver |
US10635318B2 (en) * | 2017-12-27 | 2020-04-28 | Intel Corporation | Logical storage driver |
US11652883B2 (en) * | 2018-08-25 | 2023-05-16 | Panzura, Llc | Accessing a scale-out block interface in a cloud-based distributed computing environment |
US20200204626A1 (en) * | 2018-08-25 | 2020-06-25 | Panzura, Inc. | Accessing a scale-out block interface in a cloud-based distributed computing environment |
CN112540982A (en) * | 2019-09-20 | 2021-03-23 | Sap欧洲公司 | Virtual database table with updatable logical table pointers |
CN113946286A (en) * | 2021-08-17 | 2022-01-18 | 丝路信息港云计算科技有限公司 | Cloud node block-level caching method, storage device and server |
Also Published As
Publication number | Publication date |
---|---|
US10042760B2 (en) | 2018-08-07 |
US20130346557A1 (en) | 2013-12-26 |
US9401960B2 (en) | 2016-07-26 |
US20160283373A1 (en) | 2016-09-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10042760B2 (en) | Scalable cloud storage architecture | |
US11163699B2 (en) | Managing least recently used cache using reduced memory footprint sequence container | |
US11243708B2 (en) | Providing track format information when mirroring updated tracks from a primary storage system to a secondary storage system | |
US10540279B2 (en) | Server-based persistence management in user space | |
Byan et al. | Mercury: Host-side flash caching for the data center | |
US10291739B2 (en) | Systems and methods for tracking of cache sector status | |
US11157376B2 (en) | Transfer track format information for tracks in cache at a primary storage system to a secondary storage system to which tracks are mirrored to use after a failover or failback | |
JP5951582B2 (en) | Hypervisor I / O staging on external cache devices | |
US11086784B2 (en) | Invalidating track format information for tracks in cache | |
US10970209B2 (en) | Destaging metadata tracks from cache | |
US11188430B2 (en) | Determine whether to rebuild track metadata to determine whether a track format table has a track format code for the track format metadata | |
US10754780B2 (en) | Maintaining track format metadata for target tracks in a target storage in a copy relationship with source tracks in a source storage | |
US20190050339A1 (en) | Invalidating track format information for tracks demoted from cache | |
Kim et al. | Flash-Conscious Cache Population for Enterprise Database Workloads. | |
US11163698B2 (en) | Cache hit ratios for selected volumes using synchronous I/O | |
Tak et al. | Block-level storage caching for hypervisor-based cloud nodes | |
US11237730B2 (en) | Favored cache status for selected volumes within a storage system | |
US11169919B2 (en) | Cache preference for selected volumes within a storage system | |
US10891227B2 (en) | Determining modified tracks to destage during a cache scan | |
US11663144B2 (en) | LRU list reorganization for favored and unfavored volumes | |
US11176052B2 (en) | Variable cache status for selected volumes within a storage system | |
Tak et al. | Designing a Storage Infrastructure for Scalable Cloud Services |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:CHANG, RONG N.;TAK, BYUNG C.;TANG, CHUNQIANG;SIGNING DATES FROM 20101222 TO 20110103;REEL/FRAME:025797/0058 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO PAY ISSUE FEE |