US20130218840A1 - System and method for building a point-in-time snapshot of an eventually-consistent data store - Google Patents
System and method for building a point-in-time snapshot of an eventually-consistent data store Download PDFInfo
- Publication number
- US20130218840A1 US20130218840A1 US13/399,467 US201213399467A US2013218840A1 US 20130218840 A1 US20130218840 A1 US 20130218840A1 US 201213399467 A US201213399467 A US 201213399467A US 2013218840 A1 US2013218840 A1 US 2013218840A1
- Authority
- US
- United States
- Prior art keywords
- key
- data store
- point
- value
- rows
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 44
- 238000005056 compaction Methods 0.000 claims description 8
- 230000000737 periodic effect Effects 0.000 claims 2
- 230000008569 process Effects 0.000 description 10
- 230000004044 response Effects 0.000 description 7
- 230000010076 replication Effects 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 3
- 230000007774 longterm Effects 0.000 description 3
- 230000002085 persistent effect Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 238000005192 partition Methods 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 210000001072 colon Anatomy 0.000 description 1
- 238000001816 cooling Methods 0.000 description 1
- 238000001152 differential interference contrast microscopy Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 230000008439 repair process Effects 0.000 description 1
- 239000000344 soap Substances 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1446—Point-in-time backing up or restoration of persistent data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/1658—Data re-synchronization of a redundant component, or initial sync of replacement, additional or spare unit
- G06F11/1662—Data re-synchronization of a redundant component, or initial sync of replacement, additional or spare unit the resynchronized component or unit being a persistent storage device
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
- G06F16/2308—Concurrency control
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
- G06F16/2358—Change logging, detection, and notification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2474—Sequence data queries, e.g. querying versioned data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2477—Temporal data queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/2053—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
- G06F11/2094—Redundant storage or storage space
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
- G06F2201/80—Database-specific techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
- G06F2201/82—Solving problems relating to consistency
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
- G06F2201/835—Timestamp
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
- G06F2201/84—Using snapshots, i.e. a logical point-in-time copy of the data
Definitions
- Embodiments of the invention generally relate to eventually-consistent data stores. More specifically, embodiments of the invention relate to systems and methods for building a point-in-time snapshot of an eventually-consistent data store.
- a datacenter may consist of hundreds or thousands of server computers in a single building along with high-speed communication lines to connect those servers to the Internet.
- the servers may also be connected to large data stores that consist of thousands of disk drives or other non-volatile storage.
- a “cloud” computing model has enabled companies to purchase computing resources on an as-needed basis from providers such as Amazon®.
- Cloud computing is the delivery of computing resources as a service over a network such as the Internet.
- the company can “lease” use of a virtual data center provided by a third-party provider.
- the provider maintains the hardware at various locations throughout the world, which the company can lease and scale to match the companies needs at any given time.
- cloud storage where the provider leases virtual storage space to various companies or individuals.
- Amazon® Web Services include Amazon® Simple Storage Service (S3) that enables a user to store objects (e.g., videos, documents, etc.) at datacenters around the world using a web interface.
- the user can choose in which geographic region an object is stored and choose an amount of redundancy (i.e., by storing the object at multiple different datacenters) that ensures object availability even if one datacenter goes offline.
- An eventually-consistent data store is a data store that sacrifices consistency for availability and partition tolerance.
- a system may store data redundantly in multiple locations in order to ensure that the data is available despite communication failure between nodes (partition tolerance), however, the system cannot then also ensure that the data is consistent across the multiple nodes.
- Eventually-consistent data stores ensure that requests for data are serviced quickly while not ensuring that the data is consistent across every node where that data may be stored.
- an administrator In order to retrieve a consistent snapshot of data from the distributed data store, an administrator must either force a consistent read across all nodes (essentially preventing any requests from being processed by the system during this time) or read separately from the various nodes and reconcile the data at a later time.
- the former poses a large load on the data store and, in some cases, may be impossible to perform given the distributed nature of the data store.
- the latter requires additional services to be implemented in the data store to generate a snapshot of the state of each individual node and the ability to reconcile the data from every node at a later point in time.
- Improved techniques are needed to provide data analysts with a snapshot of the eventually-consistent data store at a particular point-in-time that does not interfere with normal operation of the data store.
- One embodiment of the present invention includes a method for building a point-in-time snapshot of an eventually-consistent data store distributed among a plurality of nodes connected by a network.
- the method includes the steps of receiving a plurality of inconsistent snapshots, wherein each inconsistent snapshot includes one or more rows of key-value pairs associated with the data store and reflects the contents of at least a portion of the data store stored on a particular node of the plurality of nodes, and generating the point-in-time snapshot by resolving the rows of key-value pairs to remove any inconsistent values, wherein the point-in-time snapshot includes a subset of the key-value pairs included in the plurality of inconsistent snapshots.
- inventions include, without limitation, a computer-readable medium that includes instructions that enable a processing unit to implement one or more aspects of the disclosed methods as well as a system configured to implement one or more aspects of the disclosed methods.
- a consistent snapshot of the data store may be generated using back-up copies of a set of inconsistent snapshots automatically generated by each node of the data store.
- Each back-up copy may be resolved to produce a consistent snapshot of a single node, which in the aggregate may be resolved to produce a consistent snapshot of the entire data store.
- Resolving the back-up copies of the inconsistent snapshots may be performed on a related system to generate a point-in-time snapshot without overloading the data store during normal operation.
- users of the data store are free to access the data store uninterrupted while data analysts may perform analysis of various metrics using a consistent view of the data store at a point-in-time in the recent past (such as generated once a day).
- FIG. 1 illustrates a distributed computer network, according to one example embodiment of the invention
- FIG. 2 illustrates the global nature of the distributed computer network, according to one example embodiment of the invention
- FIG. 3 illustrates one node of an eventually-consistent data store, according to one example embodiment of the invention
- FIG. 4 illustrates a cluster configured to implement the eventually-consistent data store of FIG. 3 , according to one example embodiment of the invention
- FIG. 5 illustrates one implementation of an eventually-consistent data store on the cluster of FIG. 4 , according to one example embodiment of the invention
- FIG. 6 illustrates a system for generating a point-in-time snapshot of an eventually-consistent data store, according to one example embodiment of the invention
- FIGS. 7A , 7 B, and 7 C illustrate various intermediate data structures generated by a cluster to create a point-in-time snapshot of the data store, according to one example embodiment of the invention
- FIG. 8 illustrates a method for building a point-in-time snapshot of an eventually-consistent data store, according to one example embodiment of the invention.
- FIG. 9 illustrates a perpetual process for updating the point-in-time snapshot, according to one example embodiment of the invention.
- Embodiments of the invention provide techniques for building a point-in-time snapshot of an eventually-consistent data store.
- One or more compute nodes collect a plurality of node specific snapshots generated by a plurality of distributed nodes that implements the eventually-consistent data store. Each of the node specific snapshots is created at various times in relation to the overall data store such that the individual snapshots may contain inconsistent data.
- one or more compute nodes analyze the plurality of snapshots to generate a consistent snapshot of the entire data store corresponding to a previous point-in-time. The lagging consistency point of the snapshot may be updated based on additional snapshots generated by the plurality of distributed nodes at various intervals.
- FIG. 1 illustrates a distributed computer network 100 , according to one example embodiment of the invention.
- the distributed computer network 100 includes a client computer 181 connected to a cloud computing infrastructure 110 (i.e., “the cloud”) that includes a plurality of compute nodes 131 .
- the client computer 181 may be connected to the cloud 110 via a network 151 such as a LAN (Local Area Network), a WAN (Wide Area Network), or the Internet.
- the cloud 110 provides one or more virtual computing services via standard messaging protocols (e.g., SOAP, REST, etc.) over the network 151 . Examples of virtual computing services may include processing capacity, storage, and relational databases, among many other types of services.
- standard messaging protocols e.g., SOAP, REST, etc.
- the cloud 110 is hosted by a cloud services provider such as Amazon®.
- the cloud services provider houses the nodes 131 in various datacenters in different physical locations around the world and enables clients to access the cloud services over the network 151 .
- Amazon® hosts a virtual cloud storage solution called Amazon Simple Storage ServiceTM (S3) as well as a virtual processing solution called Amazon Elastic Compute CloudTM (EC2), accessible through the internet using common transport protocols such as Hypertext Transport Protocol (http).
- S3 Amazon Simple Storage Service
- EC2 Amazon Elastic Compute CloudTM
- http Hypertext Transport Protocol
- a single organization may host both the cloud 110 and the client computer 181 in a private network.
- Each of the nodes 131 includes a processor (CPU) 121 , a memory 123 , a network interface controller (NIC) 125 , and one or more non-volatile storage devices 128 such as a hard-disk drive, a magnetic tape drive, optical disk drives, a drive array (e.g., RAID), or the like.
- Each node 131 may include an operating system (e.g., Microsoft® WindowsTM, LinuxTM, Unix®, etc.) as well as one or more applications stored in memory 123 and running on CPU 121 . Some of the applications may provide a software framework for various cloud service architectures, such as a distributed database management system like ApacheTM Casandra or distributed application system like ApacheTM Hadoop.
- each node 131 comprises a blade server, where two or more blade servers are housed in a chasis and share certain resources such as common power supplies and cooling systems.
- Client computer 181 also includes a processor (CPU) 111 , a memory 113 , a NIC 115 , and one or more non-volatile storage devices 118 . Similar to nodes 131 , client computer 181 also includes an operating system as well as one or more applications stored in memory 113 and running on CPU 111 . In one embodiment, client computer 181 may be maintained by a data analyst to analyze the distributed computer network 100 . Client computer 181 may communicate with each of the nodes 131 via network 151 (through NIC 115 and NICs 125 ).
- network 151 through NIC 115 and NICs 125 .
- FIG. 2 illustrates the global nature of the distributed computer network 100 , according to one example embodiment of the invention.
- a cloud services provider may build and maintain datacenters 221 at various locations around the world.
- a first data center (DataCenter_ 0 ) 221 ( 0 ) may be located in a western region of North America
- a second data center (DataCenter_ 1 ) 221 ( 1 ) may be located in an eastern region of North America
- a third data center (DataCenter_ 2 ) 221 ( 2 ) may be located in a European region.
- the distributed nature of the various nodes 131 may decrease latency for servicing requests transmitted to the cloud service from different areas of the world.
- a request generated in France may be serviced by a node in DataCenter_ 2 221 ( 2 ) whereas a request generated in California may be serviced by a node in DataCenter_ 0 221 ( 0 ).
- the latency for receiving a response based on these requests is reduced compared to a network in which every request is serviced by a particular node at a particular location.
- each data center 221 includes three nodes 131 .
- the first data center 221 ( 0 ) includes a first node (Node_ 0 ) 131 ( 0 ), a second node (Node_ 1 ) 131 ( 1 ), and a third node (Node_ 2 ) 131 ( 2 ).
- the second data center 221 ( 1 ) includes a fourth node (Node_ 3 ) 131 ( 3 ), a fifth node (Node_ 4 ) 131 ( 4 ), and a sixth node (Node_ 5 ) 131 ( 5 ).
- the third data center 221 ( 2 ) includes a seventh node (Node_ 6 ) 131 ( 6 ), an eighth node (Node_ 7 ) 131 ( 7 ), and a ninth node (Node_ 8 ) 131 ( 8 ).
- a group of nodes may be referred to as a cluster.
- a multi-data center deployment of a cluster that implements the eventually-consistent data store may include all nine nodes ( 131 ( 0 ), 131 ( 1 ), . . . , 131 ( 8 )) shown in FIG. 2 .
- each cluster may be comprised of three or more nodes.
- nodes of FIGS. 1 and 2 in the cloud 110 may represent physically distinct computing resources, such as individual blade servers, or virtual computing resources such as distinct virtual machines that may share access to a pool of physical computing resources.
- the effect is the same, as some physical computing resources are being used to store the data in multiple physical hardware devices, likely at locations in separate data centers throughout the world.
- FIG. 3 illustrates one node 300 of an eventually-consistent data store, according to one example embodiment of the invention.
- Relational databases typically store data in a series of related tables, each table having a defined structure in relation to other tables. Because of how the data is structured, operations to query or write to the relational database are expensive. The database server is required to do additional work to ensure data integrity among the many related tables.
- key-value data stores often offer high availability or scalability at the cost of consistency and the ability to support complex queries. Key-value data stores are better in this manner because they support operations that perform better when distributed among many individual nodes. Key-value data stores are simpler to implement than a relational database, but may be harder to query because the relationships between values must be stored explicitly. Rather than associating a user with each entry in a related table, thus binding all entries related to the same user across multiple tables, each entry in the key-value data store explicitly includes an identifier that specifies a relationship.
- Node 300 implements a persistent distributed multi-dimensional map (i.e., an associative array).
- node 300 implements a structured key-value data store in which each key may map to one or more values.
- the data store is implemented in the cloud 110 via the ApacheTM Cassandra application framework.
- Cassandra multiple values are grouped into tables known as “column families”.
- a column family is a container mapping row keys to a sorted list of key/value pairs (i.e., “columns”).
- node 300 includes one or more column families 310 defined when a Cassandra database is implemented.
- Each row 311 will be added to the column family 310 .
- Each row includes exactly one key 312 and one or more columns (e.g., 313 , 314 , 315 , etc.).
- Key 312 is a universally unique identifier (UUID) that associates that key 312 to the values stored for each column.
- UUID universally unique identifier
- the UUID may be a 128-bit value that is generated whenever a new row 311 is added to the data store.
- the key 312 may be a string identifier.
- Each column family 310 may be associated with a Column Family Identifier.
- the identifier may be a string such as “Users”, for example.
- Each column family 310 is associated with an identifier (e.g., a string) and a timestamp (e.g., 313 - 1 , 314 - 1 , 315 - 1 , etc.).
- identifier e.g., a string
- timestamp e.g., 313 - 1 , 314 - 1 , 315 - 1 , etc.
- Each row 311 in a column family 310 may define a number of different columns that represent various values associated with a particular key 312 .
- each row 311 may include a different set of columns. In other words, each row 311 may include a subset of all columns associated with the column family 310 . In such embodiments, column 313 ( 0 ) may be associated with a “username” identifier, while column 313 ( 1 ) may be associated with a different identifier such as a “password” identifier.
- Data may be added to the data store by various applications running on client computer 181 or the like (such as a mobile phone or tablet computer). For example, a user may enter information on a web page and click a submit button.
- the web page may include a form with text boxes for entering a username, password, and an address.
- submit a standard RESTful request message is generated and transmitted to an URL associated with the cloud 110 via network 151 .
- the message will typically be proxied by a web server and application server to one of the nodes 131 , which causes an entry (i.e., row 311 ) to be added to the data store.
- the node 131 may generate a UUID associated with the request and create a row 311 ( 0 ) in the data store to hold the strings entered on the web page.
- Key 312 ( 0 ) may hold the UUID
- column 313 ( 0 ) may include the string entered on the web page in the username textbox
- column 314 ( 0 ) may include the string entered on the web page in the password textbox
- column 315 ( 0 ) may include the string entered on the web page in the address textbox.
- data store is persistent, meaning that once a value for a particular key 312 is added to data store, the value will never be deleted from the data store.
- a new row may be added to data store using the same key 312 .
- Each column may also include a timestamp (e.g., 313 - 1 , 314 - 1 , 315 - 1 ) that indicates a particular time at which the corresponding column data (e.g., 313 , 314 , 315 ) was added to data store.
- a timestamp e.g., 313 - 1 , 314 - 1 , 315 - 1
- the column associated with the same identifier may be inconsistent. In such cases, the column associated with the most recent timestamp will hold the most recent value, and all other columns associated with that identifier may be discarded for that particular key 312 .
- FIG. 4 illustrates a cluster 400 configured to implement the eventually-consistent data store of FIG. 3 , according to one example embodiment of the invention.
- cluster 400 includes 9 nodes 131 located in 3 distinct data centers, which may be visualized as a ring.
- Node_ 0 131 ( 0 ), Node_ 1 131 ( 1 ), and Node_ 2 131 ( 2 ) are located in a first data center
- Node_ 3 131 ( 3 ), Node_ 4 131 ( 4 ), and Node_ 5 131 ( 5 ) are located in a second data center
- Node_ 6 131 ( 6 ), Node_ 7 131 ( 7 ), and Node_ 8 131 ( 8 ) are located in a third data center.
- Each of the nodes 131 is configured to implement the same framework such that requests from any client computer 181 may be processed by any of the nodes 131 of cluster 400 . It will be appreciated that a cluster may include any number of nodes located in one or more data centers.
- Client computer 181 connects to one of the nodes of cluster 400 , such as Node_ 0 131 ( 0 ) via network 151 .
- the node to which client computer 181 connects acts as a coordinator for requests transmitted by client computer 181 to read from or write to the data store. Each request is associated with a particular key 312 and a particular column family 310 .
- the “coordinator” node will analyze the key 312 to determine which node of the cluster is configured to store that particular value, a process known as partitioning. For example, keys 312 may be hashed to generate an MD5 hash value, that randomly assigns a particular key 312 to one of the nodes 131 of the cluster 400 .
- Cluster 400 may also be associated with a replication factor that ensures that data is stored redundantly such that a failure in one node does not stop requests to access the data store from being serviced by the cluster 400 .
- a replication factor of 3 reflects that data is stored on three nodes.
- data will be stored on a primary node associated with a particular key, a secondary node located on the same rack in the data center as the primary node, and a tertiary node located in a different data center from the primary node.
- a request received by Node_ 0 to store a value associated with a key 312 that corresponds to Node_ 0 will cause a row to be added to a column family 310 in Node_ 0 131 ( 0 ), a row to be added to a column family 310 in Node_ 1 131 ( 1 ), and a row to be added to a column family 310 in Node_ 3 131 ( 3 ).
- a different node received a request related to the same key 312 value, then that node would forward the request to Node_ 0 , Node_ 1 , and Node_ 3 .
- FIG. 5 illustrates one implementation of an eventually-consistent data store, according to one example embodiment of the invention.
- the data store is implemented using various data structures stored on a plurality of nodes in cluster 400 .
- data associated with a particular key that corresponds to Node_ 0 is replicated on a number of nodes specified by a replication factor for a cluster (e.g., for a replication factor of 3, data is replicated on Node_ 0 131 ( 0 ), Node_ 1 131 ( 1 ), and Node_ 3 131 ( 3 )).
- the nodes in the cluster 400 are peers where any node is configured to service a request from a client computer 181 .
- each node 131 includes an operating system 510 and an application 520 in memory 123 .
- application 520 implements the ApacheTM Cassandra framework that handles requests to access the data store and transmits/receives messages to/from other nodes in the cluster 400 as well as any client computer 181 connected to that node.
- a message is routed to one of the nodes 131 in the cluster 400 .
- the message may contain a request to write to the data store or read from the data store.
- the message may define a number of nodes within the cluster to which the value should be written in order to consider the WRITE successful.
- a WRITE request may be considered successful when a response is received from a single node within the cluster, a quorum of nodes within the cluster, or all nodes within the cluster associated with the key specified by the write request.
- a quorum WRITE request does not reflect that data is written to a quorum of all nodes within the cluster but merely that data is written to a quorum of the nodes associated with storing that particular key.
- a READ request may read a value from a single node within the cluster, a quorum of nodes within the cluster, or all nodes within the cluster.
- Writing to a single node is a fast operation, but fails to ensure that a READ request will return the correct data (e.g., the READ request may be directed to a different node than the node where the data was written).
- Writing to all the nodes is a slow operation, but ensures that a READ request to a single node will always return the correct data.
- Writing to a quorum of nodes is a faster operation than writing to all the nodes while also ensuring that, in conjunction with issuing a quorum READ request, the correct data is always returned to a user, even though a READ is only performed on a subset of the nodes associated with that key.
- Reading a value from a single node is a fast operation, but does not ensure that the returned value is correct (i.e., the correct value may be stored in a different node). Reading a value from a quorum of nodes is a slower operation than reading from a single node, but reading from a quorum of nodes will ensure that the correct value is returned as long as the value was also written to a quorum of nodes.
- the READ request is forwarded to a quorum of the nodes associated with the given key and the correct value is given by the value that was returned with the highest frequency.
- the value associated with the most recent timestamp ( 313 - 1 , 314 - 1 , etc.) is selected as the correct value. Reading a value from all nodes is a slower operation, but will always ensure that the correct value is returned, regardless of whether the value was written to one node, a quorum of nodes, or all nodes within the cluster 400 .
- Application 520 is configured to store key-value pairs in the data store and perform backend operations to ensure that the data store is eventually-consistent across every node.
- a request is received by a node, the request is added to a commit log 530 , stored in storage device 128 .
- the commit log 530 acts like a buffer that allows requests to be processed asynchronously from when the requests arrive at the node.
- the commit log 530 is stored in storage device 128 so that, in the event of a power failure, requests may be processed by replaying the requests stored in the commit log 530 , ensuring integrity of the data in the data store.
- requests may be processed as they are received by a node 131 , thereby obviating the need for commit log 530 .
- a power failure may result in requests not being processed by a node and data being lost within the data store.
- These embodiments improve on latency caused by a disk input/output operation, but sacrifice robustness because pending requests may be lost due to power failure of a single node.
- Node_ 0 131 ( 0 ) includes two column families 310 ( 0 ) and 310 ( 1 ).
- the first column family 310 ( 0 ) may correspond to a database of distinct users
- a second column family 310 ( 1 ) may correspond to data associated with various users specified in the rows of the first column family 310 ( 0 ).
- application 520 stores the WRITE requests to the commit log 530 .
- Application 520 also processes requests from the commit log 530 as the processing capacity of CPU 121 allows.
- application 520 adds a row 311 to the column family 310 specified by the WRITE request and, if necessary, transmits a message to the originating node in cluster 400 indicating that the operation was successful (where the node did not receive a request directly from a client computer 181 but rather a request that was forwarded from another node in cluster 400 to which the client computer 181 has established a direct communications channel with cluster 400 ).
- Node_ 1 131 ( 1 ) includes two column families, a first column family 310 ( 2 ) that corresponds to the first column family 310 ( 0 ) in Node_ 0 131 ( 0 ) and a second column family 310 ( 3 ) that corresponds to the second column family 310 ( 1 ) in Node_ 0 131 ( 0 ).
- the first column family 310 ( 2 ) in Node_ 1 131 ( 1 ) may contain one or more rows associated with various users. Some of the rows in the first column family 310 ( 0 ) in Node_ 0 131 ( 0 ) may also be stored in the first column family 310 ( 2 ) in Node_ 1 131 ( 1 ).
- Node_ 0 131 ( 0 ) or Node_ 1 131 ( 1 ) may only be found in either Node_ 0 131 ( 0 ) or Node_ 1 131 ( 1 ) due to the eventually-consistent nature of the data store.
- a WRITE request may only specify that data be written to a single node and, therefore, the data may be added to column family 310 ( 0 ) but not column family 310 ( 2 ).
- Node_ 3 131 ( 3 ) includes two column families 310 ( 4 ) and 310 ( 5 ), the first column family 310 ( 4 ) corresponds to the first column family 310 ( 0 ) of Node_ 0 131 ( 0 ) and the first column family of Node_ 1 131 ( 1 ) and the second column family 310 ( 5 ) corresponds to the second column family 310 ( 1 ) of Node_ 0 131 ( 0 ) and the second column family 310 ( 3 ) of Node_ 1 131 ( 1 ).
- Column families 310 may become very large.
- each node 131 may contain a plurality of column families 310 .
- memory capacity for each node may be insufficient to store all of the data.
- nodes 131 may implement a backing store in storage device 128 to hold portions of column families 310 .
- As a column family 310 grows past a threshold value some of the rows of column family 310 may be copied into the storage device 128 and removed from memory 123 .
- node 131 may check the portions of the column family 310 in memory 123 as well as portions of the column family 310 in storage device 128 , portions of which may be temporarily read into memory 123 to service the READ request.
- a particular node may be disconnected from the cloud 110 due to a power failure at the data center or a failure in the communications link between nodes. In this case, the WRITE will fail and data for a given key will be inconsistent between the various nodes of the cluster 400 .
- a client computer 181 may request a WRITE operation to only a single node in order to reduce latency, sacrificing consistency in the process.
- application 520 may be configured to periodically repair the data store to ensure that replicas are consistent between nodes.
- This operation may be performed in the background, by checking that a value associated with a key in a column family is the same as the value stored in a column family on a different node associated with that key. If the values are different, then the application 520 will transmit a WRITE request to the node that stores the incorrect data in order to make the nodes consistent across the cluster. Again, this operation may be performed in the background to avoid high latencies to be experienced by client computers attempting to access the data store. In other words, over time, an entry added to a column family 310 in one node will eventually be replicated across a number of redundant nodes, even if the data was only written to one node based on the request.
- Application 520 is configured to periodically flush the entries in column families 310 to a non-volatile storage device 128 in a data structure known as an SSTable (“sorted string table”) 540 .
- the SSTable 540 data structure is a persistent data structure of key-value pairs that may be associated with a bloom filter.
- a bloom filter enables application 520 to quickly check whether a particular key 312 is likely included within the SSTable 540 (a bloom filter will sometimes return a false positive that a key is included in the table, but will never return a negative response when the key is included in the table).
- Node_ 0 131 ( 0 ) includes two SSTables 540 , a first SSTable 540 ( 0 ) that corresponds to the first column family 310 ( 0 ) in Node_ 0 131 ( 0 ) and a second SSTable 540 ( 1 ) that corresponds to the second column family 310 ( 1 ) in Node_ 0 131 ( 0 ).
- application 520 is configured to periodically flush column families 310 to non-volatile storage 128 in order to permanently store key-value pairs in the data store. Once key-value pairs in column families 310 have been flushed to storage 128 , the requests may be removed from the commit log 530 .
- Application 520 may also be configured to merge multiple SSTables 540 associated with a single column family 310 using a process called compaction.
- application 520 sorts each SSTable 540 and combines rows associated with the same key into a single row in a new SSTable 540 , associating only the most recent entries for each column associated with a unique identifier for that row.
- all SSTables 540 stored in the cluster reflect the data in the eventually-consistent data store at a previous point in time.
- each SStable 540 provides an inconsistent snapshot of the key-value pairs associated with a column family for a particular node at a given point in time.
- a data analyst may have problems attempting to monitor the state of the data store at a particular point-in-time because of the distributed nature of the data store.
- the data analyst could force a consistent scan across each node 131 in the data store, in effect, blocking any access to the data store while the consistent scan is taking place.
- such a technique may be disruptive to clients attempting to access information in the data store because the nodes will be inaccessible while data is replicated across all nodes and then transmitted to the analyst's computer.
- the present disclosure describes a method and system to generate a snapshot of the data store that provides a lagging consistency point based on the inconsistent SSTables 540 stored in the various nodes of cluster 400 .
- FIG. 6 illustrates a system for generating a point-in-time snapshot of an eventually-consistent data store, according to one example embodiment of the invention.
- the eventually-consistent data store may be implemented on cluster 400 as described above in connection with FIGS. 1-5 .
- a plurality of inconsistent snapshots i.e., SSTables 540
- a second cluster 600 may be implemented in cloud 110 using different resources, physical or virtual, than cluster 400 .
- the number of nodes in cluster 600 may be different than the number of nodes of cluster 400 .
- the nodes in cluster 600 may be similar to the nodes in cluster 400 .
- each of the nodes of cluster 600 also includes a processor, a memory, a NIC, and one or more non-volatile storage devices.
- the nodes may execute different applications that provide an application framework for a distributed compute environment.
- the applications executing on the nodes of cluster 600 may implement the ApacheTM Hadoop distributed compute environment framework.
- the Hadoop framework provides common utilities, a distributed file system, and a MapReduce framework for processing large data sets on a plurality of nodes.
- cluster 600 is configured with one master node 631 and a plurality of slave nodes 632 configured to process the SSTables 540 retrieved from data store.
- the master node 631 is configured to retrieve the SStables 540 generated by each node in cluster 400 .
- the SSTables 540 may correspond to different points-in-time and may be inconsistent from node to node. For example, two nodes may contain related SSTables 540 that store different values for the same key because a WRITE request was configured to only write a value to a single node.
- an SSTable 540 from one node may be generated before a WRITE request is processed whereas a related SSTable 540 from another node may be generated after the WRITE request is processed, leading to an inconsistency in the values stored in the SSTables 540 .
- cluster 400 is configured to back-up all of the SSTables 540 stored in storage devices 128 of each node to a separate long-term storage location.
- the long-term storage location may be, among other implementations, a drive array connected to a private network coupled to cluster 400 or a virtual storage service such as Amazon® S3.
- Application 520 may be configured to copy SSTables 540 stored in storage device 128 to a location in the long-term storage periodically. For example, every night at 3 am, the nodes of cluster 400 may copy the SSTables 540 in storage device 128 to a bucket in the Amazon® S3 cloud storage.
- cluster 600 may be configured to read the copies of SSTables 540 from the back-up storage location instead of the individual nodes in cluster 400 . By reading backup copies instead of the primary copies on cluster 400 , the administrator may avoid causing disruptions to access of the data store.
- the master node 631 retrieves the SSTables 540 (from either cluster 400 or the back-up storage location), the master node 631 implements a distributed MapReduce operation to process the SSTables 540 and generate a point-in-time snapshot of the data store.
- the MapReduce operation splits data into smaller portions for processing and distributes the data to a plurality of processing nodes that perform the same operation on the data.
- the results from all of the slave nodes 632 are then combined to form the total output.
- the master node 631 splits the SSTable 540 into processing tasks.
- a processing task includes at least a portion of the rows 311 in one or more SSTables 540 .
- the master node 631 may split the SSTables 540 into 64 MB chunks for distribution to each of the slave nodes 632 .
- the processing task is then assigned to one of the slave nodes 632 for processing.
- FIGS. 7A , 7 B, and 7 C illustrate various intermediate data structures generated by cluster 600 to create a point-in-time snapshot of the data store, according to one example embodiment of the invention.
- cluster 600 generates a point-in-time snapshot of data store based on a plurality of inconsistent snapshots (i.e., SSTables 540 ) generated by the various nodes of the distributed data store. More specifically, the master node 631 retrieves each of the SSTables 540 associated with a particular column family 310 and combines the SSTables 540 into a single aggregate table 700 that includes one or more rows 311 storing key-value pairs from that column family 310 .
- SSTables 540 inconsistent snapshots
- the table 700 includes 16 rows (e.g., 311 ( 0 )- 311 ( 15 )) of data combined from one or more SSTables 540 generated by the various nodes 131 that implement the data store.
- a first row 311 ( 0 ) includes a key 312 ( 0 ) (“Bob”), a first column 313 ( 0 ) (“name”: “Bobby12”), a timestamp 313 - 1 ( 0 ) (“1325135486”) corresponding to the entry in the first column 313 ( 0 ), a second column 314 ( 0 ) (“pass”: “BBCali24”), and a timestamp 314 - 1 ( 0 ) (“1325135486”) corresponding to the entry in the second column 314 ( 0 ).
- a second row 311 ( 1 ) is associated with the same key 312 as the first row 311 ( 0 ), but the second row 311 ( 1 ) includes a different value in the second column 314 ( 1 ) (“pass”: “bigboy123”) as well as a different timestamp 314 - 1 ( 1 ).
- the second row 311 ( 1 ) corresponds to data added to the data store that reflects an operation for changing a password associated with a user (“Bob”) stored in the data store.
- Table 700 also includes a fourth row 311 ( 3 ) that is a duplicate of the second row 311 ( 1 ), which reflects that the data was written to data store on at least two different nodes 131 (e.g., in response to a quorum WRITE request).
- the rows 311 in the table 700 may be duplicated (because replicas of the rows 311 are included in SSTables 540 from different nodes) or that multiple rows 311 in the table 700 may map different values to the same key 312 because new values for that key 312 were added to data store at different times to “replace” old values.
- the sixteen rows 311 of the table 700 are shown in random order, such as the result of a process by which a master node 631 adds each row from an SSTable 540 to the bottom of table 700 .
- the master node 631 may perform a sort of the rows 311 based on the keys 312 associated with the rows 311 to generate a sorted table 710 , as shown in FIG. 7B .
- the master node 631 may insert each row 311 from an SSTable 540 into a sorted location within table 700 , thus maintaining a sorted order as table 700 is generated.
- such techniques may not be as efficient as first generating an unordered table 700 and then performing a distributed sort on the resulting data to generate the sorted table 710 .
- the master node 631 may not sort table 700 at all, distributing portions of the table 700 to the slave nodes 632 for compaction. In such embodiments, master node 631 must perform a further compaction operation on the combined results received from each of the slave nodes 632 in order to ensure that there are no duplicate rows or inconsistent data in the final output.
- the master node 631 performs a MapReduce operation to split the data into smaller portions and distribute the data to the slave nodes 632 for processing.
- master node 631 splits the sorted table 710 up into processing tasks along boundaries defined by the keys 312 associated with the rows 311 of the sorted table 710 . For example, as shown in FIG.
- the master node 631 may split the sorted table 710 up into processing tasks along the boundary formed between the tenth row 311 ( 9 ) and the eleventh row 311 ( 10 ), ensuring that the key 312 associated with the last row (e.g., 311 ( 9 )) in a first processing task is not the same as a key 312 associated with the first row (e.g., 311 ( 10 )) in the next processing task.
- the master node 631 then distributes each of the processing tasks to the slave nodes 632 to perform the distributed MapReduce operation.
- Each of the slave nodes 632 receives a sorted list of rows from table 710 .
- a first slave node 632 ( 0 ) may receive a sorted list including rows 311 ( 0 ) through 311 ( 9 ) from sorted table 710 . Rows 311 ( 0 ) through 311 ( 9 ) correspond to rows associated with the “Bob” and “Carrie” keys.
- a third slave node 632 ( 2 ) may receive a sorted list including rows 311 ( 10 ) through 311 ( 15 ) from sorted table 710 . Rows 311 ( 10 ) through 311 ( 15 ) correspond to rows associated with the “Jim” and “Steve” keys.
- a slave node 632 will process the sorted list of rows to remove any duplicate rows and discard any columns that were “replaced” by entries associated with a more recent timestamp.
- the first slave node 632 ( 0 ) may compact a first processing task (i.e., all rows corresponding to keys 312 “Bob” and “Carrie”) to generate a result that includes two rows, a first row selected from one of the third 311 ( 2 ) through fifth 311 ( 4 ) rows of table 710 and a second row selected from one of the ninth 311 ( 8 ) through tenth 311 ( 9 ) rows of table 710 .
- a first processing task i.e., all rows corresponding to keys 312 “Bob” and “Carrie”
- the third slave node 632 ( 2 ) may compact a second processing task to generate a result that includes two additional rows, a first row selected from one of the eleventh 311 ( 10 ) through thirteenth 311 ( 12 ) rows of table 710 and a second row selected from one of the fourteenth 311 ( 13 ) through sixteenth 311 ( 15 ) rows of table 710 .
- the resulting compacted lists of non-duplicate, consistent data is then transmitted back to the master node 631 , which combines the output of the slave nodes to generate the point-in-time snapshot (i.e., table) 720 , as shown in FIG. 7C .
- the master node 631 converts the point-in-time snapshot 720 to a JSON (JavaScript Object Notation) format, a text-based standard designed for human-readable data exchange.
- JSON JavaScript Object Notation
- a JSON object is an unordered set of name/value pairs enclosed in braces ( ⁇ ⁇ ) and separated by a colon (:). Multiple sets of name/value pairs in the JSON object may be separated by a comma.
- the point-in-time snapshot may be converted to other types of formats as well, both human-readable and machine-readable, encrypted or non-encrypted, as is known in the art.
- FIG. 8 illustrates a method 800 for building a point-in-time snapshot 720 of an eventually-consistent data store, according to one example embodiment of the invention.
- the method 800 begins at step 810 where a processing cluster 600 receives one or more inconsistent snapshots of an eventually-consistent data store.
- the cluster 600 implements a distributed compute environment as a service on a set of virtual compute nodes 631 , 632 .
- a master node 631 in cluster 600 retrieves a set of SSTables 540 from a back-up data store coupled to the cluster 600 .
- the master node 631 generates a sorted table 710 of key-value pairs from the set of inconsistent snapshots.
- the master node 631 first generates an unsorted list by combining rows 311 from each SSTable 540 associated with the data store into an aggregate table 700 and then sorts the unsorted, aggregate table 700 to generate a sorted table 710 .
- the master node 631 divides the sorted table 710 of key-value pairs into one or more processing tasks.
- a processing task is a subset of rows from the sorted table 710 .
- the master node 631 divides the sorted table 710 along row boundaries to ensure that rows 311 associated with the same key 312 are directed to the same slave node 632 for processing.
- the master node 631 transmits processing tasks to slave nodes 632 for processing.
- the slave nodes 631 perform a compaction function on the rows of the processing tasks to generate an output table that includes a single row for each unique key 312 .
- the slave node 632 selects the most recent columns (e.g., 313 , 314 , etc.) associated with each key 312 to include in the output table, eliminating any duplicate rows associated with the same key 312 .
- the master node 631 receives the results from the plurality of slave nodes 632 .
- the master node 631 combines the results received from the slave nodes 632 to generate a point-in-time snapshot of the eventually-consistent data store.
- the point-in-time snapshot may be updated by iterating through steps 810 - 820 and combining any additional SSTables 540 generated by cluster 400 since a previous point-in-time corresponding to the snapshot 720 .
- FIG. 9 illustrates a perpetual process for updating the point-in-time snapshot 720 , according to one example embodiment of the invention.
- the cluster 600 may generate a new point-in-time snapshot 720 ( n ) using the snapshot 720 ( n - 1 ) from the previous point-in-time as a starting point for generating sorted table 710 .
- the master node 631 Rather than accumulating each and every SSTable 540 from cluster 400 , the master node 631 only needs to retrieve any additional SSTables 540 that have been generated by cluster 400 since the previous point-in-time.
- the rows 311 from these new SSTables 540 are then added to the snapshot from the previous point-in-time 720 ( n - 1 ) to generate a new sorted list 710 that is compacted via the slave nodes 632 to generate the new point-in-time snapshot 720 ( n ) corresponding to a consistency point that is more recent than that of the previous point-in-time snapshot 720 ( n - 1 ).
- step 822 the master node 631 determines whether to update the previous point-in-time snapshot 720 . If the previous point-in-time snapshot should be updated, then master node 631 retrieves any SSTables 540 generated by cluster 400 since the previous point-in-time and repeats steps 810 - 820 to generate a new point-in-time snapshot 720 corresponding to a more recent consistency point. However, if the master node 631 determines that the snapshot 720 does not need to be updated at this time, then method 800 terminates.
- steps 814 - 820 of method 800 may be replaced by a single step where the server computer performs the compaction operation on the sorted table 710 .
- a server computer may download each of the SSTables 540 from the cloud 110 and generate sorted table 710 locally in a memory or storage device attached to the server computer.
- the server computer could perform the equivalent of the MapReduce operation locally by compacting the sorted table 710 using a single processor.
- aspects of the present invention may be implemented in hardware or software or in a combination of hardware and software.
- One embodiment of the invention may be implemented as a program product for use with a computer system.
- the program(s) of the program product define functions of the embodiments (including the methods described herein) and can be contained on a variety of computer-readable storage media.
- Illustrative computer-readable storage media include, but are not limited to: (i) non-writable storage media (e.g., read-only memory devices within a computer such as CD-ROM disks readable by a CD-ROM drive, flash memory, ROM chips or any type of solid-state non-volatile semiconductor memory) on which information is permanently stored; and (ii) writable storage media (e.g., floppy disks within a diskette drive or hard-disk drive or any type of solid-state random-access semiconductor memory) on which alterable information is stored.
- non-writable storage media e.g., read-only memory devices within a computer such as CD-ROM disks readable by a CD-ROM drive, flash memory, ROM chips or any type of solid-state non-volatile semiconductor memory
- writable storage media e.g., floppy disks within a diskette drive or hard-disk drive or any type of solid-state random-access semiconductor memory
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Quality & Reliability (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
- 1. Field of the Invention
- Embodiments of the invention generally relate to eventually-consistent data stores. More specifically, embodiments of the invention relate to systems and methods for building a point-in-time snapshot of an eventually-consistent data store.
- 2. Description of the Related Art
- Companies involved in e-commerce typically maintain one or more datacenters to provide the resources to handle customer's needs on the Internet. A datacenter may consist of hundreds or thousands of server computers in a single building along with high-speed communication lines to connect those servers to the Internet. The servers may also be connected to large data stores that consist of thousands of disk drives or other non-volatile storage.
- Lately, a “cloud” computing model has enabled companies to purchase computing resources on an as-needed basis from providers such as Amazon®. Cloud computing is the delivery of computing resources as a service over a network such as the Internet. Instead of the company maintaining the datacenter at a facility owned by the company, the company can “lease” use of a virtual data center provided by a third-party provider. The provider maintains the hardware at various locations throughout the world, which the company can lease and scale to match the companies needs at any given time.
- One aspect of cloud services is cloud storage, where the provider leases virtual storage space to various companies or individuals. For example, Amazon® Web Services (AWS) include Amazon® Simple Storage Service (S3) that enables a user to store objects (e.g., videos, documents, etc.) at datacenters around the world using a web interface. The user can choose in which geographic region an object is stored and choose an amount of redundancy (i.e., by storing the object at multiple different datacenters) that ensures object availability even if one datacenter goes offline.
- An eventually-consistent data store is a data store that sacrifices consistency for availability and partition tolerance. In other words, a system may store data redundantly in multiple locations in order to ensure that the data is available despite communication failure between nodes (partition tolerance), however, the system cannot then also ensure that the data is consistent across the multiple nodes. Eventually-consistent data stores ensure that requests for data are serviced quickly while not ensuring that the data is consistent across every node where that data may be stored.
- In order to retrieve a consistent snapshot of data from the distributed data store, an administrator must either force a consistent read across all nodes (essentially preventing any requests from being processed by the system during this time) or read separately from the various nodes and reconcile the data at a later time. The former poses a large load on the data store and, in some cases, may be impossible to perform given the distributed nature of the data store. The latter requires additional services to be implemented in the data store to generate a snapshot of the state of each individual node and the ability to reconcile the data from every node at a later point in time.
- Improved techniques are needed to provide data analysts with a snapshot of the eventually-consistent data store at a particular point-in-time that does not interfere with normal operation of the data store.
- One embodiment of the present invention includes a method for building a point-in-time snapshot of an eventually-consistent data store distributed among a plurality of nodes connected by a network. The method includes the steps of receiving a plurality of inconsistent snapshots, wherein each inconsistent snapshot includes one or more rows of key-value pairs associated with the data store and reflects the contents of at least a portion of the data store stored on a particular node of the plurality of nodes, and generating the point-in-time snapshot by resolving the rows of key-value pairs to remove any inconsistent values, wherein the point-in-time snapshot includes a subset of the key-value pairs included in the plurality of inconsistent snapshots.
- Other embodiments include, without limitation, a computer-readable medium that includes instructions that enable a processing unit to implement one or more aspects of the disclosed methods as well as a system configured to implement one or more aspects of the disclosed methods.
- One advantage of such techniques is that a consistent snapshot of the data store may be generated using back-up copies of a set of inconsistent snapshots automatically generated by each node of the data store. Each back-up copy may be resolved to produce a consistent snapshot of a single node, which in the aggregate may be resolved to produce a consistent snapshot of the entire data store. Resolving the back-up copies of the inconsistent snapshots may be performed on a related system to generate a point-in-time snapshot without overloading the data store during normal operation. Thus, users of the data store are free to access the data store uninterrupted while data analysts may perform analysis of various metrics using a consistent view of the data store at a point-in-time in the recent past (such as generated once a day).
- So that the manner in which the above recited features of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments of this invention and are therefore not to be considered limiting of its scope, for the invention may admit to other equally effective embodiments.
-
FIG. 1 illustrates a distributed computer network, according to one example embodiment of the invention; -
FIG. 2 illustrates the global nature of the distributed computer network, according to one example embodiment of the invention; -
FIG. 3 illustrates one node of an eventually-consistent data store, according to one example embodiment of the invention; -
FIG. 4 illustrates a cluster configured to implement the eventually-consistent data store ofFIG. 3 , according to one example embodiment of the invention; -
FIG. 5 illustrates one implementation of an eventually-consistent data store on the cluster ofFIG. 4 , according to one example embodiment of the invention; -
FIG. 6 illustrates a system for generating a point-in-time snapshot of an eventually-consistent data store, according to one example embodiment of the invention; -
FIGS. 7A , 7B, and 7C illustrate various intermediate data structures generated by a cluster to create a point-in-time snapshot of the data store, according to one example embodiment of the invention; -
FIG. 8 illustrates a method for building a point-in-time snapshot of an eventually-consistent data store, according to one example embodiment of the invention; and -
FIG. 9 illustrates a perpetual process for updating the point-in-time snapshot, according to one example embodiment of the invention. - Embodiments of the invention provide techniques for building a point-in-time snapshot of an eventually-consistent data store. One or more compute nodes collect a plurality of node specific snapshots generated by a plurality of distributed nodes that implements the eventually-consistent data store. Each of the node specific snapshots is created at various times in relation to the overall data store such that the individual snapshots may contain inconsistent data. Then, one or more compute nodes analyze the plurality of snapshots to generate a consistent snapshot of the entire data store corresponding to a previous point-in-time. The lagging consistency point of the snapshot may be updated based on additional snapshots generated by the plurality of distributed nodes at various intervals.
- In the following description, numerous specific details are set forth to provide a more thorough understanding of the present invention. However, it will be apparent to one of skill in the art that the present invention may be practiced without one or more of these specific details. In other instances, well-known features have not been described in order to avoid obscuring the present invention.
-
FIG. 1 illustrates adistributed computer network 100, according to one example embodiment of the invention. As shown, thedistributed computer network 100 includes aclient computer 181 connected to a cloud computing infrastructure 110 (i.e., “the cloud”) that includes a plurality ofcompute nodes 131. Theclient computer 181 may be connected to thecloud 110 via anetwork 151 such as a LAN (Local Area Network), a WAN (Wide Area Network), or the Internet. Thecloud 110 provides one or more virtual computing services via standard messaging protocols (e.g., SOAP, REST, etc.) over thenetwork 151. Examples of virtual computing services may include processing capacity, storage, and relational databases, among many other types of services. In one embodiment, thecloud 110 is hosted by a cloud services provider such as Amazon®. The cloud services provider houses thenodes 131 in various datacenters in different physical locations around the world and enables clients to access the cloud services over thenetwork 151. For example, Amazon® hosts a virtual cloud storage solution called Amazon Simple Storage Service™ (S3) as well as a virtual processing solution called Amazon Elastic Compute Cloud™ (EC2), accessible through the internet using common transport protocols such as Hypertext Transport Protocol (http). In another embodiment, a single organization may host both thecloud 110 and theclient computer 181 in a private network. - Each of the
nodes 131 includes a processor (CPU) 121, amemory 123, a network interface controller (NIC) 125, and one or morenon-volatile storage devices 128 such as a hard-disk drive, a magnetic tape drive, optical disk drives, a drive array (e.g., RAID), or the like. Eachnode 131 may include an operating system (e.g., Microsoft® Windows™, Linux™, Unix®, etc.) as well as one or more applications stored inmemory 123 and running onCPU 121. Some of the applications may provide a software framework for various cloud service architectures, such as a distributed database management system like Apache™ Casandra or distributed application system like Apache™ Hadoop. In one embodiment, eachnode 131 comprises a blade server, where two or more blade servers are housed in a chasis and share certain resources such as common power supplies and cooling systems. -
Client computer 181 also includes a processor (CPU) 111, amemory 113, aNIC 115, and one or morenon-volatile storage devices 118. Similar tonodes 131,client computer 181 also includes an operating system as well as one or more applications stored inmemory 113 and running onCPU 111. In one embodiment,client computer 181 may be maintained by a data analyst to analyze the distributedcomputer network 100.Client computer 181 may communicate with each of thenodes 131 via network 151 (throughNIC 115 and NICs 125). -
FIG. 2 illustrates the global nature of the distributedcomputer network 100, according to one example embodiment of the invention. A cloud services provider may build and maintaindatacenters 221 at various locations around the world. As shown inFIG. 2 , a first data center (DataCenter_0) 221(0) may be located in a western region of North America, a second data center (DataCenter_1) 221(1) may be located in an eastern region of North America, and a third data center (DataCenter_2) 221(2) may be located in a European region. The distributed nature of thevarious nodes 131 may decrease latency for servicing requests transmitted to the cloud service from different areas of the world. For example, a request generated in France may be serviced by a node in DataCenter_2 221(2) whereas a request generated in California may be serviced by a node in DataCenter_0 221(0). The latency for receiving a response based on these requests is reduced compared to a network in which every request is serviced by a particular node at a particular location. - In one embodiment, each
data center 221 includes threenodes 131. The first data center 221(0) includes a first node (Node_0) 131(0), a second node (Node_1) 131(1), and a third node (Node_2) 131(2). The second data center 221(1) includes a fourth node (Node_3) 131(3), a fifth node (Node_4) 131(4), and a sixth node (Node_5) 131(5). The third data center 221(2) includes a seventh node (Node_6) 131(6), an eighth node (Node_7) 131(7), and a ninth node (Node_8) 131(8). A group of nodes may be referred to as a cluster. For example, a multi-data center deployment of a cluster that implements the eventually-consistent data store may include all nine nodes (131(0), 131(1), . . . , 131(8)) shown inFIG. 2 . In other embodiments, each cluster may be comprised of three or more nodes. - It will be appreciated that the nodes of
FIGS. 1 and 2 in thecloud 110 may represent physically distinct computing resources, such as individual blade servers, or virtual computing resources such as distinct virtual machines that may share access to a pool of physical computing resources. The effect is the same, as some physical computing resources are being used to store the data in multiple physical hardware devices, likely at locations in separate data centers throughout the world. -
FIG. 3 illustrates onenode 300 of an eventually-consistent data store, according to one example embodiment of the invention. Relational databases typically store data in a series of related tables, each table having a defined structure in relation to other tables. Because of how the data is structured, operations to query or write to the relational database are expensive. The database server is required to do additional work to ensure data integrity among the many related tables. In contrast, key-value data stores often offer high availability or scalability at the cost of consistency and the ability to support complex queries. Key-value data stores are better in this manner because they support operations that perform better when distributed among many individual nodes. Key-value data stores are simpler to implement than a relational database, but may be harder to query because the relationships between values must be stored explicitly. Rather than associating a user with each entry in a related table, thus binding all entries related to the same user across multiple tables, each entry in the key-value data store explicitly includes an identifier that specifies a relationship. -
Node 300 implements a persistent distributed multi-dimensional map (i.e., an associative array). In other words,node 300 implements a structured key-value data store in which each key may map to one or more values. In one embodiment, the data store is implemented in thecloud 110 via the Apache™ Cassandra application framework. In Cassandra, multiple values are grouped into tables known as “column families”. A column family is a container mapping row keys to a sorted list of key/value pairs (i.e., “columns”). As shown inFIG. 3 ,node 300 includes one ormore column families 310 defined when a Cassandra database is implemented. As key-value pairs are added to the data store, arow 311 will be added to thecolumn family 310. Each row includes exactly onekey 312 and one or more columns (e.g., 313, 314, 315, etc.).Key 312 is a universally unique identifier (UUID) that associates that key 312 to the values stored for each column. For example, the UUID may be a 128-bit value that is generated whenever anew row 311 is added to the data store. In other embodiments, the key 312 may be a string identifier. Eachcolumn family 310 may be associated with a Column Family Identifier. The identifier may be a string such as “Users”, for example. - Columns (e.g., 313, 314, 315, etc.) within each
column family 310 are associated with an identifier (e.g., a string) and a timestamp (e.g., 313-1, 314-1, 315-1, etc.). Eachrow 311 in acolumn family 310 may define a number of different columns that represent various values associated with aparticular key 312. For example, in a “Users” column family 310(0), column 313(0) may be associated with a “username” identifier, column 314(0) may be associated with a “password” identifier, and column 315(0) may be associated with an “address” identifier. In some embodiments, eachrow 311 may include a different set of columns. In other words, eachrow 311 may include a subset of all columns associated with thecolumn family 310. In such embodiments, column 313(0) may be associated with a “username” identifier, while column 313(1) may be associated with a different identifier such as a “password” identifier. - Data may be added to the data store by various applications running on
client computer 181 or the like (such as a mobile phone or tablet computer). For example, a user may enter information on a web page and click a submit button. The web page may include a form with text boxes for entering a username, password, and an address. When the user clicks submit, a standard RESTful request message is generated and transmitted to an URL associated with thecloud 110 vianetwork 151. The message will typically be proxied by a web server and application server to one of thenodes 131, which causes an entry (i.e., row 311) to be added to the data store. Thenode 131 may generate a UUID associated with the request and create a row 311(0) in the data store to hold the strings entered on the web page. Key 312(0) may hold the UUID, column 313(0) may include the string entered on the web page in the username textbox, column 314(0) may include the string entered on the web page in the password textbox, and column 315(0) may include the string entered on the web page in the address textbox. As additional users submit information via the web page, additional RESTful request messages are sent to thecloud 110 and more rows are added to data store. - In one embodiment, data store is persistent, meaning that once a value for a
particular key 312 is added to data store, the value will never be deleted from the data store. In such instances, in order to update a value associated with aparticular key 312, a new row may be added to data store using thesame key 312. Each column may also include a timestamp (e.g., 313-1, 314-1, 315-1) that indicates a particular time at which the corresponding column data (e.g., 313, 314, 315) was added to data store. When two rows share thesame key 312, columns associated with the same identifier may be inconsistent. In such cases, the column associated with the most recent timestamp will hold the most recent value, and all other columns associated with that identifier may be discarded for thatparticular key 312. -
FIG. 4 illustrates acluster 400 configured to implement the eventually-consistent data store ofFIG. 3 , according to one example embodiment of the invention. As shown inFIG. 4 ,cluster 400 includes 9nodes 131 located in 3 distinct data centers, which may be visualized as a ring. Again, Node_0 131(0), Node_1 131(1), and Node_2 131(2) are located in a first data center, Node_3 131(3), Node_4 131(4), and Node_5 131(5) are located in a second data center, and Node_6 131(6), Node_7 131(7), and Node_8 131(8) are located in a third data center. Each of thenodes 131 is configured to implement the same framework such that requests from anyclient computer 181 may be processed by any of thenodes 131 ofcluster 400. It will be appreciated that a cluster may include any number of nodes located in one or more data centers. -
Client computer 181 connects to one of the nodes ofcluster 400, such as Node_0 131(0) vianetwork 151. The node to whichclient computer 181 connects acts as a coordinator for requests transmitted byclient computer 181 to read from or write to the data store. Each request is associated with aparticular key 312 and aparticular column family 310. The “coordinator” node will analyze the key 312 to determine which node of the cluster is configured to store that particular value, a process known as partitioning. For example,keys 312 may be hashed to generate an MD5 hash value, that randomly assigns aparticular key 312 to one of thenodes 131 of thecluster 400.Cluster 400 may also be associated with a replication factor that ensures that data is stored redundantly such that a failure in one node does not stop requests to access the data store from being serviced by thecluster 400. For example, a replication factor of 3 reflects that data is stored on three nodes. - In one embodiment, data will be stored on a primary node associated with a particular key, a secondary node located on the same rack in the data center as the primary node, and a tertiary node located in a different data center from the primary node. As shown in
FIG. 4 , a request received by Node_0 to store a value associated with a key 312 that corresponds to Node_0 will cause a row to be added to acolumn family 310 in Node_0 131(0), a row to be added to acolumn family 310 in Node_1 131(1), and a row to be added to acolumn family 310 in Node_3 131(3). Similarly, if a different node received a request related to thesame key 312 value, then that node would forward the request to Node_0, Node_1, and Node_3. -
FIG. 5 illustrates one implementation of an eventually-consistent data store, according to one example embodiment of the invention. The data store is implemented using various data structures stored on a plurality of nodes incluster 400. Referring back toFIG. 4 , data associated with a particular key that corresponds to Node_0 is replicated on a number of nodes specified by a replication factor for a cluster (e.g., for a replication factor of 3, data is replicated on Node_0 131(0), Node_1 131(1), and Node_3 131(3)). The nodes in thecluster 400 are peers where any node is configured to service a request from aclient computer 181. Returning toFIG. 5 , eachnode 131 includes anoperating system 510 and anapplication 520 inmemory 123. In one embodiment,application 520 implements the Apache™ Cassandra framework that handles requests to access the data store and transmits/receives messages to/from other nodes in thecluster 400 as well as anyclient computer 181 connected to that node. - As data is requested to be added to or retrieved from the data store via one or
more client computers 181 attached to thecloud 110, a message is routed to one of thenodes 131 in thecluster 400. The message may contain a request to write to the data store or read from the data store. For a WRITE request, the message may define a number of nodes within the cluster to which the value should be written in order to consider the WRITE successful. A WRITE request may be considered successful when a response is received from a single node within the cluster, a quorum of nodes within the cluster, or all nodes within the cluster associated with the key specified by the write request. For example, based on acluster 400 associated with a replication factor of 3, writing to a single node requires a response from one of the three nodes associated with a key, writing to a quorum of nodes requires a response from two of the three nodes associated with the key, and writing to all the nodes requires a response from three of the three nodes associated with the key. It will be appreciated that a quorum WRITE request does not reflect that data is written to a quorum of all nodes within the cluster but merely that data is written to a quorum of the nodes associated with storing that particular key. Similarly, a READ request may read a value from a single node within the cluster, a quorum of nodes within the cluster, or all nodes within the cluster. - Writing to a single node is a fast operation, but fails to ensure that a READ request will return the correct data (e.g., the READ request may be directed to a different node than the node where the data was written). Writing to all the nodes is a slow operation, but ensures that a READ request to a single node will always return the correct data. Writing to a quorum of nodes is a faster operation than writing to all the nodes while also ensuring that, in conjunction with issuing a quorum READ request, the correct data is always returned to a user, even though a READ is only performed on a subset of the nodes associated with that key. Reading a value from a single node is a fast operation, but does not ensure that the returned value is correct (i.e., the correct value may be stored in a different node). Reading a value from a quorum of nodes is a slower operation than reading from a single node, but reading from a quorum of nodes will ensure that the correct value is returned as long as the value was also written to a quorum of nodes. When reading from a quorum of nodes, the READ request is forwarded to a quorum of the nodes associated with the given key and the correct value is given by the value that was returned with the highest frequency. In the event of a tie (e.g., two queried nodes, out of the three nodes associated with the key, return different values), the value associated with the most recent timestamp (313-1, 314-1, etc.) is selected as the correct value. Reading a value from all nodes is a slower operation, but will always ensure that the correct value is returned, regardless of whether the value was written to one node, a quorum of nodes, or all nodes within the
cluster 400. -
Application 520 is configured to store key-value pairs in the data store and perform backend operations to ensure that the data store is eventually-consistent across every node. As a request is received by a node, the request is added to a commitlog 530, stored instorage device 128. The commitlog 530 acts like a buffer that allows requests to be processed asynchronously from when the requests arrive at the node. The commitlog 530 is stored instorage device 128 so that, in the event of a power failure, requests may be processed by replaying the requests stored in the commitlog 530, ensuring integrity of the data in the data store. In other embodiments, requests may be processed as they are received by anode 131, thereby obviating the need for commitlog 530. However, in such other embodiments, a power failure may result in requests not being processed by a node and data being lost within the data store. These embodiments improve on latency caused by a disk input/output operation, but sacrifice robustness because pending requests may be lost due to power failure of a single node. - As each
node 131 receives a WRITE request,application 520 adds anew row 311 to theappropriate column family 310 specified in the WRITE request. Again, therow 311 includes a key 312 along with values corresponding to one or more columns (e.g., 313, 314, 315, etc.). As shown inFIG. 5 , Node_0 131(0) includes two column families 310(0) and 310(1). For example, the first column family 310(0) may correspond to a database of distinct users, and a second column family 310(1) may correspond to data associated with various users specified in the rows of the first column family 310(0). As Node_0 131(0) receives WRITE requests that specify the first column family 310(0),application 520 stores the WRITE requests to the commitlog 530.Application 520 also processes requests from the commitlog 530 as the processing capacity ofCPU 121 allows. When a WRITE request is retrieved from commitlog 530,application 520 adds arow 311 to thecolumn family 310 specified by the WRITE request and, if necessary, transmits a message to the originating node incluster 400 indicating that the operation was successful (where the node did not receive a request directly from aclient computer 181 but rather a request that was forwarded from another node incluster 400 to which theclient computer 181 has established a direct communications channel with cluster 400). - As also shown in
FIG. 5 , Node_1 131(1) includes two column families, a first column family 310(2) that corresponds to the first column family 310(0) in Node_0 131(0) and a second column family 310(3) that corresponds to the second column family 310(1) in Node_0 131(0). For example, the first column family 310(2) in Node_1 131(1) may contain one or more rows associated with various users. Some of the rows in the first column family 310(0) in Node_0 131(0) may also be stored in the first column family 310(2) in Node_1 131(1). Other rows may only be found in either Node_0 131(0) or Node_1 131(1) due to the eventually-consistent nature of the data store. For example, a WRITE request may only specify that data be written to a single node and, therefore, the data may be added to column family 310(0) but not column family 310(2). Similar to Node_0 131(0) and Node_1 131(1), Node_3 131(3) includes two column families 310(4) and 310(5), the first column family 310(4) corresponds to the first column family 310(0) of Node_0 131(0) and the first column family of Node_1 131(1) and the second column family 310(5) corresponds to the second column family 310(1) of Node_0 131(0) and the second column family 310(3) of Node_1 131(1). -
Column families 310 may become very large. In addition, eachnode 131 may contain a plurality ofcolumn families 310. Thus, memory capacity for each node may be insufficient to store all of the data. Although not shown explicitly,nodes 131 may implement a backing store instorage device 128 to hold portions ofcolumn families 310. As acolumn family 310 grows past a threshold value, some of the rows ofcolumn family 310 may be copied into thestorage device 128 and removed frommemory 123. When servicing READ requests,node 131 may check the portions of thecolumn family 310 inmemory 123 as well as portions of thecolumn family 310 instorage device 128, portions of which may be temporarily read intomemory 123 to service the READ request. - As the coordinator node transmits WRITE requests to various nodes associated with a
particular key 312, some nodes may not be able to perform the WRITE operation. For example, a particular node may be disconnected from thecloud 110 due to a power failure at the data center or a failure in the communications link between nodes. In this case, the WRITE will fail and data for a given key will be inconsistent between the various nodes of thecluster 400. In addition, aclient computer 181 may request a WRITE operation to only a single node in order to reduce latency, sacrificing consistency in the process. In such cases,application 520 may be configured to periodically repair the data store to ensure that replicas are consistent between nodes. This operation may be performed in the background, by checking that a value associated with a key in a column family is the same as the value stored in a column family on a different node associated with that key. If the values are different, then theapplication 520 will transmit a WRITE request to the node that stores the incorrect data in order to make the nodes consistent across the cluster. Again, this operation may be performed in the background to avoid high latencies to be experienced by client computers attempting to access the data store. In other words, over time, an entry added to acolumn family 310 in one node will eventually be replicated across a number of redundant nodes, even if the data was only written to one node based on the request. - As WRITE requests are processed by the nodes of the cluster, the size of each
column family 310 increases.Application 520 is configured to periodically flush the entries incolumn families 310 to anon-volatile storage device 128 in a data structure known as an SSTable (“sorted string table”) 540. TheSSTable 540 data structure is a persistent data structure of key-value pairs that may be associated with a bloom filter. A bloom filter enablesapplication 520 to quickly check whether aparticular key 312 is likely included within the SSTable 540 (a bloom filter will sometimes return a false positive that a key is included in the table, but will never return a negative response when the key is included in the table).SSTables 540 are immutable, meaning that once theSSTables 540 are written tostorage device 128, thoseSSTables 540 are never modified, but rather,additional SSTables 540 are written tostorage device 128. As shown inFIG. 5 , Node_0 131(0) includes twoSSTables 540, a first SSTable 540(0) that corresponds to the first column family 310(0) in Node_0 131(0) and a second SSTable 540(1) that corresponds to the second column family 310(1) in Node_0 131(0). Again,application 520 is configured to periodicallyflush column families 310 tonon-volatile storage 128 in order to permanently store key-value pairs in the data store. Once key-value pairs incolumn families 310 have been flushed tostorage 128, the requests may be removed from the commitlog 530. -
Application 520 may also be configured to mergemultiple SSTables 540 associated with asingle column family 310 using a process called compaction. In the background,application 520 sorts eachSSTable 540 and combines rows associated with the same key into a single row in anew SSTable 540, associating only the most recent entries for each column associated with a unique identifier for that row. Once a new combinedSSTable 540 has been created, theold SSTables 540 may be deleted byapplication 520. In the aggregate, allSSTables 540 stored in the cluster reflect the data in the eventually-consistent data store at a previous point in time. Thus, eachSStable 540 provides an inconsistent snapshot of the key-value pairs associated with a column family for a particular node at a given point in time. - A data analyst may have problems attempting to monitor the state of the data store at a particular point-in-time because of the distributed nature of the data store. On one hand, the data analyst could force a consistent scan across each
node 131 in the data store, in effect, blocking any access to the data store while the consistent scan is taking place. However, such a technique may be disruptive to clients attempting to access information in the data store because the nodes will be inaccessible while data is replicated across all nodes and then transmitted to the analyst's computer. Instead, the present disclosure describes a method and system to generate a snapshot of the data store that provides a lagging consistency point based on theinconsistent SSTables 540 stored in the various nodes ofcluster 400. -
FIG. 6 illustrates a system for generating a point-in-time snapshot of an eventually-consistent data store, according to one example embodiment of the invention. As shown, the eventually-consistent data store may be implemented oncluster 400 as described above in connection withFIGS. 1-5 . A plurality of inconsistent snapshots (i.e., SSTables 540) is stored oncluster 400. Asecond cluster 600 may be implemented incloud 110 using different resources, physical or virtual, thancluster 400. The number of nodes incluster 600 may be different than the number of nodes ofcluster 400. In one embodiment, the nodes incluster 600 may be similar to the nodes incluster 400. For example, each of the nodes ofcluster 600 also includes a processor, a memory, a NIC, and one or more non-volatile storage devices. However, while each of the nodes executes a similar operating system, the nodes may execute different applications that provide an application framework for a distributed compute environment. In one embodiment, the applications executing on the nodes ofcluster 600 may implement the Apache™ Hadoop distributed compute environment framework. The Hadoop framework provides common utilities, a distributed file system, and a MapReduce framework for processing large data sets on a plurality of nodes. - In one embodiment,
cluster 600 is configured with onemaster node 631 and a plurality ofslave nodes 632 configured to process theSSTables 540 retrieved from data store. Themaster node 631 is configured to retrieve theSStables 540 generated by each node incluster 400. TheSSTables 540 may correspond to different points-in-time and may be inconsistent from node to node. For example, two nodes may contain relatedSSTables 540 that store different values for the same key because a WRITE request was configured to only write a value to a single node. Also, anSSTable 540 from one node may be generated before a WRITE request is processed whereas arelated SSTable 540 from another node may be generated after the WRITE request is processed, leading to an inconsistency in the values stored in theSSTables 540. - In one embodiment,
cluster 400 is configured to back-up all of theSSTables 540 stored instorage devices 128 of each node to a separate long-term storage location. The long-term storage location may be, among other implementations, a drive array connected to a private network coupled to cluster 400 or a virtual storage service such as Amazon® S3.Application 520 may be configured to copySSTables 540 stored instorage device 128 to a location in the long-term storage periodically. For example, every night at 3 am, the nodes ofcluster 400 may copy theSSTables 540 instorage device 128 to a bucket in the Amazon® S3 cloud storage. Then,cluster 600 may be configured to read the copies ofSSTables 540 from the back-up storage location instead of the individual nodes incluster 400. By reading backup copies instead of the primary copies oncluster 400, the administrator may avoid causing disruptions to access of the data store. - As the
master node 631 retrieves the SSTables 540 (from eithercluster 400 or the back-up storage location), themaster node 631 implements a distributed MapReduce operation to process theSSTables 540 and generate a point-in-time snapshot of the data store. The MapReduce operation splits data into smaller portions for processing and distributes the data to a plurality of processing nodes that perform the same operation on the data. The results from all of theslave nodes 632 are then combined to form the total output. In one embodiment, as themaster node 631 receives eachSSTable 540, themaster node 631 splits theSSTable 540 into processing tasks. A processing task includes at least a portion of therows 311 in one ormore SSTables 540. For example, themaster node 631 may split theSSTables 540 into 64 MB chunks for distribution to each of theslave nodes 632. The processing task is then assigned to one of theslave nodes 632 for processing. -
FIGS. 7A , 7B, and 7C illustrate various intermediate data structures generated bycluster 600 to create a point-in-time snapshot of the data store, according to one example embodiment of the invention. Again,cluster 600 generates a point-in-time snapshot of data store based on a plurality of inconsistent snapshots (i.e., SSTables 540) generated by the various nodes of the distributed data store. More specifically, themaster node 631 retrieves each of theSSTables 540 associated with aparticular column family 310 and combines theSSTables 540 into a single aggregate table 700 that includes one ormore rows 311 storing key-value pairs from thatcolumn family 310. - As shown in
FIG. 7A , the table 700 includes 16 rows (e.g., 311(0)-311(15)) of data combined from one or more SSTables 540 generated by thevarious nodes 131 that implement the data store. For example, a first row 311(0) includes a key 312(0) (“Bob”), a first column 313(0) (“name”: “Bobby12”), a timestamp 313-1(0) (“1325135486”) corresponding to the entry in the first column 313(0), a second column 314(0) (“pass”: “BBCali24”), and a timestamp 314-1(0) (“1325135486”) corresponding to the entry in the second column 314(0). A second row 311(1) is associated with thesame key 312 as the first row 311(0), but the second row 311(1) includes a different value in the second column 314(1) (“pass”: “bigboy123”) as well as a different timestamp 314-1(1). The second row 311(1) corresponds to data added to the data store that reflects an operation for changing a password associated with a user (“Bob”) stored in the data store. Table 700 also includes a fourth row 311(3) that is a duplicate of the second row 311(1), which reflects that the data was written to data store on at least two different nodes 131 (e.g., in response to a quorum WRITE request). - It will be appreciated that the
rows 311 in the table 700 may be duplicated (because replicas of therows 311 are included inSSTables 540 from different nodes) or thatmultiple rows 311 in the table 700 may map different values to thesame key 312 because new values for that key 312 were added to data store at different times to “replace” old values. The sixteenrows 311 of the table 700 are shown in random order, such as the result of a process by which amaster node 631 adds each row from anSSTable 540 to the bottom of table 700. In such a case, once themaster node 631 has added all data from theSSTables 540 to table 700, themaster node 631 may perform a sort of therows 311 based on thekeys 312 associated with therows 311 to generate a sorted table 710, as shown inFIG. 7B . In alternate embodiments, themaster node 631 may insert eachrow 311 from anSSTable 540 into a sorted location within table 700, thus maintaining a sorted order as table 700 is generated. However, such techniques may not be as efficient as first generating an unordered table 700 and then performing a distributed sort on the resulting data to generate the sorted table 710. In yet other embodiments, themaster node 631 may not sort table 700 at all, distributing portions of the table 700 to theslave nodes 632 for compaction. In such embodiments,master node 631 must perform a further compaction operation on the combined results received from each of theslave nodes 632 in order to ensure that there are no duplicate rows or inconsistent data in the final output. - Once the
master node 631 has generated a sorted table 710, themaster node 631 performs a MapReduce operation to split the data into smaller portions and distribute the data to theslave nodes 632 for processing. In one embodiment,master node 631 splits the sorted table 710 up into processing tasks along boundaries defined by thekeys 312 associated with therows 311 of the sorted table 710. For example, as shown inFIG. 7B , themaster node 631 may split the sorted table 710 up into processing tasks along the boundary formed between the tenth row 311(9) and the eleventh row 311(10), ensuring that the key 312 associated with the last row (e.g., 311(9)) in a first processing task is not the same as a key 312 associated with the first row (e.g., 311(10)) in the next processing task. Themaster node 631 then distributes each of the processing tasks to theslave nodes 632 to perform the distributed MapReduce operation. - Each of the
slave nodes 632 receives a sorted list of rows from table 710. For example, a first slave node 632(0) may receive a sorted list including rows 311(0) through 311(9) from sorted table 710. Rows 311(0) through 311(9) correspond to rows associated with the “Bob” and “Carrie” keys. Similarly, a third slave node 632(2) may receive a sorted list including rows 311(10) through 311(15) from sorted table 710. Rows 311(10) through 311(15) correspond to rows associated with the “Jim” and “Steve” keys. Aslave node 632 will process the sorted list of rows to remove any duplicate rows and discard any columns that were “replaced” by entries associated with a more recent timestamp. For example, the first slave node 632(0) may compact a first processing task (i.e., all rows corresponding tokeys 312 “Bob” and “Carrie”) to generate a result that includes two rows, a first row selected from one of the third 311(2) through fifth 311(4) rows of table 710 and a second row selected from one of the ninth 311(8) through tenth 311(9) rows of table 710. Similarly, the third slave node 632(2) may compact a second processing task to generate a result that includes two additional rows, a first row selected from one of the eleventh 311(10) through thirteenth 311(12) rows of table 710 and a second row selected from one of the fourteenth 311(13) through sixteenth 311(15) rows of table 710. The resulting compacted lists of non-duplicate, consistent data is then transmitted back to themaster node 631, which combines the output of the slave nodes to generate the point-in-time snapshot (i.e., table) 720, as shown inFIG. 7C . - In one embodiment, the
master node 631 converts the point-in-time snapshot 720 to a JSON (JavaScript Object Notation) format, a text-based standard designed for human-readable data exchange. A JSON object is an unordered set of name/value pairs enclosed in braces ({ }) and separated by a colon (:). Multiple sets of name/value pairs in the JSON object may be separated by a comma. It will be appreciated that the point-in-time snapshot may be converted to other types of formats as well, both human-readable and machine-readable, encrypted or non-encrypted, as is known in the art. -
FIG. 8 illustrates a method 800 for building a point-in-time snapshot 720 of an eventually-consistent data store, according to one example embodiment of the invention. Although the method steps are described in conjunction with the systems ofFIGS. 1 through 7 , persons skilled in the art will understand that any system configured to perform the method steps, in any order, is within the scope of the invention. - As shown, the method 800 begins at
step 810 where aprocessing cluster 600 receives one or more inconsistent snapshots of an eventually-consistent data store. In one embodiment, thecluster 600 implements a distributed compute environment as a service on a set ofvirtual compute nodes master node 631 incluster 600 retrieves a set ofSSTables 540 from a back-up data store coupled to thecluster 600. Atstep 812, themaster node 631 generates a sorted table 710 of key-value pairs from the set of inconsistent snapshots. In one embodiment, themaster node 631 first generates an unsorted list by combiningrows 311 from eachSSTable 540 associated with the data store into an aggregate table 700 and then sorts the unsorted, aggregate table 700 to generate a sorted table 710. - At
step 814, themaster node 631 divides the sorted table 710 of key-value pairs into one or more processing tasks. A processing task is a subset of rows from the sorted table 710. In one embodiment, themaster node 631 divides the sorted table 710 along row boundaries to ensure thatrows 311 associated with thesame key 312 are directed to thesame slave node 632 for processing. Atstep 816, themaster node 631 transmits processing tasks toslave nodes 632 for processing. In one embodiment, theslave nodes 631 perform a compaction function on the rows of the processing tasks to generate an output table that includes a single row for eachunique key 312. Theslave node 632 selects the most recent columns (e.g., 313, 314, etc.) associated with each key 312 to include in the output table, eliminating any duplicate rows associated with thesame key 312. Atstep 818, themaster node 631 receives the results from the plurality ofslave nodes 632. Atstep 820, themaster node 631 combines the results received from theslave nodes 632 to generate a point-in-time snapshot of the eventually-consistent data store. - Due to the fact that the
SSTables 540 produced bycluster 400 are immutable, oncecluster 600 has created a point-in-time snapshot up to a certain point, the point-in-time snapshot may be updated by iterating through steps 810-820 and combining anyadditional SSTables 540 generated bycluster 400 since a previous point-in-time corresponding to thesnapshot 720. -
FIG. 9 illustrates a perpetual process for updating the point-in-time snapshot 720, according to one example embodiment of the invention. As shown inFIG. 9 , oncecluster 600 has generated one point-in-time snapshot 720(n-1) associated with a first point-in-time, thecluster 600 may generate a new point-in-time snapshot 720(n) using the snapshot 720(n-1) from the previous point-in-time as a starting point for generating sorted table 710. Rather than accumulating each and everySSTable 540 fromcluster 400, themaster node 631 only needs to retrieve anyadditional SSTables 540 that have been generated bycluster 400 since the previous point-in-time. Therows 311 from thesenew SSTables 540 are then added to the snapshot from the previous point-in-time 720(n-1) to generate a newsorted list 710 that is compacted via theslave nodes 632 to generate the new point-in-time snapshot 720(n) corresponding to a consistency point that is more recent than that of the previous point-in-time snapshot 720(n-1). - Returning now to
FIG. 8 , method 800 continues atstep 822, where themaster node 631 determines whether to update the previous point-in-time snapshot 720. If the previous point-in-time snapshot should be updated, thenmaster node 631 retrieves anySSTables 540 generated bycluster 400 since the previous point-in-time and repeats steps 810-820 to generate a new point-in-time snapshot 720 corresponding to a more recent consistency point. However, if themaster node 631 determines that thesnapshot 720 does not need to be updated at this time, then method 800 terminates. - The techniques described herein are implemented using a cloud-based virtual computing architecture. However, it will be appreciated that these techniques and systems may also be implemented by a single computer in a non-distributed manner. In such embodiments, steps 814-820 of method 800 may be replaced by a single step where the server computer performs the compaction operation on the sorted table 710. For example, a server computer may download each of the
SSTables 540 from thecloud 110 and generate sorted table 710 locally in a memory or storage device attached to the server computer. Instead of splitting up the sorted table 710 and distributing smaller processing tasks to a plurality of compute nodes, the server computer could perform the equivalent of the MapReduce operation locally by compacting the sorted table 710 using a single processor. Although most eventually-consistent data stores 300 are very large so as to make such a technique inefficient, generating the point-in-time snapshot 720 with a single processor, rather than a distributed compute environment, is within the scope of the present invention. - While the foregoing is directed to embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof. For example, aspects of the present invention may be implemented in hardware or software or in a combination of hardware and software. One embodiment of the invention may be implemented as a program product for use with a computer system. The program(s) of the program product define functions of the embodiments (including the methods described herein) and can be contained on a variety of computer-readable storage media. Illustrative computer-readable storage media include, but are not limited to: (i) non-writable storage media (e.g., read-only memory devices within a computer such as CD-ROM disks readable by a CD-ROM drive, flash memory, ROM chips or any type of solid-state non-volatile semiconductor memory) on which information is permanently stored; and (ii) writable storage media (e.g., floppy disks within a diskette drive or hard-disk drive or any type of solid-state random-access semiconductor memory) on which alterable information is stored. Such computer-readable storage media, when carrying computer-readable instructions that direct the functions of the present invention, are embodiments of the present invention.
- In view of the foregoing, the scope of the present invention is determined by the claims that follow.
Claims (20)
Priority Applications (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/399,467 US9613104B2 (en) | 2012-02-17 | 2012-02-17 | System and method for building a point-in-time snapshot of an eventually-consistent data store |
PCT/US2013/026513 WO2013123449A1 (en) | 2012-02-17 | 2013-02-15 | System and method for building a point-in-time snapshot of an eventually-consistent data store |
DK13749480.3T DK2815304T3 (en) | 2012-02-17 | 2013-02-15 | SYSTEM AND PROCEDURE FOR BUILDING A TIME SNAPSHOT OF ANY CONSISTENT DATA STOCK |
EP13749480.3A EP2815304B1 (en) | 2012-02-17 | 2013-02-15 | System and method for building a point-in-time snapshot of an eventually-consistent data store |
US15/476,926 US10942812B2 (en) | 2012-02-17 | 2017-03-31 | System and method for building a point-in-time snapshot of an eventually-consistent data store |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/399,467 US9613104B2 (en) | 2012-02-17 | 2012-02-17 | System and method for building a point-in-time snapshot of an eventually-consistent data store |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/476,926 Continuation US10942812B2 (en) | 2012-02-17 | 2017-03-31 | System and method for building a point-in-time snapshot of an eventually-consistent data store |
Publications (2)
Publication Number | Publication Date |
---|---|
US20130218840A1 true US20130218840A1 (en) | 2013-08-22 |
US9613104B2 US9613104B2 (en) | 2017-04-04 |
Family
ID=48983093
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/399,467 Active 2033-05-28 US9613104B2 (en) | 2012-02-17 | 2012-02-17 | System and method for building a point-in-time snapshot of an eventually-consistent data store |
US15/476,926 Active US10942812B2 (en) | 2012-02-17 | 2017-03-31 | System and method for building a point-in-time snapshot of an eventually-consistent data store |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/476,926 Active US10942812B2 (en) | 2012-02-17 | 2017-03-31 | System and method for building a point-in-time snapshot of an eventually-consistent data store |
Country Status (4)
Country | Link |
---|---|
US (2) | US9613104B2 (en) |
EP (1) | EP2815304B1 (en) |
DK (1) | DK2815304T3 (en) |
WO (1) | WO2013123449A1 (en) |
Cited By (81)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130080393A1 (en) * | 2011-09-23 | 2013-03-28 | Red Lambda, Inc. | System and Method for Storing Stream Data in Distributed Relational Tables with Data Provenance |
US20130246568A1 (en) * | 2012-03-15 | 2013-09-19 | Onapp Limited | Data storage system |
US20130282668A1 (en) * | 2012-04-20 | 2013-10-24 | Cloudera, Inc. | Automatic repair of corrupt hbases |
CN103744628A (en) * | 2014-01-27 | 2014-04-23 | 北京奇虎科技有限公司 | SSTable file storage method and device |
US20150169624A1 (en) * | 2013-12-13 | 2015-06-18 | BloomReach Inc. | Distributed and fast data storage layer for large scale web data services |
US20150178309A1 (en) * | 2012-06-05 | 2015-06-25 | International Business Machines Corporation | Preserving a state using snapshots with selective tuple versioning |
US9082127B2 (en) | 2010-03-31 | 2015-07-14 | Cloudera, Inc. | Collecting and aggregating datasets for analysis |
US9081888B2 (en) | 2010-03-31 | 2015-07-14 | Cloudera, Inc. | Collecting and aggregating log data with fault tolerance |
US20150212894A1 (en) * | 2014-01-24 | 2015-07-30 | Commvault Systems, Inc. | Restoring application data from a single snapshot for multiple applications |
US9128949B2 (en) | 2012-01-18 | 2015-09-08 | Cloudera, Inc. | Memory allocation buffer for reduction of heap fragmentation |
US9172608B2 (en) | 2012-02-07 | 2015-10-27 | Cloudera, Inc. | Centralized configuration and monitoring of a distributed computing cluster |
US20150333978A1 (en) * | 2014-05-19 | 2015-11-19 | Inventec Appliances Corp. | System, method and computer readable media storage program therein for allocating cloud resource |
CN105095287A (en) * | 2014-05-14 | 2015-11-25 | 华为技术有限公司 | LSM (Log Structured Merge) data compact method and device |
US9201910B2 (en) | 2010-03-31 | 2015-12-01 | Cloudera, Inc. | Dynamically processing an event using an extensible data model |
EP2980702A1 (en) * | 2014-07-31 | 2016-02-03 | Deutsche Telekom AG | Method for enhancing the generation of a backup copy of data items of a distributed data structure, computer network for enhancing the generation of a backup copy of data items of a distributed data structure, program and computer program product |
US9317572B2 (en) | 2010-03-31 | 2016-04-19 | Cloudera, Inc. | Configuring a system to collect and aggregate datasets |
US9338008B1 (en) | 2012-04-02 | 2016-05-10 | Cloudera, Inc. | System and method for secure release of secret information over a network |
US9342557B2 (en) | 2013-03-13 | 2016-05-17 | Cloudera, Inc. | Low latency query engine for Apache Hadoop |
CN105677673A (en) * | 2014-11-20 | 2016-06-15 | 阿里巴巴集团控股有限公司 | Business processing method, device and system |
US20160217043A1 (en) * | 2015-01-28 | 2016-07-28 | DataStax | Backup to and Restore from an Offsite Backup Location |
US20160217044A1 (en) * | 2015-01-28 | 2016-07-28 | DataStax | Backup to and Clone from an Offsite Backup Location |
US9405692B2 (en) | 2012-03-21 | 2016-08-02 | Cloudera, Inc. | Data processing performance enhancement in a distributed file system |
US20160259794A1 (en) * | 2015-03-03 | 2016-09-08 | Taser International, Inc. | Automated Integration Of Video Evidence With Data Records |
US9448731B2 (en) | 2014-11-14 | 2016-09-20 | Commvault Systems, Inc. | Unified snapshot storage management |
US9471578B2 (en) | 2012-03-07 | 2016-10-18 | Commvault Systems, Inc. | Data storage system utilizing proxy device for storage operations |
US9477731B2 (en) | 2013-10-01 | 2016-10-25 | Cloudera, Inc. | Background format optimization for enhanced SQL-like queries in Hadoop |
US20160321294A1 (en) * | 2015-04-30 | 2016-11-03 | Vmware, Inc. | Distributed, Scalable Key-Value Store |
US9619341B2 (en) | 2003-11-13 | 2017-04-11 | Commvault Systems, Inc. | System and method for performing an image level snapshot and for restoring partial volume data |
US20170109376A1 (en) * | 2015-10-20 | 2017-04-20 | Samsung Sds Co., Ltd. | Method for managing data using in-memory database and apparatus thereof |
US20170111434A1 (en) * | 2015-10-14 | 2017-04-20 | International Business Machines Corporation | Geographically distributed highly available mailbox |
US9632874B2 (en) | 2014-01-24 | 2017-04-25 | Commvault Systems, Inc. | Database application backup in single snapshot for multiple applications |
US9639426B2 (en) | 2014-01-24 | 2017-05-02 | Commvault Systems, Inc. | Single snapshot for multiple applications |
US9648105B2 (en) | 2014-11-14 | 2017-05-09 | Commvault Systems, Inc. | Unified snapshot storage management, using an enhanced storage manager and enhanced media agents |
US9690671B2 (en) | 2013-11-01 | 2017-06-27 | Cloudera, Inc. | Manifest-based snapshots in distributed computing environments |
US9697241B1 (en) | 2015-03-19 | 2017-07-04 | EMC IP Holding Company LLC | Data fabric layer having nodes associated with virtual storage volumes of underlying storage infrastructure layer |
WO2017117595A1 (en) * | 2015-12-31 | 2017-07-06 | Fractal Industries, Inc. | Distributed system for large volume deep web data extraction |
US20170235814A1 (en) * | 2015-12-22 | 2017-08-17 | Jeremy L. Branscome | Method and a System for Efficient Data Sorting |
US9747317B2 (en) | 2012-06-05 | 2017-08-29 | International Business Machines Corporation | Preserving past states of file system nodes |
US9747333B2 (en) | 2014-10-08 | 2017-08-29 | Cloudera, Inc. | Querying operating system state on multiple machines declaratively |
US9753812B2 (en) | 2014-01-24 | 2017-09-05 | Commvault Systems, Inc. | Generating mapping information for single snapshot for multiple applications |
US9774672B2 (en) | 2014-09-03 | 2017-09-26 | Commvault Systems, Inc. | Consolidated processing of storage-array commands by a snapshot-control media agent |
US9886346B2 (en) | 2013-01-11 | 2018-02-06 | Commvault Systems, Inc. | Single snapshot for multiple agents |
US9892123B2 (en) | 2014-01-24 | 2018-02-13 | Commvault Systems, Inc. | Snapshot readiness checking and reporting |
US9898371B2 (en) | 2012-03-07 | 2018-02-20 | Commvault Systems, Inc. | Data storage system utilizing proxy device for storage operations |
US20180067826A1 (en) * | 2013-08-26 | 2018-03-08 | Vmware, Inc. | Distributed transaction log |
US9928002B2 (en) | 2012-04-23 | 2018-03-27 | Commvault Systems, Inc. | Integrated snapshot interface for a data storage system |
US9934382B2 (en) | 2013-10-28 | 2018-04-03 | Cloudera, Inc. | Virtual machine image encryption |
US9979783B2 (en) | 2014-01-21 | 2018-05-22 | Red Hat, Inc. | Distributed coordinated snapshots |
US20180173778A1 (en) * | 2016-12-16 | 2018-06-21 | Linkedin Corporation | Database uniqueness constraints |
US10042716B2 (en) | 2014-09-03 | 2018-08-07 | Commvault Systems, Inc. | Consolidated processing of storage-array commands using a forwarder media agent in conjunction with a snapshot-control media agent |
US20180225321A1 (en) * | 2017-02-09 | 2018-08-09 | Micron Technology, Inc. | Merge tree garbage metrics |
US10114581B1 (en) * | 2016-12-27 | 2018-10-30 | EMC IP Holding Company LLC | Creating a virtual access point in time on an object based journal replication |
WO2018200475A1 (en) * | 2017-04-24 | 2018-11-01 | Reniac, Inc. | System and method to accelerate compaction |
US10210171B2 (en) | 2014-06-18 | 2019-02-19 | Microsoft Technology Licensing, Llc | Scalable eventual consistency system using logical document journaling |
US10229015B2 (en) | 2016-08-30 | 2019-03-12 | Microsoft Technology Licensing, Llc | Quorum based reliable low latency storage |
US10324911B1 (en) * | 2016-09-30 | 2019-06-18 | Virtustream Ip Holding Company Llc | Storage system with bucket contents rebalancer providing adaptive partitioning for database buckets |
US10503753B2 (en) | 2016-03-10 | 2019-12-10 | Commvault Systems, Inc. | Snapshot replication operations based on incremental block change tracking |
US10706106B2 (en) | 2017-02-09 | 2020-07-07 | Micron Technology, Inc. | Merge tree modifications for maintenance operations |
US10719495B2 (en) | 2017-02-09 | 2020-07-21 | Micron Technology, Inc. | Stream selection for multi-stream storage devices |
US10725988B2 (en) | 2017-02-09 | 2020-07-28 | Micron Technology, Inc. | KVS tree |
US10732885B2 (en) | 2018-02-14 | 2020-08-04 | Commvault Systems, Inc. | Block-level live browsing and private writable snapshots using an ISCSI server |
US10754995B2 (en) * | 2017-10-05 | 2020-08-25 | Zadara Storage, Inc. | Consistency across key value stores with shared journals |
US10776211B1 (en) * | 2016-12-27 | 2020-09-15 | EMC IP Holding Company LLC | Methods, systems, and apparatuses to update point in time journal using map reduce to create a highly parallel update |
WO2020232096A1 (en) * | 2019-05-13 | 2020-11-19 | Snowflake Inc. | Journaled tables in database systems |
US10852978B2 (en) | 2018-12-14 | 2020-12-01 | Micron Technology, Inc. | Key-value store using journaling with selective data storage format |
US10909072B2 (en) * | 2018-08-02 | 2021-02-02 | Memverge, Inc. | Key value store snapshot in a distributed memory object architecture |
US10915546B2 (en) | 2018-10-10 | 2021-02-09 | Micron Technology, Inc. | Counter-based compaction of key-value store tree data block |
US10936661B2 (en) | 2018-12-26 | 2021-03-02 | Micron Technology, Inc. | Data tree with order-based node traversal |
KR20210057835A (en) * | 2018-10-10 | 2021-05-21 | 마이크론 테크놀로지, 인크. | Compressed Key-Value Store Tree Data Block Leak |
US11048755B2 (en) | 2018-12-14 | 2021-06-29 | Micron Technology, Inc. | Key-value store tree with selective use of key portion |
US11061609B2 (en) | 2018-08-02 | 2021-07-13 | MemVerge, Inc | Distributed memory object method and system enabling memory-speed data access in a distributed environment |
US20210271653A1 (en) * | 2015-05-07 | 2021-09-02 | Cloudera, Inc. | Mutations in a column store |
US11134055B2 (en) | 2018-08-02 | 2021-09-28 | Memverge, Inc. | Naming service in a distributed memory object architecture |
WO2021257263A1 (en) * | 2020-06-18 | 2021-12-23 | Netflix, Inc. | Techniques for generating a consistent view of an eventually consistent database |
US20220006862A1 (en) * | 2020-07-01 | 2022-01-06 | Jpmorgan Chase Bank, N.A. | Method and system for an object proxy service |
WO2022106977A1 (en) * | 2020-11-18 | 2022-05-27 | Ownbackup Ltd. | Continuous data protection using retroactive backup snapshots |
US11397717B2 (en) * | 2018-05-15 | 2022-07-26 | Palantir Technologies, Inc. | Data storage system and method |
US11429595B2 (en) * | 2020-04-01 | 2022-08-30 | Marvell Asia Pte Ltd. | Persistence of write requests in a database proxy |
US11662909B2 (en) | 2014-11-24 | 2023-05-30 | Pure Storage, Inc | Metadata management in a storage system |
US11797600B2 (en) | 2020-11-18 | 2023-10-24 | Ownbackup Ltd. | Time-series analytics for database management systems |
US12045254B2 (en) * | 2016-10-03 | 2024-07-23 | Ocient Inc. | Randomized data distribution in highly parallel database management system |
Families Citing this family (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8639781B1 (en) * | 2012-10-19 | 2014-01-28 | Dropbox, Inc. | Systems and methods for downloading files |
US9747339B2 (en) | 2015-06-04 | 2017-08-29 | Getgo, Inc. | Server-based management for querying eventually-consistent database |
US11468368B2 (en) | 2015-10-28 | 2022-10-11 | Qomplx, Inc. | Parametric modeling and simulation of complex systems using large datasets and heterogeneous data structures |
US11074652B2 (en) | 2015-10-28 | 2021-07-27 | Qomplx, Inc. | System and method for model-based prediction using a distributed computational graph workflow |
US9858011B2 (en) * | 2015-12-16 | 2018-01-02 | International Business Machines Corporation | Repopulating failed replicas through modified consensus recovery |
US11080242B1 (en) * | 2016-03-30 | 2021-08-03 | EMC IP Holding Company LLC | Multi copy journal consolidation |
US10346062B2 (en) | 2016-11-16 | 2019-07-09 | International Business Machines Corporation | Point-in-time backups via a storage controller to an object storage cloud |
US10789205B1 (en) * | 2017-04-30 | 2020-09-29 | EMC IP Holding Company LLC | Cloud data archiving using promoted objects list |
US10635597B2 (en) | 2018-02-28 | 2020-04-28 | Citrix Systems, Inc. | Read caching with early refresh for eventually-consistent data store |
WO2019171153A1 (en) * | 2018-03-09 | 2019-09-12 | Pratik Sharma | Snapshot service for clusters |
US11012760B2 (en) | 2018-10-03 | 2021-05-18 | Wanjeru Kingori | System and method for branching-plot video content and editing thereof |
US10839377B2 (en) | 2019-01-25 | 2020-11-17 | Coinbase, Inc. | Syncing blockchain nodes with snapshots |
US11080144B2 (en) | 2019-01-25 | 2021-08-03 | Coinbase, Inc. | System and method for managing blockchain nodes |
US11561999B2 (en) * | 2019-01-31 | 2023-01-24 | Rubrik, Inc. | Database recovery time objective optimization with synthetic snapshots |
EP4031990A4 (en) * | 2019-09-17 | 2023-10-11 | Coinbase, Inc. | System and method for managing blockchain nodes |
US11113311B2 (en) | 2019-12-20 | 2021-09-07 | Walmart Apollo, Llc | Technology agnostic system and method for achieving eventually-consistent data replication |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030115301A1 (en) * | 2001-11-12 | 2003-06-19 | (Nokia Corporation) | Arrangement of data synchronization in a telecommunications system |
US20110295815A1 (en) * | 2010-05-26 | 2011-12-01 | International Business Machines Corporation | Proactive Detection of Data Inconsistencies in a Storage System Point-in-Time Copy of Data |
Family Cites Families (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0916607A (en) * | 1995-06-26 | 1997-01-17 | Hitachi Ltd | Method for managing index in data base management system |
US7120572B1 (en) | 2000-01-06 | 2006-10-10 | Sun Microsystems, Inc. | Memory efficient program pre-execution verifier and method |
US7240171B2 (en) * | 2004-01-23 | 2007-07-03 | International Business Machines Corporation | Method and system for ensuring consistency of a group |
US7392324B2 (en) | 2004-08-13 | 2008-06-24 | International Business Machines Corporation | Consistent snapshots of dynamic heterogeneously managed data |
CA2542379A1 (en) | 2006-04-07 | 2007-10-07 | Cognos Incorporated | Packaged warehouse solution system |
US7882071B2 (en) | 2006-08-18 | 2011-02-01 | Isilon Systems, Inc. | Systems and methods for a snapshot of data |
US7685378B2 (en) * | 2007-02-15 | 2010-03-23 | Hitachi, Ltd. | Methods and apparatus for adjusting a journal area for continuous data protection |
US7827350B1 (en) * | 2007-04-27 | 2010-11-02 | Netapp, Inc. | Method and system for promoting a snapshot in a distributed file system |
US9483525B2 (en) * | 2007-04-30 | 2016-11-01 | Microsoft Technology Licensing, Llc | Reducing update conflicts when maintaining views |
US8918365B2 (en) * | 2009-06-19 | 2014-12-23 | Blekko, Inc. | Dedicating disks to reading or writing |
US8655859B2 (en) * | 2010-03-01 | 2014-02-18 | International Business Machines Corporation | Concurrency control for extraction, transform, load processes |
US9015126B2 (en) | 2010-05-22 | 2015-04-21 | Nokia Corporation | Method and apparatus for eventually consistent delete in a distributed data store |
US20130073845A1 (en) * | 2010-05-28 | 2013-03-21 | Nec Corporation | Anonymous credential system, user device, verification device, anonymous credential method, and anonymous credential program |
US8583599B2 (en) * | 2010-11-29 | 2013-11-12 | Ca, Inc. | Reducing data duplication in cloud storage |
US8626713B2 (en) * | 2010-12-08 | 2014-01-07 | International Business Machines Corporation | Multiple contexts in a redirect on write file system |
US9122720B2 (en) * | 2011-06-14 | 2015-09-01 | Microsoft Technology Licensing, Llc | Enriching database query responses using data from external data sources |
WO2013005245A1 (en) * | 2011-07-01 | 2013-01-10 | Hitachi, Ltd. | Storage system and controlling method of the same |
-
2012
- 2012-02-17 US US13/399,467 patent/US9613104B2/en active Active
-
2013
- 2013-02-15 DK DK13749480.3T patent/DK2815304T3/en active
- 2013-02-15 EP EP13749480.3A patent/EP2815304B1/en active Active
- 2013-02-15 WO PCT/US2013/026513 patent/WO2013123449A1/en active Application Filing
-
2017
- 2017-03-31 US US15/476,926 patent/US10942812B2/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030115301A1 (en) * | 2001-11-12 | 2003-06-19 | (Nokia Corporation) | Arrangement of data synchronization in a telecommunications system |
US20110295815A1 (en) * | 2010-05-26 | 2011-12-01 | International Business Machines Corporation | Proactive Detection of Data Inconsistencies in a Storage System Point-in-Time Copy of Data |
Cited By (150)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9619341B2 (en) | 2003-11-13 | 2017-04-11 | Commvault Systems, Inc. | System and method for performing an image level snapshot and for restoring partial volume data |
US9082127B2 (en) | 2010-03-31 | 2015-07-14 | Cloudera, Inc. | Collecting and aggregating datasets for analysis |
US9317572B2 (en) | 2010-03-31 | 2016-04-19 | Cloudera, Inc. | Configuring a system to collect and aggregate datasets |
US9201910B2 (en) | 2010-03-31 | 2015-12-01 | Cloudera, Inc. | Dynamically processing an event using an extensible data model |
US9081888B2 (en) | 2010-03-31 | 2015-07-14 | Cloudera, Inc. | Collecting and aggregating log data with fault tolerance |
US20130080393A1 (en) * | 2011-09-23 | 2013-03-28 | Red Lambda, Inc. | System and Method for Storing Stream Data in Distributed Relational Tables with Data Provenance |
US9390147B2 (en) * | 2011-09-23 | 2016-07-12 | Red Lambda, Inc. | System and method for storing stream data in distributed relational tables with data provenance |
US9128949B2 (en) | 2012-01-18 | 2015-09-08 | Cloudera, Inc. | Memory allocation buffer for reduction of heap fragmentation |
US9172608B2 (en) | 2012-02-07 | 2015-10-27 | Cloudera, Inc. | Centralized configuration and monitoring of a distributed computing cluster |
US9716624B2 (en) | 2012-02-07 | 2017-07-25 | Cloudera, Inc. | Centralized configuration of a distributed computing cluster |
US9898371B2 (en) | 2012-03-07 | 2018-02-20 | Commvault Systems, Inc. | Data storage system utilizing proxy device for storage operations |
US9928146B2 (en) | 2012-03-07 | 2018-03-27 | Commvault Systems, Inc. | Data storage system utilizing proxy device for storage operations |
US9471578B2 (en) | 2012-03-07 | 2016-10-18 | Commvault Systems, Inc. | Data storage system utilizing proxy device for storage operations |
US11431798B2 (en) * | 2012-03-15 | 2022-08-30 | Onapp Limited | Data storage system |
US20130246568A1 (en) * | 2012-03-15 | 2013-09-19 | Onapp Limited | Data storage system |
US9405692B2 (en) | 2012-03-21 | 2016-08-02 | Cloudera, Inc. | Data processing performance enhancement in a distributed file system |
US9338008B1 (en) | 2012-04-02 | 2016-05-10 | Cloudera, Inc. | System and method for secure release of secret information over a network |
US20130282668A1 (en) * | 2012-04-20 | 2013-10-24 | Cloudera, Inc. | Automatic repair of corrupt hbases |
US9842126B2 (en) * | 2012-04-20 | 2017-12-12 | Cloudera, Inc. | Automatic repair of corrupt HBases |
US11269543B2 (en) | 2012-04-23 | 2022-03-08 | Commvault Systems, Inc. | Integrated snapshot interface for a data storage system |
US9928002B2 (en) | 2012-04-23 | 2018-03-27 | Commvault Systems, Inc. | Integrated snapshot interface for a data storage system |
US10698632B2 (en) | 2012-04-23 | 2020-06-30 | Commvault Systems, Inc. | Integrated snapshot interface for a data storage system |
US9747317B2 (en) | 2012-06-05 | 2017-08-29 | International Business Machines Corporation | Preserving past states of file system nodes |
US20150178309A1 (en) * | 2012-06-05 | 2015-06-25 | International Business Machines Corporation | Preserving a state using snapshots with selective tuple versioning |
US9569458B2 (en) * | 2012-06-05 | 2017-02-14 | International Business Machines Corporation | Preserving a state using snapshots with selective tuple versioning |
US9886346B2 (en) | 2013-01-11 | 2018-02-06 | Commvault Systems, Inc. | Single snapshot for multiple agents |
US11847026B2 (en) | 2013-01-11 | 2023-12-19 | Commvault Systems, Inc. | Single snapshot for multiple agents |
US10853176B2 (en) | 2013-01-11 | 2020-12-01 | Commvault Systems, Inc. | Single snapshot for multiple agents |
US9342557B2 (en) | 2013-03-13 | 2016-05-17 | Cloudera, Inc. | Low latency query engine for Apache Hadoop |
US20180067826A1 (en) * | 2013-08-26 | 2018-03-08 | Vmware, Inc. | Distributed transaction log |
US10769036B2 (en) * | 2013-08-26 | 2020-09-08 | Vmware, Inc. | Distributed transaction log |
US9477731B2 (en) | 2013-10-01 | 2016-10-25 | Cloudera, Inc. | Background format optimization for enhanced SQL-like queries in Hadoop |
US9934382B2 (en) | 2013-10-28 | 2018-04-03 | Cloudera, Inc. | Virtual machine image encryption |
US9690671B2 (en) | 2013-11-01 | 2017-06-27 | Cloudera, Inc. | Manifest-based snapshots in distributed computing environments |
US11768739B2 (en) * | 2013-11-01 | 2023-09-26 | Cloudera, Inc. | Manifest-based snapshots in distributed computing environments |
US12007846B2 (en) | 2013-11-01 | 2024-06-11 | Cloudera, Inc. | Manifest-based snapshots in distributed computing environments |
US20200356447A1 (en) * | 2013-11-01 | 2020-11-12 | Cloudera, Inc. | Manifest-based snapshots in distributed computing environments |
EP3080953A4 (en) * | 2013-12-13 | 2017-05-17 | Bloomreach Inc. | Distributed and fast data storage layer for large scale web data services |
US10719562B2 (en) * | 2013-12-13 | 2020-07-21 | BloomReach Inc. | Distributed and fast data storage layer for large scale web data services |
US20150169624A1 (en) * | 2013-12-13 | 2015-06-18 | BloomReach Inc. | Distributed and fast data storage layer for large scale web data services |
EP3080953A1 (en) * | 2013-12-13 | 2016-10-19 | Bloomreach Inc. | Distributed and fast data storage layer for large scale web data services |
US9979783B2 (en) | 2014-01-21 | 2018-05-22 | Red Hat, Inc. | Distributed coordinated snapshots |
US20150212894A1 (en) * | 2014-01-24 | 2015-07-30 | Commvault Systems, Inc. | Restoring application data from a single snapshot for multiple applications |
US9639426B2 (en) | 2014-01-24 | 2017-05-02 | Commvault Systems, Inc. | Single snapshot for multiple applications |
US9753812B2 (en) | 2014-01-24 | 2017-09-05 | Commvault Systems, Inc. | Generating mapping information for single snapshot for multiple applications |
US10671484B2 (en) | 2014-01-24 | 2020-06-02 | Commvault Systems, Inc. | Single snapshot for multiple applications |
US9632874B2 (en) | 2014-01-24 | 2017-04-25 | Commvault Systems, Inc. | Database application backup in single snapshot for multiple applications |
US10223365B2 (en) | 2014-01-24 | 2019-03-05 | Commvault Systems, Inc. | Snapshot readiness checking and reporting |
US9892123B2 (en) | 2014-01-24 | 2018-02-13 | Commvault Systems, Inc. | Snapshot readiness checking and reporting |
US12056014B2 (en) | 2014-01-24 | 2024-08-06 | Commvault Systems, Inc. | Single snapshot for multiple applications |
US10572444B2 (en) | 2014-01-24 | 2020-02-25 | Commvault Systems, Inc. | Operation readiness checking and reporting |
US10942894B2 (en) | 2014-01-24 | 2021-03-09 | Commvault Systems, Inc | Operation readiness checking and reporting |
CN103744628A (en) * | 2014-01-27 | 2014-04-23 | 北京奇虎科技有限公司 | SSTable file storage method and device |
CN105095287A (en) * | 2014-05-14 | 2015-11-25 | 华为技术有限公司 | LSM (Log Structured Merge) data compact method and device |
US20150333978A1 (en) * | 2014-05-19 | 2015-11-19 | Inventec Appliances Corp. | System, method and computer readable media storage program therein for allocating cloud resource |
US10503554B2 (en) * | 2014-05-19 | 2019-12-10 | Inventec Appliances Corp. | System, method and computer readable media storage program therein for allocating cloud resource |
US10210171B2 (en) | 2014-06-18 | 2019-02-19 | Microsoft Technology Licensing, Llc | Scalable eventual consistency system using logical document journaling |
EP2980702A1 (en) * | 2014-07-31 | 2016-02-03 | Deutsche Telekom AG | Method for enhancing the generation of a backup copy of data items of a distributed data structure, computer network for enhancing the generation of a backup copy of data items of a distributed data structure, program and computer program product |
US9774672B2 (en) | 2014-09-03 | 2017-09-26 | Commvault Systems, Inc. | Consolidated processing of storage-array commands by a snapshot-control media agent |
US10044803B2 (en) | 2014-09-03 | 2018-08-07 | Commvault Systems, Inc. | Consolidated processing of storage-array commands by a snapshot-control media agent |
US10891197B2 (en) | 2014-09-03 | 2021-01-12 | Commvault Systems, Inc. | Consolidated processing of storage-array commands using a forwarder media agent in conjunction with a snapshot-control media agent |
US10798166B2 (en) | 2014-09-03 | 2020-10-06 | Commvault Systems, Inc. | Consolidated processing of storage-array commands by a snapshot-control media agent |
US10042716B2 (en) | 2014-09-03 | 2018-08-07 | Commvault Systems, Inc. | Consolidated processing of storage-array commands using a forwarder media agent in conjunction with a snapshot-control media agent |
US11245759B2 (en) | 2014-09-03 | 2022-02-08 | Commvault Systems, Inc. | Consolidated processing of storage-array commands by a snapshot-control media agent |
US10419536B2 (en) | 2014-09-03 | 2019-09-17 | Commvault Systems, Inc. | Consolidated processing of storage-array commands by a snapshot-control media agent |
US9747333B2 (en) | 2014-10-08 | 2017-08-29 | Cloudera, Inc. | Querying operating system state on multiple machines declaratively |
US9448731B2 (en) | 2014-11-14 | 2016-09-20 | Commvault Systems, Inc. | Unified snapshot storage management |
US10628266B2 (en) | 2014-11-14 | 2020-04-21 | Commvault System, Inc. | Unified snapshot storage management |
US11507470B2 (en) | 2014-11-14 | 2022-11-22 | Commvault Systems, Inc. | Unified snapshot storage management |
US9648105B2 (en) | 2014-11-14 | 2017-05-09 | Commvault Systems, Inc. | Unified snapshot storage management, using an enhanced storage manager and enhanced media agents |
US10521308B2 (en) | 2014-11-14 | 2019-12-31 | Commvault Systems, Inc. | Unified snapshot storage management, using an enhanced storage manager and enhanced media agents |
US9921920B2 (en) | 2014-11-14 | 2018-03-20 | Commvault Systems, Inc. | Unified snapshot storage management, using an enhanced storage manager and enhanced media agents |
US9996428B2 (en) | 2014-11-14 | 2018-06-12 | Commvault Systems, Inc. | Unified snapshot storage management |
CN105677673A (en) * | 2014-11-20 | 2016-06-15 | 阿里巴巴集团控股有限公司 | Business processing method, device and system |
US11662909B2 (en) | 2014-11-24 | 2023-05-30 | Pure Storage, Inc | Metadata management in a storage system |
US20160217043A1 (en) * | 2015-01-28 | 2016-07-28 | DataStax | Backup to and Restore from an Offsite Backup Location |
US10402275B2 (en) * | 2015-01-28 | 2019-09-03 | DataStax | Backup to and restore from an offsite backup location |
US20160217044A1 (en) * | 2015-01-28 | 2016-07-28 | DataStax | Backup to and Clone from an Offsite Backup Location |
US10402276B2 (en) * | 2015-01-28 | 2019-09-03 | DataStax | Backup to and clone from an offsite backup location |
US20160259794A1 (en) * | 2015-03-03 | 2016-09-08 | Taser International, Inc. | Automated Integration Of Video Evidence With Data Records |
US11237918B2 (en) * | 2015-03-03 | 2022-02-01 | Axon Enterprise, Inc. | Automated integration of video evidence with data records |
US9697241B1 (en) | 2015-03-19 | 2017-07-04 | EMC IP Holding Company LLC | Data fabric layer having nodes associated with virtual storage volumes of underlying storage infrastructure layer |
US10891264B2 (en) * | 2015-04-30 | 2021-01-12 | Vmware, Inc. | Distributed, scalable key-value store |
US20160321294A1 (en) * | 2015-04-30 | 2016-11-03 | Vmware, Inc. | Distributed, Scalable Key-Value Store |
US20210271653A1 (en) * | 2015-05-07 | 2021-09-02 | Cloudera, Inc. | Mutations in a column store |
US20170111434A1 (en) * | 2015-10-14 | 2017-04-20 | International Business Machines Corporation | Geographically distributed highly available mailbox |
US10681113B2 (en) * | 2015-10-14 | 2020-06-09 | International Business Machines Corporation | Geographically distributed highly available mailbox |
US20170109376A1 (en) * | 2015-10-20 | 2017-04-20 | Samsung Sds Co., Ltd. | Method for managing data using in-memory database and apparatus thereof |
US10133757B2 (en) * | 2015-10-20 | 2018-11-20 | Samsung Sds Co., Ltd. | Method for managing data using in-memory database and apparatus thereof |
US10545959B2 (en) * | 2015-12-22 | 2020-01-28 | Teradata Us, Inc. | Method and a system for efficient data sorting |
US20170235814A1 (en) * | 2015-12-22 | 2017-08-17 | Jeremy L. Branscome | Method and a System for Efficient Data Sorting |
WO2017117595A1 (en) * | 2015-12-31 | 2017-07-06 | Fractal Industries, Inc. | Distributed system for large volume deep web data extraction |
US10210255B2 (en) | 2015-12-31 | 2019-02-19 | Fractal Industries, Inc. | Distributed system for large volume deep web data extraction |
US10503753B2 (en) | 2016-03-10 | 2019-12-10 | Commvault Systems, Inc. | Snapshot replication operations based on incremental block change tracking |
US11238064B2 (en) | 2016-03-10 | 2022-02-01 | Commvault Systems, Inc. | Snapshot replication operations based on incremental block change tracking |
US11836156B2 (en) | 2016-03-10 | 2023-12-05 | Commvault Systems, Inc. | Snapshot replication operations based on incremental block change tracking |
US10229015B2 (en) | 2016-08-30 | 2019-03-12 | Microsoft Technology Licensing, Llc | Quorum based reliable low latency storage |
US10324911B1 (en) * | 2016-09-30 | 2019-06-18 | Virtustream Ip Holding Company Llc | Storage system with bucket contents rebalancer providing adaptive partitioning for database buckets |
US12045254B2 (en) * | 2016-10-03 | 2024-07-23 | Ocient Inc. | Randomized data distribution in highly parallel database management system |
US20180173778A1 (en) * | 2016-12-16 | 2018-06-21 | Linkedin Corporation | Database uniqueness constraints |
US10776211B1 (en) * | 2016-12-27 | 2020-09-15 | EMC IP Holding Company LLC | Methods, systems, and apparatuses to update point in time journal using map reduce to create a highly parallel update |
US10114581B1 (en) * | 2016-12-27 | 2018-10-30 | EMC IP Holding Company LLC | Creating a virtual access point in time on an object based journal replication |
US10706106B2 (en) | 2017-02-09 | 2020-07-07 | Micron Technology, Inc. | Merge tree modifications for maintenance operations |
KR20190113942A (en) * | 2017-02-09 | 2019-10-08 | 마이크론 테크놀로지, 인크. | Merge Tree Garbage Metrics |
US10719495B2 (en) | 2017-02-09 | 2020-07-21 | Micron Technology, Inc. | Stream selection for multi-stream storage devices |
CN110291518A (en) * | 2017-02-09 | 2019-09-27 | 美光科技公司 | Merging tree garbage indicators |
US20200334295A1 (en) * | 2017-02-09 | 2020-10-22 | Micron Technology, Inc. | Merge tree garbage metrics |
US10725988B2 (en) | 2017-02-09 | 2020-07-28 | Micron Technology, Inc. | KVS tree |
US20180225321A1 (en) * | 2017-02-09 | 2018-08-09 | Micron Technology, Inc. | Merge tree garbage metrics |
KR102289332B1 (en) | 2017-02-09 | 2021-08-17 | 마이크론 테크놀로지, 인크. | Merge Tree Garbage Metrics |
TWI702506B (en) * | 2017-02-09 | 2020-08-21 | 美商美光科技公司 | System, machine readable medium, and machine-implemenated method for merge tree garbage metrics |
US10706105B2 (en) * | 2017-02-09 | 2020-07-07 | Micron Technology, Inc. | Merge tree garbage metrics |
US11126600B2 (en) * | 2017-04-24 | 2021-09-21 | Reniac, Inc. | System and method to accelerate compaction |
WO2018200475A1 (en) * | 2017-04-24 | 2018-11-01 | Reniac, Inc. | System and method to accelerate compaction |
JP2020536339A (en) * | 2017-10-05 | 2020-12-10 | ザダラ ストレージ インコーポレイテッド | Consistency between key-value stores, including shared journals |
JP7423534B2 (en) | 2017-10-05 | 2024-01-29 | ザダラ ストレージ インコーポレイテッド | Consistency between key-value stores with shared journals |
US10754995B2 (en) * | 2017-10-05 | 2020-08-25 | Zadara Storage, Inc. | Consistency across key value stores with shared journals |
US10732885B2 (en) | 2018-02-14 | 2020-08-04 | Commvault Systems, Inc. | Block-level live browsing and private writable snapshots using an ISCSI server |
US11422732B2 (en) | 2018-02-14 | 2022-08-23 | Commvault Systems, Inc. | Live browsing and private writable environments based on snapshots and/or backup copies provided by an ISCSI server |
US10740022B2 (en) | 2018-02-14 | 2020-08-11 | Commvault Systems, Inc. | Block-level live browsing and private writable backup copies using an ISCSI server |
US11397717B2 (en) * | 2018-05-15 | 2022-07-26 | Palantir Technologies, Inc. | Data storage system and method |
US10909072B2 (en) * | 2018-08-02 | 2021-02-02 | Memverge, Inc. | Key value store snapshot in a distributed memory object architecture |
US11061609B2 (en) | 2018-08-02 | 2021-07-13 | MemVerge, Inc | Distributed memory object method and system enabling memory-speed data access in a distributed environment |
US11134055B2 (en) | 2018-08-02 | 2021-09-28 | Memverge, Inc. | Naming service in a distributed memory object architecture |
KR102665979B1 (en) * | 2018-10-10 | 2024-05-16 | 마이크론 테크놀로지, 인크. | Compressed key-value store tree data blocks leaked |
US11100071B2 (en) | 2018-10-10 | 2021-08-24 | Micron Technology, Inc. | Key-value store tree data block spill with compaction |
CN113168408A (en) * | 2018-10-10 | 2021-07-23 | 美光科技公司 | Data block overflow using compressed key value storage tree |
US10915546B2 (en) | 2018-10-10 | 2021-02-09 | Micron Technology, Inc. | Counter-based compaction of key-value store tree data block |
KR20210057835A (en) * | 2018-10-10 | 2021-05-21 | 마이크론 테크놀로지, 인크. | Compressed Key-Value Store Tree Data Block Leak |
US11599552B2 (en) | 2018-10-10 | 2023-03-07 | Micron Technology, Inc. | Counter-based compaction of key-value store tree data block |
US11334270B2 (en) | 2018-12-14 | 2022-05-17 | Micron Technology, Inc. | Key-value store using journaling with selective data storage format |
US11048755B2 (en) | 2018-12-14 | 2021-06-29 | Micron Technology, Inc. | Key-value store tree with selective use of key portion |
US10852978B2 (en) | 2018-12-14 | 2020-12-01 | Micron Technology, Inc. | Key-value store using journaling with selective data storage format |
US10936661B2 (en) | 2018-12-26 | 2021-03-02 | Micron Technology, Inc. | Data tree with order-based node traversal |
US11657092B2 (en) | 2018-12-26 | 2023-05-23 | Micron Technology, Inc. | Data tree with order-based node traversal |
KR102471196B1 (en) * | 2019-05-13 | 2022-11-28 | 스노우플레이크 인코포레이티드 | Journaled tables in database systems |
US10990576B2 (en) | 2019-05-13 | 2021-04-27 | Snowflake Inc. | Providing snapshots of journal tables |
WO2020232096A1 (en) * | 2019-05-13 | 2020-11-19 | Snowflake Inc. | Journaled tables in database systems |
US10846277B1 (en) | 2019-05-13 | 2020-11-24 | Snowflake Inc. | Journaled tables in database systems |
CN112534396A (en) * | 2019-05-13 | 2021-03-19 | 斯诺弗雷克公司 | Diary watch in database system |
KR20210133995A (en) * | 2019-05-13 | 2021-11-08 | 스노우플레이크 인코포레이티드 | Journaled tables in the database system |
US11080257B2 (en) * | 2019-05-13 | 2021-08-03 | Snowflake Inc. | Journaled tables in database systems |
US10997148B2 (en) | 2019-05-13 | 2021-05-04 | Snowflake Inc. | Processing transactions on journaled tables |
US11429595B2 (en) * | 2020-04-01 | 2022-08-30 | Marvell Asia Pte Ltd. | Persistence of write requests in a database proxy |
WO2021257263A1 (en) * | 2020-06-18 | 2021-12-23 | Netflix, Inc. | Techniques for generating a consistent view of an eventually consistent database |
US20220006862A1 (en) * | 2020-07-01 | 2022-01-06 | Jpmorgan Chase Bank, N.A. | Method and system for an object proxy service |
US11888936B2 (en) * | 2020-07-01 | 2024-01-30 | Jpmorgan Chase Bank, N.A. | Method and system for an object proxy service |
US11797600B2 (en) | 2020-11-18 | 2023-10-24 | Ownbackup Ltd. | Time-series analytics for database management systems |
WO2022106977A1 (en) * | 2020-11-18 | 2022-05-27 | Ownbackup Ltd. | Continuous data protection using retroactive backup snapshots |
US11630816B2 (en) | 2020-11-18 | 2023-04-18 | Ownbackup Ltd. | Continuous data protection using retroactive backup snapshots |
Also Published As
Publication number | Publication date |
---|---|
US10942812B2 (en) | 2021-03-09 |
EP2815304A4 (en) | 2015-12-16 |
EP2815304B1 (en) | 2017-04-26 |
DK2815304T3 (en) | 2017-08-14 |
US9613104B2 (en) | 2017-04-04 |
EP2815304A1 (en) | 2014-12-24 |
US20170206140A1 (en) | 2017-07-20 |
WO2013123449A1 (en) | 2013-08-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10942812B2 (en) | System and method for building a point-in-time snapshot of an eventually-consistent data store | |
US10929428B1 (en) | Adaptive database replication for database copies | |
JP6510112B2 (en) | Datastream Capture and Persistence Policy | |
US11487468B2 (en) | Healing failed erasure-coded write attempts in a distributed data storage system configured with fewer storage nodes than data plus parity fragments | |
CA2929777C (en) | Managed service for acquisition, storage and consumption of large-scale data streams | |
CA2929776C (en) | Client-configurable security options for data streams | |
US10585599B2 (en) | System and method for distributed persistent store archival and retrieval in a distributed computing environment | |
US10706021B2 (en) | System and method for supporting persistence partition discovery in a distributed data grid | |
US11789830B2 (en) | Anti-entropy-based metadata recovery in a strongly consistent distributed data storage system | |
CN112969996A (en) | Tracking intermediate changes in database data | |
US20190188309A1 (en) | Tracking changes in mirrored databases | |
Dwivedi et al. | Analytical review on Hadoop Distributed file system | |
KR20100048130A (en) | Distributed storage system based on metadata cluster and method thereof | |
US11683161B2 (en) | Managing encryption keys under group-level encryption | |
US11880495B2 (en) | Processing log entries under group-level encryption | |
US11093465B2 (en) | Object storage system with versioned meta objects | |
Luo et al. | LAYER: A cost-efficient mechanism to support multi-tenant database as a service in cloud | |
US11962686B2 (en) | Encrypting intermediate data under group-level encryption | |
Cooper et al. | PNUTS to sherpa: Lessons from yahoo!'s cloud database | |
US11899811B2 (en) | Processing data pages under group-level encryption | |
US12050549B2 (en) | Client support of multiple fingerprint formats for data file segments | |
US10817388B1 (en) | Recovery of tree data in a geographically distributed environment | |
Ericson et al. | Survey of storage and fault tolerance strategies used in cloud computing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: NETFLIX, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:SMITH, CHARLES;MAGNUSSON, JEFFREY;ANAND, SIDDHARTH;SIGNING DATES FROM 20120215 TO 20120216;REEL/FRAME:027769/0405 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 4 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 8 |