WO2017191877A1

WO2017191877A1 - Compression device and method for managing provenance

Info

Publication number: WO2017191877A1
Application number: PCT/KR2016/013271
Authority: WO
Inventors: 유재수; 복경수; 한지은
Original assignee: 충북대학교 산학협력단
Priority date: 2016-05-01
Filing date: 2016-11-17
Publication date: 2017-11-09
Also published as: KR101783791B1

Abstract

The present invention relates to a compression device for managing provenance, the device comprising: a provenance generation unit which generates data provenance by receiving history information and a final document, and by using a provenance model; a pre-encoding unit which pre-encodes character string data of the data provenance into numeric string data, stores same in a pre-encoding table, and outputs numeric string data provenance; a final RDF compression unit which receives the numeric string data provenance, encodes a subject and an object together into a numeric string, encodes a predicate solely and separately into a numeric string, stores same in a final RDF data encoding table, generates an encoding provenance graph by using the data stored in the final RDF data encoding table, extracts repeating graph patterns by using the generated encoding provenance graph, stores the number of extracted graph pattern repetitions in a pattern statistics table, stores, in a graph pattern variable table, a subject or an object of each node of the extracted graph patterns according to the order in which the extracted graph patterns were found, and generates a data pattern compression graph for the final document by using the values stored in the graph pattern variable table; and a provenance pattern compression unit which receives the numeric string data provenance, generates a sub-graph having a repeating pattern with reference to activity data in the numeric string data provenance, stores number-of-times information of the sub-graph having a repeating pattern in a sub-graph statistics table, and if the sub-graph having a repeating pattern occurs a preset number of times or more, determines that the sub-graph having a repeating pattern is a reference pattern.

Description

Compression Device and Method for Provenance Management

The present invention relates to a compression apparatus and method for managing management, and more particularly, to a compression apparatus and method for managing management for RDF (Resource Description Framework) documents.

Recently, with the development of computing technology and network, many users have been rapidly producing and sharing data through the internet, and various studies are being actively conducted to provide it as an efficient service.

As the amount of information exploded on the web, there is a need for automatically recognizing and searching web documents.

Accordingly, the semantic web has emerged as a next-generation web technology that enables a computer to understand and manipulate the meaning of a document.

The Semantic Web was first established as a technical standard by the World Wide Web Consortium (W3C), and represented in terms of ontology that machines can process information about resources and their relationships and semantics in a distributed environment. It is a framework that allows automated machines to handle this.

Currently, the semantic web-based research is actively being studied, and the RDF (Resource Description Framework) data structure was studied by the W3C to support this.

RDF is a standard for expressing information of resources on the web. It supports common rules about the syntax, syntax, and structure of heterogeneous data.

RDF is represented graphically and consists of triples: subjects, predicates, and objects.

RDF data on the Web has increased as more organizations support Link Open Data (LOD). Currently, more than 10 public institutions, including the JPO, the National Arboretum, and the National History Compilation Committee, provide LODs.

Since these LODs are represented as RDF data structures, more RDF data will be produced in the future, and as the RDF data increases, it becomes important to store data efficiently.

In addition, as the RDF data is continuously generated and changed, it is necessary to manage the source information of the RDF data, that is, information about where the RDF data came from, who created it, and how it was changed.

Also, by managing the usage history data, it is possible to grasp which user is doing and how the RDF data has changed.

Provenance has emerged as metadata for managing the source information of such RDF data and the history information of the usage history data. Provenance data (hereinafter referred to as 'provenance data') is metadata representing the source information of the data and the history of use.

As a result, you can take advantage of these provisions to understand the user's data changes and people.

As a standard model for managing such provenance data, the PROV model was proposed by the W3C.

The PROV model consists of nodes, which are entities, activities, agents, and properties.

The object represents an RDF document that is represented on the semantic web. Activities represent various activities, such as changing and deleting documents on the semantic web. Finally, an agent represents an individual or organization that performs an activity.

Each of these nodes is organically connected and can be used to improve the compatibility of semantic web data when managing the provenance data using the standard PROV model, and can be searched through the standard query language.

Provenance data is composed of graphs to represent historical information. Such graphs repeatedly represent duplicate data.

For this reason, graph compression is required, but since most semantic web data is represented by RDF data, RDF data compression is required.

In addition, since Proverence needs to consider user's history information, Provenence compression technique based on the flow of Provenance is needed.

Recently, researches for compressing the provisional data have been conducted.

"A. Chapman, H. V. Jagadish, and P. Ramanan," Efficient provenance storage ", In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp.993-1006, 2008." In this paper, three decomposition schemes and two inheritance-based functions are proposed to manage the provenance data. In this paper, the overlapping parts are decomposed and inherited to store the same parts efficiently.

"Y. Xie, KM Reddy, D. Feng, Y. Li, and DDE Long," Evaluation of a hybrid approach for efficient provenance storage ", Journal of ACM Transactions on Storage, Vol. 9, No. 4, pp.14 , 2013. "In this paper, we propose a method of compression that combines web graph-based compression and dictionary-based encoding to compress the data.

In the general provenance compression scheme, the existing redundant data is managed by compressing the overlapped portions. However, there is no compression technique applying the standard provenence model, and since it is compressed using general processing data, it is difficult to apply it to the provisional data composed of RDF. In addition, there is a possibility that the part of the predicate in the RDF data is lost when compressed with the existing provention compression technique. In addition, no compression scheme using the standard model has been proposed. It manages the provisioning data but not the original RDF document.

In general, the provisional data can be tens of times larger than the original data, and the size of the provisional data is represented on the semantic web as a large amount of data.

Provenance data is managed appropriately for the management techniques used, but it needs to be managed using a standard model to be used by various users. In addition, the existing provision management technique does not manage the original document separately and does not consider the RDF data. Moreover, existing RDF data compression techniques do not consider the change history.

Accordingly, an aspect of the present invention is to provide a compression scheme for efficiently managing a large amount of RDF provisional data.

Another technical problem to be solved by the present invention is to reduce the storage capacity of the RDF provisional data.

According to an aspect of the present invention, there is provided a compression apparatus for managing provenance, including: a probability generation unit configured to receive history information and a final document, and to generate a data proof using a provisioning model; A pre-encoding unit connected to the pre-encoding unit, pre-encoding the string data of the data probe into numeric string data, storing the data in a pre-encoding table, and outputting the numeric string dataverification; Receives the column data provisionance, encodes the subject and object together into a numeric string, encodes only the predicates into a numeric string, and stores them in the final RDF data encoding table, and encodes the data using the data stored in the final RDF data encoding table. Generates a graph and iterates using the generated encoding compliance graph Extracts the graph pattern, stores the number of repetitions of the extracted graph pattern in the pattern statistics table, and stores the subject or object for each node of the extracted graph pattern in the graph pattern variable table corresponding to the order of finding the extracted graph pattern. A final RDF compression unit for storing a data pattern compression graph for a final document using the values stored in the graph pattern variable table, and connected to the pre-encoding unit, and receives the numeric string data conference. Generate a subgraph having a repeating pattern in the numeric string data conference based on activity data, store information on the number of times of the subgraph having a repeating pattern in a subgraph statistics table, and have the repeated pattern If the subgraph appears more than the set number of times, the repeating pattern Establish a reference pattern for the sub-graph comprises pro governance pattern compression sections.

The provisioning model may include an object node, an agent node, an activity node, and a metadata node having information about a time and a source.

The pre-encoding unit encodes agent nodes, metadata nodes, and object nodes to store encoding values in a data table, encodes activity nodes to store encoding values in an activity table, and encodes attributes to store encoding values in a predicate table. It is desirable to.

In accordance with another aspect of the present invention, there is provided a compression method for managing provenance, generating a data provision by using history information and a final document, and generating a string of string data of the data provision. Pre-encoding the data into a pre-encoding table and outputting the numeric string data prober; receiving the numeric string data prober, encoding the subject and object together with the numeric string, and encoding only the predicates into the numeric string separately. Storing in a final RDF data encoding table, generating an encoding compliance graph using data stored in the final RDF data encoding table, extracting a repeated graph pattern using the generated encoding compliance graph, and then extracting The number of iterations of the plotted graph pattern in the pattern statistics table, Storing the subject or object for each node of the extracted graph pattern in the graph pattern variable table corresponding to the order of finding the exported graph pattern, and compressing the data pattern for the final document by using the values stored in the graph pattern variable table. Generating a graph.

According to this feature, since the existing PROV model cannot represent the changed time and the changed original RDF document, this example uses an extended PROV model that extends the standard PRVO model to represent the provenance data. We propose a compression method for managing large RDF provisioning data.

In addition, since the provenance data is represented as string data, all data of the PROV model is stored as numeric data through pre-encoding, which reduces the amount of storage by storing the string data as numeric data through pre-encoding.

In addition, unlike the existing PROV model, the extended PROV model handles the final RDF document to be changed or added, making history tracking easier.

Furthermore, since this example manages the final RDF document, unlike the existing PROV model, the original RDF document is compressed through the original RDF compression to prevent the storage space of the final RDF document from occupying much.

Finally, in the case of this example, the redundant portion of the data activity node in the PROV model is compressed into a subgraph to store the compressed data in consideration of the usage history of the data.

1 is a block diagram of a compression apparatus for management of prosperity according to an embodiment of the present invention.

2 is a flowchart illustrating an operation of a compression apparatus for management of maintenance according to an embodiment of the present invention.

3 shows an example of an extended PROV model of a compression device for provisioning management according to an embodiment of the present invention.

4A is an example of data provisionance generated according to a conventional PROV model.

4B is an example of data provisioning generated according to an extended PROV model of a compression apparatus for management of provisioning according to an embodiment of the present invention.

FIG. 5 is an example of data provisionance input to a pre-encoding unit of a compression device for provisioning management according to an embodiment of the present invention.

FIG. 6 is an example of numeric string data provisioning generated by a pre-encoding operation of a compression device for provisioning management according to an embodiment of the present invention.

7 is an example of an encoding probence graph generated according to an embodiment of the present invention.

8 is an example of a graph pattern extracted from a final RDF document in accordance with one embodiment of the present invention.

9A and 9B are repetitive graph patterns extracted from the graph pattern of FIG. 8.

10 is a data pattern compression graph for the final RDF document in accordance with one embodiment of the present invention.

FIG. 11 illustrates a process of extracting a subgraph from numeric data data provisioning in accordance with an embodiment of the present invention.

12 is an example of a reference pattern according to an embodiment of the present invention.

FIG. 13 is an example of a pattern-compressed provisionality graph according to an embodiment of the present invention. FIG.

10: Provenance generation unit 20: Provenance compression unit

21: pre-encoding section 22: final RDF compression section

23: Provenance pattern compression unit 30: Storage unit

DETAILED DESCRIPTION Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art may easily implement the present invention. As those skilled in the art would realize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present invention. In the drawings, parts irrelevant to the description are omitted in order to clearly describe the present invention, and like reference numerals designate like parts throughout the specification.

When a component is referred to as being "connected" or "connected" to another component, it is to be understood that the component may be directly connected or connected to the other component, but there may be other components in between. do. On the other hand, when a component is said to be "directly connected" or "directly connected" to another component, it should be understood that there is no other component in between.

Next, a compression apparatus and method for management of proofs according to an embodiment of the present invention will be described with reference to the accompanying drawings.

In this example, a probe data compression method based on a probe model (PROV model) is used to compress the probe data.

In order to check the provenance data, it is necessary to compress it in consideration of the passage of time or the change history of information. In this example, the Provenance data compression method extends an existing PROV model to represent RDF data.

At this time, since the time is indicated in the extended PROV model, it can be confirmed that the time has changed. Therefore, by using the extended PROV model, the data is compressed considering the historical information. In addition, it represents the proof data over time, so you can see who modified what documents and when.

Since most of the proof data consists of strings, it is converted into numeric data through dictionary encoding. Since the original document takes up a lot of space because it manages the original RDF document to be changed in the extended PROV model, the original RDF compression compresses the size of each original RDF document.

In addition, in order to reduce the overall size of the probe data, the probe version compression module extracts the patterns used in the same order based on the active nodes of the PROV model and compresses the probe data.

As shown in FIG. 1, a compression device for managing probability according to an embodiment of the present invention (hereinafter, referred to as a compression device for provenance management) refers to history information and final information. Provenance generation unit 10 for receiving a document, a probe unit 20 connected to the probe generation unit 10 and the storage unit 30 connected to the probe version 20 ).

Probability generation unit 10 generates the relevant data, that is, the probability of the final document (hereinafter, the 'proverance of the corresponding data' is referred to as 'data provision') by using the historical information and the final document. do.

Probability compression unit 20 is connected to the pre-encoding unit 21, the pre-encoding unit 21 for encoding the generated data of the conversion of the string into the numeric string, the data probe encoded in the numeric string In the following (hereinafter referred to as 'numeric data provision'), it is connected to the final RDF compression unit 22 for compressing the final document, and the pre-encoding unit 21, and the history information is obtained from the numeric data provision. The provision pattern compression part 23 which extracts and compresses is provided.

The storage unit 30 is a storage medium that stores data and information necessary for the operation of the provisioning compression device, data and information generated during the operation, and the like.

In this example, both the historical information and the final document have an RDF data structure in RDF format, and the provisioning is also expressed in RDF format. In addition, documents in RED format are represented as graphs, so RDF documents and provisions such as final RDF documents or original RDF documents are also represented as graphs.

Therefore, when data provision is applied by the probe generating unit 10, the pre-encoding unit 21 encodes data of the data proberence, which is string data, from string data to numeric data through a pre-encoding operation. do.

As such, as the string data changes the numeric data, the storage space of the numeric string data conference is reduced.

The final RDF compression unit 22 compresses the final document to compress the final document on the semantic web (that is, the final document in the form of RDF (final RDF document)). At this time, unlike the existing PROV model manages the final document. The final document refers to a document on the semantic web.

The final RDF compression unit 22 performs the encoding operation as in the pre-encoding unit 21, but encodes the final document using a method different from the encoding method used in the pre-encoding unit 10.

That is, in the final document compression, the subject S and the object O are encoded together, but the predicate P is separately encoded. At this time, the same pattern is searched and the final document is compressed using the searched same pattern. Here, the same pattern is regarded as the same pattern if the subject (S) and the object (O) are different, but the use of the predicate (P) is the same.

Finally, the probe pattern compression unit 23 extracts a subgraph based on the activity of the PROV model from the probe graph. After the subgraph is extracted, if a predetermined numerical value or more comes out according to the frequency of the extracted subgraph, the final graph is changed through the patterned information.

The PROV model is a standard model proposed by the W3C to manage provenance data. The PROV model is not compatible when the method of managing the provention data in the semantic web is different, and most of the semantic web data can be expressed as a standard standard PROV model.

Thus, in this example, the Provenance Compression method using the PROV model is used to represent the flow of Provenance.

The PROV model represents a data flow as a model for managing the provenance data. Existing PROV models are easy to represent existing provenance data, but are insufficient to represent RDF documents because there are no nodes representing RDF documents on the web (ie, documents with RDF data structures). It is also created over time, but does not display accurate information about when it was changed.

For this reason, this example extends an existing PROV model and adds a part representing metadata. This process, unlike the existing model, reveals the changed parts and the changed time of the RDF document on the Semantic Web.

Therefore, the proofer generation unit 10 according to the present example uses the extended PROV model shown in FIG. 3.

The extended PROV model consists of nodes (N11-N13) and attributes (used, wasGeneratedBy, wasDerivedFrom, wasInformedBy, wasAttributedTo, ActedOneBehalfOf, wasAssociatedWith, time, source) that consist of already described entities, activities, and agents. It consists of adding the node N14 of Meta Data from the existing PROV model with), which causes when the RDF document is transformed and what RDF is generated by the data probe generated by the extended PROV model. Information about whether the document has been modified is further represented.

An agent is made up of individuals and organizations and represents the person or organization that performed the activity.

The metadata consists of time and source and identifies when the activity was executed and what RDF documents were modified.

An object represents an RDF document, and an activity represents what you have done to that RDF document.

In FIG. 3, the 'used' attribute represents, in the graph, the object required for the execution of the object N11 by connecting to the object in the activity.

The ‘wasGeneratedBy’ branch (i.e., attribute) is the concatenation of an activity on an object, and the object that results from the activity (N12) represents that activity.

The 'wasDerivedFrom' property is a property that connects objects from an object.

The 'wasInformedBy' attribute is an attribute representing the exchange of an object with one object created by one activity, and the 'wasAttributedTo' attribute is an agent's influence on the object.

The attribute "ActedOneBehalfOf" means that an agent takes over for a specific agent. The next attribute that connects an agent to an activity is 'wasAssociatedWith'.

The 'time' attribute connects the time of the activity with the metadata so that it knows when the activity was done. The 'source' attribute is an attribute that links the source of the metadata with the activity, and refers to the RDF document in which the activity is performed.

The following Table 1 describes the definitions for the elements used in the extended PROV model shown in FIG. 3.

클래스class	서브클래스Subclass	설명Explanation
객체(entity)Entity	문서(document)Document	RDF으로 구성된 문서Documents configured with RDF
에이전트(agent)Agent	개인(person)Person	활동 행하는 개인An activity person
에이전트(agent)Agent	조직(organization)Organization	활동 행하는 조직Organization
활동(activity)Activity	삽입(insert)Insert	기존의 문서에서 RDF 데이터가 삽입될 때When RDF data is inserted from an existing document
	삭제(delete)Delete	기존의 문서에서 RDF 데이터가 삭제될 때When RDF data is deleted from an existing document
	변경(modify)Modify	기존의 RDF 문서가 새로운 RDF 문서로 변형될 때When an existing RDF document is transformed into a new RDF document
	버저닝(revision)Versioning	기존의 RDF 문서가 새롭게 버저닝 될 때When an existing RDF document is newly versioned
메타데이터(metadata)Metadata	시간(time)Time	활동이 행해지는 시간The time when the activity is done
메타데이터(metadata)Metadata	소스(source)Source	활동에 의해 추가, 삭제 또는 변경될 RDF 데이터나 RDF 문서RDF data or RDF documents to be added, deleted, or changed by an activity

In Table 1, the agent, as already explained, is divided into individuals and organizations and is the subject that operates the actual activity.

An object means an RDF document having an RDF data structure as a document.

An activity consists of four elements: insert, delete, change, and versioning.

Metadata is generated when the actual activity is run and represents the time or document (ie source) to be modified.

For example, if you make Wikipedia's change histories into a PRVO model, objects represent pages from Wikipedia, and agents represent individuals who change pages. In addition, the activity refers to activities that add content to the page or create a new page. Finally, in the metadata, time means the time when the page is modified or added, and when the source is changed, the changed content or the new page. When added, it means the newly added contents.

Next, with reference to Figure 4a will be described an example of the generated by using the existing PROV model.

Referring to the Provenence shown in FIG. 4A, a new 'Document F' is created by inserting 'Document C' and 'Document D' into a document (not shown), and the generated 'Document F' is named 'Jieun'. Created by an individual named

In addition, referring to FIG. 4A, a new document X is generated by inserting certain content into the document F by an individual 'line drawing'.

However, in the case of FIG. 4A, 'the content of the document F is inserted and it is not known which part has been changed, and when it has been changed.

On the other hand, FIG. 4B illustrates data provisionance generated using the extended PROV model according to the present example when 'Document F' and 'Document X' are generated through the same process as that of FIG. 4A.

That is, referring to the provision of FIG. 4B, a new 'document F' is generated by inserting 'document C' and 'document D' into a document (not shown), and the generated 'document F' 'Document F' was created on September 02, 2015, due to metadata for time (M11) and metadata for source (M12), indicating that RDF data was added. Can be.

Also, it can be seen that new document 'X' is created by adding the corresponding RDF data by individual 'Line Art' on September 03, 2015, like metadata (M21, M22) to newly generated 'document F'. .

As described above, when the RDF data is represented using the metadata M12 and M22 representing the source, and the changed time is recorded using the metadata M11 and M21 representing the time, when the corresponding proof data is generated. Will be checked.

As such, in the present example, the proofer generation unit 10 further adds a metadata node indicating a time and a source to generate a data probe for the document (or data) to generate the probe compression unit 20. ) To be applied (S10).

As already explained, data provisionance consists of string data that is tens of times larger than the original data.

For example, in the case of Wikipedia, a string of data changes are made by multiple users on a single page. Therefore, a large amount of storage space is required when storing the string data of the data provisionance.

Therefore, as described above, the pre-encoding unit 21 changes the string data of the data probability to numeric data (S20).

To this end, the pre-encoder 21 analyzes the inputted proof data to encode each node and branches.

The number of activity nodes and attributes are smaller than the number of other nodes, and since the compression is based on the activity node when compressing the provenance pattern, the value encoded by encoding the agent node, metadata node, and object node The encoding values are stored in a total of three tables by dividing the data table that stores the data table, the activity table that stores the encoded values of the activity nodes, and the predicate table that stores the encoded values of the attributes.

These data tables, activity data, and predicate tables may be stored in storage 30 or in pre-encoding 21.

In the pre-encoding, the input proof data is analyzed and the data is encoded through text encoding.

Text encoding is divided into three encoding schemes.

That is, the text encoding is encoded in the input order. When the first data is input, the text is analyzed to check whether there is already encoded data in the encoding table. If there is no encoded data after checking, the data for nodes and attributes corresponding to the predicate table, the activity table, and the data table are respectively encoded and stored.

It is assumed that the data provision as shown in FIG. 5 is input to the pre-encoding unit 21.

Referring to the data provision shown in FIG. 5, it can be seen that a new 'Document A' is generated by inserting a DF document corresponding to the metadata M31 into an existing 'Document B'.

As such, when the data provision of the type is input, the pre-encoding unit 21 searches for the pre-encoding table stored in the storage unit 30.

Table 2 is an example of a pre-encoding table stored in the storage unit 30. If document A is encoded, the data is first checked in the pre-encoding table, and a new ID is assigned if there is no data. When encoding with a new ID, | ID + 1 |

The encoding amount of the provisional data is reduced by encoding the character string into a number through the text encoding in the pre-encoding section 21. Like the pre-encoded data in Table 2, 'Document B' is encoded at 1 and '2015.09.01.' Is encoded at # 2. In addition, the ID of the input document A becomes 3 by adding 2 to 1, the last ID of the data table. In addition, since the activity table is pre-encoded separately as a separate table, the ID is assigned to 2 because the existing change is 1 for the insert.

IDID	스트링(string)String	서브클래스Subclass
1One	문서 B Document B		데이터data
22	2015.09.012015.09.01
33	문서 A Document A
44	지은Built
55	[a z b] [a y c][a z b] [a y c]
1One	변경 change		활동activity
22	삽입insertion		활동activity
1One	usedused	술어 terminology
22	wasAssociatedWith wasAssociatedWith
33	time time
44	source source
55	wasGeneratedBywasGeneratedBy

In the pre-encoding table as shown in [Table 2] above, the target corresponding to each corresponding table (that is, the data table, the activity table, and the predicate table) is stored by sequentially increasing the identification number (ID) by '1'.

Information about object nodes, agent nodes, and metadata nodes are stored in the data table, information about activity nodes is stored in the activity table, and attribute information is stored in the predicate table.

Pre-encoded data is reflected in graphs and encoding.

When each node and attribute are changed to numeric data in the form as shown in [Table 2], as shown in FIG. 6, the numerical data probe where all components of the PROV model such as the node and the attribute are represented by a corresponding number is shown. It is generated and input to the probe compression unit 20.

The data provisionance generated by the extended PROV model also manages the RDF data to be changed.

In addition, since the RDF data is composed of numerous triples, it takes up a lot of capacity. Accordingly, if the RDF data is large, it takes up a lot of storage space and compresses and stores it. Also, RDF data generally has fewer predicates than subjects and objects.

Thus, in this example, the RDF graph having the same predicate pattern based on the predicate in the final RDF data (eg, the final document) is patterned.

The variables included in the pattern are created and managed by creating a variable table in the storage unit 30, and converts each final RDF data into the created pattern and stores the compressed data.

To this end, the final RDF compression unit 22 includes an RDF encoding step S31 consisting of an RDF data segmentation step S311 and a text encoding step S312, a pattern extraction step S321, and a final document pattern compression step (FIG. 2). A final RDF compression step S20 having an RDF pattern compression step S32 consisting of S322 is performed.

Initially, there is a final document (ie, a final RDF document) that the source points to in the metadata, which is the document on the semantic web.

When the final RDF document is input by the first metadata, the string data is changed into numeric data through the RDF data analysis step S311.

This conversion into numeric data is performed in a manner different from the encoding scheme performed in the pre-encoding section 21.

That is, the pre-encoding unit 21 encodes the data sequentially in the input order, but the final RDF compression unit 22 encodes the subject and the object in the same number string and encodes the predicate separately in the numeric string. The final encoded RDF document is compressed via RDF pattern compression. In the RDF pattern compression, when the same predicate is used, the pattern is compressed and stored in the storage unit 30.

The operation of this final RDF compression section 22 will be described in more detail.

When the corresponding data is input, the final RDF compression unit 22 searches for the corresponding encoding ID in the final RDF data encoding table stored in the storage unit 30. If the corresponding encoding ID does not exist in the final RDF data encoding table, encoding is performed by adding 1 from the last ID.

As described above, unlike the pre-encoding technique, since the encoding of the final RDF compression unit 22 is encoded together with the subject and the object, and only the predicate is encoded, the verbs are encoded in the order in which they are entered, and the subject and the object are encoded together. .

Next, [Table 3] shows an example of the final RDF data encoding table generated through the operation of the RDF data analysis step of the final RDF compression unit 21.

In Table 3, the elements (A, B, G, C, O, X, P, J, Q, S, H, K, V) described in the string part of the subclass 'subject, object' part are the final RDF. Elements (D, F, G, Q, W, S) that are words (i.e., nouns) (e.g., articles, Kim, Young-Chul, etc.) used as subjects or objects in a document Are verbs that are used as predicates in the final RDF document (eg, submit, compose, etc.), but these nouns and verbs are shown alphabetically for city convenience.

서브클래스Subclass	IDID	스트링String	IDID	스트링String
주어, 목적어Subject	1One	AA	88	PP
	22	BB	99	JJ
	33	GG	1010	QQ
	44	CC	1111	SS
	55	OO	1212	HH
	66	WW	1313	KK
	77	XX	1414	VV
술어terminology	1One	DD	55	QQ
	22	FF	66	WW
	33	GG	77	SS
	44	PP	88	XX

Referring to Table 3, the final RDF document lists a total of 14 different subjects or objects (A, B, G, C, O, X, P, J, Q, S, H, K, V). It can be seen that a total of eight verbs are described.

For example, in Table 3, if A exists in the existing final RDF data encoding table that is already stored in the storage unit 30, the existing table is imported and used but the subject does not exist in the existing table. When an object appears (eg K), it is encoded as an ID with a new value 13 by adding 1 (12 + 1 = 13) to the value of the last ID (eg 12) of the existing table.

In the case of predicates, an ID is assigned only once even when there is a large amount of repeated data. For example, even though the predicates 'D' and 'F' are repeatedly extracted, the predicate 'D' is assigned an ID having a value of 1 and the 'F' is assigned an ID of a value of '2'.

As such, when the final RDF document is analyzed and the final RDF data encoding table is created (S311), the final RDF compression unit 22 proceeds to the text encoding step S312, and a verification graph using the encoded data (that is, encoding pro- gram). Rebuild the Verification Graph.

An example of an encoding compliance graph is shown in FIG. As is generally known, in FIG. 7, the value of each node is the value of the corresponding ID given to the 'subject, object' part, and the direction of the arrow connected between the two nodes is determined according to whether the string is given or the object, The number listed above the arrow is the value of the corresponding ID given in the 'predicate' part.

RDF data is characterized by having fewer verbs than the subject and object and having the same pattern of verbs. In this case, the same pattern means that only the variables of the subject and the object are different and the order of the verbs is the same. In this example, the same pattern is used to extract the pattern using the subject and the object as variables.

Accordingly, the final RDF compression unit 22 extracts the graph pattern repeatedly displayed by using the encoding provenance graph in the pattern extraction step S321, and stores the graph pattern having the number of times that the number of times repeatedly being used is greater than or equal to the set number. In the pattern storage unit.

In FIG. 8, as an example, a graph pattern that can be extracted from the final RDF document is shown. In FIG. 8, verb pattern 1 (pattern1) that is used repeatedly three times is used by repeating verb 1 and verb 2, and verb pattern 4 and verb 5 that are repeated twice are used.

Accordingly, two repetitive graph patterns (pattern1, pattern2) are extracted as shown in FIGS. 9A and 9B, and the shape and the number of repetitions of the extracted graph patterns are as shown in Table 4 below. The table is stored in the storage unit 30.

그래프 패턴 번호(ID)Graph pattern number (ID)	반복 횟수Number of iterations

pattern 1pattern 1	33
pattern 2 pattern 2	33

Since the graph patterns stored by patterning have to be managed as variables because only the subject and object are used as verbs, the information (main or object) for each node of the extracted graph pattern (pattern1, pattern2) is shown in [Table 5]. It is stored in the graph pattern variable table of the same type.

Table 5 is an example of a graph pattern variable table for graph pattern 1 (pattern1) shown in FIG. 9A.

변수variable	그래프 패턴을 찾은 순서Order of finding graph pattern	IDID
?x? x	1One	1One
	22	33
	33	99
?y? y	1One	44
	22	1One
	33	1010
?z? z	1One	22
	22	1212
	33	88

Referring to [Table 5], the information (that is, the subject or the object) entering the node (? X) in the order of finding the graph pattern 1 is the information having the identification numbers (ID) 1, 3, and 9 (Table 3). For reference, A, G, J), and the information contained in the node (? Y) is information having identification numbers (ID) 1, 2, 3 (A, B, G in Table 3), and the node ( information contained in? z is information having identification numbers (ID) 2, 12, and 8 (in the case of Table 3, B, H, and P).

As such, when the graph variable table is generated, the final RDF compression unit 22 proceeds to the final document pattern compression step S322 and compresses the data pattern for the final RDF document by using the repeated graph patterns pattern1 and pattern2 extracted. Generate a graph (see FIG. 10).

Provenance for the final RDF document is compressed and stored as a data compression graph.

As described above, since the data corresponding to each extracted graph pattern is stored in the graph pattern variable table and each graph pattern ID is stored and managed by the graph pattern variable table and the pattern statistics table, the final RDF compression unit 22 ) Compresses and stores the graph of the final RDF document by storing the changed node based on the extracted repeating graph pattern (see FIG. 10). At this time, the name of the graph pattern is determined based on the table shown in [Table 5], the name is determined in order with the name of the graph pattern.

The pattern for processing the provenance data is often repeated the same. For example, the pattern of document usage shows similar or identical usage patterns for various documents, such as creating a document and then changing the parts that users need to use. Therefore, the provision pattern compression unit 23 of the present example extracts and compresses and stores the repeated use pattern using the same.

Compared with the final RDF compression section 22, the compression operation of the provisional pattern compression section 23 is compressed in substantially the same manner as the final RDF compression section 22, except that only the object to be processed is different, but when the compression Rules are different.

The final RDF compression unit 22 extracts the same pattern based on the predicate, but the provisional pattern compression unit 23 extracts the same pattern based on the active node.

When the numeric string data prober whose first string data output from the pre-encoding unit 21 is converted into numeric data is input to the provisional pattern compression unit 23, the provisional pattern compression unit 23 receives the numeric string data. A subgraph is generated based on the activity in the probever (S41).

Next, the provision pattern compression unit 23 stores the generated subgraph in the subgraph statistics table of the storage unit 30 and extracts the same subgraph repeatedly (S42).

For example, if 'change' always occurs after an activity called 'insert' in the history of the document, extract the Provenance graph expressed in the order of 'insert' and 'change' as a subgraph.

At this time, the probe pattern compression unit 23 compares the number of occurrences of the extracted subgraph with the set number of times, and if the number of occurrences is equal to or more than the set number of times, the corresponding subgraph is referred to as a reference pattern and compressed and stored. Store in

11 illustrates a process of extracting a subgraph from numeric data data provisioning. As illustrated in FIG. 11, a subgraph is generated based on the activity data, and a subgraph is generated.

When extracting a subgraph having a repeated pattern, a pattern not recently used (ie, a pattern not used for a predetermined time) is deleted from the storage unit 30.

When the repeated subgraph is extracted more than a set number of times, the statistical data related to the subgraph is recorded in the subgraph statistics table of the form shown in [Table 6] stored in the storage unit 30.

서브 그래프Subgraph	반복 횟수Number of iterations
sub1sub1	22
sub2sub2	1One
sub3sub3	1One
sub4sub4	1One

As such, after the subgraph is generated, the number of times of each subgraph is managed by the subgraph statistics table.

As shown in [Table 6], the number of times the subgraph appears is recorded in the subgraph statistics table. If the number of times indicated above is the set number or more, it is compressed into a reference pattern and stored in the storage unit 30. At this time, the set number of times is designated as a limit value and this value is changed according to the processing data. All subgraphs that can be extracted in FIG. 11 are counted in the subgraph statistics table.

12 shows an example of the reference pattern.

As such, when the reference pattern is generated, the provisional pattern compression unit 23 is stored after the pattern is compressed as shown in FIG. 13 (S43).

Fig. 13 is a pattern compressed proof graph according to the present example.

In this example, when the same subgraph is found among the extracted subgraphs, the repeated subgraph is stored as a reference pattern. It is generated as a reference pattern and is converted into string data and stored as shown in Table 7. The final result is stored as a node converted into a reference pattern to compress and store the graph of the provenance data.

기준패턴1-1Reference Pattern 1-1	기준패턴1-2Reference pattern 1-2	기준패턴2-1Reference pattern 2-1	기준패턴2-1Reference pattern 2-1
문서C/문서F/문서XDocument C / Document F / Document X	문서W/문서Q/문서PDocument W / Document Q / Document P	문서A/문서P/문서VDocument A / Document P / Document V	문서K/문서Y/문서FDocument K / Document Y / Document F

In [Table 7], two reference patterns (reference pattern 1, reference pattern 2), the first reference pattern 1 (reference pattern 1-1) is related to document C, document F and document X, There are two reference patterns (reference pattern 1 and reference pattern 2), and the first reference pattern 1 (reference pattern 1-1) is related to document C, document F and document X, and the second reference pattern Document 1, document Q, and document P are associated with 1 (reference pattern 1-2).

In addition, the first reference pattern 2 (reference pattern 2-1) is associated with document A, document P, and document V, and the second reference pattern 2 (reference pattern 2-2) is document K, document Y. And document F are related.

In this way, the final RDF document itself is processed by the final RDF compression unit 22, and the processing on the history information of the final RDF document is performed by the provisional pattern compression unit 23, so that the final RDF document and the history information are processed. The management operation of takes place separately.

Since the existing PROV model does not represent the time of change and the original RDF document changed, in this example, a large amount of RDF provention is made by using an extended PROV model that extends the standard PRVO model to represent the proof data. We propose a compression method for managing data.

Although the embodiments of the present invention have been described in detail above, the scope of the present invention is not limited thereto, and various modifications and improvements of those skilled in the art using the basic concepts of the present invention defined in the following claims are also provided. It belongs to the scope of rights.

Claims

Probability generation unit for receiving the historical information and the final document to generate data provision by using the provision model,

A pre-encoding unit which is connected to the provisioning generation unit, pre-encodes string data of the data provisions into numeric string data, stores it in a pre-encoding table, and outputs numeric string data provisionance;

It is connected to the pre-encoding unit, receives the numeric string data conference, encodes a subject and object together into a numeric string, encodes only predicates into a numeric string, and stores the final RDF data encoding table, and stores the final RDF data. After generating the encoding graph by using the data stored in the encoding table, extracting the repeated graph pattern using the generated encoding probability graph, and stores the number of iterations of the extracted graph pattern in the pattern statistics table, Saves the subject or object for each node of the extracted graph pattern in the graph pattern variable table corresponding to the order of finding the extracted graph pattern, and compresses the data pattern for the final document using the values stored in the graph pattern variable table. The final RDF compression section, and

A subgraph having a repeating pattern connected to the pre-encoding unit, receiving the sequence data provisional, and having a pattern repeated in the sequence data provisional based on activity data; The number of times of the graph is stored in the subgraph statistics table, and when the subgraph having the repeated pattern is displayed more than the set number of times, the subgraph having the repeated pattern is defined as the reference pattern and the provision pattern compression unit

Compression device for management of probabilities comprising a.
In claim 1,

The provisioning model includes an object node, an agent node, an activity node, and a metadata node having information on time and information on a source.
In claim 1,

The pre-encoding unit encodes agent nodes, metadata nodes, and object nodes to store encoding values in a data table, encodes activity nodes to store encoding values in an activity table, and encodes attributes to store encoding values in a predicate table. Compression device for provisioning management.
Receiving data of the history and the final document and generating data provisioning using the provisioning model,

Pre-encoding the string data of the data probes into numeric string data, storing the string data in a pre-encoding table, and outputting a numeric string data probe;

Receives the numeric string data conference and encodes a subject and object together into a numeric string, encodes only predicates separately into a numeric string, and stores the final RDF data encoding table and encodes the data using the data stored in the final RDF data encoding table. Generating a provisioning graph,

After extracting the repeated graph pattern by using the generated encoding probence graph, the repeated number of the extracted graph pattern is stored in the pattern statistics table, and each node of the extracted graph pattern corresponds to the order of finding the extracted graph pattern. Storing the subject or object for in the graph pattern variable table, and

Generating a data pattern compression graph for the final document using the values stored in the graph pattern variable table;

Compression method for provisioning management comprising a.
In claim 4,

And the provisioning model comprises an object node, an agent node, an activity node, and a metadata node having information about a time and a source.