CN108920600B

CN108920600B - Distributed file system metadata prefetching method based on data relevance

Info

Publication number: CN108920600B
Application number: CN201810681784.7A
Authority: CN
Inventors: 许胤龙; 陈友旭; 李�诚; 李永坤; 吕敏
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2018-06-27
Filing date: 2018-06-27
Publication date: 2021-07-06
Anticipated expiration: 2038-06-27
Also published as: CN108920600A

Abstract

The invention discloses a distributed file system metadata prefetching method based on data relevance, which is characterized by comprising the steps of designing an extraction mode and a storage structure of the data relevance, prefetching metadata of a relevant file, dynamically feeding back the data relevance and dynamically updating the data relevance; compared with the traditional metadata access mode of the distributed file system, the invention provides a light-weight syntax analysis mode of data relevance, expands the metadata structure of the file system to support the data relevance, caches the metadata of the relevant files in the local client in advance through a prefetching mode, thereby reducing the cross-network interaction times of the client and the metadata server, simultaneously dynamically adjusts the compactness of the relevant files according to a file access mode by combining a dynamic feedback mechanism of the client, further improves the prefetching accuracy by utilizing threshold control, reduces the occupation of the cache space of the client, reduces the response delay of metadata access of the relevant files, and improves the metadata service performance.

Description

Distributed file system metadata prefetching method based on data relevance

Technical Field

The invention belongs to the technical field of computer distributed storage systems, and particularly relates to a prefetching method for accelerating metadata access by utilizing data relevance.

Background

With the rapid development of the internet, the data volume is increasing, and therefore mass data storage becomes crucial. The distributed file system provides high-speed data access service for users and stable system extensibility by utilizing the physical resources of distributed deployment of computer nodes interconnected by a computer network and providing file system management. The distributed file system comprises three components, namely a metadata server, a data server and a client. The metadata server is responsible for managing metadata of the whole file system, including directory entries and index nodes (inodes), the data server is responsible for storing data of the file system, and the client is responsible for initiating metadata and data requests. In the decoupled distributed file system architecture, a client wants to view data of a file, first needs to interact with a metadata server to perform corresponding metadata operation, and finally performs data transmission with the data server to complete data access. Pages 41 to 54 of USENIX annual technical conference published by the american USENIX association 2000 indicate that at least 50% of user requests are for access to file metadata, and thus metadata access performance is critical in a distributed file system. Reference relationships exist between the data of documents mentioned on pages 15 to 22 of the USENIX document and conference on storage published by the american USENIX association 2016, and accessing a document causes access to the document associated with its data. However, the existing distributed file system architecture does not consider the relevance among file data during design, so that the data relevance among files cannot be found, frequent interaction between a client and a metadata server is caused, and the metadata access flow of the associated files is difficult to optimize and the access delay of the metadata of the associated files is difficult to reduce.

Disclosure of Invention

The invention aims to provide a distributed file system metadata prefetching method based on data relevance, which overcomes the defects in the prior art, reduces the cross-network interaction times of a client and a metadata server, shortens the response time of requests and improves the throughput of a system under the condition of ensuring low overhead.

The invention discloses a distributed file system metadata prefetching method based on data relevance, which is characterized by comprising the following steps:

the first step is as follows: extraction mode and storage structure for designing data relevance

Inquiring a corresponding reference or link syntax expression according to a syntax format corresponding to the file type, and designing a target regular expression based on the inquired reference or link syntax expression; when an application program of a client modifies data of a file, syntactic analysis is carried out on data content of the file by using a designed target regular expression to extract a file path name of a reference or link file (associated file), and meanwhile, the offset of the associated file path name appearing in a data part and the length of the path name are recorded;

storing data relevance by adopting a data structure of key-value pairs (key-value pairs), wherein keys of the key-value pairs are numbers of index nodes of relevant files and are used for uniquely marking the files, and the keys are obtained by a metadata server according to path names of the relevant files by searching the contents of the index nodes of the corresponding files and occupy 8 bytes; the value of the key-value pair contains three parts, namely an association score (score) ranging from [0,1], the length of the associated file path name and the offset of the associated file path name in the data part, which respectively occupy 4 bytes, 4 bytes and 8 bytes; expanding a metadata structure of the index node of the distributed file system, and storing the key value pair for storing the data relevance in the expanded attribute of the index node of the file so that the distributed file system supports the data relevance; after the client analyzes the modified data content, the client sends data relevance synchronization information to the metadata server; after the metadata server receives the synchronization information, persistently updating the data relevance to the storage device;

the second step is that: prefetching metadata for associated files

When a metadata server processes a metadata operation request of a target file initiated by a client, firstly, a directory entry and an index node of the target file are obtained in a metadata cache of the metadata server; after the index node of the target file is obtained, retrieving each extended attribute of the index node, and obtaining the data relevance of the target file;

setting a threshold value T in the range of [0,1] to represent the lowest value of the closeness degree of the target file and the pre-fetched associated file, and pre-fetching when the value of the association score of the target file and the associated file exceeds the threshold value T; traversing each data relevance of the target file, when the value of the relevance score of the value part in the key value pair is larger than a threshold value T, extracting the number of the index node of the relevant file, and inquiring the directory entry and the index node content of the relevant file in the metadata cache according to the retrieved number of the index node of the relevant file; when the value of the correlation score is smaller than or equal to the threshold value T, skipping the correlation of the data, and performing pre-fetching operation of the correlation of the next data;

the metadata server constructs a reply message to return the metadata of the target file and the associated file to the client, the directory entry and the index node content of the target file are added into the reply message, meanwhile, the metadata server adds the directory entry and the index node content of the associated file inquired in the second step into the reply message, and a pre-fetching mark 1 is set for the reply message to indicate that the constructed reply message contains the metadata of the associated file; if the reply message does not contain the prefetch content, setting a prefetch mark as 0 for the reply message; then the metadata server sends the reply message to the client;

the third step: dynamic feedback of data relevance

When a client receives a reply message sent by a metadata server, firstly, judging whether the reply message is provided with a prefetch mark or not; if the prefetch mark is not set, analyzing the reply request content to obtain a directory entry and an index node of the target file, caching the directory entry and the index node of the analyzed target file in a memory of the client, and simultaneously linking the index node of the target file to the directory entry to establish a logical structure of a target file path;

if the prefetch mark is set, after the contents of the reply request are analyzed to obtain the directory entry and the index node of the target file, the subsequent contents of the reply request are further analyzed to obtain the directory entry and the index node of the associated file, and meanwhile, the index node of the associated file is linked to the directory entry of the associated file, a logic structure of an associated file path is established and cached in a memory of the client; recording the prefetched associated file information, including index node numbers of the associated files, index node numbers of target files triggering prefetching, prefetching time and access marks, and adding the information into a prefetching feedback table of the client; if the prefetched associated file is requested to be accessed by a subsequent client, setting an access flag of the corresponding associated file in the prefetching feedback table to be 1; if the prefetched associated file is not requested to be accessed by a subsequent client, setting an access flag of the corresponding associated file in the prefetching feedback table to be 0;

setting a Time interval (Time) for traversing the client pre-fetching feedback table, wherein the Time interval (Time) ranges from [0, N ]; the client side traverses all records in the pre-fetching feedback table one by one every other Time second and feeds back the access information of the pre-fetched associated files to the metadata server; if the value of the current Time minus the value of the prefetching Time of the traversing associated file is greater than the Time interval Time, constructing a client prefetching feedback request, and adding the index node number of the associated file, the index node number of the target file triggering prefetching and an access mark into the feedback request; if the current Time minus the prefetching Time of the associated file is less than or equal to the Time interval Time, skipping the record and traversing the next prefetching record in the prefetching feedback table; after all records in the pre-fetching feedback table are traversed once, the client sends the constructed pre-fetching feedback request to the metadata server;

the fourth step: dynamic update of data associations

When the metadata server receives a pre-fetching feedback request sent by a client, the pre-fetching records in the request are processed one by one; firstly, inquiring index node information of an associated file and index node information of a target file triggering prefetching according to the index node number of the associated file of each prefetching record and the index node number of the target file triggering prefetching, and retrieving data relevance of the corresponding associated file in the index node information of the target file triggering prefetching to obtain a key value pair of the corresponding associated file;

setting an adjustment score(s) in the range of [0,1] to represent the adjustment granularity of the closeness degree of the target file and the associated file each time, and if the access mark in the pre-fetching record is 1, increasing the association score of the key value pair by s; if the access mark in the entry prefetch record is 0, reducing the association score of the key-value pair by s; and traversing the prefetching records in the feedback request one by one, updating the data relevance in the index node of the target file triggering prefetching according to the access condition of the prefetched associated file, and finally persisting the data relevance in the storage device of the metadata server.

The distributed file system metadata prefetching method based on the data relevance adopts the operation steps of designing an extraction mode and a storage structure of the data relevance, prefetching metadata of the relevant file, dynamically feeding back the data relevance and dynamically updating the data relevance; compared with the traditional metadata access mode of the distributed file system, the method of the invention provides a light-weight syntax analysis mode of data relevance, expands the metadata structure of the file system to support the data relevance, caches the metadata of the relevant file in the local client in advance through a prefetching mode, thereby reducing the cross-network interaction times of the client and the metadata server, simultaneously dynamically adjusts the compactness of the relevant file according to the file access mode by combining a dynamic feedback mechanism of the client, further improves the prefetching accuracy by utilizing threshold control, reduces the occupation of the cache space of the client, reduces the response delay of metadata access of the relevant file, and improves the metadata service performance.

Compared with the prior art, the distributed file system metadata prefetching method based on the data relevance has the following advantages that:

1. because the invention considers the relevance among the file data, designs the extraction mode of the file data relevance mode, and expands the storage structure of the metadata of the distributed file system to support the data relevance.

2. Because the invention designs the metadata pre-fetching method based on the data relevance and utilizes the threshold control and the dynamic feedback mechanism to further improve the pre-fetching precision, compared with the prior art, the invention reduces the occupation of the cache space of the client, reduces the number of requests of the client and the metadata server interacting across the network and shortens the response delay of the metadata access of the associated file.

Drawings

FIG. 1 is a diagram of a distributed file system architecture.

Html document data correlation extraction information schematic diagram is shown in fig. 2.

Fig. 3 is a diagram illustrating a reference or link format corresponding to various syntax types.

Fig. 4 is a schematic diagram of a data association storage structure.

FIG. 5 is a flow chart illustrating a metadata pre-fetching operation performed by the metadata server.

Fig. 6 is a diagram showing a prefetch feedback table structure.

Fig. 7 is a flow chart illustrating the operation of the dynamic feedback mechanism.

FIG. 8 is a flowchart illustrating the general operation of the method for prefetching metadata of an associated file according to the present invention.

Detailed Description

The following describes the data association-based distributed file system metadata prefetching method according to the present invention in further detail by using specific embodiments in conjunction with the accompanying drawings.

Example 1:

the distributed file system metadata prefetching method based on data relevance in the embodiment specifically comprises the following steps:

Fig. 1 shows a schematic diagram of a distributed file system architecture, which includes three components, namely, a distributed file system client, a metadata server, and a data server, which interact with each other via a network. Application in which distributed file system clients

By virtual file systems

With distributed file system clients

The interaction is carried out by the user,

is responsible for initiating metadata and data requests while client caches

Caching metadata and data to speed up the response to the request; metadata Server includes metadata requestProcessing program

Metadata caching

And metadata storage

Three parts; the data server is responsible for providing data access. If the application program needs to read the file content, firstly, the distributed file system client and the metadata server interact to obtain the metadata of the file, and the metadata is cached in the client cache; and acquiring a data address according to the information in the metadata, and then, the client interacts with the data server to finish reading the file data. In order to ensure consistency of metadata and data, the metadata server and the data server interactively update file metadata.

Fig. 2 shows part of data of an index. Html document, the client parses the data content to extract data relevance. FIG. 3 shows a reference or link format for various types of syntax, such as html syntax, c + + syntax, and the like. Designing a target regular expression src [ ^ ] according to the reference or link format of the html syntax given in fig. 3, matching the data content of the index.

Fig. 4 shows a storage structure of data association. Storing a data relevance key value pair mode about a/sponsors.png file extracted from index.html file data in an extended attribute of an index node of the index.html file, wherein the content of a specific key value pair is <10101586 (0.5,13,108) >, wherein 10101586 is the index node number of the associated file/sponsors.png, and after a client finishes analyzing the modified data content in the index.html file, sending data relevance synchronization information to a metadata server; after the metadata server receives the synchronization information, persistently updating the data relevance to the storage device;

the second step is that: prefetching metadata for associated files

Html files need to be accessed when an application program of a client loads the index. And if the local cache does not have the metadata of the index. FIG. 5 illustrates the operational flow of the metadata server processing a metadata request and prefetching associated file metadata. First, the metadata server performs operation (i) in fig. 5, and receives a metadata request from a client. Then, the metadata server searches the directory entry and the inode of the target file index.html in the metadata cache, and adds the directory entry and the inode to the reply message, that is, operation two, where the inode number of index.html is 10101567. And after the index node of the target file index. Html, if there is no associated file in the target file index, operation # is performed, the prefetch flag of the reply message is set to 0, and the reply message is sent to the client, i.e., operation # is performed. If the target file has the associated file, which is a "/sports.png" file in this embodiment, operation (c) is performed, and the data relevance is traversed one by one, and whether the relevance score of the relevance is greater than the threshold T is determined. If the association score is larger than the threshold value T, operation four is carried out, directory entries and index nodes of the association files/sponsors.png are searched in the metadata cache according to the index node numbers of the association files stored in the data association, and the directory entries and the index nodes are added to the reply message, wherein the index node number of the/sponsors.png files is 10101586. After all data relevance traversal is completed, operation is performed, a prefetch flag is set to be 1 for the reply message, and finally the reply message is sent to the client, namely operation.

The third step: dynamic feedback of data relevance

When the client receives the reply message sent by the metadata server, whether the received reply message is provided with the prefetch mark is judged firstly. If the prefetch mark is 0, the reply message does not contain the metadata of the prefetched file, the directory entry and the inode of the target file are analyzed and cached in the cache of the client, and meanwhile, the logical structures of the directory entry and the inode are established. If the prefetch mark is 1, the directory entry and the inode of the prefetched associated file are analyzed and cached in the cache of the client, and meanwhile, the logical structures of the directory entry and the inode are established. Html and relevant files/sponsors, png, caching the directory entries and inodes of the target files and the relevant files/sponsors into a cache of a client, and establishing a logic structure of the directory entries and the inodes. Html file access usually causes access to associated files/sponsors.png due to the relevance among file data, and subsequent access to metadata of the/sponsors.png files by a client can be completed in a client cache without interaction with a metadata server, so that the number of interaction with the metadata server across a network is reduced.

Fig. 6 is a diagram showing a prefetch feedback table structure. To improve the accuracy of prefetching, the client maintains a prefetch feedback table, organized as shown in FIG. 6. And after the client analyzes the directory entry and the index node of the associated file, adding a prefetch record into the prefetch feedback table. The prefetch record contains the inode number of the associated file, the inode number of the target file that triggered the prefetch, the prefetch time, and access tag information. In this embodiment, the prefetch record of the prefetched associated file/sponsors<10101586,10101567,t₁,0>Wherein t is₁Png metadata prefetch time, expressed as the time the reply message reaches the client. Since the client side subsequent metadata operation accesses the metadata of the associated file/sponsors.png, the access mark in the prefetch record of the associated file/sponsors.png in the prefetch feedback table is updated to 1, and the prefetch record of the associated file/sponsors.png is updated to 1<10101586,10101567,t₁,1>。

Fig. 7 is a schematic diagram illustrating an operation flow of the dynamic feedback mechanism, where the left side of fig. 7 is an operation flow of the client, and the right side is an operation flow of the metadata server. The client traverses the prefetch feedback table once every other Time second, and first performs the operations in fig. 7

A prefetch record is retrieved in a prefetch feedback table. If the value of the current Time minus the prefetch Time in the prefetch record is greater than the Time, then the operation is performed

Adding the prefetch record to the feedback information of the client; and if the value obtained by subtracting the prefetching Time in the prefetching record from the current Time is not more than the Time, acquiring the next prefetching record for judgment. After traversing the pre-fetching feedback table for one time, the operation is carried out

And sending the feedback information to the metadata server. The content of the feedback information in this embodiment is<10101586,10101567,t₁,1>。

The fourth step: dynamic update of data associations

The metadata server performs the operations shown in FIG. 7

And receiving the prefetch feedback information sent by the client. The metadata server successively traverses the access information of each prefetch record in the feedback information, i.e. the operation

If the access mark in the record is 1, the associated file indicating the pre-fetching is accessed by the subsequent client operation, and the operation is carried out

Data correlation at target file inode triggering prefetchingIncreasing the association score corresponding to the association by s in the association; if the access flag in the record is not 1, the prefetched associated file is not accessed by the subsequent client operation, and the operation is carried out

And subtracting s from the relevance score of the corresponding relevance in the data relevance of the target file index node triggering prefetching. When the pre-fetching record in the feedback information is traversed completely, the operation is executed

The updated metadata is persisted to a storage of a metadata server. The closeness degree of the relevance is dynamically adjusted according to the access information of the prefetched file, so that the probability that the file metadata with close relevance is prefetched is higher, the probability that the file metadata with untight relevance is prefetched is lower, and the accuracy of metadata prefetching is further improved. In this embodiment, if the prefetched "/spots. png" file metadata is accessed by the client, the association score in the key value pair with key 10101586 is increased by s in the data association of the index node numbered 10101567.

FIG. 8 is a schematic diagram of the general operation flow of the manner of prefetching associated file metadata according to the method of the present invention, and the access flow of the associated file metadata is optimized by using the reference or link relationship between file data. The client is responsible for operation

And

initiating metadata requests, operations

Caching associated file metadata, operations

And

dynamically feeding back the prefetching access information to the metadata server; metadata Server is responsible for operations

Receiving client request and operation

Searching target file metadata and operating

Prefetching associated files and data and replying to metadata requests and operations

And updating the data relevance in real time according to the prefetching feedback information to complete prefetching and updating of the relevant file.

The method of the invention obtains the data relevance through syntax analysis and integrates the data relevance into the metadata of the distributed file system, so that the distributed file system can support the storage of the data relevance; and the accuracy of prefetching files is improved by threshold control when associated file metadata is prefetched, and the occupation of the cache space of a client is reduced. And designing a client dynamic feedback mechanism according to the access information of the pre-fetched file to update the relevance score of the relevance in real time so as to further improve the accuracy of pre-fetching. Compared with the traditional distributed file system, the metadata access flow of the subsequent associated file can be optimized by prefetching the metadata of the associated file, the number of times of interaction between the client and the metadata server across the network is reduced, and the metadata access delay of the associated file is shortened. Taking the index.html file and/sponsors.png file in the embodiment as an example, the existing distributed file system cannot sense the data association between files, and after accessing the index.html file metadata, the client still needs to interact with the metadata server to obtain the metadata of the associated file/sponsors.png, and in the whole access flow, the client interacts with the metadata server twice; by the method, the distributed file system senses the data relevance in advance, and when the metadata of the index.

Claims

1. A distributed file system metadata prefetching method based on data relevance is characterized by comprising the following steps:

Inquiring a corresponding reference or link syntax expression according to a syntax format corresponding to the file type, and designing a target regular expression based on the inquired reference or link syntax expression; when an application program of a client modifies data of a file, syntactic analysis is carried out on data content of the file by using a designed target regular expression to extract a file path name which refers to or links an associated file, and meanwhile, the offset of the associated file path name appearing in a data part and the length of the path name are recorded;

storing data relevance by adopting a data structure of a key value pair, wherein the key of the key value pair is the number of a relevant file index node and is used for uniquely marking a file, and the key value pair is obtained by a metadata server according to the path name of the relevant file to retrieve the content of the corresponding file index node and occupies 8 bytes; the value of the key value pair comprises three parts, namely an association score in the range of [0,1], the length of an associated file path name and the offset of the associated file path name in the data part, and the three parts respectively occupy 4 bytes, 4 bytes and 8 bytes; expanding a metadata structure of the index node of the distributed file system, and storing the key value pair for storing the data relevance in the expanded attribute of the index node of the file so that the distributed file system supports the data relevance; after the client analyzes the modified data content, the client sends data relevance synchronization information to the metadata server; after the metadata server receives the synchronization information, persistently updating the data relevance to the storage device;

the second step is that: prefetching metadata for associated files

the third step: dynamic feedback of data relevance

setting a Time interval Time for traversing the client pre-fetching feedback table within the range of [0, N ]; the client side traverses all records in the pre-fetching feedback table one by one every other Time second and feeds back the access information of the pre-fetched associated files to the metadata server; if the value of the current Time minus the value of the prefetching Time of the traversing associated file is greater than the Time interval Time, constructing a client prefetching feedback request, and adding the index node number of the associated file, the index node number of the target file triggering prefetching and an access mark into the feedback request; if the current Time minus the prefetching Time of the associated file is less than or equal to the Time interval Time, skipping the record and traversing the next prefetching record in the prefetching feedback table; after all records in the pre-fetching feedback table are traversed once, the client sends the constructed pre-fetching feedback request to the metadata server;

the fourth step: dynamic update of data associations

setting an adjustment score s in the range of [0,1] to represent the adjustment granularity of the tightness degree of the target file and the associated file each time, and if the access mark in the pre-fetching record is 1, increasing the association score of the key value pair by s; if the access mark in the entry prefetch record is 0, reducing the association score of the key-value pair by s; and traversing the prefetching records in the feedback request one by one, updating the data relevance in the index node of the target file triggering prefetching according to the access condition of the prefetched associated file, and finally persisting the data relevance in the storage device of the metadata server.