CN112328839B

CN112328839B - Enterprise risk identification method and system based on enterprise marketing relationship graph

Info

Publication number: CN112328839B
Application number: CN202011224147.0A
Authority: CN
Inventors: 王泽皓; 刘雅婷; 马谊骏; 闫凯; 林文辉; 王志刚
Original assignee: Aisino Corp
Current assignee: Aisino Corp
Priority date: 2020-11-05
Filing date: 2020-11-05
Publication date: 2024-02-27
Anticipated expiration: 2040-11-05
Also published as: CN112328839A

Abstract

The invention discloses an enterprise risk identification method and system based on enterprise business-to-sales relationship graph, comprising data set construction; constructing a relation map, adding each enterprise and the entry and sales item goods in the data set into a graph database as nodes, and adding the corresponding relation between each enterprise and the entry and sales item goods into the graph database as edges; inquiring the sum proportion of the shared entry and sales item goods between any two enterprises in the total entry and sales item goods according to the constructed relation graph, and calculating the similarity of the enterprise entry and sales item goods; and screening similar enterprises of the marketing items according to the similarity calculation result, and identifying the enterprise risk through the enterprise industry attribute. According to the method and the system, the enterprise risk identification is carried out by calculating the similarity of the business entry and sales items, so that the business attributes of the business entry and sales items similar to the business are compared, and the aim that tax supervision departments can manage and analyze the enterprise more effectively is achieved.

Description

Enterprise risk identification method and system based on enterprise marketing relationship graph

Technical Field

The invention belongs to an enterprise risk identification method and system based on enterprise business-to-sales relationship maps.

Background

With the development of economy, the role of tax field is more and more important, and for the supervision department of tax field, the dynamics of industry change and enterprise development need to be concerned constantly, and unified analysis and management are carried out to the enterprise that the business is similar, so how to find the enterprise that the business is similar becomes a valuable problem.

The existing method for searching for the business-in and business-out similar enterprises is also limited to network searching, and the similarity degree of the business-in and business-out similar enterprises is judged by comparing the text descriptions of the business-in and business-out commodity among enterprises in the same industry.

Disclosure of Invention

The invention aims to provide an enterprise risk identification method and system based on an enterprise business-to-business relationship map, which at least solve the technical problems that the data source acquired by the existing method for searching for business-to-business items similar to the business-to-business items is inaccurate and the text description is difficult to accurately summarize business-to-business conditions of the business.

In order to achieve the above object, in one aspect, the present invention provides an enterprise risk identification method based on an enterprise business-to-sales relationship map, including:

constructing a data set, acquiring enterprise information through enterprise tax data, performing word segmentation matching on the acquired enterprise information, and establishing a corresponding relation between an enterprise and matched goods of an entry and a sales item;

constructing a relation map, adding each enterprise and the entry and sales item goods in the data set into a graph database as nodes, and adding the corresponding relation between each enterprise and the entry and sales item goods into the graph database as edges;

calculating the similarity of the business in-and-out items, namely inquiring the sum proportion of the common in-and-out item goods between any two businesses in the total in-and-out item goods according to the constructed relationship graph, and calculating the similarity of the business in-and-out item goods;

and screening similar enterprises of the marketing items according to the similarity calculation result, and identifying the enterprise risk through the enterprise industry attribute.

Optionally, the data set construction includes:

acquiring enterprise information through invoice data, wherein the enterprise information mainly comprises a sales side tax payer identification number, a sales side tax payer name, a supplier tax payer identification number, a supplier tax payer name, transaction amount, goods for sales and transaction time;

storing the names of the tax payers of the selling parties as word dictionary for matching the corresponding relation;

and matching the matched goods of the in-and-out item by using a word segmentation algorithm aiming at the acquired enterprise information, and storing the enterprise name and the goods name of the in-and-out item, so that the corresponding relation between the enterprise and the goods of the in-and-out item is established.

Optionally, the enterprise relational graph construction employs a distributed graph database janus graph.

Optionally, the calculating of the similarity of the goods of the business entry and sales items comprises:

the method comprises the steps of enterprise entry similarity and enterprise marketing similarity, wherein the calculation formula of the entry similarity is as follows:

wherein N is an entry cargo shared between the enterprises A and B, ipa _i Defined as the sum of the ith shared incoming goods in enterprise A, ipb _i The sum of the items in the enterprise B, which is positioned as the ith common item, is occupied by similarity_in, and the similarity is defined as item similarity;

the calculation formula of the similarity of the pin terms is as follows:

wherein M is a sales item cargo shared between the enterprises A and B, opa _i Defined as the sum of the ith shared sales item in the enterprise A, opb _i Positioned as the ith commonThe amount of sales items in enterprise B is equal to the sum of sales items in enterprise B, and similarity_out is defined as sales item similarity;

the calculation formula of the similarity of the entry and the sales items is as follows:

similar＝(similar_in+similar_out)/2

where similarity is defined as the similarity of the entry and the sale items.

Optionally, the screening the business-in and business-out item similar enterprises and identifying the enterprise risk through the enterprise industry attribute includes:

and screening enterprises with similarity of the entry and the sales items of the target enterprises being larger than a set amount, serving as the entries, the sales items and similar enterprises of the entry and the sales items of the target enterprises, comparing industry attributes of the similar enterprises of the entry and the sales items, and identifying enterprise risks.

Optionally, the corresponding relation between each enterprise and the goods of the in-and-out item is added to the graph database as an edge, and the information such as transaction time, transaction amount and the like is added to the attribute of the edge.

Optionally, the enterprise business-in-sale similarity calculation and the query of the corresponding relationship include:

and querying the corresponding relation of the business entry and sales items by using gremlin graph query language.

On the other hand, the invention also provides an enterprise risk identification system based on the enterprise business-to-sales relationship map, which comprises the following steps:

the data set construction module acquires enterprise information through enterprise tax data, performs word segmentation matching on the acquired enterprise information, and establishes a corresponding relation between an enterprise and matched goods of an entry and a sales item;

a relation map module is constructed, each enterprise and the goods of the entry and the sales items in the data set are used as nodes to be added into a map database, and the corresponding relation between each enterprise and the goods of the entry and the sales items is used as edges to be added into the map database;

the business item similarity calculation module inquires the sum proportion of the business item goods shared between any two enterprises in the total business item goods according to the constructed relation graph and calculates the business item goods similarity;

and the risk identification module is used for screening similar enterprises of the marketing items according to the similarity calculation result and identifying the enterprise risk through the enterprise industry attribute.

Optionally, the enterprise marketing item cargo similarity calculation module includes:

the calculation formula of the similarity of the pin terms is as follows:

wherein M is a sales item cargo shared between the enterprises A and B, opa _i Defined as the sum of the ith shared sales item in the enterprise A, opb _i The sum of the sales items positioned as the ith common sales item in the enterprise B is occupied, and similarity_out is defined as sales item similarity;

similar＝(similar_in+similar_out)/2

where similarity is defined as the similarity of the entry and the sale items.

Further, query of the corresponding relation of the business in-and-out items is performed by using gremlin graph query language.

The beneficial effects of the invention are as follows:

aiming at the tax field, the invention calculates the similarity of the business entry and sales items of the enterprise, so that the business attributes of the business entry and sales items similar to the enterprise are compared to identify the risk of the enterprise, and the aim of more effectively managing and analyzing the risk enterprise by tax supervision departments is fulfilled.

Furthermore, an enterprise relation graph covering mass data is constructed on the basis of a distributed graph database Janusgraph, graph calculation is realized through Gremlin query language, defects of a traditional database and a single-version Neo4j graph database in graph storage and graph mining are overcome, and the problems of large data volume, high development cost and the like are solved.

Further, when the similarity of the business entries and sales items is calculated, a calculation formula of the similarity is optimized by taking the comprehensive transaction amount information as a key parameter.

Additional features and advantages of the invention will be set forth in the detailed description which follows.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent by describing in more detail exemplary embodiments thereof with reference to the attached drawings.

FIG. 1 shows a flow chart of an enterprise risk identification method based on an enterprise business-to-sales relationship graph of the present invention;

fig. 2 is a schematic diagram showing the amount of money of the goods in the whole goods of the enterprise a according to the embodiment of the present invention.

Detailed Description

Preferred embodiments of the present invention will be described in more detail below. While the preferred embodiments of the present invention are described below, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

In one aspect, the invention provides an enterprise risk identification method based on an enterprise business-to-sales relationship graph, which comprises the following steps:

and constructing a data set, acquiring enterprise information through enterprise tax data, performing word segmentation matching on the acquired enterprise information, and establishing a corresponding relation between the enterprise and the matched goods of the marketing items.

Specifically, enterprise information is obtained through invoice data, wherein the enterprise information mainly comprises a sales party tax payer identification number, a sales party tax payer name, a supplier tax payer identification number, a supplier tax payer name, transaction amount, goods for sales and transaction time;

storing the names of the tax payers of the selling parties as word dictionary for matching the corresponding relation; for example, the obtained enterprise names are company A and company B, the company A and the company B are stored in dictionary documents, and in addition, the enterprise abbreviations can be corresponding to the enterprise full names through manual screening, so that the abbreviation matching capability is improved.

Matching the matched goods of the in-and-out items by using a word segmentation algorithm aiming at the acquired enterprise information, and storing the enterprise name and the names of the goods of the in-and-out items so as to establish the corresponding relation between the enterprise and the goods of the in-and-out items; for example, the matching information of company a includes "goods 1, goods 2 and goods 3" and other information, and the goods 1, goods 2 and goods 3 can be matched by using the business name dictionary through the word segmentation algorithm, so that the corresponding relationship between company a and goods 1, goods 2 and goods 3 can be established, and it is necessary to explain that the word segmentation algorithm is a conventional algorithm known in the art, and no further description is given here.

specifically, the enterprise relationship graph construction adopts a distributed graph database Janusgraph.

It should be noted that, the data structure of the enterprise relationship graph is a directed graph, and the adopted data set mainly includes information on invoice data of the enterprise, so that the data set can be converted into a directed graph structure and imported into a Janusgraph database, wherein the directed graph structure includes a construction node and a construction edge.

The construction node comprises:

each enterprise in the data set is added to the graph database as a node, the label is nsr, the data set comprises an enterprise name, an enterprise tax number and a unique identification id number, and each marketing item goods in the data set is added to the graph database as a node, and the label is computability.

The construction edge comprises:

and adding the corresponding relation of each enterprise and the entry and sales goods into a graph database as edges, wherein the edge label of the entry goods is input, the edge label of the sales goods is output, and adding information such as transaction time, transaction amount and the like into the attribute of the edges, so as to facilitate the screening of time nodes and the calculation of the commodity amount ratio during inquiry. For example, when the amount ratio of a certain type of goods in all goods of the enterprise is calculated, when the sales items of the enterprise a have 3 kinds of goods including the goods 1 (amount a), the goods 2 (amount b), and the goods 3 (amount c), the amount ratio of the goods 3 is c/(a+b+c).

And (3) calculating the similarity of the business in-and-out items, namely inquiring the sum proportion of the common in-and-out item goods between any two businesses in the total in-and-out item goods according to the constructed relation graph, and calculating the similarity of the business in-and-out item goods.

Specifically, the corresponding relation query specifically includes:

enterprise node query:

and querying the target enterprise through a 'nsr' tag, and obtaining node information by using gremlin query language. For example: querying an enterprise node with an enterprise name of "spaceflight information", the following statement may be used:

g.V (). HasLabel ('nsr'). Has ('NSRMC', 'aerospace'). ToList ();

node and edge queries with similar entry and sale terms:

after enterprise information is obtained through enterprise node query, nodes and edges similar to all marketing items in a single level are rapidly found through 'id', and the following statement is used as an example of 'aerospace information':

g.V(‘id’).outE().otherV().inE().otherV().outE().simplePath().toList()。

specifically, the calculation of the similarity of the goods of the business entry and sales items comprises the following steps:

wherein N is an entry cargo shared between the enterprises A and B, ipa _i Defined as the sum of the ith shared incoming goods in enterprise A, ipb _i The amount of the item goods located as the ith common item in the enterprise B is the ratio, and similarity_in is defined as the item similarity.

The calculation formula of the similarity of the pin terms is as follows:

wherein M is a sales item cargo shared between the enterprises A and B, opa _i Defined as the sum of the ith shared sales item in the enterprise A, opb _i The amount of sales in enterprise B that is located as the ith common sales item is the ratio, similarity_out is defined as sales item similarity.

similar＝(similar_in+similar_out)/2

where similarity is defined as the similarity of the entry and the sale items.

Specifically, an enterprise with the similarity of the entry and the sales item of the target enterprise being larger than the set quantity is selected, the enterprise is used as the entry, the sales item and the similar enterprise of the entry and the sales item of the target enterprise, after the similar enterprise is selected, the industry attribute of the similar enterprise of the entry and the sales item is compared, and if two enterprises with similar commodities of the entry and the sales item are obviously different, the abnormal risk of the intersection information is judged.

the data set construction module is used for acquiring enterprise information, performing word segmentation matching on the acquired enterprise information, and establishing a corresponding relation between an enterprise and matched goods of the entry and the sales items;

Aiming at the tax field, the invention calculates the similarity of the business entry and sales items of the enterprises, so that the business attributes of the business entry and sales items similar to the enterprises are compared to identify the risk of the enterprises, and the aim of more effectively managing and analyzing the risk enterprises by the auxiliary tax supervision department is fulfilled.

Specifically, the enterprise business entry and sales item cargo similarity calculation module includes:

the calculation formula of the similarity of the pin terms is as follows:

similar＝(similar_in+similar_out)/2

where similarity is defined as the similarity of the entry and the sale items.

Specifically, the enterprise business-in-and-out similarity calculation and the query of the corresponding relationship include:

Example 1

Referring to fig. 1, the invention provides an enterprise risk identification method based on an enterprise business-to-sales relationship map, which comprises the following steps:

s1, acquiring enterprise tax data, constructing a data set, acquiring enterprise information through the enterprise tax data, performing word segmentation matching on the acquired enterprise information, and establishing a corresponding relation between an enterprise and matched goods of an entry and a sales item.

Specifically, acquiring enterprise information through tax data on an invoice, wherein the enterprise information mainly comprises a sales party tax payer identification number, a sales party tax payer name, a supplier tax payer identification number, a supplier tax payer name, transaction amount, goods for sales and transaction time, as shown in a table a;

table a

Matching the matched goods of the in-and-out items by using a word segmentation algorithm aiming at the acquired enterprise information, and storing the enterprise name and the names of the goods of the in-and-out items so as to establish the corresponding relation between the enterprise and the goods of the in-and-out items; for example, the matching information of company a includes "goods 1, goods 2 and goods 3" and other information, and by using the word segmentation algorithm and the enterprise name dictionary, the goods 1, goods 2 and goods 3 can be matched, so that the corresponding relationship between company a and goods 1, goods 2 and goods 3 can be established.

S2, building a relation map through a graph database Janusgraph, adding each enterprise and the entry and sales item goods in the data set as nodes into the distributed graph database Janusgraph, and adding the corresponding relation between each enterprise and the entry and sales item goods into the distributed graph database Janusgraph as edges.

The construction node comprises:

The construction edge comprises:

and adding the corresponding relation of each enterprise and the entry and sales goods into a graph database as edges, wherein the edge label of the entry goods is input, the edge label of the sales goods is output, and adding information such as transaction time, transaction amount and the like into the attribute of the edges, so as to facilitate the screening of time nodes and the calculation of the commodity amount ratio during inquiry. For example, when the amount ratio of a certain type of goods in all goods of the enterprise a is calculated, the sales items of the enterprise a are 3, including goods 1 (amount a), goods 2 (amount b) and goods 3 (amount c), and the amount ratio of the goods 3 is c/(a+b+c), as shown in fig. 2.

S3, carrying out corresponding relation query by using gremlin query language

The method specifically comprises the following steps:

enterprise node query:

g.V (). HasLabel ('nsr'), has ('NSRMC', 'aerospace') toList ().

Node and edge queries with similar entry and sale terms:

after enterprise information is obtained through enterprise node query, nodes and edges similar to all marketing items in a single level are rapidly found through 'id', and the following statement is used as an example of 'aerospace information': g.V ('id'). OutE (). OtherV (). InE (). OtherV (). OutE (). SimplePath (). ToList ().

S4, calculating the similarity of the business in-and-out items, namely inquiring the sum proportion of the common in-and-out item goods between any two businesses in the total in-and-out item goods according to the constructed relation graph, and calculating the similarity of the business in-and-out item goods.

the method comprises the steps of calculating the entry similarity of an enterprise and the sales similarity of the enterprise, wherein the specific process of calculating the entry similarity is as follows:

the entry goods of the enterprise A and the enterprise B are respectively A:and B: { ib ₁ ,ib ₂ ,…,ib _x The sum of which is respectively A: { ipa ₁ ,ipa ₂ ,…,ipa _m And B: { ipb ₁ ,ipb ₂ ,…,ipb _x Enterprise a and enterprise B have N identical incoming shipments (n.gtoreq. 0;N.ltoreq.m; n.ltoreq.x).

The formula of the entry similarity calculation is as follows:

the specific process for calculating the similarity of the pin items is as follows:

the sales items of the enterprise A and the enterprise B are respectively A: { oa ₁ ,oa ₂ ,…,oa _n And B: { ob ₁ ,ob ₂ ,…,ob _y And the sum of the two amounts is A { opa }, respectively ₁ ,opa ₂ ,…,opa _n Sum of B { opb } ₁ ,opb ₂ ,…,opb _y Enterprise a and enterprise B have M identical incoming shipments (n.gtoreq. 0;N.ltoreq.m; n.ltoreq.x).

The pin similarity calculation formula is:

similar＝(similar_in+similar_out)/2

where similarity is defined as the similarity of the entry and the sale items.

S5, screening similar enterprises of marketing items

And screening enterprises with the similarity of the entry and the sales items of the target enterprises being larger than the set quantity, and taking the enterprises as the entries, the sales items and the similar enterprises of the entry and the sales items of the target enterprises.

In the embodiment, the set quantity is 50%, an enterprise relation graph covering mass data is constructed on the basis of a distributed graph database Janusgraph, graph calculation is realized through Gremlin query language, defects of a traditional database and a single-machine Neo4j graph database in graph storage and graph mining are overcome, and the problems of large data quantity, high development cost and the like are solved.

S6, identifying enterprise risks through enterprise industry attributes

And comparing the industry attributes of the business with similar business, and judging that the business has abnormal risk of intersection information if the two businesses with similar business exist in the business with similar business.

Example 2

The invention also provides an enterprise risk identification system based on the enterprise business-to-sales relationship map, which comprises:

Aiming at the tax field, the method and the system consider the problem of weights of main and auxiliary marketing commodities of enterprises by calculating the similarity of the marketing items of the enterprises, so that the enterprise risk identification is carried out by comparing the industry attributes of the marketing items similar to the enterprises, and the aim of more effectively managing and analyzing the risk enterprises by the auxiliary tax supervision department is fulfilled.

the calculation formula of the similarity of the pin terms is as follows:

similar＝(similar_in+similar_out)/2

where similarity is defined as the similarity of the entry and the sale items.

Aiming at the tax field, the method and the system consider the problem of weights of main and auxiliary marketing commodities of enterprises by calculating the similarity of the marketing items of the enterprises, so that the enterprise risk identification is carried out by comparing the industry attributes of the marketing items similar to the enterprises, and the tax supervision department can more effectively manage and analyze the risk enterprises.

The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described.

Claims

1. An enterprise risk identification method based on an enterprise marketing relationship map is characterized by comprising the following steps:

screening similar enterprises of the marketing items according to the similarity calculation result and identifying enterprise risks through enterprise industry attributes;

wherein, the business entry and sales item goods similarity calculation includes:

，

wherein N is an entry cargo shared between the enterprises A and B, ipa _i Defined as the sum of the ith shared incoming goods in enterprise A, ipb _i The sum of the ith shared incoming goods in the enterprise B is defined as the sum ratio, and similarity_in is defined as the incoming similarity;

the calculation formula of the similarity of the pin terms is as follows:

，

wherein M is a sales item cargo shared between the enterprises A and B, opa _i Defined as the sum of the ith shared sales item in the enterprise A, opb _i The sum of the ith shared sales item goods in the enterprise B is defined as the sum ratio, and similarity_out is defined as sales item similarity;

，

where similarity is defined as the similarity of the entry and the sale items.

2. The method for identifying enterprise risk based on enterprise business trip relationship graph according to claim 1, wherein the data set construction comprises:

acquiring enterprise information through invoice data, wherein the enterprise information comprises a sales side tax payer identification number, a sales side tax payer name, a supplier tax payer identification number, a supplier tax payer name, transaction amount, goods for sales and transaction time;

3. The enterprise risk identification method based on the enterprise business trip relationship graph of claim 1, wherein the enterprise business trip relationship graph is constructed by using a distributed graph database janus graph.

4. The method for identifying enterprise risk based on enterprise business-to-business relationship graph according to claim 1, wherein the steps of screening business-to-business items for similar enterprises and identifying enterprise risk by enterprise industry attributes according to the similarity calculation result include:

5. The enterprise risk identification method based on the enterprise business trip relationship graph according to claim 1, wherein the step of adding the corresponding relationship between each enterprise and the business trip item goods as an edge to the graph database comprises adding transaction time and transaction amount information to the attribute of the edge.

6. The method for identifying the enterprise risk based on the enterprise business relationship graph as claimed in claim 1, comprising the following steps: and calculating the similarity of the business entry and sales items and inquiring the corresponding relation, wherein the method comprises the following steps:

7. An enterprise risk identification system based on an enterprise business-to-sales relationship graph, comprising:

the risk identification module is used for screening similar enterprises of the marketing items according to the similarity calculation result and identifying enterprise risks through enterprise industry attributes;

wherein, the business entry goods similarity calculation module includes:

，

the calculation formula of the similarity of the pin terms is as follows:

，

where similarity is defined as the similarity of the entry and the sale items.

8. The enterprise risk identification system based on the enterprise business trip relationship graph of claim 7, wherein the enterprise business trip item similarity calculation, the query of the correspondence, comprises: