[go: nahoru, domu]

CN106533792A - Method and device for monitoring and configuring resources - Google Patents

Method and device for monitoring and configuring resources Download PDF

Info

Publication number
CN106533792A
CN106533792A CN201611140737.9A CN201611140737A CN106533792A CN 106533792 A CN106533792 A CN 106533792A CN 201611140737 A CN201611140737 A CN 201611140737A CN 106533792 A CN106533792 A CN 106533792A
Authority
CN
China
Prior art keywords
resource
monitoring
data
ganglia
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611140737.9A
Other languages
Chinese (zh)
Inventor
张侠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ruian Technology Co Ltd
Original Assignee
Beijing Ruian Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ruian Technology Co Ltd filed Critical Beijing Ruian Technology Co Ltd
Priority to CN201611140737.9A priority Critical patent/CN106533792A/en
Publication of CN106533792A publication Critical patent/CN106533792A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/069Management of faults, events, alarms or notifications using logs of notifications; Post-processing of notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0803Configuration setting
    • H04L41/0823Configuration setting characterised by the purposes of a change of settings, e.g. optimising configuration for enhancing reliability
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/085Retrieval of network configuration; Tracking network configuration history
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The embodiment of the invention discloses a method and a device for monitoring and configuring resources. The method comprises the steps: Ganglia collects information of different functional clusters dynamically, grades each piece of node information in a resource monitoring system and periodically records the grades into a log; Nagios sets alarms at different levels in a resource alarming system, sets plug-ins of different types of sent messages, self-defines message content, obtains data sent by the ganglia and records the data into the log; and the Nagios optimizes and redistributes tasks and resources based on a grade of a resource feedback system. Accordingly, a big data platform is optimized and adjusted based on quantified resources and task information; a knowledge base is formed for historical data and processing methods; convenience is brought to updating of a monitoring mode, and encountered problems can be processed based on the knowledge base.

Description

A kind of monitoring and the method and device of configuration resource
Technical field
The present embodiments relate to the monitoring of cluster and report to the police and operation tuning field in big data, more particularly to it is a kind of Monitoring and the method and device of configuration resource.
Background technology
In big data process field, with the increase of the data and server of data center, monitoring to data and resource, Using there is higher requirement.Due to increase and the increasing of requiring in terms of using resource of program of cluster scale, can be real-time Therefore monitoring cluster state simultaneously makes the timely feedback for being directed to cluster and operation operation, affects to a great extent whole big The allomeric function and operating efficiency of data platform.
Node in monitoring cluster is an important component part of cluster management, that is, track the state of node.Ganglia It is an application program for monitoring cluster interior joint, is widely used in the big data platform and Yun Ping of Ge great Internet firms On platform.
For system operator, the meaning of network monitoring system and effect essentially consist at following 2 points:One be can To be timely concerned about some abnormal conditions of server, and alerted according to the threshold values of preset value, such as disk space is not Foot, cpu and memory usage increase extremely, and the process of operation increases suddenly, and operation operating speed and contrast before bright Aobvious slow, certain the stage committed memory exception for running operation simultaneously causes operation that failure is run multiple times, certain node delay machine or There is large area and delays machine situation in cluster;Also one when gone wrong in complex applied environment, in such as network When situations such as error of disconnected, application program, system crash occurs, in the face of server and application program, can be given according to monitoring system The quick DXXW of warning for going out is located to problem, gains time to fix a breakdown.
Partial key operation system has deployed monitoring programme in actual production activity, but there is following the limitation Property:
Monitoring procedural item is limited, is confined to cpu loads, and internal memory is used, the project such as disk space;Monitoring limitation Property, it is impossible to it is generalized to other systems and carries out the integration of monitoring data;Safe limitation, it is desirable to be able to which direct detection should to other Serve port and remotely reading Simple Network Management Protocol (Simple Network Management Protocol, ) etc. SNMP system information, challenging in this business higher for network security requirements.
Nagios can realize comprehensively monitoring the server on network, including the service run on server The state of the state and server system resources of (Apache, MySQL, FTP, DNS and hadoop, Hbase, Solr etc.).
The quantity of big data application platform operation system is being constantly increasing, and fusion and interaction day each other is cumulative Many, between application architecture system, produced problem probability also increases therewith, by automatically-monitored and reponse system, can be real-time The state of platform application and service is checked, the bottleneck of systematic function is found when operation is run, and is automatically processed or is alerted, protected Demonstrate,prove whole plateform system efficiently and reliably to operate, mitigate the working strength of detection and system manager, improve operating efficiency, Optimum program design structure, and reduction hinders the loss for bringing for some reason.
Job scheduling system is the important component part for managing cluster and management operating operation, in big data platform application The DAG (Database Availability Group) for having many job scheduling systems, such as hadoop and spark is dispatched, The Workflow scheduling of Oozie, but how to combine scheduling system and resource monitoring is that each company focuses on The problem of solution.Additionally, the monitoring system of big data platform is in specific production application, if can by monitoring data and Real time execution work data combines, and scoring feeds back to corresponding process engineer and manager accordingly, and when will feed back The status information of cluster and job information preserve log recording and get off, as knowledge base, to be for future reference.System engineer Data can have deeper into understanding for the state of existing cluster accordingly, and data preparation is carried out for cluster dilatation in future.
(and business) monitoring software of increasing income has two main problems as follows:
(1) all contents needed for no any instrument can be monitored;
(2) need to allow these instruments to adapt to different self-defined work completely.
The content of the invention
The purpose of the embodiment of the present invention is the method and device for proposing a kind of monitoring and configuring resource, how to ensure prison Control resource and while operation, can reach optimization cluster resource and utilizes according to monitoring situation again, program feature optimization and can and Alarm reaches the purpose for stopping loss harm.
It is that, up to this purpose, the embodiment of the present invention is employed the following technical solutions:
In a first aspect, a kind of monitoring and the method for configuring resource, methods described includes:
The information of Ganglia dynamic collection difference in functionality clusters, is carried out to each nodal information in resource monitoring Scoring is simultaneously periodically recorded in daily record;
Nagios arranges the warning of different stage in resource warning system, and arranges inserting for different types of transmission message Part self-defined message content, obtain the data of the ganglia transmissions and recorded in the daily record;
The Nagios is optimized and is redistributed to task and resource according to the scoring of the resource reponse system.
Preferably, the information of the Ganglia dynamic collections difference in functionality cluster, saves to each in resource monitoring Before point information is scored and periodically recorded in daily record, also include:
Monitored device be added by Host Administration, changed, being deleted and inquiry operation, the addition operation includes It is manually entered and supports that the mode of all devices in the default network segment of automatic topology discovery, the support network topology are sent out automatically The method for now presetting all devices in the network segment is included automatically finding needing user-specified network section, is scanned institute again in the way of ping The method that some IP judgement are then added in host table after scanning the type of each equipment.
Preferably, the information of the Ganglia dynamic collections difference in functionality cluster, including:
The cpu and memory information of Ganglia monitoring cluster interior joint, according in the workflow of oozie not With job and corresponding running status, occupation condition of the determining program in running, in getmad configuration files Palmic rate required for configuration;
If the operation run on difference in functionality cluster exceedes default operation amount threshold, the Ganglia is according to feedback The resource information of occupancy and combinations of states yarn or mesos of different nodes on job and corresponding operation conditions, system Determine the optimisation strategy in program.
Preferably, the information of the Ganglia dynamic collections difference in functionality cluster, including:
The corresponding type of the monitored device is obtained, and corresponding service is gone out by the type search, then to arrange The form of table shows the service that the monitored device can be monitored.
Preferably, it is described each nodal information to be scored in resource monitoring, including:
Data in different time sections are sampled by the data in each monitoring cycle statistics respective cycle, and Most value, average to statistical sample, standard deviation.
Preferably, the data in different time sections are sampled, including:
A reference time t is selected first1, according to pi=wi/uiObtain the time tiData Vi of interior generation, the data The weight of Vi is:wi=f (ti-t1), functions of the f for monotonic nondecreasing;The wi=ea(ti-t1), a>0;uiFor between 0 to 1 with Machine number.
The device of second aspect, a kind of monitoring and configuration resource, described device include:
Collection module, for the information of dynamic collection difference in functionality cluster;
Grading module, for scoring to each nodal information in resource monitoring;
First logging modle, for periodically recorded in daily record;
Second logging modle, for the warning for arranging different stage in resource warning system, and arranges different types of The plug-in unit self-defined message content of message are sent, the data of the ganglia transmissions is obtained and be recorded in the daily record;
Distribute module, for being optimized to task and resource and dividing again according to the scoring of the resource reponse system Match somebody with somebody.
Preferably, described device also includes:
Processing module, for the information in the Ganglia dynamic collections difference in functionality cluster, in resource monitoring Before each nodal information is scored and periodically recorded in daily record, monitored device is carried out by Host Administration Addition, modification, deletion and inquiry operation, the addition operation include being manually entered and supporting that automatic topology discovery presets net The mode of all devices in section, in the default network segment of the support automatic topology discovery, the method for all devices includes sending out automatically User-specified network section is now needed, all of IP is scanned again in the way of ping and judgement is scanned after the type of each equipment again The method being added in host table.
Preferably, the collection module, specifically for:
The cpu and memory information of monitoring cluster interior joint, according to different job in the workflow of oozie and right The running status answered, occupation condition of the determining program in running, required for configuring in getmad configuration files Palmic rate;If the operation run on difference in functionality cluster exceedes default operation amount threshold, according to the occupancy fed back Job and corresponding operation conditions in combinations of states yarn or mesos of resource information and different nodes, in formulation program Optimisation strategy;
The collection module, also particularly useful for::
The corresponding type of the monitored device is obtained, and corresponding service is gone out by the type search, then to arrange The form of table shows the service that the monitored device can be monitored.
Preferably, institute's scoring module, specifically for:Data in each monitoring cycle statistics respective cycle, to not It is sampled with the data in the time period, and obtains the most value of statistical sample, average, standard deviation;
Institute's scoring module, also particularly useful for:
A reference time t is selected first1, according to pi=wi/uiObtain the time tiData Vi of interior generation, the data The weight of Vi is:wi=f (ti-t1), functions of the f for monotonic nondecreasing;The wi=ea(ti-t1), a>0;uiFor between 0 to 1 with Machine number.
A kind of monitoring provided in an embodiment of the present invention and the method and device of configuration resource, Ganglia dynamic collections are different The information of function cluster, in resource monitoring is scored and periodically recorded in daily record to each nodal information; Nagios arranges the warning of different stage in resource warning system, and arranges different types of plug-in unit for sending message and make by oneself Adopted message content, obtains the data of the ganglia transmissions and recorded in the daily record;The Nagios is according to the resource The scoring of reponse system is optimized and is redistributed to task and resource.So as to be counted to big according to quantization resource and job information It is optimized and adjusts according to platform operation;Knowledge base is formed to historical data and processing method, it is convenient to update monitoring mode, and can To process the problem for running into according to knowledge base.
Description of the drawings
Fig. 1 is the schematic flow sheet of a kind of monitoring provided in an embodiment of the present invention and the method for configuring resource;
Fig. 2 is a kind of schematic flow sheet of the DFD of ganglia provided in an embodiment of the present invention;
Fig. 3 is a kind of Nagios performances processing framework schematic diagram provided in an embodiment of the present invention;
Fig. 4 is a kind of aggregated structure schematic diagram provided in an embodiment of the present invention;
Fig. 5 is the schematic flow sheet of a kind of monitoring provided in an embodiment of the present invention and the method for configuring resource;
Fig. 6 is a kind of schematic flow sheet of monitoring configuration feature provided in an embodiment of the present invention;
Fig. 7 is the high-level schematic functional block diagram of a kind of monitoring provided in an embodiment of the present invention and the device for configuring resource.
Specific embodiment
With reference to the accompanying drawings and examples the embodiment of the present invention is described in further detail.It is understood that this The described specific embodiment in place is used only for explaining the embodiment of the present invention, rather than the restriction to the embodiment of the present invention.In addition also It should be noted that for the ease of description, the part related to the embodiment of the present invention rather than entire infrastructure are illustrate only in accompanying drawing.
With reference to Fig. 1, Fig. 1 is the schematic flow sheet of a kind of monitoring provided in an embodiment of the present invention and the method for configuring resource.
As shown in figure 1, the method for the monitoring and configuration resource includes:
Step 101, the information of Ganglia dynamic collection difference in functionality clusters, to each node in resource monitoring Information is scored and periodically recorded in daily record;
As shown in Fig. 2 wherein Ganglia monitoring systems include three major parts:Gmond, gmetad and ganglia- web.Pass through XDL (compressed format of xml) between them or XML format transmits monitoring data, reach monitoring effect.In cluster Node, by run gmond collect publisher node status information, then the periodic poll gmond of gmetad collect Information, is then stored in rrd databases, can carry out inquiry displaying to which by web server.The system load that gmond brings Seldom, the performance without affecting user can be run in the cluster on each server.As cluster is in the network, can be with " shake (Jitter) " between clustered node is avoided by clock (NTP) setting of clustered node.
Secondary development for ganglia adopts SOA patterns.
As shown in figure 3, Nagios carries out data acquisition used in big data plateform system, due to the data lattice for collecting Formula does not meet routine use and management, needs to parse by monitoring the performance data for producing to Nagios, is parsed into and meets The data of daily management specification, and be saved in system database, for the displaying of data.
The mentality of designing of performance processing framework is sent out by the performance data that Nagios is collected by socket modes herein The middleware program of independent research is delivered to, then the program carries out dissection process, after unified form is formed, then unified transmission To system database.
In performance data analysis program, need process performance data options are opened inside the service definition of Nagios, Otherwise can no performance data output.The definition process performance data order in command file:
Wherein 192.168.251.60 is the short IP of Nagios services in experiment.For development approach, adopt herein Socket modes realize that generate jar bags the service of registering, method is as follows:
(1) judge performance data, then report an error and point out if null, find time resource associated component and index, modification Index is 1.
(2) capability array segmentation is obtained by corresponding array by regular expression.
(3) each element in array is circulated, is split with equal sign, the equal sign left side is index name, again with branch point on the right of equal sign Cut, take first element, be to refer to target value.
(4) whether query monitor example is in database.If not processed if, if there is then carrying out lower step Process.
(5) by service name, index name inquires about the index with the presence or absence of in database, new if there is no then increase Alert type.
(6) desired value is stored in into database.
Preferably, the information of the Ganglia dynamic collections difference in functionality cluster, including:
The cpu and memory information of Ganglia monitoring cluster interior joint, according in the workflow of oozie not With job and corresponding running status, occupation condition of the determining program in running, in getmad configuration files Palmic rate required for configuration;
If the operation run on difference in functionality cluster exceedes default operation amount threshold, the Ganglia is according to feedback The resource information of occupancy and combinations of states yarn or mesos of different nodes on job and corresponding operation conditions, system Determine the optimisation strategy in program.
Preferably, the information of the Ganglia dynamic collections difference in functionality cluster, including:
The corresponding type of the monitored device is obtained, and corresponding service is gone out by the type search, then to arrange The form of table shows the service that the monitored device can be monitored.
Preferably, it is described each nodal information to be scored in resource monitoring, including:
Data in different time sections are sampled by the data in each monitoring cycle statistics respective cycle, and Most value, average to statistical sample, standard deviation.
The data in different time sections are sampled, including:
A reference time t is selected first1, according to pi=wi/uiObtain the time tiData Vi of interior generation, the data The weight of Vi is:wi=f (ti-t1), functions of the f for monotonic nondecreasing;The wi=ea(ti-t1), a>0;uiFor between 0 to 1 with Machine number.
Step 102, Nagios arrange the warning of different stage in resource warning system, and arrange different types of transmission The plug-in unit self-defined message content of message, obtains the data of the ganglia transmissions and recorded in the daily record;
Step 103, the Nagios are optimized and are weighed to task and resource according to the scoring of the resource reponse system It is new to distribute.
Specifically, as shown in figure 4, being cluster environment in terms of software and hardware, different clusters can constitute different groups, such as Hadoop groups, solr groups, spark groups etc., it is general for linux system in cluster, it is CentOs6.4 systems during this contrived experiment. It is different function cluster components in being production system in big data platform, as the bottom of component is stored as hadoop's HDFS, thus need configuration hadoop metrics so that the feature card of ganglia and nagios can and cluster Association.The information of Ganglia dynamic collection difference in functionality clusters, in resource monitoring, comments to each nodal information Divide and periodically recorded in daily record, business can be adjusted check from now on and record according to historical data accordingly It is whole.The warning of different stage, in resource warning system, can be set, and different types of plug-in unit for sending message is set simultaneously certainly Message content is defined, by the incoming data of ganglia, nagios makes corresponding reaction, and recorded in corresponding daily record. The standards of grading such as table 1 of resource reponse system can be set, the behavior of warning is defined according to standard, it is possible to according to feedback system The scoring of system is optimized and is redistributed to task and resource.
Table 1
Scoring object model in table 1 is the linear relationship scoring of acquiescence herein, also dependent on real data and needs Change other models.
A kind of monitoring provided in an embodiment of the present invention and the method for configuring resource, Ganglia dynamic collection difference in functionality collection The information of group, in resource monitoring is scored and periodically recorded in daily record to each nodal information;Nagios The warning of different stage is set in resource warning system, and different types of plug-in unit self-defined message for sending message is set Content, obtains the data of the ganglia transmissions and recorded in the daily record;The Nagios is according to resource feedback system The scoring of system is optimized and is redistributed to task and resource.So as to according to quantifying resource and job information to big data platform Operation is optimized and adjusts;Knowledge base is formed to historical data and processing method, it is convenient to update monitoring mode, it is possible to foundation The problem that knowledge base process runs into.
With reference to Fig. 5, Fig. 5 is the schematic flow sheet of a kind of monitoring provided in an embodiment of the present invention and the method for configuring resource.
As shown in figure 5, the method for the monitoring and configuration resource includes:
Step 501, is added to monitored device by Host Administration, is changed, being deleted and inquiry operation, the addition Operation includes the mode for being manually entered and supporting all devices in the default network segment of automatic topology discovery, and the support network is opened up Flutter automatically find the method for all devices in the default network segment include finding automatically needing user-specified network section, again with the side of ping Formula scans the method being then added in host table after all of IP judging scans the type of each equipment;
Specifically, as shown in fig. 6, the module design of monitoring system:
The management of main frame and host groups:(1) when host name (2) network address (3) monitoring period (4) contact person (5) notifies Section.
Service and the management of service group:(1) host name (2) monitor command (3) monitoring period, contact person and notice section are notified Deng.
Time rule is managed:(1) in (3) time period time period that title (2) is specifically defined, the date specifies (4) special day Phase (festivals or holidays that need not such as monitor etc.).
Monitored device be added by Host Administration, changed, being deleted and inquiry operation.Addition equipment supports hand Dynamic addition, i.e. manual input device title and IP address;Also support the side of all devices in the automatic topology discovery network segment Formula, finds to need user-specified network section automatically, and acquiescence is the gateway that server is located herein, is then scanned in the way of ping All of IP, judgement scan the type of each equipment, are finally then added to (res_host) in host table.
Specific policer operation is implemented to monitored object by resource distribution.Resource distribution can pass through devices selected first The type corresponding to which is found, and then corresponding service is found out by the type, then show this in the form of a list The service that equipment can be monitored:After validation, the service that will confirm that is added in resource instances list, finally by equipment and service Example is written in configuration file.
Step 502, the information of Ganglia dynamic collection difference in functionality clusters, to each node in resource monitoring Information is scored and periodically recorded in daily record;
Step 503, Nagios arrange the warning of different stage in resource warning system, and arrange different types of transmission The plug-in unit self-defined message content of message, obtains the data of the ganglia transmissions and recorded in the daily record;
Step 504, the Nagios are optimized and are weighed to task and resource according to the scoring of the resource reponse system It is new to distribute.
Represent aspect in application, the various application datas of monitoring resource can be collected and be arranged, to warning message Presented with source, and dilatation accordingly to cluster and the distribution of task are provided informative suggestion, it is possible to as needed Different plug-in units are set, so as to obtain different warning species.System personnel and programmer can also be combined according to sink information Past monitoring knowledge, carries out overall merit to present running status, so as to carry out further resource allocation and task point Match somebody with somebody, it is also possible to which the resource and job run situation of corresponding time period, such as accompanying drawing 6 are checked according to job run situation.
Modelling can be carried out according to difference in functionality cluster deck system personnel are served by, mainly include cluster The postitallation evaluation of packet, the configuration of points-scoring system, different task and operation, and the early warning configuration of difference in functionality cluster is (herein The alarm mode of acquiescence is note or email).Programmer can arrange and check related operation procedure operation conditions, and according to Some important parameter indexs such as sessions run time, operation occupancy cpu check figures and internal memory ratio, and operation peak Thread Count evaluating to operation overall operation index, and as improving the important references of programming and performance.
The cpu and memory information of Ganglia monitoring cluster interior joints, according to the different job in the workflow of oozie And running status, two combine occupation condition of the determining program in running, match somebody with somebody in getmad configuration files Palmic rate (generally 30ms) required for putting.If the operation for also running on cluster is more, need to be fed back according to ganglia The resource information of occupancy and combinations of states yarn or mesos of different nodes on job operation conditions, optimize operation process And in terms of adjusting the details of operation, and the thus optimisation strategy in formulation program.
In the job run of present cluster, consume with IO due to calculating so that need to obtain Query Result and final fortune Row result has been sometimes more than the stipulated time, especially when spark clusters operationally, it is larger to memory consumption, sometimes can Affected by the operation to same other job of cluster, be at this moment accomplished by according to feedback information optimizing program, fixed point is carried out Compression Strategies (such as snappy and LZO) and serialization (Protobuf or Kryo, Avro) strategy, reduce resource consumption.
With reference to Fig. 7, Fig. 7 is a kind of monitoring provided in an embodiment of the present invention and the functional module signal of the device for configuring resource Figure.
As shown in fig. 7, described device includes:
Collection module 701, for the information of dynamic collection difference in functionality cluster;
Grading module 702, for scoring to each nodal information in resource monitoring;
First logging modle 703, for periodically recorded in daily record;
Second logging modle 704, for the warning for arranging different stage in resource warning system, and arranges variety classes Transmission message plug-in unit and self-defined message content, obtain the data that the ganglia sends and simultaneously recorded in the daily record;
Distribute module 705, for being optimized to task and resource and again according to the scoring of the resource reponse system Distribution.
Preferably, described device also includes:
Processing module, for the information in the Ganglia dynamic collections difference in functionality cluster, in resource monitoring Before each nodal information is scored and periodically recorded in daily record, monitored device is carried out by Host Administration Addition, modification, deletion and inquiry operation, the addition operation include being manually entered and supporting that automatic topology discovery presets net The mode of all devices in section, in the default network segment of the support automatic topology discovery, the method for all devices includes sending out automatically User-specified network section is now needed, all of IP is scanned again in the way of ping and judgement is scanned after the type of each equipment again The method being added in host table.
Preferably, the collection module 701, specifically for:
The cpu and memory information of monitoring cluster interior joint, according to different job in the workflow of oozie and right The running status answered, occupation condition of the determining program in running, required for configuring in getmad configuration files Palmic rate;If the operation run on difference in functionality cluster exceedes default operation amount threshold, according to the occupancy fed back Job and corresponding operation conditions in combinations of states yarn or mesos of resource information and different nodes, in formulation program Optimisation strategy;
The collection module 701, also particularly useful for:
The corresponding type of the monitored device is obtained, and corresponding service is gone out by the type search, then to arrange The form of table shows the service that the monitored device can be monitored.
Preferably, institute's scoring module 702, specifically for:Data in each monitoring cycle statistics respective cycle are right Data in different time sections are sampled, and obtain the most value of statistical sample, average, standard deviation;
Institute's scoring module 702, also particularly useful for:
A reference time t is selected first1, according to pi=wi/uiObtain the time tiData Vi of interior generation, the data The weight of Vi is:wi=f (ti-t1), functions of the f for monotonic nondecreasing;The wi=ea(ti-t1), a>0;uiFor between 0 to 1 with Machine number.
A kind of monitoring provided in an embodiment of the present invention and the device of configuration resource, Ganglia dynamic collection difference in functionality collection The information of group, in resource monitoring is scored and periodically recorded in daily record to each nodal information;Nagios The warning of different stage is set in resource warning system, and different types of plug-in unit self-defined message for sending message is set Content, obtains the data of the ganglia transmissions and recorded in the daily record;The Nagios is according to resource feedback system The scoring of system is optimized and is redistributed to task and resource.So as to according to quantifying resource and job information to big data platform Operation is optimized and adjusts;Knowledge base is formed to historical data and processing method, it is convenient to update monitoring mode, it is possible to foundation The problem that knowledge base process runs into.
The know-why of the embodiment of the present invention is described above in association with specific embodiment.These descriptions are intended merely to explain this The principle of inventive embodiments, and the restriction to embodiment of the present invention protection domain can not be construed to by any way.Based on herein Explanation, those skilled in the art associate by need not paying performing creative labour the embodiment of the present invention other are concrete Embodiment, these modes are fallen within the protection domain of the embodiment of the present invention.

Claims (10)

1. it is a kind of monitoring and configure resource method, it is characterised in that methods described includes:
The information of Ganglia dynamic collection difference in functionality clusters, scores to each nodal information in resource monitoring And periodically recorded in daily record;
Nagios arranges the warning of different stage in resource warning system, and arranges different types of plug-in unit for sending message simultaneously Self-defined message content, obtains the data of the ganglia transmissions and recorded in the daily record;
The Nagios is optimized and is redistributed to task and resource according to the scoring of the resource reponse system.
2. method according to claim 1, it is characterised in that the letter of the Ganglia dynamic collections difference in functionality cluster Breath, before each nodal information being scored and periodically recorded in daily record in resource monitoring, also includes:
Monitored device be added by Host Administration, changed, being deleted and inquiry operation, the addition operation is including manual The mode of all devices in input and the default network segment of support automatic topology discovery, the support automatic topology discovery are pre- If in the network segment method of all devices include automatically finding needing user-specified network section, scanned in the way of ping again it is all of The method that IP judgement are then added in host table after scanning the type of each equipment.
3. method according to claim 1, it is characterised in that the letter of the Ganglia dynamic collections difference in functionality cluster Breath, including:
The cpu and memory information of the Ganglia monitoring cluster interior joint, according to the different job in the workflow of oozie And corresponding running status, occupation condition of the determining program in running, configure in getmad configuration files Required palmic rate;
If the operation run on difference in functionality cluster exceedes default operation amount threshold, the Ganglia is accounted for according to feedback Job and corresponding operation conditions in combinations of states yarn or mesos of resource information and different nodes, formulates journey Optimisation strategy in sequence.
4. method according to claim 1, it is characterised in that the letter of the Ganglia dynamic collections difference in functionality cluster Breath, including:
The corresponding type of the monitored device is obtained, and corresponding service is gone out by the type search, then with list Form shows the service that the monitored device can be monitored.
5. method according to claim 1, it is characterised in that described each nodal information to be entered in resource monitoring Row scoring, including:
Data in different time sections are sampled, and are united by the data in each monitoring cycle statistics respective cycle The most value of meter sample, average, standard deviation.
6. method according to claim 5, it is characterised in that the data in different time sections are sampled, bag Include:
A reference time t is selected first1, according to pi=wi/uiObtain the time tiData V of interior generationi, data Vi's Weight is:wi=f (ti-t1), functions of the f for monotonic nondecreasing;It is describeda>0;uiFor the random number between 0 to 1.
7. it is a kind of monitoring and configure resource device, it is characterised in that described device includes:
Collection module, for the information of dynamic collection difference in functionality cluster;
Grading module, for scoring to each nodal information in resource monitoring;
First logging modle, for periodically recorded in daily record;
Second logging modle, for the warning for arranging different stage in resource warning system, and arranges different types of transmission The plug-in unit self-defined message content of message, obtains the data of the ganglia transmissions and recorded in the daily record;
Distribute module, for task and resource being optimized and being redistributed according to the scoring of the resource reponse system.
8. device according to claim 7, it is characterised in that described device also includes:
Processing module, for the information in the Ganglia dynamic collections difference in functionality cluster, to each in resource monitoring Before individual nodal information is scored and periodically recorded in daily record, monitored device is added by Host Administration Plus, modification, delete and inquiry operation, addition operation is including being manually entered and support that automatic topology discovery presets the network segment The mode of middle all devices, in the default network segment of the support automatic topology discovery, the method for all devices includes discovery automatically User-specified network section is needed, all of IP is scanned again in the way of ping and is added after judgement scans the type of each equipment again The method being added in host table.
9. device according to claim 7, it is characterised in that the collection module, specifically for:
The cpu and memory information of monitoring cluster interior joint, according to different job in the workflow of oozie and corresponding Running status, occupation condition of the determining program in running, the heart required for configuring in getmad configuration files Frequency hopping rate;If the operation run on difference in functionality cluster exceedes default operation amount threshold, according to the resource of the occupancy of feedback Job and corresponding operation conditions in combinations of states yarn or mesos of information and different nodes, it is excellent in formulation program Change strategy;
The collection module, also particularly useful for::
The corresponding type of the monitored device is obtained, and corresponding service is gone out by the type search, then with list Form shows the service that the monitored device can be monitored.
10. device according to claim 7, it is characterised in that institute's scoring module, specifically for:In each monitoring week Data in different time sections are sampled by the data in phase statistics respective cycle, and obtain the most value, of statistical sample Value, standard deviation;
Institute's scoring module, also particularly useful for:
A reference time t is selected first1, according to pi=wi/uiObtain the time tiData V of interior generationi, data Vi's Weight is:wi=f (ti-t1), functions of the f for monotonic nondecreasing;It is describeda>0;uiFor random between 0 to 1 Number.
CN201611140737.9A 2016-12-12 2016-12-12 Method and device for monitoring and configuring resources Pending CN106533792A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611140737.9A CN106533792A (en) 2016-12-12 2016-12-12 Method and device for monitoring and configuring resources

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611140737.9A CN106533792A (en) 2016-12-12 2016-12-12 Method and device for monitoring and configuring resources

Publications (1)

Publication Number Publication Date
CN106533792A true CN106533792A (en) 2017-03-22

Family

ID=58342011

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611140737.9A Pending CN106533792A (en) 2016-12-12 2016-12-12 Method and device for monitoring and configuring resources

Country Status (1)

Country Link
CN (1) CN106533792A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108243061A (en) * 2017-10-10 2018-07-03 北京车和家信息技术有限公司 Apparatus monitoring method, device and computer equipment based on Nagios
CN108616421A (en) * 2018-04-13 2018-10-02 郑州云海信息技术有限公司 A kind of condition detection method of multi-node cluster, device and equipment
CN108845865A (en) * 2018-06-28 2018-11-20 郑州云海信息技术有限公司 A kind of monitoring service dispositions method, system and storage medium
CN109951313A (en) * 2019-01-18 2019-06-28 长江大学 A kind of monitoring device and method of Hadoop cloud platform
CN110545326A (en) * 2019-09-10 2019-12-06 杭州数梦工场科技有限公司 Cluster load scheduling method and device, electronic equipment and storage medium
CN110795301A (en) * 2018-08-01 2020-02-14 马上消费金融股份有限公司 Job monitoring method, device, terminal and computer storage medium
CN111435319A (en) * 2019-01-15 2020-07-21 阿里巴巴集团控股有限公司 Cluster management method and device
CN112241349A (en) * 2020-10-21 2021-01-19 山东超越数控电子股份有限公司 Method and system for automatically configuring and managing network IP address of whole cabinet server
CN112291194A (en) * 2020-09-27 2021-01-29 上海赫千电子科技有限公司 State management method and device based on ECU in vehicle-mounted network and intelligent automobile
CN113495840A (en) * 2021-06-22 2021-10-12 北京交通大学 Big data platform testing method based on bottleneck resource positioning and parameter optimization
CN117749645A (en) * 2023-11-29 2024-03-22 北京金诺珩科技发展有限公司 Machine room dynamic IP address data acquisition method
CN118394455A (en) * 2024-07-01 2024-07-26 北京科杰科技有限公司 Big data assembly cluster management system based on java language

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050132041A1 (en) * 2003-12-10 2005-06-16 Ashish Kundu Systems, methods and computer programs for monitoring distributed resources in a data processing environment
CN103905253A (en) * 2014-04-04 2014-07-02 浪潮电子信息产业股份有限公司 Server monitoring and management method based on Nagios and BMC
CN104092575A (en) * 2014-07-29 2014-10-08 中国联合网络通信集团有限公司 Resource monitoring method and system
CN105208098A (en) * 2015-08-24 2015-12-30 用友网络科技股份有限公司 Cloud monitoring system realization device and method
CN105260235A (en) * 2015-09-23 2016-01-20 浪潮集团有限公司 Method and device for scheduling resources on basis of application scenarios in cloud platform
CN105718351A (en) * 2016-01-08 2016-06-29 北京汇商融通信息技术有限公司 Hadoop cluster-oriented distributed monitoring and management system
CN105975378A (en) * 2016-05-11 2016-09-28 国网江苏省电力公司 Distributed layering autonomous monitoring and management system facing supercomputer

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050132041A1 (en) * 2003-12-10 2005-06-16 Ashish Kundu Systems, methods and computer programs for monitoring distributed resources in a data processing environment
CN103905253A (en) * 2014-04-04 2014-07-02 浪潮电子信息产业股份有限公司 Server monitoring and management method based on Nagios and BMC
CN104092575A (en) * 2014-07-29 2014-10-08 中国联合网络通信集团有限公司 Resource monitoring method and system
CN105208098A (en) * 2015-08-24 2015-12-30 用友网络科技股份有限公司 Cloud monitoring system realization device and method
CN105260235A (en) * 2015-09-23 2016-01-20 浪潮集团有限公司 Method and device for scheduling resources on basis of application scenarios in cloud platform
CN105718351A (en) * 2016-01-08 2016-06-29 北京汇商融通信息技术有限公司 Hadoop cluster-oriented distributed monitoring and management system
CN105975378A (en) * 2016-05-11 2016-09-28 国网江苏省电力公司 Distributed layering autonomous monitoring and management system facing supercomputer

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108243061A (en) * 2017-10-10 2018-07-03 北京车和家信息技术有限公司 Apparatus monitoring method, device and computer equipment based on Nagios
CN108616421A (en) * 2018-04-13 2018-10-02 郑州云海信息技术有限公司 A kind of condition detection method of multi-node cluster, device and equipment
CN108845865A (en) * 2018-06-28 2018-11-20 郑州云海信息技术有限公司 A kind of monitoring service dispositions method, system and storage medium
CN110795301A (en) * 2018-08-01 2020-02-14 马上消费金融股份有限公司 Job monitoring method, device, terminal and computer storage medium
CN111435319A (en) * 2019-01-15 2020-07-21 阿里巴巴集团控股有限公司 Cluster management method and device
CN109951313B (en) * 2019-01-18 2022-04-19 长江大学 Monitoring device and method for Hadoop cloud platform
CN109951313A (en) * 2019-01-18 2019-06-28 长江大学 A kind of monitoring device and method of Hadoop cloud platform
CN110545326A (en) * 2019-09-10 2019-12-06 杭州数梦工场科技有限公司 Cluster load scheduling method and device, electronic equipment and storage medium
CN110545326B (en) * 2019-09-10 2022-09-16 杭州数梦工场科技有限公司 Cluster load scheduling method and device, electronic equipment and storage medium
CN112291194A (en) * 2020-09-27 2021-01-29 上海赫千电子科技有限公司 State management method and device based on ECU in vehicle-mounted network and intelligent automobile
CN112291194B (en) * 2020-09-27 2022-12-13 上海赫千电子科技有限公司 State management method and device based on ECU in vehicle-mounted network and intelligent automobile
CN112241349A (en) * 2020-10-21 2021-01-19 山东超越数控电子股份有限公司 Method and system for automatically configuring and managing network IP address of whole cabinet server
CN113495840A (en) * 2021-06-22 2021-10-12 北京交通大学 Big data platform testing method based on bottleneck resource positioning and parameter optimization
CN117749645A (en) * 2023-11-29 2024-03-22 北京金诺珩科技发展有限公司 Machine room dynamic IP address data acquisition method
CN117749645B (en) * 2023-11-29 2024-06-04 北京金诺珩科技发展有限公司 Machine room dynamic IP address data acquisition method
CN118394455A (en) * 2024-07-01 2024-07-26 北京科杰科技有限公司 Big data assembly cluster management system based on java language
CN118394455B (en) * 2024-07-01 2024-08-23 北京科杰科技有限公司 Big data assembly cluster management system based on java language

Similar Documents

Publication Publication Date Title
CN106533792A (en) Method and device for monitoring and configuring resources
US11677635B2 (en) Hierarchical network analysis service
US10108411B2 (en) Systems and methods of constructing a network topology
US20180129579A1 (en) Systems and Methods with a Realtime Log Analysis Framework
CN108197261A (en) A kind of wisdom traffic operating system
CN106452881B (en) Operation and maintenance data processing system based on cloud adding mode
US8504733B1 (en) Subtree for an aggregation system
US8769095B2 (en) System and method for dynamically grouping devices based on present device conditions
WO2023142054A1 (en) Container microservice-oriented performance monitoring and alarm method and alarm system
US10318333B2 (en) Optimizing allocation of virtual machines in cloud computing environment
US10819584B2 (en) System and method for performing actions based on future predicted metric values generated from time-series data
CN104917627B (en) A kind of log cluster for large server cluster scans and analysis method
CN101095307A (en) Network management appliance
CN102567531B (en) General method for monitoring status of light database
US9032518B2 (en) Internet monitoring and alerting system
CN113179173A (en) Operation and maintenance monitoring system for highway system
CN108052358B (en) Distributed deployment system and method
US10466686B2 (en) System and method for automatic configuration of a data collection system and schedule for control system monitoring
US11477077B1 (en) Change management system with monitoring, alerting, and trending for information technology environment
CN116629802A (en) Big data platform system for railway port station
CN111125450A (en) Management method of multilayer topology network resource object
CN103226572A (en) Expandable monitoring method and monitoring system based on data compression
KR20160097502A (en) Ems server and log data management method thereof
AT&T Microsoft Word - sigmod_2011_final.doc
CN114168672A (en) Log data processing method, device, system and medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170322