Practical DevOps for Big Data/Trace Checking

Introduction

Trace checking is an approach for the analysis of system executions that are recorded as sequences of timestamped events. Collected logs are analyzed to establish whether the system logs satisfy a property, usually specified in a logical language. In the positive case, the sampled system behavior conforms with the constraints modeled by the property; conversely, the system behavior simply does not satisfy the property.

When the language that is used to specify properties allows for the usage of temporal operators, trace checking is a way to check the correctness of the ordering of the events occurring in the system and of the time delays between pairs of events. For instance, if a property requires that all the emitting events of a certain node occur not more than ten millisecond after the latest receive event, then checking the property over a trace results in a boolean outcome, which is positive if the distance between two consecutive and ordered pair of emit and receive events is less than ten milliseconds.

The tool that is available in DICE to perform trace-checking (of Storm applications) is DICE-TraCT.

Motivations

Trace checking is especially useful when the aggregated data that are available from the monitoring system are not enough to conclude the correctness of the system executions with respect to some specific criteria. In some cases, in fact, these criteria are application dependent as they are related to some non-functional property of the application itself and they do not depend on the physical infrastructure where the application is executed.

Trace checking is a possible technique to achieve this goal and can be used on purpose to extract information from the executions of a running application.

Logical languages involved in the trace checking analysis are usually extensions of metric temporal logics which offer special operators called aggregating modalities. These operators hold if the trace satisfies particular quantitative features like, for instance, a specific counting property of events in the trace.

According to the DICE vision, trace checking is performed after verification to allow for continuous model refinement. The result obtained through the log analysis confirms or refutes the outcome of the verification task, which is run at design time. The value of the parameters in the design-time model is compared with the value at runtime; if both are “compatible” then the results of verification are valid, otherwise the model must be refined.

Existing solutions

The research area related to runtime verification is very active on trace-checking. Various are the tools that support trace-checking and that provide either on-line or off-line analysis. Trace-checking in DICE is applied off-line but further extensions to the framework might consider also on-line analysis to improve the model-driven refinement that designers can perform. As far as the team knowledge, DICE is (one of) the first attempt which uses trace-checking analysis for supporting model-driven development of DIA. Trace-checking engines are tools that can be used in the assessment process and the choice of a specific one depends on the kind of analysis that designers wants to achieve. The International Competition on Runtime Verification is the best reference to compare available tools and techniques. Some of the tools that were involved in the competition and that might be considered as alternative to the ones used and implemented in DICE are the following:

Rithm2 (paper): efficient parallel algorithm for verifying log traces with an extended version of LTL with counting semantics.
MonPoly: first-order extension of LTL enriched with aggregating modelities.
ARTiMon: a tool for the analyses of flow of timestamped observations that can be used to detect hazards expressed in a temporal logic-based language.

How the tool works

The outcome of the trace-checking analysis is used for refining the model of the application at design time that is verified with D-VerT.

Trace-checking is performed at run-time but it is based on annotated DTSM models, defined at design-time, that contain the topology undergoing the analysis.

DICE-TraCT has been designed to encompass Soloist language Soloist which offers the following class of aggregating modalities:

number of occurrences of an event e in a time window of length d,
maximum/average number of occurrences of an event e, aggregated over right-aligned adjacent non-overlapping subintervals (of size h) in a time window of length d,
average time elapsed between a pair of specific adjacent and alternating events e and e occurring in a time window of length d.

However, some basic functionalities, such as the averaging of event in a given time window, have been developed in a simple trace analyser to improve the flexibility of the tool. The user can therefore exploit the most appropriate engine to perform the log analysis.

The architecture

DICE-TraCT has a client-server architecture: the client component is an Eclipse plugin that is fully integrated with the DICE IDE and the server component is a RESTful web server. The client component manages the transformation from the DTSM model defined by the user to an intermediate JSON object which is then used to invoke the server. The server component, based on the content of the JSON file generates the trace-checking instance for the trace-checking engine.

DICE-TraCT is an Eclipse plugin that can be executed within the DICE IDE. To carry out the trace-checking analysis, Storm topologies have to be specified by means of UML class diagrams and activity diagrams with the DICE::Storm profile. The diagrams can easily be drawn by dragging and dropping Eclipse items that are available in the DICE toolbars whereas the annotations can be specified in the bottom panel of the DICE IDE.

Figure 1. The architectural connectors involving DICE-TraCT

DICE-TraCT can be set through a suitable Run Configuration window allowing users to launch the analysis. The user can specify:

the file containing the model undergoing the analysis,
the time-bound limiting the duration of the time window that is analyzed,
the set of Storm bolts and spouts whose metrics (sigma and avg_emit_rate) will be calculated by means of the topology logs,
the IP address and the port where the DICE-TraCT server is running and
the IP address and the port of the monitoring platform.

The trace-checking process

Trace-checking is performed on a running deployed application that has been designed by means of an abstract model verified by D-VerT (verification tool), and possibly with the results obtained from the simulation and optimization tools. The parameter values that were employed to carry out the analysis with D-VerT are compared with the current values obtained by the analysis of the application log traces performed by DICE-TraCT. Based on the kind of the analysis (counting, average or quantitative) and the structure of the topology, DICE-TraCT retrieves the logs from the monitoring platform (D-Mon) and performs the analysis. When the outcome is available, the result is shown in the DICE IDE to allow the user to improve the model of the application, rerun the verification tool and, based on the outcome of the analysis, change the implementation if needed. The process is iterative and can be repeated until the verification tool does not detect anomalies that might lead the deployed application to undesired behaviours at runtime.

Open challenges

Trace-checking of application logs can be achieved successfully when logs contain suitable entries that enable the evaluation of temporal formulae or metrics. When the property to be checked only refers to events of the application that are tracked by the monitoring service of the data-intensive platform which runs the application, trace-checking can be employed simply by defining the property (or metric) to assess and by running the trace-checker on a selected log history. On the other hand, when the property to be checked refers to application events that are not natively monitored by the framework, trace-checking requires an ad-hoc instrumentation of the application code that allows the monitoring service to log those application-specific events needed for the evaluation of the property of interest.

Application domains

DICE-TraCT currently supports Storm logs analysis.

The current implementation of DICE-TraCT enables the analysis of two distinct parameters which are tightly related to the verification analysis carried out by D-VerT. In particular, the two available metrics that DICE-TraCT can elaborate are:

avg_emit_rate that is the emitting rate of a spout node, defined as the number of tuple emitted per second, and
sigma that is the ratio between the number of tuples produced by a bolt node in output and the number of tuples that it has received in input.

The value of the metrics that is obtained by trace-checking helps designers in tuning the model used for verification with D-VerT.

The trace-checking analysis can be restricted only to a specific time window and consider only the log entries that occur in a limited amount of time. The size determines how many log events are considered to carry out the evaluation of the metrics (The unit measurement is the millisecond).

DICE-TraCT implements three different kind of analysis calculating the metrics. Each one depends on the features of the underlying engines that are used to analyse the logs. Logical languages involved in the trace checking analysis are usually extensions of metric temporal logics which offer special operators called aggregating modalities. These operators are useful to count events in the logs or calculate the average number of the occurrences over a certain time window and compare the value with a given threshold. Therefore, the outcome of the analysis is always a boolean answer that can be either yes or no. DICE-TraCT is based on Soloist but the current implementation of DICE-TraCT can also call a simpler trace-checker than the one for Soloist, that is developed to provide quantitative informations calculated from the logs.

The following kind of analysis are available:

counting: it enables the evaluation of the selected metric by means of the so called Counting formulae of the Soloist logic. This choice will call the Soloists trace-checker.
average: it enables the evaluation of the selected metric by means of the so called Average formulae of the Soloist logic. This choice will call the Soloists trace-checker.
quantitative: it enables the quantitative analysis of the logs which extracts the value of the metric.

The analysis instantiated by the Soloist trace checker is designed to provide a boolean result that is calculated by comparing the number of occurrences of an event in the log with a user defined threshold. The relations <, = and > and the threshold value can be specified via an appropriate configuration window when DICE-TraCT is launched.

A test case - Wikistats

DICE-TraCT is an Eclipse plugin that can be executed within the DICE IDE. To carry out log analysis, Storm topologies have to be specified by means of UML class diagrams and activity diagrams. The diagrams can easily be drawn by dragging and dropping Eclipse items that are available in the DICE toolbars whereas the annotations can be specified in the bottom panel of the DICE IDE.

Figure 2 depicts the Storm topology implementing the Wikistats topology (see the Github repository for implementation details). The diagram is a DTSM model which is also used for the analysis with D-VerT.

Figure 2. Wikistats topology

DICE-TraCT can be set through a Run Configuration window allowing users to launch the analysis. The user can specify:

the file containing the model undergoing the analysis,
the Storm components undergoing analysis,
the IP address and the port where the DICE-TraCT server is running.

The analysis allows for the extraction of model parameters related to verification, i.e., values that are used in the formal model analysed by D-VerT. DICE-TraCT extracts the following parameter values:

sigma: ratio between number of tuples emitted by a bolt and number of tuples received by the bolt.
alpha: average time required to elaborate a tuple by a bolt.
emit rate: average emit rate of a spout.

Figure 3 shows the run configuration for DICE-TraCT and the list of the components defining the Wikistats topology. Moreover, the last column on the right-hand side of the figure shows the result of the analysis for some components of the topology after the execution of DICE-TraCT.

Figure 3. DICE-TraCT run configuration and results

Conclusion

DICE-TraCT has been developed to perform log analysis of Storm applications. It is used to assess the runtime behavior of a deployed application as log traces collected from the monitoring platform are analyzed to certify the adherence to the behavioral model that is assumed at design time. If the runtime behavior does not conform to the design, then the design must be refined and later verified to obtain a new certification of correctness. Trace-checking can be applied to supply information to the verification task carried out by D-VerT. Trace checking extracts from real executions the parameter values of the model used by D-VerT that are not available from the monitoring service of the framework, as they are inherently specific of the modeling adopted for the verification.