Practical DevOps for Big Data/Platform-Independent Modelling



Introduction

DICE provides Software Architects with a set of core concepts, at the DPIM layer, to specify the fundamental architecture elements that constitute a Data-Intensive Application (DIA), i.e., during the DIA Design phase. Designers may use the identified core architecture elements to quickly put together the structural view of their Big-Data application, highlighting and tackling concerns such as data flow and essential high-level processing properties (e.g., rate, properties provided and required by every component, etc.) as well as key data processing needs (e.g., batch, streaming, etc.).

DPIM Profile

DPIM includes all concepts that are relevant to structure a DIA. At the DPIM level we define the high level topology of the application and its QoS requirements. Elements of the DPIM meta-model fall into two categories:

  1. Active DIA elements, which process the data, such as computational nodes ;
  2. Passive DIA elements, which stores and visualize the data, such as the storage nodes ;
 
Picture of the DICE DPIM Profile (Meta-model)

More in particular, the DICE DPIM Profile meta-model shows that DIA elements are essentially aggregates of two sets of components. Firstly, the "ComputationNode", which is basically responsible for carrying out computational task like map, or reduce in MapReduce. One of important attributes of ComputationNode is "computationType" that shows the processing type of big-data i.e, batch processing or stream processing. The ComputationNode itself, further specializes into "SourceNode" and "Visualization" nodes. The role of the SourceNode is to provide data for processing. In other words, the SourceNode represents the source of data which are coming into application in order to being processed. The attribute "sourceType" further specifies the characteristics of source. The ultimate goal of a big-data application is to process the data that have high volume and velocity. So the SourceNode, and ComputationNode are in DPIM since there are the essential part of each and every DIA. The sourceNode is the entry point of data into the application and the Computation is where data would be processed. Visualization here means to visualize the data to represent the knowledge more intuitively and effectively by using different graphs which are computed through Data-Intensive means. Even though, the visualization of big-data itself could be done by a separate application, but here we considered visualization as specification of ComputationNode since ultimately the visualization is a data-intensive computation task. Another element which is also specification of ComputationNode is the FilterNode. Its role is to do any type of pre-processing and post-processing of data if needed.

The second key element in the DICE profile is the "StorageNode". As its name may suggest, the StorageNode represents the element which is responsible to store the data, either for long or short term. Moreover, it is associated with "Channel" that represents the communication channel in the application. The specification of Channel also shows the restrictions and constraints of a channel. It also specifies the characteristics related to transformation of data like information rate and taps. The concept of StorageNode in DPIM mainly corresponds with the "database" in the model. In some cases, it could also be a "filesystem". The channel in DPIM is a representation of "Governance and data Integration" in which mainly includes the technologies responsible for transferring the data, like message broker systems. The other elements in the model are "DataSpecification" and "QoSRequiredProperty", which are annotation stubs for specification the type and format of data and the QoS for system and its evaluation respectively. These annotations are inherited from MODACloudML[1]. Appendix A[2] specifies the DICE Profile with greater detail. Table 1 summarizes the current list of stereotypes of the DICE Profile for the DPIM level.

Stereotype Description (This stereotype is for model elements representing. . . )
DpimComputationNode DIA components with computation throughput, type of data processing, and maybe expected target technology.
DpimFilterNode Filter nodes that extend general DpimComputationNode with input and output ratios.
DpimSourceNode DIA components with a given storage volume, type of generated data, and data generation rate.
DpimStorageNode DIA component with resource multiplicity, type of stored data, and speed in terms of maximum operations rate.
DpimChannel Connectors that have a maximum speed and that are subject to failures and propagation of errors.
DpimScenario An execution scenario of the DIA, which defines the quality properties of interest and the scenario quality requirements.

DPIM Example: The Maritime Operations Case Study

In this section, we describe a UML-based design (i.e., Activity Diagram) that is annotated using the DPIM profile. In particular, input parameters are assigned to the mean durations of the action steps (i.e., hostDemand tagged-values) and to the data stream arrival rate (i.e., arrivalRate tagged-value). We show the modelling of a portion of the Maritime Operations case study.

 
Profiled Activity Diagram

As previously explained in Introduction to Modelling, the DPIM profile rely on the standard MARTE and DAM profiles. This is because DAM is a profile specialized in dependability and reliability analysis, and MARTE offers the GQAM sub-profile, a complete framework for quantitative analysis. Therefore, they matches perfectly to our purposes: the quality assessment of data intensive applications. Moreover, MARTE offers the NFPs and VSL sub-profiles. The NFP sub-profile aims to describe the non-functional properties of a system, performance in our case. The latter, the VSL sub-profile, provides a concrete textual language for specifying the values of metrics, constraints, properties, and parameters related to performance. VSL expressions are used in DPIM-profiled models with two main goals: (i) to specify the input parameters of the model and (ii) to specify the performance metric(s) that will be computed for the model (i.e., the output results). An example of VSL expression for a host demand tagged value of type NFP_Duration is:

expr=$parse (1) , unit=ms (2), statQ=mean (3), source=est (4)

This expression specifies that the parsing step in (yellow box in the image) demands $parse (1) milliseconds (2) of processing time, whose mean value (3) will be obtained from an estimation in the real system (4). $parse is a variable that can be set with concrete values during the analysis of the model.

Conclusion

DICE UML-based application modelling heavily rotates around DPIM, a new refined UML profile to specify the fundamental architecture elements that constitute a Data-Intensive Application (DIA).

References edit

  1. MODAClouds (2016). MODACloudML Development - Initial Version (Report). http://www.modaclouds.eu/wp-content/uploads/2012/09/MODAClouds_D4.2.1_MODACloudMLDevelopmentInitialVersion.pdf. 
  2. The DICE Consortium (2016). Design and Quality Abstractions - Initial Version (Report). http://wp.doc.ic.ac.uk/dice-h2020/wp-content/uploads/sites/75/2016/02/D2.1_Design-and-quality-abstractions-Initial-version.pdf.