Practical DevOps for Big Data/Deployment-Specific Modelling
Introduction
editDIAs and the Big Data assets these manipulate are key to industrial innovation. However going data-intensive requires much effort not only in design, but also in system/infrastructure configuration and deployment - these still happen via heavy manual fine-tuning and trial-and-error. We outline abstractions and automations that support data-intensive deployment and operation in an automated DevOps fashion, featuring Infrastructure-as-Code and TOSCA.
Concerning Infrastructure-as-Code and TOSCA in the specific, they reflect the DevOps tactic to adopt source-code practicals in infrastructure design as well. More in particular, infrastructure-as-code envisions the definition, versioning, evaluation, testing, etc. of source-code for infrastructural designs just as application code is defined, versioned, evaluated, and tested. TOSCA, is the OASIS standard definition language for infrastructure-as-code and stands for "Topology and Orchestration Specification for Cloud Applications".
The DDSM allows to express the deployment of DIAs on the Cloud, using UML.
Technical Overview
editOn one hand, IasC is a typical DevOps tactic that offers standard ways to specify the deployment infrastructure and support operations concerns using human-readable notations. The IasC paradigm features: (a) domain-specific languages (DSLs) for Cloud application specification such as TOSCA, i.e., the “Topology and Orchestration Specification for Cloud Applications" standard, to program the way a Cloud application should be deployed; (b) specific executors, called orchestrators, that consume IasC blueprints and automate the deployment based on those IasC blueprints.
On the other h DDSM framework is a UML-based modeling framework based on the MODAClouds4DICER meta-model, which is a transposition and an extension of the MODACloudsML meta-model adapted for the intents and purposes of data intensive deployment. MODACloudsML is a language that allows to model the provisioning and deployment of multi-cloud applications exploiting a component-based approach. The main motivation behind the adoption of such a language on top of TOSCA is that we want to make the design methodology TOSCA-independent, in such a way that the designer have not to be a TOSCA-expert, nor even to be aware about TOSCA, but he should just follow the proposed methodology. Moreover the MODACloudsML language has basically the same purpose of the TOSCA standard, but it exhibits a higher level of abstraction and so results in being more user friendly. The below Figure shows an extract of the MODAClouds4DICER meta-model. The main concepts are inherited directly from MODACloudsML. A MODACloudsML model is a set of Components which can be owned by a Cloud provider ExternalComponents or by the application provider InternalComponents. A Component can be either an application, a platform or a physical host. While an ExternalComponent can just provide Ports and ExecutionPlatforms, an InternalComponent can also require them, since it is controlled by the application provider. Ports and ExecutionPlatforms serve as a way to connect Components to each other. ProvidedPorts and RequiredPorts can be linked by mean of the concept of Relationship, while ProvidedExecutionPlatforms and RequiredExecutionPlatforms can be linked by mean of the concept of ExecutionBinding. The latter could be seen as a particular type of relationship between two Components which tells that one of them is executing the other.
MODACloudsML has been adapted extending elements in order to capture data intensive specific concepts, e.g. systems that are usually exploited by data intensive applications such as NoSQLStorage solutions and ParallelProcessingPlatforms, which are typically composed of a MasterNode and one or many SlaveNodes.
DDSM UML Deployment Profile
editStemming from the previous technical overview, in the following we elaborate on essential DDSM stereotypes, which are reported in the below Table.
# | Stereotype | Meaning |
---|---|---|
1. | InternalNode | Service that are managed and deployed by the application owner |
2. | ExternalNode | Service that are managed and deployed by the third-party provider |
3. | VMsCluster | A cluster of virtual machines |
4. | PeerToPeerPlatform | A data-intensive platform operating according to the peer-to-peer style |
5. | MasterSlavePlatform | A data-intensive platform operating according to the master-slave style |
6. | StormCluster | An instance of a Storm cluster |
7. | CassandraCluster | An instance of a Cassandra cluster |
8. | BigDataJob | The actual DIA to be executed |
9. | JobSubmission | Deployment association between a BigDataJob and its corresponging execution environment |
- DDSM distinguishes between
InternalNode
, or services that are managed and deployed by the application owner, andExternalNode
that are owned and managed by a third-party provider (see theproviderType
property of theExternalNode
stereotype). Both theInternalNode
andExternalNode
stereotypes extend the UML meta-classNode
.
VMsCluster
stereotype is defined as a specialisation ofExternalNode
, as renting computational resources such as virtual machines is one of the main services (so called Infrastructure-as-a-Service) offered by Cloud providers.VMsCluster
also extends the Device UML meta-class, since a cluster of VMs logically represents a single computational resource with processing capabilities, upon which applications and services may be deployed for execution. AVMsCluster
has aninstances
property representing its replication factor, i.e., the number of VMs composing the cluster. VMs in a cluster are all of the same size (in terms of amount of memory, number of cores, clock frequency), which can be defined by means of theVMSize
enumeration.
- Alternatively the user can specify lower and upper bounds for the VMs’ characteristics (e.g.
minCore
/maxCore
,minRam
/maxRam
), assuming the employed Cloud orchestrator is then able to decide the optimal Cloud offer, according to some criteria, that matches the specified bounds. TheVMsCluster
stereotype is fundamental towards providing DDSM users with the right level of abstraction, so that they can model the deployment of DIAs, without having to deal with the complexity exposed by the underlying distributed computing infrastructure. In fact, an user just has model her clusters of VMs as stereotyped Devices that can have nestedInternalNodes
representing the hosted distributed platforms. Furthermore, a specific OCL constraint imposes that eachInternalNode
must be contained into a Device holding theVMsCluster
stereotype, since by definition anInternalNode
have to be deployed and managed by the application provider, which thus has to dispose the necessary hosting resources.
- We then define DIA-specific deployment abstractions, i.e. the
PeerToPeerPlatform
,MasterSlavePlatform
stereotypes, as further specialisations ofInternalNode
. These two stereotypes basically allow the modelling language to capture the key differences between the two general type of distributed architectures. For instance theMasterSlavePlatform
stereotype allows to indicate a dedicated host for the master node, since it might require more computational resources. By extending our deployment abstractions, we implemented a set of technology modelling elements (StormCluster
,CassandraCluster
, etc.), one for each technology we support. DIA execution engines (e.g. Spark or Storm) also extend UMLExecutionEnvironment
, so to distinguish those platforms DIA jobs can be submitted to. Each technology element allows to model deployment aspects that are specific to a given technology, such as platform specific configuration parameters or dependencies on other technologies, that are enforced by means of OCL constraints in the case they are mandatory.
- The
BigDataJob
stereotype represents the actual application that can be submitted for execution to any of the available execution engine. It is defined as a specialisation of UML Artefact, since it actually corresponds to the DIA executable artefact. It allows to specify job-specific information, for instance the artifactUrl from which the application executable can be retrieved.
- The
JobSubmission
stereotype, which extends UML Deployment, is used to specify additional deployment options of a DIA. For instance, it allows to specify job scheduling options, such as how many times it has to be submitted and the time interval between two subsequent submissions. In this way the same DIA job can be deployed in multiple instances using different deployment options. An additional OCL constraint requires eachBigDataJob
to be connected by mean ofJobSubmission
to a UMLExecutionEnvironment
which holds a stereotype extending one between theMasterSlavePlatform
or thePeerToPeerPlatform
stereotypes.
UML Deployment Modelling: The WikiStats Example
editWe showcase the defined profile by applying it to model the deployment of a simple DIA that we called Wikistats, a streaming application which processes Wikimedia articles to elicit statistics on their contents and structure. The application features Apache Storm as a stream processing engine and uses Apache Cassandra as storage technology. Wikistats is a simple example of a DIA needing multiple, heterogeneous, distributed platforms such as Storm and Cassandra. Moreover Storm depends on Apache Zookeeper. The Wikistats application itself is a Storm application (a streaming job) packaged in a deployable artefact. The below Figure shows the DDSM for the Wikistats example.
In this specific example scenario, all the necessary platforms are deployed within the same cluster of 2 large-sized VMs from an OpenStack installation. Each of the required platform elements is modelled as a Node
annotated with a corresponding technology specific stereotype. In particular Storm is modelled as an ExecutionEnvironment
, as it is the application engine that executes the actual Wikistats application code. At this point, fine tuning of the Cloud infrastructure and of the various platforms is the key aspect supported by DDSM. The technology stereotypes allow to configure each platform in such a way to easily and quickly test different configurations over multiple deployments and enabling the continuous architecting of DIAs. The dependency of Storm on Zookeeper is enforced via the previously discussed OCL constraints library which comes automatically installed within the DDSM profile. The deployment of the Wikistats application is modelled as an Artefact
annotated with the BigDataJob
stereotype and linked with the StormCluster
element using a Deployment
dependency stereotyped as a JobSubmission
. Finally BigDataJob
and JobSubmission
can be used to elaborate details about the Wikistats job and how it is scheduled.
Conclusion
editDICE UML-based deployment modelling heavily rotates around DDSM, a refined UML profile to specify Infrastructure-as-Code using simple UML Deployment Diagrams stereotyped with appropriate data-intensive augmentations.