Data Alignment, Aggregation and Vectorisation (DAAV) BB – Design Document – Prometheus-X Components & Services

Data Alignment, Aggregation and Vectorisation (DAAV) BB – Design Document

The DAAV building block purpose is to setup data manipulation pipeline to create new dataset by :

Examples of data manipulation

The project is divided in two modules :

Blocks are divided in three groups :

Technical usage scenarios & Features

The first main objective of this building block is to reduce entry barriers for data providers and ai services clients by simplifying the handling of heterogeneous data and conversion to standard or service-specific formats.

The second main objective is to simplify and optimize data management and analysis in complex environments.

Features/main functionalities

Create a connector from any input data model to any output model. Examples :

Create a aggregator from different data sources :

Receive pushed data from PDC.

Tools to describe and explore datasets.

Expert system and AI automatic data alignment.

Technical usage scenarios

  1. Creation of a new dataset from different sources
    1. Aggregation of data from multiple sources ensuring data compatibility and standardization.This is useful for data analysts who often work with fragmented datasets.
    2. Providing mechanism to transform and standardize data formats, such as converting xAPI data to a standardized xAPI format.
    3. Merging datasets with different competency frameworks to create a unified model.
    4. Optimize data manipulation and freshness based on user’s needs.
  2. Give access to specific data based on user rights (contracts and organisation roles)
    1. Paul is CMO and wants to follow KPIs on each products of his entreprise. He was access to specific data prepared by his IT team
    2. Arsen is fullstack developper and wants to verify integrity of data generated by his code. He configures his own data to this purpose.
    3. Christophe is an AI developer and wants to prepare his data for training. He create pipelines with continuous integration.
    4. Fatim is a data analyst and has been given access to a dataset to calculate specific metrics : mean, quantile, min, max, etc.
  3. Manage inputs and outputs for others building blocks or AI services.
    1. Prepare data for Distributed Data Visualization.
    2. Prepare input and receive (and store) output for Data Veracity and Edge translators.
    3. Create subsets of a main dataset to test performance of an AI service on each set.

Requirements

BB must communicate with catalog API to retrieve contract.

BB must communicate with pdc to trigger data exchange with pdc.

BB MUST communicate to PDC to get data from contract and consent BB.

BB CAN receive data pushed by PDC.

BB CAN connect to others BB.

BB MUST expose endpoints to communicate with others BB.

BB SHOULD be able to process any type of data as inputs

Expected requests time :

Type Requests
Simple request < 100ms
Medium request < 3000ms
Large requests < 10000ms

Integrations

Direct Integrations with Other BBs

No other building block interacting with this building block requires specific integration.

Relevant Standards

Data Format Standards

JSON - CSV - NoSQL (mongo, elasticsearch) - SQL - xAPI - Parquet - Archive (tar, zip, 7z, rar).

Mapping to Data Space Reference Architecture Models

DSSC :

*IDS RAM* 4.3.3 Data as an Economic Good

Input / Output Data

DAAV - Input and output Data DAAV - Sample data

Project interface

Project interface handle all requires information for a DAAV project. The front-end can import and export json content who follows this structure.

The back-end can execute the workflow describe in this structure. Inside we have Data Connectors required to connect to a datasource.

A workflow is represented by nodes (Node) with input and output (NodePort) who can be connected. All nodes are divides in three group :

project Daav interface

This may be for a complete run, or simply to test a single node in the chain to ensure hierarchical dependency between connected nodes.

Architecture

Prometheus-X - DAAV Architecture

Front-end worflow editor class diagram

This diagram describes the basic architecture of the front-end, whose purpose is to model a set of tools for the user, where he can build a processing chain.

This is based on Rete.js, a framework for creating processing-oriented node-based editors.

A workflow is a group of nodes connected by ports (input/output) each ports have a socket type who define the data format which can go through and so define connection rules between node.

Class diagram

Example of a node base editor with nodes and inside inputs and/or output and a colored indicator to visualize its status.

Example Workflow

Back-end workflow class diagram

The back-end class, which reconstructs a representation of the defined workflow and executes the processing chain, taking dependencies into account.

For each node, we know its type, and therefore its associated processing, as well as its inputs and outputs, and the internal attributes defined by the user via the interface.

Workflow Backend

Dynamic Behaviour

The sequence diagram shows how the component communicates with other components.

---
title: Sequence Diagram Example (Connector Data Exchange)
---

sequenceDiagram
    participant i1 as Input Data Block (Data Source)
    participant ddvcon as PDC
    participant con as Contract Service
    participant cons as Consent Service
    participant dpcon as Data Provider Connector
    participant dp as Participant (Data Provider)
    participant i2 as Transformer Block
    participant i3 as Merge Block
    participant i4 as Output Data Block
    participant enduser as End User
    i1 -) ddvcon:: Trigger consent-driven data exchange<br>BY USING CONSENT 
    ddvcon -) cons: Verify consent validity
    cons -) con: Verify contract signature & status
    con --) cons: Contract verified
    cons -) ddvcon: Consent verified
    ddvcon -) con: Verify contract & policies
    con --) ddvcon: Verified contract
    ddvcon -) dpcon: Data request + contract + consent
    dpcon -) dp: GET data
    dp --) dpcon: Data
    dpcon --) ddvcon: Data
    ddvcon --) i1: Data
    i1 -) i2: Provide data connection or data
    Note over i2 : setup of transformation
    i2 -) i3: Provide data
    Note over i3 : setup merge with another data source
    i3 -) i4: Provide data
    Note over i4 : new data is available
    enduser -) i4: Read file directly from local filesystem
    enduser -) i4: Read file through SFTP protocol
    enduser -) i4: Read data through REST API 
    enduser -) i4: Read data through database connector 

---
title: Node status - on create or update
---
stateDiagram-v2

   classDef Incomplete fill:yellow
   classDef Complete fill:orange
   classDef Valid fill:green
   classDef Error fill:red
    
    [*] --> Incomplete
    Incomplete --> Complete: parameters are valid
    state fork_state <<choice>>
    Complete --> fork_state : Backend execution
    fork_state --> Valid  
    fork_state --> Error
    #state fork_state2 <<choice>> 
    #Error --> fork_state2 : User modify connection/parameter
    #fork_state2 --> Complete 
    #fork_state2 --> Incomplete
    
    class Incomplete Incomplete
    class Complete Complete
    class Valid Valid
    class Error Error

Backend Node Execute : Node mother class function "execute" Each child class have its function "Process" with specific treatment.

Inside a workflow a recursive pattern propagate the execution following parents nodes.

---
title: Backend - Node class function "Execute"
---
stateDiagram-v2

   classDef Incomplete fill:yellow
   classDef Complete fill:orange
   classDef Valid fill:green
   classDef Error fill:red
    state EachInput {
        [*] --> ParentNodeStatus
        ParentNodeStatus --> ParentNodeValid
        ParentNodeStatus --> ParentNodeComplete
        ParentNodeStatus --> ParentNodeIncomplete
        ParentNodeIncomplete -->  [*]
        ParentNodeValid -->  [*]
        ParentNodeComplete --> ParentNodeStatus : Parent Node function "Execute"
        ParentNodeStatus --> ParentNodeError
        ParentNodeError --> [*]
    }
    [*] --> NodeStatus
    NodeStatus --> Complete 
   NodeStatus --> Incomplete
    Incomplete --> [*] : Abort
    Complete --> EachInput
    state if_state3 <<choice>>
    EachInput --> if_state3 : Aggregate Result   
    if_state3 --> SomeInputIncomplete
    if_state3 --> AllInputValid
    if_state3 --> SomeInputError    
    SomeInputIncomplete --> [*] : Abort
    AllInputValid --> ProcessNode: function "Process"   
    SomeInputError --> [*] : Abort
    ProcessNode --> Error
    ProcessNode --> Valid
        
    Valid --> [*] :Success
    Error --> [*] :Error
    
    class Incomplete Incomplete
    class SomeInputIncomplete Incomplete
    class ParentNodeIncomplete Incomplete
    class ParentNodeComplete Complete
    class Complete Complete
    class Valid Valid
    class AllInputValid Valid
    class ParentNodeValid Valid
    class Error Error
    class SomeInputError Error
    class ParentNodeError Error

Configuration and deployment settings

Various configuration example :

**.env file :** 
MONGO_URI = ''
SQL_HOST = ''

LOG_FILE = "logs/log.txt"
...

**secrets folder (openssl password generated) :** 
- secrets/db_root_password 
- secrets/elasticsearch_admin

**angular environments :** 
- production: true / false
- version : X.X

**python fast api backend config.ini :**
[dev]
DEBUG=True/False

[edge]
master:True/False
cluster=014

What are the limits in terms of usage (e.g. number of requests, size of dataset, etc.)?

# work in progress

Third Party Components & Licenses

Implementation Details

Nodes implementation :

All nodes have a front-end implementation where the user can setup its configuration and a back-end implementation to execute the process.

All nodes inherit from the abstract class Nodeblock.

We identify shared functionalities like :

Inputs Node :

All input nodes inherit from InputDataBlock :

We identify shared functionalities like :

We need one class of input Node for each data-source format (Local file, MySQL, Mongo , API, ...)

For the front-end a factory deduces the correct input instance according to the given source connector sent in parameter.

Node instance exposes one output and its socket type define its format :

Connections rules between nodes

Each block input defines a socket, ie. what kind of data it accepts (float, string, array of string, etc…).

On the back-end counterpart class :

“Process” function implement the business logic. In this case, it retrieves data-source information and populate the output with data schemes and data content.

For huge data content we can use a parquet format to store intermediate result physically, likely as a selector shared attribute of InputDataBlock

Each input Node may have rules executed at the request level and a widget in the front-end to setup them.

Outputs Node :

All output nodes inherit from OutputDataBlock.

We need one class of output Node for each format handle by the BB.

Each output block specifies the inputs it can process.

The widgets’ block permits to setup the output and visualizes its status.

On the back-end counterpart class :

One method to execute the node and export a file or launch transactions to fill the destination.

Transform Node :

All nodes called upon to manipulate data inherit from the transformation node.

Transform node can dynamically add input or output based on data manipulated and use widgets to expose parameters for the user.

For complex cases, we can imagine a additional modal window to visualize a sample of the data and provide a convenient interface for describing the desired relationships between data.

Considerations on revision identifiers

The workflow representation can be likened to a tree representation, where each node can have a unique revision ID representing its contents and children.

With this we can have a convenient pattern to compare and identify modifications between two calls on back-end API.

If we use intermediate results with physical records such as parquet files, for example, we can avoid certain processing when a node retains the same revision ID between two calls, and reconstruct only those parts where the user has made changes.

By keeping track of the last revision of the root of a front-end and back-end call, it is also possible to detect desynchronization when several instances of a project are open and so, make the user aware of the risks of overwriting.

OpenAPI Specification

The back-end will expose a swagger OpenAPI description.

For all entity describes we will follow a REST API (Create / Update / Delete).

API output data blocks (GET)

And around the project all business action :

Test specification

Internal unit tests

Front-end:

Unit test for all core class function :

Back-end :

Unit test for all endpoints and core class function.

Core class :

API endpoints :

Component-level testing

Charge Test with K6 to evaluate API performance.

UI test

Front end : Selenium to test custom node usage deploy and interaction with custom parameter.

Partners & roles

Enumerate all partner organizations and specify their roles in the development and planned operation of this component. Do this at a level which a)can be made public, b)supports the understanding of the concrete technical contributions (instead of "participates in development" specify the functionality, added value, etc.)

Profenpoche (BB leader):

Inokufu :

BME, cabrilog and Ikigaï are also partners available for beta-testing.

Usage in the dataspace

Specify the Dataspace Enalbing Service Chain in which the BB will be used. This assumes that during development the block (lead) follows the service chain, contributes to this detailed design and implements the block to meet the integration requirements of the chain.

The Daav building block can be used as a data transformer to build new datasets from local data or from prometheus dataspace.

The output can also be share on prometheus dataspace.

Example 1 Chain service 001 Example 2 Chain service 002 Example 3 Chain service 003 Example 4 Chain service 004 Example 5 Chain service 005 Example 6 Chain service 006