The DAAV building block purpose is to setup data manipulation pipeline to create new dataset by :
The project is divided in two modules :
Blocks are divided in three groups :
The first main objective of this building block is to reduce entry barriers for data providers and ai services clients by simplifying the handling of heterogeneous data and conversion to standard or service-specific formats.
The second main objective is to simplify and optimize data management and analysis in complex environments.
Create a connector from any input data model to any output model. Examples :
Create a aggregator from different data sources :
Receive pushed data from PDC.
Tools to describe and explore datasets.
Expert system and AI automatic data alignment.
BB must communicate with catalog API to retrieve contract.
BB must communicate with pdc to trigger data exchange with pdc.
BB MUST communicate to PDC to get data from contract and consent BB.
BB CAN receive data pushed by PDC.
BB CAN connect to others BB.
BB MUST expose endpoints to communicate with others BB.
BB SHOULD be able to process any type of data as inputs
Expected requests time :
Type | Requests |
---|---|
Simple request | < 100ms |
Medium request | < 3000ms |
Large requests | < 10000ms |
No other building block interacting with this building block requires specific integration.
JSON - CSV - NoSQL (mongo, elasticsearch) - SQL - xAPI - Parquet - Archive (tar, zip, 7z, rar).
DSSC :
*IDS RAM* 4.3.3 Data as an Economic Good
Project interface handle all requires information for a DAAV project. The front-end can import and export json content who follows this structure.
The back-end can execute the workflow describe in this structure. Inside we have Data Connectors required to connect to a datasource.
A workflow is represented by nodes (Node) with input and output (NodePort) who can be connected. All nodes are divides in three group :
This may be for a complete run, or simply to test a single node in the chain to ensure hierarchical dependency between connected nodes.
This diagram describes the basic architecture of the front-end, whose purpose is to model a set of tools for the user, where he can build a processing chain.
This is based on Rete.js, a framework for creating processing-oriented node-based editors.
A workflow is a group of nodes connected by ports (input/output) each ports have a socket type who define the data format which can go through and so define connection rules between node.
Example of a node base editor with nodes and inside inputs and/or output and a colored indicator to visualize its status.
The back-end class, which reconstructs a representation of the defined workflow and executes the processing chain, taking dependencies into account.
For each node, we know its type, and therefore its associated processing, as well as its inputs and outputs, and the internal attributes defined by the user via the interface.
The sequence diagram shows how the component communicates with other components.
---
title: Sequence Diagram Example (Connector Data Exchange)
---
sequenceDiagram
participant i1 as Input Data Block (Data Source)
participant ddvcon as PDC
participant con as Contract Service
participant cons as Consent Service
participant dpcon as Data Provider Connector
participant dp as Participant (Data Provider)
participant i2 as Transformer Block
participant i3 as Merge Block
participant i4 as Output Data Block
participant enduser as End User
i1 -) ddvcon:: Trigger consent-driven data exchange<br>BY USING CONSENT
ddvcon -) cons: Verify consent validity
cons -) con: Verify contract signature & status
con --) cons: Contract verified
cons -) ddvcon: Consent verified
ddvcon -) con: Verify contract & policies
con --) ddvcon: Verified contract
ddvcon -) dpcon: Data request + contract + consent
dpcon -) dp: GET data
dp --) dpcon: Data
dpcon --) ddvcon: Data
ddvcon --) i1: Data
i1 -) i2: Provide data connection or data
Note over i2 : setup of transformation
i2 -) i3: Provide data
Note over i3 : setup merge with another data source
i3 -) i4: Provide data
Note over i4 : new data is available
enduser -) i4: Read file directly from local filesystem
enduser -) i4: Read file through SFTP protocol
enduser -) i4: Read data through REST API
enduser -) i4: Read data through database connector
---
title: Node status - on create or update
---
stateDiagram-v2
classDef Incomplete fill:yellow
classDef Complete fill:orange
classDef Valid fill:green
classDef Error fill:red
[*] --> Incomplete
Incomplete --> Complete: parameters are valid
state fork_state <<choice>>
Complete --> fork_state : Backend execution
fork_state --> Valid
fork_state --> Error
#state fork_state2 <<choice>>
#Error --> fork_state2 : User modify connection/parameter
#fork_state2 --> Complete
#fork_state2 --> Incomplete
class Incomplete Incomplete
class Complete Complete
class Valid Valid
class Error Error
Backend Node Execute : Node mother class function "execute" Each child class have its function "Process" with specific treatment.
Inside a workflow a recursive pattern propagate the execution following parents nodes.
---
title: Backend - Node class function "Execute"
---
stateDiagram-v2
classDef Incomplete fill:yellow
classDef Complete fill:orange
classDef Valid fill:green
classDef Error fill:red
state EachInput {
[*] --> ParentNodeStatus
ParentNodeStatus --> ParentNodeValid
ParentNodeStatus --> ParentNodeComplete
ParentNodeStatus --> ParentNodeIncomplete
ParentNodeIncomplete --> [*]
ParentNodeValid --> [*]
ParentNodeComplete --> ParentNodeStatus : Parent Node function "Execute"
ParentNodeStatus --> ParentNodeError
ParentNodeError --> [*]
}
[*] --> NodeStatus
NodeStatus --> Complete
NodeStatus --> Incomplete
Incomplete --> [*] : Abort
Complete --> EachInput
state if_state3 <<choice>>
EachInput --> if_state3 : Aggregate Result
if_state3 --> SomeInputIncomplete
if_state3 --> AllInputValid
if_state3 --> SomeInputError
SomeInputIncomplete --> [*] : Abort
AllInputValid --> ProcessNode: function "Process"
SomeInputError --> [*] : Abort
ProcessNode --> Error
ProcessNode --> Valid
Valid --> [*] :Success
Error --> [*] :Error
class Incomplete Incomplete
class SomeInputIncomplete Incomplete
class ParentNodeIncomplete Incomplete
class ParentNodeComplete Complete
class Complete Complete
class Valid Valid
class AllInputValid Valid
class ParentNodeValid Valid
class Error Error
class SomeInputError Error
class ParentNodeError Error
Various configuration example :
**.env file :**
MONGO_URI = ''
SQL_HOST = ''
LOG_FILE = "logs/log.txt"
...
**secrets folder (openssl password generated) :**
- secrets/db_root_password
- secrets/elasticsearch_admin
**angular environments :**
- production: true / false
- version : X.X
**python fast api backend config.ini :**
[dev]
DEBUG=True/False
[edge]
master:True/False
cluster=014
What are the limits in terms of usage (e.g. number of requests, size of dataset, etc.)?
# work in progress
All nodes have a front-end implementation where the user can setup its configuration and a back-end implementation to execute the process.
All nodes inherit from the abstract class Nodeblock.
We identify shared functionalities like :
Inputs Node :
All input nodes inherit from InputDataBlock :
We identify shared functionalities like :
We need one class of input Node for each data-source format (Local file, MySQL, Mongo , API, ...)
For the front-end a factory deduces the correct input instance according to the given source connector sent in parameter.
Node instance exposes one output and its socket type define its format :
Connections rules between nodes
Each block input defines a socket, ie. what kind of data it accepts (float, string, array of string, etc…).
On the back-end counterpart class :
“Process” function implement the business logic. In this case, it retrieves data-source information and populate the output with data schemes and data content.
For huge data content we can use a parquet format to store intermediate result physically, likely as a selector shared attribute of InputDataBlock
Each input Node may have rules executed at the request level and a widget in the front-end to setup them.
Outputs Node :
All output nodes inherit from OutputDataBlock.
We need one class of output Node for each format handle by the BB.
Each output block specifies the inputs it can process.
The widgets’ block permits to setup the output and visualizes its status.
On the back-end counterpart class :
One method to execute the node and export a file or launch transactions to fill the destination.
Transform Node :
All nodes called upon to manipulate data inherit from the transformation node.
Transform node can dynamically add input or output based on data manipulated and use widgets to expose parameters for the user.
For complex cases, we can imagine a additional modal window to visualize a sample of the data and provide a convenient interface for describing the desired relationships between data.
Considerations on revision identifiers
The workflow representation can be likened to a tree representation, where each node can have a unique revision ID representing its contents and children.
With this we can have a convenient pattern to compare and identify modifications between two calls on back-end API.
If we use intermediate results with physical records such as parquet files, for example, we can avoid certain processing when a node retains the same revision ID between two calls, and reconstruct only those parts where the user has made changes.
By keeping track of the last revision of the root of a front-end and back-end call, it is also possible to detect desynchronization when several instances of a project are open and so, make the user aware of the risks of overwriting.
The back-end will expose a swagger OpenAPI description.
For all entity describes we will follow a REST API (Create / Update / Delete).
API output data blocks (GET)
And around the project all business action :
Front-end:
Unit test for all core class function :
Back-end :
Unit test for all endpoints and core class function.
Core class :
API endpoints :
Charge Test with K6 to evaluate API performance.
Front end : Selenium to test custom node usage deploy and interaction with custom parameter.
Enumerate all partner organizations and specify their roles in the development and planned operation of this component. Do this at a level which a)can be made public, b)supports the understanding of the concrete technical contributions (instead of "participates in development" specify the functionality, added value, etc.)
Profenpoche (BB leader):
Inokufu :
BME, cabrilog and Ikigaï are also partners available for beta-testing.
Specify the Dataspace Enalbing Service Chain in which the BB will be used. This assumes that during development the block (lead) follows the service chain, contributes to this detailed design and implements the block to meet the integration requirements of the chain.
The Daav building block can be used as a data transformer to build new datasets from local data or from prometheus dataspace.
The output can also be share on prometheus dataspace.
Example 1 Example 2 Example 3 Example 4 Example 5 Example 6