Socrat Data Integrator
SDI was made for dataflow. It supports highly configurable directed graphs of data routing, transformation, and system mediation logic.
Some of its key features include:
Data Provenance
- Track dataflow from beginning to end
Web-based user interface
- Seamless experience for design, control, and monitoring
- Multi-tenant user experience
Designed for extension
- Build your own processors and more
- Enables rapid development and effective testing
Secure
- SSL, SSH, HTTPS, encrypted content, etc.
- Pluggable fine-grained role-based authentication/authorization
- Multiple teams can manage and share specific portions of the flow
Highly configurable
- Loss tolerant vs guaranteed delivery
- Low latency vs high throughput
- Dynamic prioritization
- Flows can be modified at runtime
- Back pressure
- Scales up to leverage full machine capability
- Scales out with zero-leader clustering model
SDI is a dataflow system based on the concepts of flow-based programming. It supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic. SDI has a web-based user interface for design, control, feedback, and monitoring of dataflows. It is highly configurable along several dimensions of quality of service, such as loss-tolerant versus guaranteed delivery, low latency versus high throughput, and priority-based queuing. SDI provides fine-grained data provenance for all data received, forked, joined cloned, modified, sent, and ultimately dropped upon reaching its configured end-state.
The SDI UI provides mechanisms for creating automated dataflows, as well as visualizing, editing, monitoring, and administering those dataflows. The UI can be broken down into several segments, each responsible for different functionality of the application. This section provides screenshots of the application and highlights the different segments of the UI. Each segment is discussed in further detail later in the document.
Put simply, SDI was built to automate the flow of data between systems. While the term ‘dataflow’ is used in a variety of contexts, we use it here to mean the automated and managed flow of information between systems. This problem space has been around ever since enterprises had more than one system, where some of the systems created data and some of the systems consumed data. The problems and solution patterns that emerged have been discussed and articulated extensively. A comprehensive and readily consumed form is found in the Enterprise Integration Patterns [eip].
Some of the high-level challenges of dataflow include:
Systems fail
- Networks fail, disks fail, software crashes, people make mistakes.
Data access exceeds capacity to consume
- Sometimes a given data source can outpace some part of the processing or delivery chain – it only takes one weak-link to have an issue.
Boundary conditions are mere suggestions
- You will invariably get data that is too big, too small, too fast, too slow, corrupt, wrong, or in the wrong format.
What is noise one day becomes signal the next
- Priorities of an organization change – rapidly.
- Enabling new flows and changing existing ones must be fast.
Systems evolve at different rates
- The protocols and formats used by a given system can change anytime and often irrespective of the systems around them.
- Dataflow exists to connect what is essentially a massively distributed system of components that are loosely or not-at-all designed to work together.
Compliance and security
- Laws, regulations, and policies change.
- Business to business agreements change.
- System to system and system to user interactions must be secure, trusted, accountable.
Continuous improvement occurs in production
- Replicating production environments in a lab setting is challenging, as real-world complexities are hard to replicate.
- Dataflow, once considered a difficult necessity, is now crucial to enterprise success due to advances in technology.
- Key drivers include Service Oriented Architecture (SOA), the rise of APIs, the Internet of Things (IoT), and Big Data, each adding new demands for dataflow management.
- Requirements for compliance, privacy, and security are growing, necessitating stricter data handling practices.
- While dataflow patterns remain largely the same, the scale, complexity, and need for rapid adaptation have increased, making edge cases more common.
- SDI (Software Defined Infrastructure) is built to address these modern dataflow challenges, supporting the evolving demands of enterprise systems.

SDI executes within a JVM on a host operating system.
The primary components of SDI on the JVM are as follows:
Web Server
- The purpose of the web server is to host SDI’s HTTP-based command and control API.
Flow Controller
- The flow controller is the brains of the operation.
- It provides threads for extensions to run on, and manages the schedule of when extensions receive resources to execute.
Extensions
- There are various types of SDI extensions which are described in other documents.
- The key point here is that extensions operate and execute within the JVM.
FlowFile Repository
- The FlowFile Repository is where SDI keeps track of the state of what it knows about a given FlowFile that is presently active in the flow.
- The implementation of the repository is pluggable.
- The default approach is a persistent Write-Ahead Log located on a specified disk partition.
Content Repository
- The Content Repository is where the actual content bytes of a given FlowFile live.
- The implementation of the repository is pluggable.
- The default approach is a fairly simple mechanism, which stores blocks of data in the file system.
- More than one file system storage location can be specified so as to get different physical partitions engaged to reduce contention on any single volume.
Provenance Repository
- The Provenance Repository is where all provenance event data is stored.
- The repository construct is pluggable with the default implementation being to use one or more physical disk volumes.
- Within each location event data is indexed and searchable.