D-NET: Building Sustainable Aggregative Data Infrastructures
As a natural consequence of multi-disciplinary research and the strong requirement of immediate access to digital information, research communities manifested the need to cross-operate over content from several possibly heterogeneous and autonomous data sources. Such a demand brought in a novel class of software systems, called aggregative data infrastructures, whose aim is to address the specific data collection, processing, and consumption need of the community. Examples are repository federation systems in the Digital Library world (e.g., OAIster, BASE, NARCIS, Europeana), or more sophisticated infrastructures, where also digital content (i.e., payloads) are to be collected and processed, before being made available to infrastructure users (e.g., D4Science, HOPE). Organizations willing to build data infrastructure have to face the problem of considerable costs of:
- software realization, i.e., design and development: due to the absence of general-purpose data infrastructure software, “traditional” solutions tend to be realized in-house from scratch. To minimize realization costs, they address the specific static requirements for which they were conceived and, as such, are hardly reusable under different scenarios.
- maintenance of public service (running system), i.e., hardware, system administration, and system refinement: harvesting activities, as well as content curation and refinement of the system to match new requirements often turn to be "expensive" for the average organization in the long term.
- Customizability: service components are designed to match functional patterns (e.g., indezing of metadata records), so that they can be customized to the application domain needs (e.g., indexing of records conforming to a given proprietary format); moreover, services can be combined in different ways (LEGO-approach), to match the given application funcional requirements.
- Openness: organizations can add new service components, possibly offering business logic not yet provided by the D-NET software toolkit; furthermore, service components are loosely coupled, following SOA principles. Changes to their implementation, to improve performance or adopt different underlying technologies, can be transparently done without compromising the running applications.
- Sharability: the same service component, made available by one organizations, can be shared and reused by other organizations; cooperative approach reduces overall cost, by sharing, hardware and funcionality.
- Distribution: service components can be deployed at different sites thereby improving robustness, availability and scalability of the system; For example, multiple replicas of indices can be kept, so as to ensure robustness, in the case of server crashes, availabiity, in the case of newtork failures, and scalability, in the case of large number of concurrent accesses (by workload distribution).
- Autonomicity: once the service components have been configured and installed, special services, called Manager Services, can be configured to orchestrate (monitor and master) the pool of services in order to automatically identify and fix misbehaviour or guarantee Quality of Service at run-time. An example, is the maintenance of synchronized replicas of indices or storage.