Data Curation and Enrichment Area

The Data Curation and Enrichment Area contains services whose purpose is to operate over an information space of data in order to adjust, refine, add content to it. Such services tyipically enter the data processing workflow, once XML metadata records have been collected from data sources, mapped onto XML internal representations, and stored using one of the services in the Data Storage and Indexing Area (graph store, full-text index, relational database, or metadata store).
Specifically, such services are:

  • Deduplication Service: It can be used to efficiently de-duplicate collections of (tens of) millions of entities. In simple words, the tool is capable of automatically identifying records which are potentially representing the same entity and, when this is the case, allow data curators to manually/automatically merge the records. Curators can configure the tool to define the notion of similarity which best suits their collection (the structure of the collection). Its efficiency outclasses other known solutions due to its parallel implementation and Cassandra back-end.
  • Records Tagging Service. It is used to tag large collections of records stored into D-NET Full-Text Index Services (more generally Solr Indices). The tool allows data curators to run searches and search-refinements on top of the index and to bulk-tag their search results with a selection of tag terms. Most importantly, data curators can perform a number of tagging actions in a temporary session (end-users running queries over the index will not see the changes), before eventually committing the result and have the index committed in production.
  • Text Similarity Service*
  • Citation Service*
  • Classification Service*
  • End-User Feedbacks Service*
  • Metadata Editor Service

(*) Under development for the project OpenAIREplus, to be delivered by 2013.