Skip to content

Welcome to Scicat Ingestor

License: BSD 3-Clause Contributor Covenant

SciCat Ingestor is a versatile application with the primary focus to automate the ingestion of new dataset in to SciCat.

Scicat Ingestor aims to accomplish FAIR data
by making files visible via scicat, associated with their metadata.

The project is composed of two main components:

  • online ingestor

    is responsible to connect to a kafka cluster and listen to selected topics for a specific message and trigger the data ingestion by running the offline ingestor as a background process. At the moment, this is specific to ESS IT infrastructure, but it is already planned to generalize it as soon as other facilities express interest in adopting it.

    For details, see online ingestor page.

  • offline ingestor

    can be run from the online ingestor or by an operator. It is responsible to collect all the necessary metadata and create a dataset entry in SciCat.

    For details, see offline ingestor page.

Key Features

  • Continuously and asynchronously retrieving information of files from kafka.
  • Retrieve metadata from files.
  • Ingest files along with retrieved metadata to scicat.

Infrastructure around Scicat Ingestor

scicat-ingestor is written for specific infrastructure setup like below:

---
title: Infrastructure around Scicat Ingestor
---
graph LR
    filewriter@{ shape: processes, label: "File Writers" } -.write file.-> storage[(Storage)]
    filewriter --report (wrdn)--> kafkabroker[Kafka Broker]
    ingestor[Scicat Ingestor] -.subscribe (wrdn).-> kafkabroker
    storage -.read file.-> ingestor
    ingestor --report--> log[Gray Log]

Framework Required Description
Scicat O Scicat service that scicat ingestor can ingest files to.
Kafka O Kafka broker that scicat ingestor can receive write done messages from.
All messages are assumed to be serialized
as flatbuffer using these schema: flatbuffer schemas for filewriter
scicat-ingestor uses python wrapper of those schemas to deserialize messages.
Currently only wrdn schema is used.
File Writer O and X Any process that can write files and produce write done messages can be used.
GrayLog X - optional scicat ingestor has built in stdout logging option.


File Ingesting Sequence

Here is a simple overview of how the ingestion is done.

---
title: File Ingesting Sequence
---

sequenceDiagram
  create participant File Writer
  create actor File
  File Writer --> File: File Written
  loop Ingest Files
    Ingestor -->> Kafka Broker: Subscribe
(listening to writing done - wrdn) Kafka Broker ->> Ingestor: Writing Done Message (wrdn) Note over Ingestor: Parse writing done message Ingestor ->> File: Check file opt Ingestor ->> File: Parse Metadata end Note over Ingestor: Wrap files and metadata as
Scicat Dataset critical Ingestor ->> Scicat: Ingest File end end



Here is the typical file writing sequence including when the files are created/open.

Click to see the File Writing Sequence
sequenceDiagram
loop File Writing
    File Writer -->> Kafka Broker: Subscribe (run start)
    Kafka Broker ->> File Writer: Run Start
    create actor File
    File Writer ->> File: Create File and Close
    File Writer --> File: Open File as Append Mode
    loop File Writing
        File Writer -->> Kafka Broker: Subscribe Relevant Topics for the run
        Kafka Broker ->> File Writer: Detector Data/Log/etc ...
        File Writer ->> File: Write Data in the File.
    end
    Kafka Broker ->> File Writer: Run Stop
    File Writer --> File: Close File.
    Note over File Writer: Compose wrdn message including id and file path
File Writer ->> Kafka Broker: Report (writing done - wrdn)
end

Used At

scicat ingestor is mainly maintained by ESS DMSC but it can be used in any systems that have same infrastructure set up.

European Spallation Source

Quick Start

We do not release scicat-ingestor into any package index services.
You can directly download it from our github page.

git clone https://github.com/SciCatProject/scicat-ingestor.git
cd scicat-ingestor
git fetch origin
git checkout v25.01.0  # Latest Version
pip install -e .  # It will allow you to use entry-points of the scripts,
                  # defined in ``pyproject.toml``, under ``[project.scripts]`` section.

Contribution

Anyone is welcome to contribute to our project.

Please check our developer guide.