This server provides a temporary GDI Starter Kit demo, which is not the final solution that Estonia will be offering in October 2026.

This server will be removed by April 2025. The main source of latest information will be available at https://gdi.ut.ee

Overview

This server is hosting the GDI Starter Kit to demonstrate the potential GDI solution and workflow. This deployment is provided as part of the onboarding phase by the Estonian GDI team. Technically, this a single Ubuntu 24.04 LTS server where the services are deployed using docker-compose. The software is supported by 8 CPUs, 16 GB of RAM, and 500 GB of SSD storage.

Deployment follows the default solution of the Starter Kit without significant deviations. For the purpose of demonstration, we use the EGAD00001003338 synthetic data here, which is not considered sensitive. To keep it simple, there are no subdomains used, and all services are behind a single nginx web-server (as a proxy).

The Starter Kit components are accessible from the top-level menu. The deployed Docker-based services are following:

Beacon API (v2):

Database: mongo:8
Data ingestion: ghcr.io/ega-archive/beacon2-ri-tools-v2:latest
API service: registry.hpc.ut.ee/gdi/starter-kit-beacon2-ri-api:v2.0.1

Funnel:

Software: ghcr.io/mrtamm/funnel:2024-03-03

Htsget:

Software: harbor.nbis.se/gdi/htsget-rs:20240430

REMS:

Database: postgres:16-alpine
Software: cscfi/rems:latest

Sensitive Data Archive (SDA):

Object storage: minio/minio:latest
Database: neicnordic/sensitive-data-archive:v0.3.165-postgres
Messaging: neicnordic/sensitive-data-archive:v0.3.165-rabbitmq
SDA microservices: neicnordic/sensitive-data-archive:v0.3.165
SDA download: ghcr.io/neicnordic/sensitive-data-archive:v0.3.165-download

Services in the Starter Kit

A quick reminder about the list of services. The deployments of these services can be accessed from the top-level menu. Here the links refer to corresponding GitHub repositories.

Catalogue and Access Control / REMS: For managing a public catalogue of resources (datasets) and access (request/approval) to them.
Storage and Interfaces / Sensitive Data Archive: For storing the genomic files – encrypted at rest (using Crypt4gh), decrypted/re-encrypted for authorized requests (through Funnel).
Data Discovery / Beacon v2: For enabling genomic data queries (Beacon v2 API) by interested researchers. These queries help researchers to understand the potential number of individuals in the node with matching genomic sequences.
Data Reception / Htsget, SDA-Download: The authorized gateways for requesting the permitted data from the Storage to the Data Analysis environment. In addition, Htsget simplifies data filtering, so the download process could be optimised.
Data Analysis / Funnel: Task execution service where the permitted data can be analysed.

Besides these, there are also central service in the GDI:

Life Science AAI for federated user authentication
User Portal as the central user interface for the end-users (researchers)

Userflows

Note: following userflow descriptions are very generalised and simplified for novice readers. The final workflows will be more detailed with more nuances and restrictions.

Data Submission

User roles:

Data Provider: data uploader

Preconditions:

Data Provider has some genomic files and their index-files that are ready for sharing to the GDI.

User activity:

User authenticates at SDA (component: sda-auth) and obtains a session token (JWT).
User uses the sda-cli command-line tool for uploading the files to sda-s3inbox.
SDA processes the files (registers file attributes) and forms a dataset.
User registers the dataset in REMS as a resource, and creates a public catalogue item, which is associated with an access-request workflow.
User runs the Beacon pipeline on the VCF-file so that searchable data could be ingested into Mongo database, which is used by the Beacon API to respond to genomic queries.

Dataset Discovery And Access

User roles:

User Portal: frontend, Beacon Network (also FAIR Data Point API, which is not yet part of the Starter Kit)
Researcher
Data Access Committee: people who have to approve or reject data access requests.

Preconditions:

GDI nodes have their endpoints registered in the User Portal, and the endpoints are accessible to the User Portal.

User (researcher) activity:

User performs a genomic query in the User Portal.
User Portal delegates the query to the GDI nodes (Beacon API endpoints), and collects responses to be returned to the user.
User Portal displays matching datasets to the user: source node, matching number of individuals.
User adds potentially useful datasets to the basket in the User Portal.
User proceeds to the checkout form where the user fills in the application form for requesting access to the selected datasets. (The form requests information about the purpose and the context of the research.)
The User Portal submits the form to (central) REMS.
User can see the status of the application, waiting for it to be reviewed by the Data Access Committee.

User (Data Access Committee) activity:

User receives a notification about a received application.
User reviews the application and checks its conformance to the legal and ethical requirements.
User may request additional details from the user.
Finally, user approves or rejects the application form, and the researcher will be notified about it.
On approval, REMS will issue a visa to the researcher that claims user's permission for the specified resources until a certain expiry date.

Data Analysis at GDI Node

User roles:

Researcher
Task Execution Service (TES) at GDI node.

Preconditions:

Researcher has been granted access to dataset file located in the GDI node.
Researcher has been informed about the location of the TES API.
Researcher has been informed about the file URLs of the dataset.
Researcher knows how to use TES API.

User (researcher) activity:

User prepares a TES task:
- environment (Docker image)
- command/script to be executed
- input files (references to SDA and Htgset services)
- output files (to be preserved when task completes)
User authenticates at Life Science AAI and obtains a session-token (JWT) for using it with the TES API.
User submits the task to the TES API and checks the status of the task to see if it has completed.
User can view the log of the task using TES API.
User can submit multiple tasks.
When user wants to collect some output files to user's storage, the user will contact the GDI node (helpdesk) and request the files. The GDI node (including original data provider) will verify that the files match the data export requirements (general results and not individual-specific sensitive information).
Eventually, when the data access visa expires, the user will no longer be able to perform data analysis on the data.

Closing Remarks

Since the data submission is a local (node-level) process, consider that each node may choose its own strategy to incorporate existing and future genomic data into the system. This process is very tricky due to the amount of file formats, quality/precision variations, different sample naming practices, really big file sizes, and also legal/ethical expectations.

The data analysis workflow specified here is only one possible solution. In practice, researche may also need graphical interfaces or computer cluster resources for analysing the data. Therefore, in future, the set of data analysis tools might be more dispersed.

Finally, there is also a great interest in the use of genomic data for clinical use, and there will be a different data access workflow adapted to that scenario. However, as the legal framework is developing, it's too early to provide a sample workflow for that.

As the GDI is still an ongoing project, the debate about the final solution is under development. If you feel interested about what the Estonian GDI team is doing and you would like to provide ideas or feedback, feel free to send an email to us.