Skip to content

add a future work section regarding making the data space tamper proof#2

Open
Rahien wants to merge 1 commit into
masterfrom
karel/lbron-1599-tamper-proof-dcat
Open

add a future work section regarding making the data space tamper proof#2
Rahien wants to merge 1 commit into
masterfrom
karel/lbron-1599-tamper-proof-dcat

Conversation

@Rahien

@Rahien Rahien commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

No description provided.

@mirdono mirdono left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo the text is a bit of a weird mix between concrete and vague. On one hand, specific technologies are mentioned, with some rather concrete steps of how they are to be used. On the other hand, the responsibilities and data flows are not entirely clear to me. For example, who is responsible for signing which data?

Furthermore, should we not also consider the DCAT data itself? For example, how will we detect someone changing a distribution's dcat:downloadURL as well as http://spdx.org/rdf/terms#checksum.

### Making the data space tamper-proof

### Possible future work LBLOD related
The data space will provide a couple of DCAT Distributions holding various represenation of the space's data sets. When downloading, users will want to be sure that the contents of these sets has not been tampered with. This can be guaranteed for simple downloads by creating a [SHA-256 hash](https://datatracker.ietf.org/doc/html/rfc6234) of the archive's contents, then signing the resulting checksum with the private key of the owner of the dataset and publishing that hash as part of the distribution's DCAT description, for instance using the `http://spdx.org/rdf/terms#checksum` predicate. The public key corresponding to this private key can be published using the owner's DID (see [write-up-verifiable-credentials.md](write-up-verifiable-credentials.md "mention")). Users of the distribution can then easily figure out if the distribution has been tampered with by applying the public key of the owner that they find in the owner's DID to the signature and verifying that the SHA256 of their download results in the same checksum value.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: What kind of "archives" are we talking about here?

Do you mean actual archive files (zip, tar.gz, ...) or something else? In case of former, why bother with manually calculating the checksum and signing that? You are probably better of using something tried and tested like gnupg to sign files and than publishing the (contents of the) sig file. Side note, creating a SHA-256 checksum "of the archive's contents", as you say, has the same problems as with the triples in the following paragraph, order matters.

### Possible future work LBLOD related
The data space will provide a couple of DCAT Distributions holding various represenation of the space's data sets. When downloading, users will want to be sure that the contents of these sets has not been tampered with. This can be guaranteed for simple downloads by creating a [SHA-256 hash](https://datatracker.ietf.org/doc/html/rfc6234) of the archive's contents, then signing the resulting checksum with the private key of the owner of the dataset and publishing that hash as part of the distribution's DCAT description, for instance using the `http://spdx.org/rdf/terms#checksum` predicate. The public key corresponding to this private key can be published using the owner's DID (see [write-up-verifiable-credentials.md](write-up-verifiable-credentials.md "mention")). Users of the distribution can then easily figure out if the distribution has been tampered with by applying the public key of the owner that they find in the owner's DID to the signature and verifying that the SHA256 of their download results in the same checksum value.

The same process can be done to certify the correctness of the DCAT distribution regarding a certain dataset. We could construct a [n-triples](https://www.w3.org/TR/rdf12-n-triples/) file that contains all triples making up the dataset and its distributions in a stable, repeatable fashion, for instance by sorting the triples by subject, then by predicate and then by object, excluding our signature predicate itself. We can then take this n-triples file and again perform a SHA-256 hash on it and signing the result using the private key of the owner of the DCAT description, likely the owner of the dataset or the owner of the dataspace.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remark: It is unclear to me what exactly is meant to be verified here.

Is this meant to allow recipients of a distribution to verify that the received distribution contains the same data as the actual dataset? If so, this would require that recipients construct an n-triples file starting from the received distribution and check whether that file's checksum/signature matches the published one. This seems rather complicated and very fragile to do.

Side note, to me the last sentence leaves doubt as to which private key should be used, possibly even implying private keys are passed around to sign data/checksums at the right place. This would be a big no no.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants