add a future work section regarding making the data space tamper proof#2
add a future work section regarding making the data space tamper proof#2Rahien wants to merge 1 commit into
Conversation
mirdono
left a comment
There was a problem hiding this comment.
Imo the text is a bit of a weird mix between concrete and vague. On one hand, specific technologies are mentioned, with some rather concrete steps of how they are to be used. On the other hand, the responsibilities and data flows are not entirely clear to me. For example, who is responsible for signing which data?
Furthermore, should we not also consider the DCAT data itself? For example, how will we detect someone changing a distribution's dcat:downloadURL as well as http://spdx.org/rdf/terms#checksum.
| ### Making the data space tamper-proof | ||
|
|
||
| ### Possible future work LBLOD related | ||
| The data space will provide a couple of DCAT Distributions holding various represenation of the space's data sets. When downloading, users will want to be sure that the contents of these sets has not been tampered with. This can be guaranteed for simple downloads by creating a [SHA-256 hash](https://datatracker.ietf.org/doc/html/rfc6234) of the archive's contents, then signing the resulting checksum with the private key of the owner of the dataset and publishing that hash as part of the distribution's DCAT description, for instance using the `http://spdx.org/rdf/terms#checksum` predicate. The public key corresponding to this private key can be published using the owner's DID (see [write-up-verifiable-credentials.md](write-up-verifiable-credentials.md "mention")). Users of the distribution can then easily figure out if the distribution has been tampered with by applying the public key of the owner that they find in the owner's DID to the signature and verifying that the SHA256 of their download results in the same checksum value. |
There was a problem hiding this comment.
Question: What kind of "archives" are we talking about here?
Do you mean actual archive files (zip, tar.gz, ...) or something else? In case of former, why bother with manually calculating the checksum and signing that? You are probably better of using something tried and tested like gnupg to sign files and than publishing the (contents of the) sig file. Side note, creating a SHA-256 checksum "of the archive's contents", as you say, has the same problems as with the triples in the following paragraph, order matters.
| ### Possible future work LBLOD related | ||
| The data space will provide a couple of DCAT Distributions holding various represenation of the space's data sets. When downloading, users will want to be sure that the contents of these sets has not been tampered with. This can be guaranteed for simple downloads by creating a [SHA-256 hash](https://datatracker.ietf.org/doc/html/rfc6234) of the archive's contents, then signing the resulting checksum with the private key of the owner of the dataset and publishing that hash as part of the distribution's DCAT description, for instance using the `http://spdx.org/rdf/terms#checksum` predicate. The public key corresponding to this private key can be published using the owner's DID (see [write-up-verifiable-credentials.md](write-up-verifiable-credentials.md "mention")). Users of the distribution can then easily figure out if the distribution has been tampered with by applying the public key of the owner that they find in the owner's DID to the signature and verifying that the SHA256 of their download results in the same checksum value. | ||
|
|
||
| The same process can be done to certify the correctness of the DCAT distribution regarding a certain dataset. We could construct a [n-triples](https://www.w3.org/TR/rdf12-n-triples/) file that contains all triples making up the dataset and its distributions in a stable, repeatable fashion, for instance by sorting the triples by subject, then by predicate and then by object, excluding our signature predicate itself. We can then take this n-triples file and again perform a SHA-256 hash on it and signing the result using the private key of the owner of the DCAT description, likely the owner of the dataset or the owner of the dataspace. |
There was a problem hiding this comment.
Remark: It is unclear to me what exactly is meant to be verified here.
Is this meant to allow recipients of a distribution to verify that the received distribution contains the same data as the actual dataset? If so, this would require that recipients construct an n-triples file starting from the received distribution and check whether that file's checksum/signature matches the published one. This seems rather complicated and very fragile to do.
Side note, to me the last sentence leaves doubt as to which private key should be used, possibly even implying private keys are passed around to sign data/checksums at the right place. This would be a big no no.
No description provided.