For the week ending 16 July 2022

Rich Miller
5 min readJul 17, 2022

Sources that have caught my attention this week. This week’s offerings in the areas of Data Engineering and Data Regulatory Governance come as no surprise as we prepare the first distribution for design partners of Provenant Data’s Source Data Management (SDM) platform.

Data Engineering

Cerbos’s Secret Ingredients: Protobufs and gRPC

  • The New Stack, 14 July 2022

gRPC is an RPC framework that uses protobufs for data exchange. It uses HTTP/2 as the transport mechanism that allows it to make use of all the speed, security and efficiency features provided by the HTTP/2 spec for interservice (or even interprocess) communication. gRPC also benefits from code generation to make the RPC calls resemble native function calls in the programming language. …

The efficient binary encoding helped us save bandwidth and transmit messages between different processing pipelines quickly and efficiently. (At the scale of dealing with billions of messages, even a few bytes shaved off each message makes a massive difference.) Because encoded protobufs are language-agnostic, they were an ideal format to exchange data between applications written in different languages such as API services written in Go and data-processing pipelines written in Java or Python.

The Guide to Modern Data Architecture | Future

  • Author: future.a16z.com — Matt Bornstein, Jennifer Li, Martin Casado
  • published originally in 2020. Recently updated.

To help data teams stay on top of the changes happening in the industry, we’re publishing in this post an updated set of data infrastructure architectures. They show the current best-in-class stack across both analytic and operational systems, as gathered from numerous operators we spoke with over the last year. Each architectural blueprint includes a summary of what’s changed since the prior version. …

We’ll also attempt to explain why these changes are taking place. We argue that core data processing systems have remained relatively stable over the past year, while supporting tools and applications have proliferated rapidly. We explore the hypothesis that platforms are beginning to emerge in the data ecosystem, and that this helps explain the particular patterns we’re seeing in the evolution of the data stack.

Data Governance and Regulation

Europe’s Big Tech Law Is Approved. Now Comes the Hard Part

  • WIRED, 08 July 2022

The potential gold standard for online content governance in the EU- the Digital Services Act -is now a reality after the European Parliament voted overwhelmingly for the legislation earlier this week. The final hurdle, a mere formality, is for the European Council of Ministers to sign off on the text in September. …

It will give users real control over and insight into the content they engage with, and offer protections from some of the most pervasive and harmful aspects of our online spaces. …

The focus now turns to implementation, as the European Commission begins in earnest to develop the enforcement mechanisms. The proposed regime is a complex structure in which responsibilities are shared between the European Commission and national regulators, in this case known as Digital Services Coordinators (DSCs). It will rely heavily on the creation of new roles, expansion of existing responsibilities, and seamless cooperation across borders. …

What’s clear is that as of now, there simply isn’t the institutional capacity to enact this legislation effectively.

5 Essential Steps for Building a GDPR-Compliant Data Strategy | Immuta

  • Author: Sophie Stalla-Bourdillon, 22 October 2021

[1] Detect and label sensitive and personal data

GDPR protects personal data by preemptively restricting its use before harm can be done. Therefore, prior to data usage, it’s key to detect and label personal data within an analytics environment.

[2] Review your de-identification capabilities

De-identification is the process by which the link between the data and the individual is altered. However, this does not mean that data custodians and/or data recipients are necessarily relieved of all obligations. … Data controls affect the visibility of the data and include the familiar techniques of tokenization, k-anonymization, and local and global differential privacy. … Context controls, on the other hand, affect the data’s environment and include data access controls and user segmentation, contracts, training, monitoring, and auditing.

[3] Build in controls for each data protection goal

Although data security is a key data protection goal under GDPR, the law goes well beyond it and includes purpose limitation, data minimization, data accuracy, transparency, accountability, and fairness.

[4] Make individual data requests executable

Under GDPR there are seven types of individual intervention: data correction, data access, data portability, data deletion, processing restriction, opt-out, and opt-in. … DataOps teams should be able to trigger four intervention functions and generate logs to capture metadata when these functions are being performed:

- Processing termination: the action of stopping the processing. In practice, this will mean that access to a processing domain must be time-based.

- Data deletion: the process by which data is put beyond use and destroyed. Processing termination is thus an essential primary step of a data deletion process.

- Data export: the outputting of data for use by other systems. This involves translating the data into a format that can be reused by other systems.

- Data rewriting: the process of replacing attribute values. This function does not necessarily imply that all data analysts should have rewrite permissions for all processing purposes.

[5] Embed rich compliance-related metadata

GDPR’s record-keeping and transparency obligations require DataOps teams to capture four types of metadata.

- Object metadata, which describes the data elements, which are usually grouped into categories.

- User metadata, which describes the user or recipient having access to the data. In a multi-compute environment, the location of the data user will be of paramount importance to make the claim that data localization requirements are being met and no international transfer is happening.

- Activity metadata, which describes the processing performed upon the data.

- Context metadata, which describes the broader environment in which the data processing takes place, including the purpose for which the processing is being performed, the impact of the processing upon individuals, etc.

--

--

Rich Miller

Silicon Valley irregular, CEO of Telematica, Inc. and Executive Chair of Provenant Data, Inc.