Schemas (Avro)


The DVS uses schemas for all data inside of the DVS. The use of schemas is somewhat controversial but standardizing on a programmatic way of describing data inside the system has numerous benefits. Schemas (specifically Avro) provide:
1. Automatically updated documentation for data
2. Automatic ways to materialize data streams in alternative data stores (Relational DBs, Hadoop, other) eliminating a lot of the manual labor involved in actually utilizing our data
3. Checks for overall data integrity
4. Help with modeling and reasoning about change in data formats organization wide
5. A space efficient binary serialization format


Internal data producers will be expected to define a schema for data prior to ingest into the DVS. Schemas will be allowed to evolve using the “backwards compatible” rules for Avro schemas. The DVS manager will be used to enforce backwards compatibility when evolving schemas for the same data stream. Additionally the DVS manager will provide guidance for how to create and modify schemas to maintain compatibility.

Creating backwards compatible schemas means that fields can be added and dropped to some extent. Adding and dropping fields must be done in a particular way, providing defaults when possible to allow the older data to be read with newer schemas, whenever possible, reading should be done with the latest schema registered in the DVS.

In cases where the schema cannot be evolved to handle the application use case, the expectation is that an entirely new versioned topic / data stream will be created with the new schema. Generally this should be reserved for significant changes to the shape or content of messages.


Example
The following is an example of a Customer entity message. This is a JSON message that’s produced whenever a user signs up for our service. It contains data such as the user’s identifier, their preferences, and address information.

{
  "id": 11623,
  "userName": "bob@myemail.com",
  "preferences": ["sporting-goods", "holiday-deals"],
  "address": {
    "streetAddress": "1123 Forester Way",
    "city": "Hoboken",
    "state": "New Jersey",
    "zipCode": 11972
  },
  "eventName": "CustomerCreated",
  "eventDate": "2016-09-18T17:34:02.666Z"
}


If we were to create a DVS ingestor for Customer, we would need to provide an Avro schema. Avro schemas are defined in JSON format and describe the “shape” of the data. An Avro schema for this message might look like the following:

{
	"type": "record",
	"namespace": "my-org.some-data",
	"name": "Customer",
	"hydra.key": "id",
	"fields": [{
			"name": "id",
			"type": "int"
		},
		{
			"name": "userName",
			"type": "string"
		},
		{
			"name": "preferences",
			"type": {
				"type": "array",
				"items": "string"
			}
		},
		{
			"name": "address",
			"type": {
				"type": "record",
				"name": "addressRecord",
				"fields": [{
						"name": "streetAddress",
						"type": "string"
					},
					{
						"name": "city",
						"type": "string"
					},
					{
						"name": "state",
						"type": "string"
					},
					{
						"name": "zipCode",
						"type": "int"
					}
				]
			}
		},
    {
      "name": "eventName",
      "type": "string"
    },
    {
      "name": "eventDate",
      "type": {
        "type": "string",
        "logicalType": "iso-datetime"
      }
    }
  ]
}


This represents a compound Avro data type called a “record” that consists of various “fields”. Each field has a name and a data type. There are a limited number of “primitive” types that cover the most common cases, these include string , int , boolean , float , binary and null . Avro allows arbitrary nesting so you can use nested types such as records and arrays. While acceptable, generally the recommendation is to avoid overly complex data types whenever possible (lots of nested records and compound types). This recommendation is merely to simplify the automation of materializing and moving data into equivalent forms in alternative data stores like relational databases, document stores and other data storage forms.

We have provided examples above for how to create collection/complex built-in types in Avro (array, map, enum, etc...) with the preferences field.  Notice how the type is nested in an outer type.  We have also provided an example of how to specify a custom record with address.  Avro is recursive, and so you can arbitrarily nest records inside one another.  The type for address is by itself a valid Avro schema definition.

In the above Avro schema all of the fields listed are required . When recording data, not providing a required field will result in the Ingestor (in strict mode) rejecting the message and returning an error. To specify a field as being optional, the field should be marked as being nullable and having a default value of null. This is accomplished using an avro union type which looks like an array of types. Avro union types allow specifying a field as having multiple potential data types. Usually this is used to specify null and an expected data type. This will mark the field as being optional and if it’s not provided the field will be stored/read as null. This is very similar to how a relational database might have a nullable column. When evolving schemas, fields can be added to schemas and still preserve backwards compatibility. This is accomplished by adding a default value to the field. When the new schema is used to read old records, the default value will be used where the field is not present. In this way the latest schema can be used to read records written with older compatible schemas.

hydra.key and other required fields

There is a top-level field called hydra.key, which serves a similar purpose to a Primary Key in the relational model.  The hydra.key identifies the field(s) that uniquely identify a record. The hydra.key is required in all history streams to enable deriving a current state compacted stream from the history stream.  In addition for Entity Streams with state change the eventName and eventDate fields are included to describe the event that occurred along with the timestamp it occurred.  Including these three fields is not programmatically enforced, but is considered a best practice and strongly recommended.


{
	"type": "record",
	"namespace": "my-org.some-data",
	"name": "Customer",
	"hydra.key": "id",
	"fields": [{
			"name": "id",
			"type": "int"
		},
		{
			"name": "userName",
			"type": "string"
		},
		{
			"name": "preferences",
			"type": {
				"type": "array",
				"items": "string"
			}
		},
		{
			"name": "address",
			"type": {
				"type": "record",
				"name": "addressRecord",
				"fields": [{
						"name": "streetAddress",
						"type": "string"
					},
					{
						"name": "city",
						"type": "string"
					},
					{
						"name": "state",
						"type": "string"
					},
					{
						"name": "zipCode",
						"type": "int"
					}
				]
			}
		},
		{
			"name": "inGoodStanding",
			"type": ["null", "boolean"],
			"default": null
		},
		{
			"name": "eventName",
			"type": "string"
		},
		{
			"name": "eventDate",
			"type": {
				"type": "string",
				"logicalType": "iso-datetime"
			}
		}
	]
}


In this example, another field has been added called inGoodStanding . This field is ‘nullable’, accepting either a null or boolean value with a default of null. This is a backwards compatible change to our previous schema because of the default value. When this schema is used to read old records, where this field didn’t exist, it will fill in the default value of null. Additionally this field is also optional because it is nullable. This means that you can post to the ingestor with the new schema and omit inGoodStanding and the field will be filled with the default value of null.

Unfortunately, schema evolution is mostly limited to two operations. The first is the addition of a new field with a default value. The second way is to remove a field by marking an existing field as nullable and provide a default value of null. Removing fields in this way isn’t a true removal but rather a logical removal where the field will no longer be required to be provided but will still appear when reading the values (with newer records showing null). This form of removal is useful if a field is no longer provided but there is a desire to still be able to read the field in historical records.. Both of these operations are considered backwards compatible changes.

Due to limitations in downstream Avro tooling, Avro features such as aliases for renaming fields can’t be used. If the shape of data must be changed substantially (required field renames, lots of removals, etc…) the intent is for the producer of the data to provide an entirely new versioned data stream with a new non-backwards-compatible schema. 


More On Avro Schemas

Please refer to https://avro.apache.org/docs/current/spec.html and https://docs.confluent.io/current/avro.html for more detail on how to create your schema.