MetaFormat: specifying data structure

UML diagram showing relationship between meta class heirarchies

Figure 1. How the various meta types relate to each other, as UML. This is a simplified overview.

There is a special schema in the git db store which represents the schema of the tables itself; these are identified with the meta. schema name. Optionally there are rows in the schema which represent the schema of the meta. store itself.

One way to understand how this works is to see how it works out with a simple example schema.

Meta Tables

As the intrinsic storage building block, the meta. schema defines the tables that exist in the store.

The meta. schema is rich enough to express itself, as well as a number of useful features such as inheritance, nested tuples, and basic keys.

There are four groups of meta tables:

namespace

The objects in this table are top-level containers for all the schema objects relating to a particular version of the application. When querying, this specifies the name that can be used to specify the schema for objects.

types

The type system uses (strict) inheritance, with all types sharing a common (abstract) base type, meta.types.

As is described later, this also implies that they share a primary key. This is different to Postgres, on which inherited types do not share keys. In the meta schema, they will also share the same storage path, /meta/types.

Types can be value types (meta.value), tuples (meta.tuple) or storage tables (meta.table).

attributes

Tuples are comprised of an ordered set of attributes. Taken together, these attributes define the tuple type.

Each attribute has a type, which can be a basic type or a tuple, as described above.

keys

There are three types of keys. Primary keys, foreign keys, and unique keys. Every key belongs to a table (called the source table), and all the attributes that it refers to are found in the source table. There are no indexes in this version of gitdb.

The Namespace table (`meta.namespace`)

The namespace table is a local surrogate key given to the schema, which is used as the first member of the primary key for all the other types in the meta schema.

The namespace has a URL and revision pair. The URL is a distinguishing feature of the schema. The revision number is increased as the schema is modified going forward. The meta schema itself can store multiple revisions of a single application's database schema, under different names.

Here is a table describing the structure. The first four colums of this table are equivalent to a query like:

select
   attr_index, attr_name, attr_type, attr_required
from meta.attr
where ns_name = 'meta'
  and type_name = 'namespace'

attr_index	attr_name	attr_type	attr_required	Description	Key(s)
0	`ns_name`	`string`	yes	local schema name	Primary
1	`ns_url`	`string`	yes	uniquely identitying URL (may be empty)	Unique (with ns_rev)
2	`ns_rev`	`num`	yes	schema iteration number	Unique (with ns_rev)

If you are storing in a verbose encoding such as JSON, then the attribute indices are not important and property names are used instead. For example, the meta schema could be declared with this entry in /meta/namespace.json (or /meta/namespace/meta.json):

{ "ns_name": "meta",
  "ns_url": "http://github.com/samv/Git-DB",
  "ns_rev": 0.1,
}

It could also be encoded in binary as:

00000000  0204 6d65 7461 021d 6874 7470 3a2f 2f67 ␂␄meta␂␝http://g
00000010  6974 6875 622e 636f 6d2f 7361 6d76 2f47 ithub.com/samv/Gi
00000020  6974 2d44 4203 7f01                     t-DB␃␡␁

This shows a curious situation, in that it is possible to include information in the meta tables about the meta schema itself. This is a well-known chicken-and-egg situation, found in type theory, metaprogramming, etc.

To keep things simple when connecting, all that is required is a single row which includes the meta schema URL and revision - and if the implementation does not know how what that means, it cannot process the rest of the schema metadata, and therefore should not continue. If the data is provided, then it should be compared against the known good data, and any discrepancies treated as a fatal error.

Types (`meta.type`)

Types are abstract, in that you can't just have a type, it has to be a particular kind of type. When reading the row, which kind of type you have can be distinguished from which attributes it posesses.

The type tuple exists only to keep a registry of type names within a namespace. The primary namespace for types is their names and not a surrogate index, which makes type renaming more complex, but makes the schema overall nicer to work with. Numbered surrogates are used for the attr table, only, as they are necessary for the binary column format.

Column	Type	Nullable	Key(s)
`ns_name`	string	no	Primary (with `type_name`) and Foreign to `namespace`
`type_name`	string	no	Primary (with `ns_name`)

Value Types (`meta.value`)

Value Types are a kind of type that are simple and concrete: they have a single set of allowable column formats, and well-known set of functions for converting from the value to representations such as the tree format and the column format.

Column	Type	Nullable	Key(s)
`ns_name`	string	no	Primary (with `type_name`); foreign key inherited from `type`
`type_name`	string	no	Primary (with `ns_name`)
`value_formats`	`column_formats`	no	-
`value_dump_f`	string	yes	-
`value_load_f`	string	yes	-
`value_choose_f`	string	yes	-
`value_cmp_f`	string	yes	-
`value_print_f`	string	yes	-
`value_scan_f`	string	yes	-

Value Type definitions are relatively mundane, so the meanings of these columns is described elsewhere, in Value Types. You can include this in your schema to assist others in working with the new data types that get defined, should you feel the need.

Enum Types (`meta.enum`)

Enums. Useful enough to include on a first cut.

Column	Type	Nullable	Key(s)
`ns_name`	string	no	Primary (with `type_name`); foreign key inherited from `type`
`type_name`	string	no	Primary (with `ns_name`)
`enum_values`	string[]	no	-

enum_values is a column with 1 dimension, that is, an array. The contents of this array relate to 'int' values.

Field Types (`meta.field`)

Fields are like 'enum's but stored bitwise, either in an int or a block of bytes. May be mapped to an array of values, or a 'FOO|BAR' C-style rendering, depending on the IO functions specified.

Column	Type	Nullable	Key(s)
`ns_name`	string	no	Primary (with `type_name`); foreign key inherited from `type`
`type_name`	string	no	Primary (with `ns_name`)
`field_bits`	string[]	no	-

Fields are included mainly because there is one in the meta schema.

Tuples (`meta.tuple`)

The tuple table defines what Postgres would call compound data types, and are usually called tables in the classic RDBMS, or perhaps classes or result sets in other contexts.

Similar to the situation in Postgres, they are not necessarily intended for use as real tables, and can be used as column types in the case of nested data types. This allows for deeply nested data in rows, while retaining the ability to use strict typing.

Column	Type	Nullable	Key(s)
`ns_name`	string	no	Primary (with `type_name`); foreign key inherited from `type`
`type_name`	string	no	Primary (with `ns_name`)
`tuple_super`	string[]	yes	Foreign, with `ns_name`, to `type` primary key

Tuples can inherit from other tuples, but they must add at least one required attribute in order to distinguish themselves in storage. This restriction allows row data to be merged between tables by simple concatenation and avoids storing schema information in the data. In other words, types within an inheritance heirarchy share attribute numbering ranges.

Storage Tables (`meta.table`)

A tuple with storage is called a table, and this requires a primary key.

By default, its storage path is /ns_name/type_name - but this can be overridden by a non-null value in the table_path column. The only fixed path in the repository is therefore /meta - though you can call this path /_meta or /.meta if you prefer. On connection, they should be tried in that order, and the first one found used. If a value_name is used in the schema, but not listed in meta.types, then 'standard' definitions from the relevant version of the git-db spec may be used.

Column	Type	Nullable	Key(s)
`ns_name`	string	no	Primary (with `type_name`); foreign key inherited from `type`
`type_name`	string	no	Primary (with `ns_name`)
`tuple_super`	string[]	yes	(inherited from tuple)
`tuple_path`	string	no	none

The Attribute table (`meta.attr`)

Tuples are ordered lists of attributes.

Column	Type	Nullable	Key(s)
`ns_name`	string	no	Primary (1/3)
`type_name`	string	no	Primary (2/3); Foreign with `ns_name` to `type` or `tuple`
`attr_index`	int	no	Primary (3/3)
`attr_name`	int	yes	Unique with `ns_name` and `type_name`
`attr_type`	string	yes	Foreign with `ns_name` to `type` primary key index
`attr_dim`	int	yes	none
`attr_required`	bool	no	none

ns_name, type_name

These parts of the primary key are a foreign keys to the tuple table (or the type table, but they really only may add attributes to tuple types, as value types cannot have extra type encoded). They specify the meta.tuple that the entry describes.

attr_index

This specifies the column number in storage. When new columns are added, they are given a new attr_index number. When they are renamed, the number stays the same, and finally, when they are deleted the entry remains.

attr_name

This is the logical name that is ascribed to the property, and should be the important one as far as applications are concerned. As there is a unique constraint, attributes which are deleted should have attr_name not set.

attr_type

This specifies the type of the attribute. It is a reference to the meta.type heirarchy. Through inheritance, this may also refer to tuples in the meta.tuple table - even itself. Cross-namespace type references are not allowed.

If this field is not set then the attribute is untyped. The standard JSON-like transform rules will apply.

attr_dim

This specifies the number of dimensions of the attribute. If it is 0 or not specified, then the attribute is a regular attribute. If it is 1, then it is an array. If 2 or higher, a multi-dimensional array.

attr_required

A most basic constraint, whether the value can be NULL or not. If not set, then the value may be NULL. No primary key attributes may be NULL.

Constraints and Keys

There are three types of indexes: primary keys, unique keys, and foreign keys. These are all specializations of the 'key' tuple:

The Key Tuple (`meta.key`)

Column	Type	Nullable	Key(s)
`ns_name`	string	no	Primary (1/3)
`type_name`	string	no	Primary (2/3); Foreign with `ns_name` to `type` or `tuple`
`key_name`	int	no	Primary (3/3); Unique with `ns_name`
`key_inherit`	bool	yes	This rule doesn't apply to sub-types.
`key_attr_index`	int[]	yes	Foreign with `ns_name` and `type_name` to `attr_ns_name_type_name_attr_index_unique`

Unique Keys (`meta.unique`)

Unique keys add the notion of constraint on the source table.

It has a table, which has no extra columns compared to the tuple.

Primary Keys (`meta.primary`)

The only difference between the primary key and a unique key is that it is used as the primary identity for rows in the table. It stores one extra fact, the location of the table in the store.

Column	Type	Nullable	Key(s)
`ns_name`, `type_name`, `key_name`, `key_inherit`, `key_attr_index`	various	various	Inherited from 'key' tuple
`primary_path`	string	no	no key, must exist

There are two fields where the path is recorded; this one records the location of the TreeFormat structure. Technically, the table class' table_path attribute records which TreeFormat structure that this table is saving its rows in.

Foreign Keys (`meta.foreign`)

Finally, there is the foreign key type. Foreign keys refer to another key. If they refer to a unique or primary key, a traditional foreign key relationship is established. Foreign keys which refer to each other imply the existence test, but place no constraint on uniqueness of either side.

Column	Type	Nullable	Key(s)
`ns_name`, `type_name`, `key_name`, `key_inherit`, `key_attr_index`	various	various	Inherited from 'key' tuple
`foreign_key_name`	string	no	Foreign to `key` with `ns_name`

Summary

The schema so far is capable of storing typed and untyped data, as well as achieving several of the various levels of normal form. An important test is that the schema completely describes itself, and naturally fits within itself.

UML diagram showing relationship between all meta classes

Figure 2. A UML diagram of all of the meta types and how they relate to each other.

MetaFormat: specifying data structure

Meta Tables

The Namespace table (meta.namespace)

Types (meta.type)

Value Types (meta.value)

Enum Types (meta.enum)

Field Types (meta.field)

Tuples (meta.tuple)

Storage Tables (meta.table)

The Attribute table (meta.attr)