deferred features

The following features are defined as being on the roadmap, but not close enough to worry about exactly where yet.

This isn't really a TODO list for the next things to do, just thinking about "large" features, and how they fit logically into the picture.

PARTITIONING, SHARDING AND FAN-OUT

Being able to specify the exact fan-out of the schema in the store may be useful; it can be used as a part of a sharding implementation as well.

That is, say that the primary key of a row is:

(urn:ietf:gitdb:schema, 0.1, 4, 2)

This corresponds to the filename;

urn:ietf:gitdb:schema,0.1,4,2

The fan-out might be specified as the first two keys; so the expected filename in the store is then:

urn:ietf:gitdb:schema,0.1/4,2

This can be useful for partitioning/sharding, where you might have your primary key as some function of the lookup ID, then split by the first 3 nybbles to achieve 1024 shards;

badabadabadabadabaa001

becomes:

bad/abadabadabadabaa001

As each directory has a single checksum which represents the entirety of its contents, nodes need not hold a copy of the other shards; just the current checksum of the contents.

Partitioning could be represented either by abstracting the object database layer, or a more explicit means such as submodules.

FUNCTIONS

This is probably the first next big thing to describe. A function is described by a few things;

  • Its name, which within a schema can be reasonably happy to be unique
  • The number and type of input arguments and return values.
  • The language which the function is defined in.
  • The actual definition of the function, in said language.

This is a bit of a can of worms, so a prototype which avoids it is probably better.

CHECK CONSTRAINTS

Check constraints are a basic part of data modelling and allow sanity checks to be applied to the data as it goes in. These do however require a mechanism for expressing functions, so until that is solved, check constraints will be out of scope.

TRIGGERS

Triggers are a way of making database stuff happen on certain events; such as inserting, deleting or updating a row. This can be used to enforce very domain-specific rules.

INDEX PREDICATES

Sometimes you don't want an index to apply to all rows. There will be an implied predicate of all indexes; they will only apply if the columns they index are not null. This itself is useful, but being able to only index a select portion of columns is also very handy.

FUNCTIONAL INDEXES

Normally values are inserted into indexes as is, but this allows for some transformation of columns to the value which is considered unique. Useful for things like case insensitive (but case preserving) constraints.

ARRAY TYPES

Array types allow for more compact class definitions, and for example in the above definitions would allow for some slave tables to be removed entirely (<tt>meta.key:attr</tt> and <tt>meta.class:super</tt>)

SEQUENCES

Not really useful until they can be used, as with functions. They are also not a very good fit to the distributed approach in many of the distributed computing profiles; GUIDs (or just random strings) are often a better idea.

That being said, if sequences are used as default values and constraints, then the work to merge when two writers use the same sequence number is defined and limited to changing the values in the new rows which were written or the linked rows which were updated.

This is likely to be an acceptable penalty for all but the busiest OLTP systems.

VIEWS

Describing a view requires a good definition of query syntax, joins, expressions, etc. After the abstract query syntax tree is completed, this will be revisited. Materialized views should also come "for free" in this design without having to manually write triggers

Skipping storage

There are MetaFormat ways to specify this, but also regular git filesystem mechanisms for skipping storage such as <tt>.gitignore</tt> files should be honoured. If they are discovered and not understood, then the program should probably refuse to run.

FULL PREDICATE TRANSACTION ISOLATION

In this mode, temporary views are created for all queries issued in the transaction. These take up little space, just a few rows in the meta tables for each query; the data returned is not duplicated. Locks are then recorded on the view rows. When merging in this isolation level, all views computed in other transactions must be computed on the new data, to show that they return the same result.

This feature is, shall we say, a long way down the road-map :-).