git db scalability

This standard is primarily intended to be scalable from small to large, while still giving the convenience of being able to 'check out' repositories and inspect their contents, while being fully auditable.

Embedded use cases

In the embedded case, table primitives used for a git DB implementation are used directly. This allows for useful applications from early implementations even before query systems are developed.

"embedded" this usually means "phone" these days, and mobile applications should benefit from the small data size requirements of git-db tables and efficient synchronization.

A basic implementation need only care for encoding and tree format, and its meta. schema can be fixed. On connection, it just has to compare the meta schema to its own data, or perhaps even just the revision/URL stored in the meta.namespace table.

SQLite use case

In this case, a C-ISAM-type library like SQLite is engineered to connect to the git db store using shared libraries. In this picture the application would be linked with a git db library and libgit2, and access the git store directly. Unlike C-ISAM, it could support concurrent writers. Also unlike C-ISAM, it would use an append-only store, which is much less subject to jamming.

The command-line "sqlite" utility could work, as it can read the application's table structure from the meta. application schema, and know what tables and properties exist in the store.

Distributed Use Case

In this case, multiple nodes are capable of performing actions on the data. There are many possible configurations which are explored on the distribution page.

RDBMS use case

In this case, git db tables are engineered underneath an RDBMS as a storage layer. This could be as a MySQL engine, or even a port of Postgres to it.

As there is an RDBMS engine present, it handles transactional integrity on contents itself. The main benefit of this is replication for readers.

Clustered RDBMS use case

In this case, the RDBMS above is also taught how to replicate the store and negotiate writers which have successful updates to the database state (ie, commits). This allows some horizontal scaling of complex update transactions (assuming every node can hold a clone of the repository).

This is something of a cross between the RDBMS case (where git-db does not get visibility of extra transactional integrity information), and the distributed use case.

Sharded partition use case

In this mode, the store is broken up into submodules, with writes to each submodule managed by a different independent worker process or even computing node.

As the tree format specifies that trees can be split into arbitrary nesting, data sets with natural partition boundaries (eg, users, sites, domains) can be sharded by ranges of primary keys. Node A might have keys from A-C, Node B keys from D-G, etc.

Whole Hadoop use case (Map/Reduce)

Map/Reduce is where a problem is chunked up into parts, and then farmed out to many nodes, which all compute a part calculation ("map") and then the result is condensed into a result ("reduce"). Hadoop is an open source implementation of this.

In this mode, the system is distributed at the object store layer; re-use of low level components such as libgit2 are less likely, or lower level. The simplest implementation would be equivalent in functionality to the "Embedded" case, above.

git db has characteristics which suit this approach. As the commit id summarizes a known and immutable database state, this already solves many of the problems encountered by mutable state systems. Nodes can process their result in terms of that commit ID, allowing for what are effectively long-running transactions. Update jobs could potentially work incrementally at little extra incurred cost.

For very large data sets, there's no reason you really need to keep the history forever. For ETL systems, you would probably only keep condensed summaries and not raw data in the actual clonable history; the "extracts".