The Raijin Database engine is an advanced SQL-like database that tries to fill the gap between SQL and Document databases while also providing a solution to today's data challenges.

Vast amounts of data are generated by machines and humans that both old school relational SQL databases and the new breed of NoSQL systems struggle with such as the following.

  • Time series event data, metrics, sensor data, logs and other machine generated data,

  • Clickstream data,

  • E-Commerce data,

  • Stock ticks and financial instrument data,

  • Call Detail Records,

  • Other semi-structured data.

Most SQL systems were designed for transactional data processing. Many of these OLTP systems work with data one tuple at a time and are unsuitable for processing large amounts of data especially when analytical queries are involved. Analytic RDBMS products - mostly those that have a columnar architecture - generally do a much better job but are still plagued by the limitations of SQL when it comes to processing this kind of data. Rigid database schemas get in the way of application development and ongoing schema maintenance becomes a burden. Due to schema rigidity, data needs to be denormalized and stored in different tables. This introduces another complexity cost while performance also suffers.

To come around these limitations IT engineers frequently combine document databases or other NoSQL solutions with RDBMS products in order to store and process denormalized data. Unfortunately most of these document database systems suffer from the same issue that OLTP systems have when it comes to analytical type workloads. Some don't support any analytical queries at all and the sole purpose of using a NoSQL solution is to be able to easily store semi-structured data.

The goal of the Raijin Database engine is to provide a solution to this issue. Backed by our decade of experience working with database systems in this area, Raijin also implements concepts from modern database research in addition to our own novel ideas.


Support for Semi-structured Data

One of the primary goals of Raijin is to natively support semi-structured data to provide the flexibility that made document databases popular.

A big limitation of traditional SQL systems is schema rigidity.  The Raijin database does not require a schema to be defined up-front allowing you to cope with data variety as records may contain different fields. The database does not store NULL values in order to efficiently deal with sparse data.

If you want schema enforcement, explicitly listing the column names and types in the CREATE TABLE statement will also work in order to remain compatible and ensure that your existing SQL code would still work. The difference from traditional SQL systems is that Raijin will accept records with field names that were not defined beforehand and it will allow using this data as if it were a regular column.

This allows denormalized data to be painlessly stored and processed.

Using SQL

SQL is a great declarative query language that all data analysts are familiar with.

The Raijin Database supports SQL as its primary query language while lifting some of SQL's limitations. It tries to adhere to the PostgreSQL syntax and also supports some other dialects that we considered useful.

In addition to lifting the requirement of having to define a schema, Raijin will also allow operations on some types that are forbidden in standard SQL. For example when an operation is attempted on a string (i.e. varchar) type, a conversion is done to deduce the required type and proceed with the operation instead of giving an error. The side-effect of being schemaless allows the same field (i.e. column) to contain different types and this automatic type conversion behavior aims to help with the veracity of the data. This makes Raijin's SQL dialect similar to a weakly typed language.

Exploiting Modern CPU Features

To be able to process large amounts of data, Raijin uses cache-aware algorithms and data structures to exploit the capabilities of modern CPUs. Instead of processing data one tuple at a time, it operates on data blocks. Using vectorized execution backed by optimized SIMD instructions, Raijin ensures that your CPUs are not wasting cycles. The code contains optimizations for the following two SIMD instruction set flavors:

  • SSE2-  supported by most x86 processors.
  • AVX2 - available in recent Intel and AMD CPUs.

Raijin is written in C++ using C++11 features and is compiled into a native binary instead of using an interpreted language that is inefficient at processing large amounts of data.

Built for Analytics

Most NoSQL solutions are inefficient at or totally lack support for analytical queries. Raijin DB supports group by and aggregations using standard SQL syntax.

Vectorized execution and cache-friendly algorithms allow large amounts of data to be operated on. Backed by optimized SIMD instructions and a modern compressed hybrid columnar storage layer Raijin strives to bring the most out of your hardware. This ensures that analytical queries touching millions of records will complete in seconds not hours.

Data Compression

Raijin can store data in a compressed format. Data compression not only saves disk space but provides a performance boost with modern CPUs.  JSON is only an external transport format which can be used to import and export data and is not used internally. Instead, data is stored in a more space efficient binary format which can be compressed using block level data compression.

Raijin also employs data specific compression algorithms such as dictionary, run-length and delta encoding.

Compression can be enabled on a per-table basis.

 

 

Efficient Data Storage

Raijin uses a hybrid columnar data storage. The columnar format combined with vectorized execution greatly increases the data throughput demanded by analytical workloads.

Many traditional SQL solutions use B-Tree based indexing algorithms. While this data structure can work great for OLTP type queries, index maintenance and tuple access greatly inhibits performance when a large amount of data needs to be accessed from secondary storage. This often leads to gradual performance degradation for both read and write operations as the database size grows. Instead of using B-Trees, the Raijin Database engine employs a variant of LSM Trees that performs well for both read and write (append) operations.

The storage layer in Raijin can use partitioning in order to increase data locality. Besides boosting query performance, detaching data partitions can help to quickly and efficiently move data out of the database. In case of time-series data, data partitioning can be used to discard historical data or move data into an archive without the need to run expensive dump operations.

JSON Data Representation

The JSON data format has quickly become the defacto standard for data interchange in recent years.

In order to deal with sparse data, the Raijin Database uses a flat JSON representation for the data records. This is natively supported when loading and querying data unlike in other popular SQL solutions where this got introduced as a bolt-on afterthought. The JSON format is used both in query output as well as when loading data. Note that only flat JSON (i.e. a list of key-value pairs) is supported, unfortunately Raijin can't deal with nested data structures such as arrays and maps. For compatibility with existing SQL code, the traditional CSV format is also supported when loading data.

The support for the JSON format coupled with the REST API allows developers to get data in and out of the database easily, especially when interfacing with web-based applications.

Optimizer Techniques

Being able to crunch data as fast as possible - even if ultra-fast vectorized SIMD execution is used - will still waste precious CPU cycles if the data is uselessly processed.  Our query optimizer employs the following techniques in order to avoid processing data that does not need to in order to make queries even faster:

  • Block level operations and data skipping,
  • Use of block-level metadata to answer queries,
  • Other secret sauce and black magic.

Some of the above is still work in progress. As we enable these features, we expect the database to gain significant performance boost for some queries on large data sets.

The query optimizer in the Raijin Database engine is heavily integrated with the data storage and indexing layer, otherwise all this would be not possible as is the case with several well-known database solutions offering pluggable storage engines. 

REST API

The Raijin database provides an HTTP interface where queries and data can be submitted to. Queries using the standard SQL syntax are simply wrapped in the HTTP request and the database will return a result using JSON. Together with native JSON support this should enable rapid web-application development.

The same HTTP interface can be accessed from a web browser in order to submit SQL statements using a web-form so there is no need to install any client side tools.

The Road Ahead

The Raijin Database engine is nowhere near complete. There is still an extreme amount of work to be done on features currently missing that we intend to implement, including but not limited to the following:

  • Joins,
  • Subselects,
  • Window functions, common table expressions,
  • Update, upsert, delete,
  • Views,
  • A great deal of optimizer wizardry,
  • Full-text search,
  • Scale-out distributed cluster mode,
  • Authentication and access control,
  • Language bindings,

On the other hand there are features found in other SQL solutions that are not in-line with the design goals of Raijin and will probably not be implemented in the near future:

  • Transactions (commit/rollback),
  • Foreign key constraints,
  • Triggers, stored procedures,
  • User defined functions and types,
  • Cursors.