Logical vs Physical Data Types

The rationale behind the Vortex logical type system
Thu Oct 24 2024

Originally posted: https://blog.spiraldb.com/logical-vs-physical-data-types/

Here at Spiral we are working to build data warehousing for the AI era. Today, we’re kicking off our series of deep-dives into one important piece: our open-source file format called Vortex.

The TLDR for Vortex is:

An extensible, state-of-the-art columnar file format
A shared data layout on-disk, in-memory, and over-the-wire, which enables zero-copy and (almost) zero-allocation reads
Mechanisms to perform compute over compressed data
Cascading lightweight compression scheme based on BtrBlocks
Extensible set of compression codecs, shipping with state-of-the-art implementations of FastLanes (integer bit-packing), ALP (floating point), FSST (strings)
Configurable file layout (to row-group or not to row-group…?), allowing writers to tune for fast writes, fast reads, small files, few columns, many columns, over-sized columns, etc.
Carefully designed for forward compatibility. Vortex may well be the last file format you ever need 🙈
Early benchmarks suggest a roughly comparable compression ratio to Parquet (with zstd), with 1-2x write throughput, 2-3x faster scans, and 200x faster random access

The rest of this post will introduce the idea of logical and physical types, and explain why Vortex leans into a logical type system.

Prior Art

The most common setup for analytical data is to use Apache Parquet as a compressed atomic file format, to push-down some row filtering and column pruning to skip irrelevant data, and then to decompress the remainder into memory for further processing.

Historically, this in-memory format has been specific to the compute engine. Spark has its own representation, as does Presto, as does Numpy/Pandas. This meant that sharing data between these systems, for example invoking a Python UDF from Spark, incurred a huge conversion cost.

Apache Arrow was created in 2016 to solve this problem of interoperability by providing a common in-memory layout that can be shared, zero-copy, between languages. It succeeded wildly and in turn spawned a revitalized ecosystem of data tooling with relative high performance.

🐍 The Scientific Python stack demonstrated the enormous value of a shared in-memory format with Numpy arrays underpinning almost all Python data projects including Pandas, SciPy, Xarray, and more. It is perhaps no surprise that Wes McKinney, the creator of Pandas, helped create Apache Arrow to bring these benefits to the wider data ecosystem.

Arrow is a very good solution for communicating and sharing data after it has been scanned, but it is not particularly meant as a storage format. Uncompressed Arrow data can easily occupy 10x the number of bytes compared to the compressed representation of on-disk formats like Parquet.

Vortex is a file format and associated in-memory representation that natively supports compressed data. Compressed Vortex arrays can be loaded from disk, network, object stores, and even memory-mapped with very little overhead, allowing us to choose precisely when and how much data to decompress.

We believe this is the right interface between storage and compute. But the obvious question is... couldn't we just add this to Arrow?

Arrow has Physical Types

Arrow is built around a physical type system. That means each type of array has a strict specification for how the data is laid out in memory.

For example, Arrow represents a StringArray as a single buffer of contiguous string data, and an additional buffer of offsets signed 32-bit integers pointing to the start of each string. The array pa.array(['hello', 'world']) would be stored like this:

StringArray
   offsets: [0, 5, 10]
   data: ['h', 'e', 'l', 'l', 'o', 'w', 'o', 'r', 'l', 'd']

Due to the limit on the maximum value of a 32-bit offset, a single StringArray can only store at most ~2.147GB of string data. To work around this limitation, Arrow also has a LargeStringArray that uses 64-bit offsets (9.2 exabytes if you were wondering), but these are two different and incompatible types.

When returning Arrow data to a compute engine, such as DataFusion, it is typically passed via an ArrowArrayStream — “a streaming source of data chunks, each with the same schema.”

And here lies the fundamental problem with using physical type systems at the boundary of storage and compute: it forces all data in a column to be represented in the same way.

Parquet might compress strings using dictionary encoding for one row group, and run-length encoding for another row group. Arrow supports both DictionaryArray<String> and RunEndArray<String> , but we are forced to decompress into a common representation.

What’s so bad about that?

Logical vs Physical

Almost all modern compute kernels, for example DuckDB and Velox, have custom internal representations for data, allowing them to perform compute over dictionary encoded data and other partially decompressed arrays.

Notably, Arrow’s DataFusion compute engine does not do this, but soon will.

This unlocks some very powerful optimizations. Suppose we have a DictionaryArray with 1 million strings, but only 2 unique values foo and bar. A filter expression of value == "baz" can very quickly rule out any matches by performing two comparisons, rather than 1 million.

In order to read dictionary encoded data directly from a Parquet file, many of these compute engines have implemented their own Parquet readers; examples include Arrow, DuckDB, Velox, Impala, and cuDF (in total, I have found 10 distinct implementations of Parquet!)

Having such a diverse ecosystem is a sign of how successful Parquet has been, but unfortunately this imposes drag on how fast the format can evolve. If readers don’t support a new encoding or feature, then writers are more hesitant to enable it.

By separating logical and physical types at the storage layer (the file format itself), rather than the compute layer, Vortex is able to return data to compute engines in whatever format works best for them, all with minimal conversion overhead.

This helps to future-proof Vortex, allowing us to pick up and support new compression codecs or new compute engines, all while improving today’s performance.

Vortex has Logical Types

Now we know why Vortex opts for a logical type system, let’s see what’s supported. The vortex-dtype crate contains the following logical types:

Integers - signed and unsigned, 8, 16, 32 and 64 bits
Floats - 16, 32 and 64 bits
Bools
UTF-8 and Binary
Structs
Lists
Extension Types - an underlying logical type with optional metadata

There are a few types we’re still missing, for example decimals and unions, but it’s mostly complete.

... and Physical Encodings

In Vortex, the physical representation of an array is called an encoding. Encodings are fully extensible and encapsulate logic for interpreting the array’s memory into the described logical data type. This is a many-to-many relationship; for example, dictionary encoding can represent any of the logical data types by de-duplicating values.

Vortex Core includes a base set of encodings designed to mirror Arrow’s physical types, including all three forms of string encoding: String, LargeString, StringView (German Strings).

Other useful encodings in Vortex include:

Constant - Stores a single scalar value and a length, trivially allowing us to optimize away a lot of compute.
Chunked - Native support for assembling arrays from multiple chunks of the same logical type, without constraints on the physical encoding.
Sparse - Stores a scalar fill value, as well as an array of indices and associated ‘patch’ values. This is very useful when data is almost uniform. For example, if 99% of values pack into 3-bits, but a handful are much larger.
ByteBool - unlike Arrow (and notably the C++ std::vector<bool>), most languages store arrays of booleans using one byte each. This encoding allows us to support zero-copy to byte-per-bool encodings, such as with Numpy arrays.

And of course, a full set of state-of-the-art compressed encodings:

FastLanes BitPacking - packs integers into the minimal number of bits required to represent the actual values. Read more in our post: Life in the FastLanes.
FSST - a technique for string compression that supports fast random access and compressed comparisons. Read more in our post: Compressing strings with FSST
... and many others, including FastLanes Delta, Frame of Reference, Dictionary, RunEnd, Adaptive Lossless Floating Point (ALP), and ZigZag.

All of these encodings can also be composed hierarchically, allowing writers to tune for the best compression ratio, the highest performance, or anything in-between.

Early days

The Vortex project is still in its infancy, but we're excited to engage with the open source community to improve it. If you have feedback or questions, please feel free to file GitHub issues!

P.S. If you stuck around this long, why not send us your CV! We’re hiring for in-office roles in London 🇬🇧 and NYC 🇺🇸.