bigdata - JSONiq
Querying denormalized data
imperative vs declarative
Imperative programming is a programming paradigm that expresses computation as a series of statements that change a program’s state. In imperative programming, the focus is on describing how a program operates step by step. Common imperative languages include C, C++, Java, and Python.
In contrast to imperative programming, declarative programming focuses on describing what the program should accomplish rather than how to achieve it. In a declarative host language, the emphasis is on specifying the desired outcome or properties, and the language itself takes care of the underlying implementation details. Such as SQL, HTML.
motivation
Denormalized data
it is characterized with two features: nestedness, and heterogeneity. In fact, denormalized datasets should not be seen as “broken tables pushed to their limits”, but rather as collections of trees.
Features of a query language
A query language for datasets has three main features: Declarative, Functional, and Set-based. First, it is declarative. This means that the users do not focus on how the query is computed, but on what it should return. Being functional means that the query language is made of composable expressions that nest with each other, like a Lego game. Finally, it is set-based, in the sense that the values taken and returned by expressions are not only single values (scalars), but are large sequences of items (in the case of SQL, an item is a row).
Query languages for denormalized data
For denormalized data though, sadly, the number of languages keeps increasing: the oldest ones being XQuery, JSONiq, but then now also JMESPath, SpahQL, JSON Query, PartiQL, UnQL, N1QL, ObjectPath, JSONPath, ArangoDB Query Language (AQL), SQL++, GraphQL, MRQL, Asterix Query Language (AQL), RQL.
JSONiq as a data calculator
it can perform arithmetics, but also comparison and logic. It is, however, more powerful than a common calculator and supports more complex constructs, for example variable binding.
It also supports all JSON values. Any copy-pasted JSON value literally returns itself.
And, unlike a calculator, it can access storage.
The JSONiq Data Model
Every expression of the JSONiq “data calculator” returns a sequence of items.An item can be either an object, an array, an atomic item, or a function item. A sequence can also be empty. Caution, the empty sequence is not the same logically as a null item.
The general idea of navigation is that it is possible to “dive” into the nested data with dots and square brackets (originally, these were slashes with XPath) – all in parallel: starting with an original collection of objects (or, possibly, a single document), each step (i.e., for each dot and each pair of square brackets) goes down the nested structure and returns another sequence of nested items.
Object lookups (dot syntax)
json-doc(“file.json”).o
Array unboxing (empty square bracket syntax)
We can unbox the array, meaning, extract its members as a sequence of seven object items, with empty square brackets.json-doc(“file.json”).o[].
The dot syntax, in fact, works on sequences, too and will extract the value associated with a key in every object of the sequence (anything else than an object is ignored and thrown away).
Filtering with predicates (simple square bracket syntax)
It is possible to filter any sequence with a predicate, where .c = 3].It is also possible to access the item at position n in a sequence with this same notation: json-doc(“file.json”).o[].a.b[][5].
Array lookup (double square bracket syntax)
To access the n-th member of an array, you can use double-squarebrackets, like so:
json-doc(“file.json”).o[[2]].a.
A common pitfall: Array lookup vs. Sequence predicates
Do not confuse sequence positions (single square brackets) with array positions (double square brackets)!
([1, 2], [3, 4])[2] -> [ 3, 4 ]
([1, 2], [3, 4])[[2]] -> 2 4
Schema discovery
Collections
many datasets are in fact found in the form of large collections of smaller objects (as in document stores). Such collections are access with a function call together with a name or (if reading from a data lake) a path.
Getting all top-level keys
The keys function retrieves all keys. It can be called on the entire sequence of objects and will return all unique keys found (at the top level) in that collection.
Getting unique values associated with a key
With distinct-values, it is then possible to eliminate duplicates and look at unique values:
distinct-values(collection( “https://www.rumbledb.org/samples/git-archive.jsonl“ ).type)
Aggregations
Aggregations can be made on entire sequences with a single function call:The five basic functions are count, sum, avg, min, max.
count(distinct-values(collection( “https://www.rumbledb.org/samples/git-archive.jsonl“ ).type))
Construction
Construction of atomic values
Atomic values that are core to JSON can be constructed with exactly the same syntax as JSON.
Construction of objects and arrays
In fact, one can copy-paste any JSON value, and it will always be recognized as a valid JSONiq query returning that value.
Construction of sequences
Sequences can be constructed (and concatenated) using commas.Increasing sequences of integers can also be built with the to keyword.
Scalar expressions
Sequences of items can have any number of items. A few JSONiq expression (arithmetic, logic, value comparison…) work on the particular case that a sequence has zero or one items.
Arithmetic
JSONiq supports basic arithmetic: addition (+), subtraction (-), division (div), integer division (idiv) and modulo (mod).If the data types are different, then conversions are made automatically. The empty sequence enjoys special treatment: if one of the sides (or both) is the empty sequence, then the arithmetic expression returns an empty sequence (no error). If one of the two sides is null (and the other side is not the empty sequence), then the arithmetic expression returns null. If one of the sides (or both) is not a number, null, or the empty sequence, then a type error is thrown.
String manipulation
String concatenation is done with a double vertical bar: “foo” || “bar”.
Most other string manipulation primitives are available from the rich JSONiq builtin function library:
concat(“foo”, “bar”),string-join((“foo”, “bar”, “foobar”), “-“),substr(“foobar”, 4, 3),
string-length(“foobar”).
Value comparison
Sequences of one atomic item can be compared with eq (equal), ne (not equal), le (lower or equal), gt (greater or equal), lt (lower than) and gt (greater than).
Logic
JSONiq supports the three basic logic expressions and, or, and not. not has the highest precedence, then and, then or.
JSONiq also supports universal and existential quantification:
every $i in 1 to 10 satisfies $i gt 0, some $i in 1 to 10 satisfies $i gt 5.
If one of the two sides is not a sequence of a single Boolean item, then implicit conversions are made. This mechanism is called the Effective Boolean Value (EBV). For example, an empty sequence, or a sequence of one empty string, or a sequence of one zero integer, is considered false. A sequence of one non-empty string, or a sequence or one non-zero integer, or a sequence starting with one object (or array) is considered true.
General comparison
JSONiq has a shortcut for existential quantification on value comparisons. This is called general comparison.
some $i in (1, 2, 3, 4, 5) satisfies $i eq 1 ==
(1, 2, 3, 4, 5) = 1.
Composability
Data flow
A few expressions give some control over the data flow by picking the output or this or that expression based on the value of another expression. This includes conditional expressions. This includes conditional expressions. This also includes try-catch expressions.
Binding variables with cascades of let clauses
Variables in JSONiq start with a dollar sign. It is important to understand that this is not a variable assignment that would change the value of a variable. This is only a declarative binding.
FLWOR expressions
One of the most important and powerful features of JSONiq is the FLWOR expression. It corresponds to SELECT-FROM-WHERE queries in SQL, however, it is considerably more expressive and generic than them in several aspects. In JSONiq the clauses can appear in any order with the exception of the first and last clause. JSONiq supports a let clause, which does not exist in SQL.
In SQL, when iterating over multiple tables in the FROM clause, they “do not see each other”. In JSONiq, for clauses (which correspond to FROM clauses in SQL), do see each other, meaning that it is possible to iterate in higher and higher levels of nesting by referring to a previous for variable.
For clauses
It can thus be seen that the for clause is akin to the FROM clause in SQL, and the return is akin to the SELECT clause. Projection in JSONiq can be made with a project() function call, with the keys to keep. It is possible to implement a join with a sequence of two for clauses and a predicate. Note that allowing empty can be used to perform a left outer join, i.e., to account for the case when there are no matching records in the second collection.
Let clauses
A let clause outputs exactly one outgoing tuple for each incoming tuple (think of a map transformation in Spark). Let clauses also allow for joining the two datasets.
Where clauses
Where clauses are used to filter variable bindings (tuples) based on a predicate on these variables. They are the equivalent to a WHERE clause in SQL.
Order by clauses
Order by clauses are used to reorganize the order of the tuples, but without altering them. They are the same as ORDER BY clauses in SQL.It is also possible, like in SQL, to specify an ascending or a descending order. In case of ties between tuples, the order is arbitrary. But it is possible to sort on another variable in case there is a tie with the first one (compound sorting keys). It is possible to control what to do with empty sequences: they can be considered smallest or greatest.
Group by clauses
Group by clauses organize tuples in groups based on matching keys, and then output only one tuple for each group, aggregating other variables (count, sum, max, min…). It is also possible to opt out of aggregating other (non-grouping-key) variables.
Types
Variable types
Since every value in JSONiq is a sequence of item, a sequence type consists of two parts: an item type, and a cardinality.
Item types can be any of the builtin atomic types. as well as “object”, “array” and the most generic item type, “item”.
Cardinality can be one of the following four:Any number of items (suffix ); for example object, One or more items (suffix +); for example array+,Zero or one item (suffix ?); for example integer,Exactly one item (no suffix); for example boolean?
Type expressions
An “instance of” expression checks whether a sequences matches a sequence type, and returns true or false. A “cast as” expression casts single items to an expected item type.
A “castable as” expression tests whether a cast would succeed (in which case it returns true) or not (false).
A “treat as” expression checks whether its input sequence matches an expected type (like a type on a variable); if it does, the input sequence is returned unchanged. If not, an error is raised.
Types in user-defined functions
JSONiq supports user-defined functions. Parameter types can be optionally specified, and a return type can also be optionally specified.
Validating against a schema
It is possible to declare a schema, associating it with a user-defined type, and to validate a sequence of items against this user-defined type.
Architecture of a query engine
Static phase
When a query is received by an engine, it is text that needs to be parsed. The output of this is a tree structure called an Abstract Syntax Tree. An Abstract Syntax Tree, even though it already has the structure of a tree, is tightly tied to the original syntax. Thus, it needs to be converted into a more abstract Intermediate Representation called an expression tree. Every node in this tree corresponds to either an expression or a clause in the JSONiq language, making the design modular. At this point, static typing takes place, meaning that the engine infers the static type of each expression, that is, the most specific type possible expected at runtime (but without actually running the program). Engines like RumbleDB perform their optimization round on this Intermediate Representation. Once optimizations have been done, RumbleDB decides the mode with which each expression and clause will be evaluated (locally, sequentially, in parallel, in DataFrames, etc). The resulting expression tree is then converted to a runtime iterator tree; this is the query plan that will actually be evaluated by the engine.
Dynamic phase
During the dynamic phase, the root of the tree is asked to produce a sequence of items, which is to be the final output of the query as a whole. Then, recursively, each node in the tree will ask its children to produce sequences of items (or tuple streams). Each node then combines the sequences of items (or tuple streams) it receives from its children in order to produce its own sequence of items according to its semantics, and pass it to its parent.
Materialization
When a sequence of items is materialized, it means that an actual List (or Array, or Vector), native to the language of implementation (in this case Java) is stored in local memory, filled with the items.
Streaming
When a sequence of items (or tuple stream) is produced and consumed in a streaming fashion, it means that the items (or tuples) are produced and consumed one by one, iteratively. But the whole sequence of items (or tuple stream) is never stored anywhere. The classical pattern for doing so is known as the Volcano iterator architecture.
However, there are two problems with this: first, it can take a lot of time to go through the entire sequence (imagine doing so with billions or trillions of items). Second, there are expressions or clauses that are not compatible with streaming (consider, for example, the group by or order by clause, which cannot be implemented without materializing their full input).
Parallel execution (with RDDs)
When a sequence becomes unreasonably large, RumbleDB switches to a parallel execution, leveraging Spark capabilities: the sequences of items are passed and processed as RDDs of Item objects.
Parallel execution (with DataFrames)
The RDD implementation supports heterogeneous sequences by leveraging the polymorphism of Item objects. However, this is not efficient in the case that Items in the same sequence happen to have a regular structure. Thus, if the Items in a sequence are valid against a specific schema, or even against an array type or an atomic type, the underlying physical storage in memory relies on Spark DataFrames instead of RDDs.
To summarize, homogeneous sequences of the most common types are stored in DataFrames, and RDDs are used in all other cases.
Parallel execution (with Native SQL)
In some cases (more in every release), RumbleDB is able to evaluate the query using only Spark SQL, compiling JSONiq to SQL directly instead of packing Java runtime iterators in UDFs. This leads to faster execution, because UDFs are slower than a native execution in SQL. This is because, to a SQL optimizer, UDFs are opaque and prevent automatic optimizations.
bigdata - JSONiq
install_url
to use ShareThis. Please set it in _config.yml
.