SQL(2) - 开窗函数

开窗函数(Window Functions)是SQL中一种强大的功能,用于在一组相关行(窗口)上执行聚合计算,同时保留每行的详细信息。与传统的聚合函数(如 SUM, AVG, COUNT 等)不同,开窗函数不会将结果聚合成单一的行,而是为每一行返回一个计算结果。

基本语法

开窗函数的基本语法如下:

1
2
3
4
5
function_name(expression) OVER (
[PARTITION BY partition_expression]
[ORDER BY sort_expression [ASC | DESC]]
[ROWS BETWEEN start AND end]
)

主要组成部分

  1. 函数名:常见的开窗函数包括 ROW_NUMBER(), RANK(), DENSE_RANK(), LEAD(), LAG(), SUM(), AVG(), MIN(), MAX() 等。
  2. OVER子句:定义窗口的范围。
    • PARTITION BY:将数据分成多个分区,每个分区独立计算。
    • ORDER BY:在每个分区内对数据进行排序。
    • ROWS BETWEEN:定义窗口的行范围。

常见的开窗函数

1. 行号函数

  • ROW_NUMBER():为每个分区内的行分配一个唯一的行号。

    1
    ROW_NUMBER() OVER (PARTITION BY column1 ORDER BY column2)
  • RANK():为每个分区内的行分配一个排名,如果有相同的值,排名会跳过。

    1
    RANK() OVER (PARTITION BY column1 ORDER BY column2)
  • DENSE_RANK():为每个分区内的行分配一个排名,如果有相同的值,排名不会跳过。

    1
    DENSE_RANK() OVER (PARTITION BY column1 ORDER BY column2)

2. 前后行函数

  • LEAD():获取当前行之后的某一行的值。

    1
    LEAD(column, offset, default) OVER (PARTITION BY column1 ORDER BY column2)
  • LAG():获取当前行之前的某一行的值。

    1
    LAG(column, offset, default) OVER (PARTITION BY column1 ORDER BY column2)

3. 聚合函数

  • SUM():计算窗口内的总和。

    1
    SUM(column) OVER (PARTITION BY column1 ORDER BY column2)
  • AVG():计算窗口内的平均值。

    1
    AVG(column) OVER (PARTITION BY column1 ORDER BY column2)
  • MIN():计算窗口内的最小值。

    1
    MIN(column) OVER (PARTITION BY column1 ORDER BY column2)
  • MAX():计算窗口内的最大值。

    1
    MAX(column) OVER (PARTITION BY column1 ORDER BY column2)

示例

假设有一个销售数据表 sales,包含以下字段:id, product_id, sale_date, amount

1. 行号函数示例

1
2
3
SELECT id, product_id, sale_date, amount,
ROW_NUMBER() OVER (PARTITION BY product_id ORDER BY sale_date) as row_num
FROM sales;

这个查询为每个产品的销售记录按日期排序,并分配一个行号。

2. 排名函数示例

1
2
3
SELECT id, product_id, sale_date, amount,
RANK() OVER (PARTITION BY product_id ORDER BY amount DESC) as rank
FROM sales;

这个查询为每个产品的销售记录按金额降序排序,并分配一个排名。

3. 前后行函数示例

1
2
3
4
SELECT id, product_id, sale_date, amount,
LAG(amount, 1, 0) OVER (PARTITION BY product_id ORDER BY sale_date) as prev_amount,
LEAD(amount, 1, 0) OVER (PARTITION BY product_id ORDER BY sale_date) as next_amount
FROM sales;

这个查询为每个产品的销售记录按日期排序,并获取前一行和后一行的金额。

4. 聚合函数示例

1
2
3
SELECT id, product_id, sale_date, amount,
SUM(amount) OVER (PARTITION BY product_id ORDER BY sale_date) as cumulative_sum
FROM sales;

这个查询为每个产品的销售记录按日期排序,并计算累计销售额。

窗口范围

  • ROWS BETWEEN:定义窗口的行范围。
    1
    SUM(amount) OVER (PARTITION BY product_id ORDER BY sale_date ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING)
    这个查询计算每个产品销售记录按日期排序时,当前行及其前后两行的总和。

bigdata - review notes

data cubes

slides

OLTP VS OLAP:

record keeping vs decision support
read-intensive vs write-intensive
detailed individual records vs summarized data
Lots of transactions on small portions of data vs Large portions
of the data.
fully interactive vs slow interactive.
Alt text
ETL: Extract Transform Load

slicing and dicing,Pivoting

Slicing involves selecting a specific “slice” or subset of the data cube by fixing one or more dimensions at a particular value. Dicing involves creating a subcube by selecting specific values for two or more dimensions. It’s like slicing, but you are selecting a rectangular subset of the cube, cutting through more than one dimension. Pivoting is another operation often used alongside slicing and dicing. It involves rotating the data cube to view it from a different perspective by swapping the positions of dimensions.

Aggregation and roll-up

These operations allow users to summarize and view data at different levels of granularity within a multidimensional dataset. Roll-up is a specific form of aggregation that involves moving from a lower level of detail to a higher level by collapsing one or more dimensions(moving up the hierarchy from a finer level of granularity (monthly) to a coarser level (quarterly)).

Two flavors of OLAP: ROLAP and MOLAP

ROLAP (Relational Online Analytical Processing) and MOLAP (Multidimensional Online Analytical Processing) are two different approaches to organizing and processing data in the context of Online Analytical Processing (OLAP) systems. In summary, the main difference between ROLAP and MOLAP lies in how they store and organize data. ROLAP uses relational databases to store multidimensional data in tables, while MOLAP uses specialized databases with a cube structure optimized for efficient multidimensional analysis.

fact table and Satellite table

Fact tables are surrounded by dimension tables in a star schema. They usually have foreign key relationships with dimension tables, linking to various dimensions that provide context to the measures. Satellite tables are used to store additional details about dimension members that are not part of the core structure of the dimension table. Satellite tables typically have a foreign key relationship with the main dimension table, linking to the primary key of the dimension. They can be joined to the main dimension table to enrich the analysis with additional context.

star schema and snow-flake schema

In a star schema, there is a central fact table surrounded by dimension tables. The fact table contains quantitative data (measures), and the dimension tables store descriptive data related to the measures. A snowflake schema is an extension of the star schema, where dimension tables are normalized into multiple related tables. This normalization leads to a structure resembling a snowflake when visualized, as opposed to the star-like structure of a star schema.

SQL querying tables and MDX querying Cubes

MDX stands for Multi-Dimensional
eXpressions.

“Roll-up” and “drill down”

Roll-up involves summarizing data at a higher level of aggregation or moving from a lower level of detail to a higher level. It’s a way of viewing data at a coarser level in the hierarchy. Drill down is the opposite of roll-up. It involves accessing more detailed information or moving from a higher level of aggregation to a lower, more detailed level.

GROUPING SETS, CUBE, ROLLUP

GROUPING SETS is a SQL feature that allows you to specify multiple grouping sets within a single query. The CUBE operation is an extension of GROUP BY that generates all possible combinations of grouped elements. ROLLUP is another extension of GROUP BY that generates subtotals and grand totals for a specified set of columns. It creates a hierarchy of grouping levels, calculating subtotals as it rolls up the hierarchy. Note that the column order in rollup is important.

book

exercise

concepts of fact tables;
pivot table;
SQL query with grouping set, cube…

Graph Databases

sildes

Index-free adjacency

With index-free adjacency, graph databases are designed in such a way that relationships between nodes are directly accessible without the need for an index structure. Instead of traversing indexes to find related nodes, the relationships are stored alongside the nodes, allowing for faster and more efficient traversal.Each node maintains direct pointers or references to its neighboring nodes, making it possible to navigate from one node to another without the need for an explicit index lookup. This design principle is particularly beneficial for scenarios where traversing relationships in a graph is a common and performance-critical operation.

Property Graph and Triple stores (RDF)

Labeled property graphs: ingredients: Nodes, Edges,Properties,Labels.
A property graph is a graph data model that consists of nodes, edges, and properties.RDF is a standard for representing data about resources in the form of subject-predicate-object triples.

Cypher

Cypher is a query language commonly used with property graph databases like Neo4j.

SPARQL

SPARQL is a query language commonly used with RDF triple stores.

Neo4j

Data replication;
sharding:neo4j fabric;
Caching and pages;

RDF

IRI (=URI for all practical purposes);
Alt text
RDF formats:RDF/XML;Turtle;JSON-LD;RDFa;N-Triples.

book

Labeled property graph model vs. triple stores

labeled property graphs enhance mathematical graphs with extra ingredients: properties, and labels.
Labels are some sort of “tags”, in the form of a string, that can be attached to a node or an edge.
Each node and each edge can be associated with a map from strings to values, which represents its properties.

Triple stores are a different and simpler model. It views the graph as nodes and edges that all have labels, but without any properties.
The graph is then represented as a list of edges, where each edge is a triple with the label of the origin node (called the subject), the label of the edge (called the property), and the label of the destination node (called the object).

Labels can be:URIs,Literals, that is, atomic values,Literals are only allowed as objects. Absent, in which case the node is called a blank node. Blank nodes are only allowed as subjects or objects, but not as properties.

cypher

Other graph databases have other means of querying data. Many, including Neo4j, support the RDF query language SPARQL and the imperative, path-based query language Gremlin.

Cypher enables a user (or an application acting on behalf of a user) to ask the database to find data that matches a specific pattern. Colloquially, we ask the database to “find things like this.” And the way we describe what “things like this” look like is to draw them, using ASCII art.

The MATCH clause is at the heart of most Cypher queries.

native graph processing

To understand why native graph processing is so much more efficient than graphs based on heavy indexing, consider the following. Depending on the implementation, index lookups could be O(log n) in algorithmic complexity versus O(1) for looking up immediate relationships. To traverse a network of m steps, the cost of the indexed approach, at O(m log n), dwarfs the cost of O(m) for an implementation that uses index-free adjacency.

With index-free adjacency, bidirectional joins are effectively precomputed and stored in the database as relationships

stores

Neo4j stores graph data in a number of different store files. Each store file contains the data for a specific part of the graph (e.g., there are separate stores for nodes, relationships, labels, and properties). Like most of the Neo4j store files, the node store is a fixed-size record store, where each record is nine bytes in length. Fixed-size records enable fast lookups for nodes in the store file.

CYPHER,Traverser API and Core API

Neo4j’s Core API is an imperative Java API that exposes the graph primitives of nodes, relationships, properties, and labels to the user. When used for reads, the API is lazily evaluated, meaning that relationships are only traversed as and when the calling code demands the next node.

The Traversal Framework is a declarative Java API. It enables the user to specify a set of constraints that limit the parts of the graph the traversal is allowed to visit.

Cypher can be more tolerant of structural changes—things such as variable-length paths help mitigate variation and change.

however, the Traversal Framework tends to perform marginally less well than a well-written Core API query.

Choosing between the Core API and the Traversal Framework is a matter of deciding whether the higher abstraction/lower coupling of the Traversal Framework is sufficient, or whether the close-to-the-metal/higher coupling of the Core API is in fact necessary for implementing an algorithm correctly and in accordance with our performance requirements.

exercise

Querying trees

slides

JSONiq

Data independence with
heterogeneous, denormalized data.
JSONiq Data Model (JDM): Sequences of Items;

Declarative languages,Functional languages and Set-based languages

Declarative languages focus on describing what the program should accomplish, rather than specifying how to achieve it. Functional languages treat computation as the evaluation of mathematical functions and avoid changing state or mutable data.Haskell, Lisp, and Erlang are functional programming languages.
Parts of JavaScript and Python support functional programming paradigms. Set-based languages are a subset of declarative languages that focus on manipulating sets of data.
It’s worth noting that languages can often belong to more than one category. For example, SQL is both declarative and set-based, and functional programming concepts can be integrated into languages that are not purely functional.

FLWOR clauses

query,Abstract Syntax Tree, Expression Tree, Iterator Tree

Materialized execution,Streamed execution,Parallel execution

Materialized execution takes lots of space. Streamed execution takes lots of time.
Parallel execution can take lots of machines.
Execution modes determined statically for every expression and clause.
Alt text
Alt text

Rumble

Traditional RDBMS/warehouses vs. data lakes

book

exercise

Document Stores

slides

JSON and XML

SQL and NoSQL:

NoSQL: validation after the data was populated.

CRUD

Create,Read,Update,Delete

MongoDB

projecting away: not selecting a field.

hash indices,Tree indices (B+-tree),compund index

Limitations of hash indices:
No support for range queries,Hash function not perfect in real life, Space requirements for collision avoidance.
B+-tree: All leaves at same depth, All non-leaf nodes have between 3 and 5 children,But it’s fine if the root has less.

book

exercise

Performance at Large Scales

slides

sources of bottleneck

Memory, CPU, Disk I/O, Network I/O.
Sweet Spot for MapReduce and Spark: Disk I/O.

Latency, Throughput,Response time

Latency: When do I start receiving data.
Throughput:”How fast can we transmit data.
Response time=Latency + Transfer.

speedup,Amdahl’s law,Gustafson’s law

speedup = latency(old)/latency(new).
Gustafson’s law:Constant computing power.
Amdahl’s law: Constant problem size.

Scaling out,Scaling up

tail latency, SLA

book

exercise

Massive Parallel Processing I(MapReduce)

slides

combine and reduce

map task and reduce task, map slot and reduce slot, map phase amd reduce phase

no combine task and combine slot.
map slot = sequential map task; map task = sequential split map;

split(mapreduce) and block(HDFS)

1 split = 1 map task
split(mapreduce)!= block(HDFS)

book

mapreduce model

In MapReduce, the input data, intermediate data, and output data are all made of a large collection of key-value pairs (with the keys not necessarily unique, and not necessarily sorted by key).

MapReduce architecture

In the original version of MapReduce, the main node is called JobTracker, and the worker nodes are called TaskTrackers.
In fact, the JobTracker typically runs on the same machine as the NameNode (and HMaster) and the TaskTrackers on the same machines as the DataNodes (and RegionServers). This is called “bring the query to the data.”

Note that shuffling can start before the map phase is over, but the reduce phase can only start after the map phase is over.

combine

Combining happens during the map phase.
In fact, in most of the cases, the combine function will be identical to the reduce function, which is generally possible if the intermediate key-value pairs have the same type as the output key-value pairs, and the reduce function is both associative and commutative.

MapReduce programming API

In Java, the user needs to define a so-called Mapper class that contains the map function, and a Reducer class that contains the reduce function.
A map function takes in particular a key and a value. Note that it outputs key-value pairs via the call of the write method on the context, rather than with a return statement.
A reduce function takes in particular a key and a list of values.

function,task,slot,phase

A map function is a mathematical, or programmed, function that takes one input key-value pair and returns zero, one or more intermediate key-value pairs.
Then, a map task is an assignment (or “homework”, or “TODO”) that consists in a (sequential) series of calls of the map function on a subset of the input.

There is no such thing as a combine task. Calls of the combine function are not planned as a task, but is called ad-hoc during flushing and compaction.

The map tasks are processed thanks to compute and memory resources (CPU and RAM). These resources are called map slots. One map slot corresponds to one CPU core and some allocated memory. Each map slot then processes one map task at a time, sequentially.

The map phase thus consists of several map slots processing map tasks in parallel.

short-circuiting(split and block)

This is because the DataNode process of HDFS and the TaskTracker process of MapReduce are on the same machine. Thus, getting a replica of the block containing the data necessary to the processing of the task is as simple as a local read. This is called short-circuiting.
split(logical level)!=HDFS block(physical level).

exercise

Resource management

book

YARN

YARN means Yet Another Resource manager. It was introduced as an additional layer that specifically handles the management of CPU and memory resources in the cluster.
YARN, unsurprisingly, is based on a centralized architecture in which the coordinator node is called the ResourceManager, and the worker nodes are called NodeManagers. NodeManagers furthermore provide slots (equipped with exclusively allocated CPU and memory) known as containers.

YARN provides generic support for allocating resources to any application and is application-agnostic. When the user launches a new application, the ResourceManager assigns one of the container to act as the ApplicationMaster which will take care of running the application.

Version 2 of MapReduce works on top of YARN by leaving the job lifecycle management to an ApplicationMaster.
It is important to understand that, unlike the JobTracker, the ResourceManager does not monitor tasks, and will not restart slots upon failure. This job is left to the ApplicationMasters.

Scheduling strategies

FIFO scheduling,Capacity scheduling,Fair scheduling

Dominant Resource Fairness algorithm.

The two (or more) dimensions are projected again to a single dimension by looking at the dominant resource for each user.

Massive Parallel Processing II (Spark)

slides

YARN

ResourceManager + NodeManager; ResourceManager allocates one NodeManager as application master. Application Master communicates with containers. ResourceManager
Does not monitor tasks and Does not restart upon failure. Fault tolerance is on the application master.

spark

Full-DAG query processing.distributed acyclic graph (DAG)
RDD: Resilient Distributed Dataset.

RDD

lazy evaluation:Lazy evaluation means that the execution of transformations on RDDs is deferred until an action is triggered. Instead of immediately executing the transformations, Spark keeps track of the sequence of transformations in the form of a logical execution plan. The actual computation is only performed when an action is called.

A narrow dependency (also known as a narrow transformation) occurs when each partition of the parent RDD contributes to at most one partition of the child RDD. In other words, the number of partitions remains the same before and after the transformation, and each partition of the child RDD depends on a one-to-one relationship with partitions of the parent RDD.

A wide dependency (also known as a wide transformation) occurs when each partition of the parent RDD contributes to multiple partitions of the child RDD. This typically involves operations that require data shuffling or redistribution, such as groupByKey or reduceByKey.

Data Frames

book

Resilient distributed datasets(RDDs)

Resilient means that they remain in memory or on disk on a “best effort” basis, and can be recomputed if need be. Distributed means that, just like the collections of key-value pairs in MapReduce, they are partitioned and spread over multiple machines.
A major difference with MapReduce, though, is that RDDs need not be collections of pairs. Since a key-value pair is a particular example of possible value, RDDs are a generalization of the MapReduce model for input, intermediate input and output.

The RDD lifecycle

Creation,Transformation(Mapping or reducing, in this model, become two very specific cases of transformations.),Action.
transformations:
unary transformations: The filter transformation,The map transformation,The flatMap transformation,The distinct transformation,The sample transformation
Binary transformations: taking the union of two RDDs, take the intersection, take the subtraction.
Pair transformations: Spark has transformations specifically tailored for RDDs of key-value pairs: The key transformation,The values transformation,The reduceByKey transformation, The groupByKey transformation,The sortByKey transformation,The mapValues transformation,The join transformation,The subtractByKey transformation.
Actions: The collect action,The count action,The countByValue action,The take action,The top action,The takeSample action,The reduce action,saveAsTextFile action,saveAsObjectFile action.

Note that Spark, at least in its RDD API, is not aware of any particular format or syntax, i.e., it is up to the user to parse and serialize values to the appropriate text or bytes.

Actions on Pair RDDs: There are actions specifically working on RDDs of key-value pairs:The countByKey action, The lookup action,

Lazy evaluation

It is only with an action that the entire computation pipeline is put into motion, leading to the computation of all the necessary intermediate RDDs all the way down to the final output corresponding to the action.

Physical architecture

There are two kinds of transformations: narrow-dependency transformations and wide-dependency transformations.

Such a chain of narrow-dependency transformations executed efficiently as a single set of tasks is called a stage, which would correspond to what is called a phase in MapReduce.

Optimizations

Pinning RDDs, Pre-partitioning.

DataFrames in Spark

A DataFrame can be seen as a specific kind of RDD: it is an RDD of rows (equivalently: tuples, records) that has relational integrity, domain integrity, but not necessarily (as the name “Row” would otherwise fallaciously suggest) atomic integrity.

Note that Spark automatically infers the schema from discovering the JSON Lines file, which adds a static performance overhead that does not exist for raw RDDs: there is no free lunch.

Unlike the RDD transformation API, there is no guarantee that the execution will happen as written, as the optimizer is free to reorganize the actual computations.

Spark SQL

both GROUP BY and ORDER BY will trigger a shuffle in the system. The SORT BY clause can sort rows within each partition, but not across partitions, i.e., does not induce any shuffling. The DISTRIBUTE BY clause forces a repartition by putting all rows with the same value (for the specified field(s)) into the same new partition.

use both SORT and DISTRIBUTE = the use of another clause, CLUSTER BY.

A word of warning must be given on SORT, DISTRIBUTE and CLUSTER clauses: they are, in fact, a breach of data independence, because they expose partitions.

explode() and lateral view: Lateral views are more powerful and generic than just an explode() because they give more control, and they can also be used to go down several levels of nesting.

book

exercise

Spark’s RDDs are by default recomputed each time you run an action on them. Please note that both persist() and cache() are lazy operations themselves. The caching operation will, in fact, only take place when the first action is called. With successive action calls, the cached RDD will be used.

introduction

book

three Vs: Volume, Variety, Velocity.
Alt text
four more shapes: trees, unstructured, cubes,graphs.
three factors: Capacity,Throughput, Latency.
Alt text

partial function and function

A function is a strict mapping where every element in the domain is mapped to a unique element in the codomain.
A partial function is a mapping where not every element in the domain necessarily has a defined value in the codomain.
For a table, we need to throw in three additional constraints: relational integrity, domain integrity and atomic integrity.

lessons learned from the past

book

natural join, theta join, and outer join,Semi-outer join

A natural join is a type of join that combines rows from two tables based on columns with the same name and data type. The columns used for the join condition are not explicitly specified; instead, the database system automatically identifies the matching columns. A theta join is a generalization of the natural join, where the join condition is explicitly specified using a comparison operator.

Normal forms

The first normal form was already covered earlier: it is in fact atomic integrity.
The second normal form takes it to the next level: it requires that each column in a record contains information on the entire record. The third normal form additionally forbids functional dependencies on anything else than the primary key.

SQL

SQL is a declarative language, which means that the user specifies what they want, and not how to compute it: it is up to the underlying system to figure out how to best execute the query. It is also a set-based language, in the sense that it manipulates sets of records at a time, rather than single values as is common in other languages. It is also, to some limited extent, a functional language in the sense that it contains expressions that can nest in each other (nested queries).

ACID

There are four main properties (often called ACID):Atomicity,Consistency,Isolation,Durability.

exercise

SQL

1NF,2NF,3NF,BCNF;
intersection;

Storing data

book

CAP

(Atomic) Consistency, Availability, Partition tolerance.

document stores

book

Document stores provide a native database management system for semi-structured data. A document store typically specializes in either JSON or XML data, even though some companies (e.g., MarkLogic) offer support for both. It is important to understand that document stores are optimized for the typical use cases of many records of small to medium sizes. Typically, a collection can have millions or billions of documents, while each single document weighs no more than 16 MB (or a size in a similar magnitude).

MongoDB

In MongoDB, the format is a binary version of JSON called BSON.

The API of MongoDB, like many document stores, is based on the CRUD paradigm. CRUD means Create, Read, Update, Delete.

MongoDB automatically adds to every inserted document a special field called “ id” and associated with a value called an Object ID and with a type of its own.

hash indices and tree indices

Hash indices are used to optimize point queries and more generally query that select on a specific value of a field.

Secondary indices

By default, MongoDB always builds a tree index for the id field. Users can request to build hash and tree indices for more fields. These indices are called secondary indices.

exercise

By default, MongoDB creates the _id index, which is an ascending unique index on the _id field, for all collections when the collection is created. You cannot remove the index on the _id field.

Querying denormalized data

book

Features of a query language

First, it is declarative. This means that the users do not focus on how the query is computed, but on what it should return.

Second, it is functional. This means that the query language is made of composable expressions that nest with each other, like a Lego game.

Finally, it is set-based, in the sense that the values taken and returned by expressions are not only single values (scalars), but are large sequences of items (in the case of SQL, an item is a row).

JSONiq

It is possible to filter any sequence with a predicate, where .c = 3]”
To access the n-th member of an array, you can use double-squarebrackets: “json-doc(“file.json”).o[[2]].a”.

Do not confuse sequence positions (single square brackets) with array positions (double square brackets)!

The empty sequence enjoys special treatment: if one of the sides (or both) is the empty sequence, then the arithmetic expression returns an empty sequence (no error).

Note that unlike SQL, JSONiq logic expressions are two-valued and return either true or false.

general comparison

universal and existential quantification: every and some;
JSONiq has a shortcut for existential quantification on value comparisons. This is called general comparison.

FLWOR expressions

One of the most important and powerful features of JSONiq is the FLWOR expression. It corresponds to SELECT-FROM-WHERE queries in SQL.

exercise

Accessing a JSON dataset can be done in two ways depending on the exact format:
If this is a file that contains a single JSON object spread over multiple lines, use json-doc(URL).
If this is a file that contains one JSON object per line (JSON Lines), use json-file(URL).

HDFS

book

HDFS data model

HDFS does not follow a key-value model: instead, an HDFS cluster organizes its files as a hierarchy, called the file namespace. Files are thus organized in directories, similar to a local file system.

key-value model vs file hierarchy

The key-value model and file hierarchy are two different approaches to organizing and accessing data within storage systems. While the key-value model excels in flexibility and quick data access, file hierarchy provides a structured and predictable organization suitable for many traditional storage use cases.

object storage vs block storage

Organizes data as objects, each containing both data and metadata. Objects are stored in a flat address space.
Organizes data as fixed-size blocks, typically within storage volumes.Requires a file system to manage and organize data into files and directories.

object storage and key-value model

Object storage is a data storage architecture that manages data as objects, each containing both data and metadata. Objects are stored in a flat address space without the hierarchy found in traditional file systems. In the key-value model, data is organized as pairs of keys and values. Each key uniquely identifies a value, and the system allows for the efficient retrieval and storage of data based on these key-value pairs.
Object storage systems often use a key-value model internally to manage objects. Each object has a unique identifier (key), and the associated data and metadata form the corresponding value.

physical architecture

In the case of HDFS, the central node is called the NameNode and the other nodes are called the DataNodes. In fact, more precisely, the NameNode and DataNodes are processes running on these nodes.
The NameNode stores in particular three things:the file namespace,a mapping from each file to the list of its blocks, a mapping from each block, represented with its 64-bit identifier, to the locations of its replicas.

exercise

object storage vs block storage

syntax

book

json

JSON stands for JavaScript Object Notation.
JSON is made of exactly six building blocks: strings, numbers, Booleans, null, objects, and arrays.
in JSON, escaping is done with backslash characters (\).
The way a number appears in syntax is called a lexical representation, or a literal.
JSON places a few restrictions: a leading + is not allowed. Also, a leading 0 is not allowed except if the integer part is exactly 0 (in which case it is even mandatory).
Objects are simply maps from strings to values. The keys of an object must be strings.
The JSON standard recommends for keys to be unique within an object.

unicode

Unicode is a standard that assigns a numeric code (called a code point) to each character in order to catalogue them across all languages of the world, even including emojis. The code point must be indicated in base 16.

XML

XML stands for eXtensible Markup Language.
XML’s most important building blocks are elements, attributes, text and comments.
Unlike JSON keys, element names can repeat at will.
At the top-level, a well-formed XML document must have exactly one element.
Attributes appear in any opening elements tag and are basically keyvalue pairs. Values can be either double-quoted or single-quoted. The key is never quoted, and it is not allowed to have unquoted values. Within the same opening tag, there cannot be duplicate keys. Attributes can never appear in a closing tag. It is not allowed to create attributes that start with XML or xml, or any case combination. because this is reserved for namespace declarations.
a single comment alone is not well-formed XML (remember: we need exactly one top-level element).
XML documents can be identified as such with an optional text declaration containing a version number and an encoding.
Another tag that might appear right below, or instead of, the text declaration is the doctype declaration. It must then repeat the name of the top-level element.
Remember that in JSON, it is possible to escape sequences with a backslash character. In XML, this is done with an ampersand (&) character.
Escape sequences can be used anywhere in text, and in attribute values.there are a few places where they are mandatory: In text, & and < MUST be escaped. In double-quoted attribute values, ”, & and < MUST be escaped. In single-quoted attribute values, ’, & and < MUST be escaped.

Namespaces in XML

A namespace is identified with a URI.
The triplet (namespace, prefix, localname) is called a QName (for “qualified name”).
For the purpose of the comparisons of two QNames (and thus of documents), the prefix is ignored: only the local name and the namespace are compared.
First, unprefixed attributes are not sensitive to default namespaces: unlike elements, the namespace of an unprefixed attribute is always absent even if there is a default namespace.

exercise

xml names

Remember:
Element names are case-sensitive.
Element names must start with a letter or underscore.
Element names cannot start with the letters xml (or XML, or Xml, etc).
Element names can contain letters, digits, hyphens, underscores, and periods.
Element names cannot contain spaces.

JSON Key names

The only restriction the JSON syntax imposes on the key names is that “ and \ must be escaped.

Wide column stores

book

HDFS VS Wide column stores

The problem with HDFS is its latency: HDFS works well with very large files (at least hundreds of MBs so that blocks even start becoming useful), but will have performance issues if accessing millions of small XML or JSON files. Wide column stores were invented to provide more control over performance and in particular, in order to achieve high-throughput and low latency for objects ranging from a few bytes to about 10 MB, which are too big and numerous to be efficiently stored as so-called clobs (character large objects) or blobs (binary large objects) in a relational database system, but also too small and numerous to be efficiently accessed in a distributed file system.

object storage VS Wide column stores

a wide column store has additional benefits: a wide column store will be more tightly integrated with the parallel data processing systems.wide column stores have a richer logical model than the simple key-value model behind object storage; wide column stores also handle very small values (bytes and kBs) well thanks to batch processing.

HBase

From an abstract perspective, HBase can be seen as an enhanced keyvalue store, in the sense that:a key is compound and involves a row, a column and a version; keys are sortable; values can be larger (clobs, blobs), up to around 10 MB. On the logical level, the data is organized in a tabular fashion: as a collection of rows. Each row is identified with a row ID. Row IDs can be compared, and the rows are logically sorted by row ID. Column qualifiers are arrays of bytes (rather than strings), and as for row IDs, there is a library to easily create column qualifiers from primitive values.

HBase supports four kinds of low-level queries: get, put, scan and delete. Unlike a traditional key-value store, HBase also supports querying ranges of row IDs and ranges of timestamps.

HBase offers a locking mechanism at the row level, meaning that different rows can be modified concurrently, but the cells in the same row cannot: only one user at a time can modify any given row.

A table in HBase is physically partitioned in two ways: on the rows and on the columns. The rows are split in consecutive regions. Each region is identified by a lower and an upper row key, the lower row key being included and the upper row key excluded. A partition is called a store and corresponds to the intersection of a region and of a column family.

All the cells within a store are eventually persisted on HDFS, in files that we will call HFiles. An HFile is, in fact, nothing else than a (boring) flat list of KeyValues, one per cell. What is important is that, in an HFile, all these KeyValues are sorted by key in increasing order.

HDFS block and HBlock

HBase uses index structures to quickly skip to the position of the HBase block which may hold the requested key. Note HBase block is not to be confused with HDFS block and the underlying file system block.By default, each HBase block is 64KB (configurable) in size and always contains whole key-value pairs, so, if a block needs more than 64KB to avoid splitting a key-value pair, it will just grow.

Log-structured merge trees

Upon flushing, all cells are written sequentially to a new HFile in ascending key order, HBlock by HBlock, concurrently building the index structure. In fact, sorting is not done in the last minute when flushing. Rather, what happens is that when cells are added to memory, they are added inside a data structure that maintains them in sorted order (such as tree maps) and then flushing is a linear traversal of the tree.

Bloom filters

HBase has a mechanism to avoid looking for cells in every HFile. This mechanism is called a Bloom filter. It is basically a black box that can tell with absolute certainty that a certain key does not belong to an HFile, while it only predicts with good probability (albeit not certain) that it does belong to it.

exercise

Bloom filters

Bloom filters are a data structure used to speed up queries, useful in the case in which it’s likely that the value we are looking doesn’t exist in the collection we are querying. Their main component is a bit array with all values initially set to 0. When a new element is inserted in the collection, its value is first run through a certain number of (fixed) hash functions, and the locations in the bit array corresponding to the outputs of these functions are set to 1.

This means that when we query for a certain value, if the value has previously been inserted in the collection then all the locations corresponding to the hash function outputs will certainly already have been set to 1. On the contrary, if the element hasn’t been previously inserted, then the locations may or may not have already been set to 1 by other elements. Then, if prior to accessing the collection we run our queried value through the hash functions, check the locations corresponding to the outputs, and find any of them to be 0, we are guaranteed that the element is not present in the collection (No False Negatives), and we don’t have to waste time looking. If the corresponding locations are all set to 1, the element may or may not be present in the collection (possibility of False Positives), but in the worst case we’re just wasting time.

As you have seen in the task above, HBase has to check all HFiles, along with the MemStore, when looking for a particular key. As an optimisation, Bloom filters are used to avoid checking an HFile if possible. Before looking inside a particular HFile, HBase first checks the requested key against the Bloom filter associated with that HFile. If it says that the key does not exist, the file is not read.

a Bloom filter can produce false positive outcomes. Luckily, it never produces false negative outcomes.

Log-structured merge-tree (LSM tree) (optional)

As opposed to B+-tree which has a time complexity of O(log n) when inserting new elements, n being the total number of elements in the tree, LSM tree has O(1) for inserting, which is a constant cost.

Data models and validation

book

A data model is an abstract view over the data that hides the way it is stored physically. For example, a CSV file should be abstracted logically as a table.

The JSON Information Set

it is possible to take a tree and output it back to JSON syntax. This is called serialization.

The XML Information Set

A fundamental difference between JSON trees and XML trees is that for JSON, the labels (object keys) are on the edges connecting an object information item to each one of its children information items. In XML, the labels (these would be element and attribute names) are on the nodes (information items) directly.

Item types

Also, all atomic types have in common that they have a logical value space and a lexical value space. Atomic types can be in a subtype relationship: a type is a subtype of another type if its logical value space is a subset of the latter.
However, in modern databases, it is customary to support unbounded integers.
Decimals correspond to real numbers that can be written as a finite sequence of digits in base 10, with an optional decimal period.Support for the entire decimal value space can be costly in performance. In order to address this issue, a floating-point standard (IEEE 754) was invented and is still very popular today.
Timestamp values are typically stored as longs (64-bit integers) expressing the number of milliseconds elapsed since January 1, 1970 by convention.
XML Schema, JSound and JSONiq follow the ISO 8601 standard.
The lexical representation of duration can vary, but there is a standard defined by ISO 8601 as well, starting with a P and prefixing sub-day parts with a T.
Maps (not be confused with records, which are similar) are maps from any atomic value to any value, i.e., generalize objects to keys that are not necessarily strings (e.g., numbers, dates, etc). However, unlike records, the type of the values must be the same for all keys.
Alt text

JSound and JSON Schema

JSound is a schema language that was designed to be simple for 80% of the cases, making it particularly suitable in a teaching environment. It is independent of any programming language.
JSON Schema is another technology for validating JSON documents.
The type system of JSON Schema is thus less rich than that of JSound, but extra checks can be done with so-called formats, which include date, time, duration, email, and so on including generic regular expressions.
It is possible to require the presence of a key by adding an exclamation mark in JSound. in JSON Schema, which uses a “required” property associated with the list of required keys to express the same.
In the JSound compact syntax, extra keys are forbidden. Unlike JSound, in JSON Schema, extra properties are allowed by default. JSON Schema then allows to forbid extra properties with the “additionalProperties” property.
There are a few more features available in the compact JSound syntax (not in JSON Schema) with the special characters @, ? and =. The question mark (?) allows for null values (which are not the same as absent values). The arobase (@) indicates that one or more fields are primary keys for a list of objects that are members of the same array. The equal sign (=) is used to indicate a default value that is automatically populated if the value is absent.
Alt text
Note that some values are quoted, which does not matter for validation: validation only checks whether lexical values are part of the type’s lexical space.

Accepting any values in JSound can be done with the type “item”, which contains all possible values. In JSON Schema, in order to declare a field to accept any values, you can use either true or an empty object in lieu of the type.

JSON Schema additionally allows to use false to forbid a field.

In JSON Schema, it is also possible to combine validation checks with Boolean combinations using “anyOf”.JSound schema allows defining unions of types with the vertical bar inside type strings.

In JSON Schema only (not in JSound), it is also possible to do a conjunction (logical and) with “allOf” as well as exclusive with “oneOf” as well as negation with “not”.

XML Schema

all elements in an XML Schema are in a namespace, the XML Schema namespace. The namespace is prescribed by the XML Schema standard and must be this one.

dataframe

There is a particular subclass of semi-structured datasets that are very interesting: valid datasets, which are collections of JSON objects valid against a common schema, with some requirements on the considered schemas. The datasets belonging to this particular subclass are called data frames, or dataframes.
relational tables are data frames, while data frames are not necessarily relational tables: data frames can be (and are often) nested, but they are still relatively homogeneous to some extent.Thus, Data frames are a generalization of (normalized) relational tables allowing for (organized and structured) nestedness.

data format

In fact, if the data is structured as a (valid) data frame, then there are many, many different formats that it can be stored in, and in a way that is much more efficient than JSON. These formats are highly optimized and typically stored in binary form, for example Parquet, Avro, Root, Google’s protocol buffers, etc.

Why is it possible to store the data more efficiently when it is valid and data-frame-friendly? One important reason is that the schema can be stored as a header in the binary format, and the data can be stored without repeating the fields in every record (as is done in textual JSON).

exercise

Dremel

Dremel is a query system developed at Google for deriving data stored in a nested data format such as XML, JSON, or Google Protocol Buffers into column storage, where it can be analyzed faster.

paper notes

Dynamo

CAP: AP.
Alt text

preference list and coordinator

A node handling a read or write operation is known as the coordinator. Typically, this is the first among the top N nodes in the preference list.

quorum-like system

To maintain consistency among its replicas, Dynamo uses a consistency protocol similar to those used in quorum systems. This protocol has two key configurable values: R and W. R is the minimum number of nodes that must participate in a successful read operation. W is the minimum number of nodes that must participate in a successful write operation. Setting R and W such that R + W > N yields a quorum-like system.

To remedy this it does not enforce strict quorum membership and instead it uses a “sloppy quorum”; all read and write operations are performed on the first N healthy nodes from the preference list.

Merkle Trees

A hash tree or Merkle tree is a binary tree in which every leaf node gets as its label a data block and every non-leaf node is labelled with the cryptographic hash of the labels of its child nodes. Some KeyValue stores use Merkle trees for efficiently detecting inconsistencies in data between replicas.

vector clock

Dremel

In contrast to layers such as Pig19 and Hive,16 it executes queries natively without translating them into MR jobs. Lastly, and importantly, Dremel uses a column-striped storage representation, which enables it to read less data from secondary storage and reduce CPU cost due to cheaper compression.

Repetition and definition levels

we define the repetition level as the number of repeated fields in the common prefix (including the first path element identifying the record). The definition level specifies the number of optional and repeated fields in the path (excluding the first path element).A definition level smaller than the maximal number of repeated and optional fields in a path denotes a NULL.

HDFS

Each block replica on a DataNode is represented by two files in the local host’s native file system. The first file contains the data itself and the second file is block’s metadata including checksums for the block data and the block’s generation stamp.

If the NameNode does not receive a heartbeat from a DataNode in ten minutes the NameNode considers the DataNode to be out of service and the block replicas hosted by that DataNode to be unavailable. The NameNode then schedules creation of new replicas of those blocks on other DataNodes. Heartbeats from a DataNode also carry information about total storage capacity, fraction of storage in use, and the number of data transfers currently in progress. These statistics are used for the NameNode’s space allocation and load balancing decisions. The NameNode does not directly call DataNodes. It uses replies to heartbeats to send instructions to the DataNodes.

A recently introduced feature of HDFS is the BackupNode. Like a CheckpointNode, the BackupNode is capable of creating periodic checkpoints, but in addition it maintains an inmemory, up-to-date image of the file system namespace that is always synchronized with the state of the NameNode.

A replica stored on a DataNode may become corrupted because of faults in memory, disk, or network. HDFS generates and stores checksums for each data block of an HDFS file. Checksums are verified by the HDFS client while reading to help detect any corruption caused either by client, DataNodes, or network. When a client creates an HDFS file, it computes the checksum sequence for each block and sends it to a DataNode along with the data. A DataNode stores checksums in a metadata file separate from the block’s data file. When HDFS reads a file, each block’s data and checksums are shipped to the client. The client computes the checksum for the received data and verifies that the newly computed checksums matches the checksums it received. If not, the client notifies the NameNode of the corrupt replica and then fetches a different replica of the block from another DataNode.

The design of HDFS I/O is particularly optimized for batch processing systems, like MapReduce, which require high throughput for sequential reads and writes.

Currently HDFS provides a configurable block placement policy interface so that the users and researchers can experiment and test any policy that’s optimal for their applications.

When a block becomes over replicated, the NameNode chooses a replica to remove. The NameNode will prefer not to reduce the number of racks that host replicas, and secondly prefer to remove a replica from the DataNode with the least amount of available disk space. When a block becomes under-replicated, it is put in the replication priority queue. A background thread periodically scans the head of the replication queue to decide where to place new replicas.

HDFS block placement strategy does not take into account DataNode disk space utilization. This is to avoid placing new—more likely to be referenced—data at a small subset of the DataNodes.

Each DataNode runs a block scanner that periodically scans its block replicas and verifies that stored checksums match the block data. Whenever a read client or a block scanner detects a corrupt block, it notifies the NameNode. The NameNode marks the replica as corrupt, but does not schedule deletion of the replica immediately. Instead, it starts to replicate a good copy of the block. Only when the good replica count reaches the replication factor of the block the corrupt replica is scheduled to be removed. So even if all replicas of a block are corrupt, the policy allows the user to retrieve its data from the corrupt replicas.

A present member of the cluster that becomes excluded is marked for decommissioning. Once a DataNode is marked as decommissioning, it will not be selected as the target of replica placement, but it will continue to serve read requests. The NameNode starts to schedule replication of its blocks to other DataNodes. Once the NameNode detects that all blocks on the decommissioning DataNode are replicated, the node enters the decommissioned state. Then it can be safely removed from the cluster without jeopardizing any data availability.

MapReduce

The master pings every worker periodically.If no response isrecei ved from awork er in acertain amount of time, the master marks the worker as failed.

json

JSON stands for JavaScript Object Notation and was inspired by the object literals of JavaScript.

A JSON value can be an object, array, number, string, true, false, or null.

The JSON syntax does not impose any restrictions on the strings used as names, does not require that name strings be unique, and does not assign any significance to the ordering of name/value pairs.

Numeric values that cannot be represented as sequences of digits (such as Infinity and NaN) are not permitted.

All strings in JSON must be double-quoted.

rumble

a query execution engine for large, heterogeneous, and nested collections of JSON objects built on top of Apache Spark.

xml

XML, unlike HTML, is case-sensitive. is not the same as or .

Every well-formed XML document has exactly one root element.

XML elements can have attributes. An attribute is a name-value pair attached to the element’s start-tag. Names are separated from values by an equals sign and optional whitespace. Values are enclosed in single or double quotation marks.

each element may have no more than one attribute with a given name.

Element and other XML names may contain essentially any alphanumeric character. This includes the standard English letters A through Z and a through z as well as the digits 0 through 9. They may also include these three punctuation characters: _ The underscore ,- The hyphen, . The period.Finally, all names beginning with the string “XML” (in any combination of case) are reserved for standardization in W3C XML-related specifications. XML names may only start with letters, ideograms, or the underscore character. They may not start with a number, hyphen, or period.

exam notes

2022

HTTP command and status code.
Storage type choosing.

HDFS and random access:
For distributed data storage though, and for the use case at hand where we read a large dataset, analyze it, and write back the output as a new dataset, random access is not needed. A distributed file system is designed so that, in cruise mode, its bottleneck will be the data flow (throughput), not the latency. This aspect of the design is directly consistent with a full-scan pattern, rather than with a random access pattern, the latter being strongly latency-bound.

json and comments:In JSON (JavaScript Object Notation), comments are not officially supported.

Hbase and meta table.
if empty json valid against to some schema.
mapreduce split.
mapreduce function: emit.
if reduce function can be used as combine function.

1NF,2NF and 3NF:
A candidate key is a minimal set of attributes that determines the other attributes included in the relation. A non-prime attribute is an attribute that is not part of the candidate key.Informally, the second normal form states that all attributes must depend on the entire candidate key.In other words, non-prime attributes must be functionally dependent on the key(s), but they must not depend on another non-prime attribute. 3NF non-prime attributes depend on “nothing but the key”.

different type comparision in mongodb.

If you have n dimensions, the CUBE operation will generate 2^n combinations.

xml schema:
You have already seen the xs:sequence element, which dictates that the elements it contains must appear in exactly the same order in which they appear within the sequence element.

mongdb atomic operations:
In MongoDB, a write operation is atomic on the level of a single document, even if the operation modifies multiple embedded documents within a single document.
When a single write operation (e.g. db.collection.updateMany()) modifies multiple documents, the modification of each document is atomic, but the operation as a whole is not atomic.

Json schema:
tuple validation.

references

https://vertabelo.com/blog/normalization-1nf-2nf-3nf/

bigdata - Cube Data

On-Line Analytic Processing(OLAP)

On-Line Analytic Processing generally involves highly complex queries that use one or more aggregations.

OLAP and Data Warehouses

Data from many separate databases may be integrated into the warehouse. The warehouse is usually only updated overnight. Data warehouses play an important role in OLAP applications. First, the warehouse is necessary to organize and centralize data in a way that supports OLAP queries. Second, OLAP queries are usually complex and touching much of the data and take too much time to be executed in a transaction-processing system with high throughput requirements.

A Multidimensional View of OLAP Data

In typical OLAP applications there is a central relation or collection of data called the fact table. Often, it helps to think of the objects in the fact table as arranged in a multidimensional space. Two broad directions that have been taken by specialized systems that support cube-structured data for OLAP: ROLAP and MOLAP. ROLAP, which is Relational OLAP. In this approach, data may be stored in relations with a specialized structure called a “star schema”. MOLAP, which is Multidimensional OLAP. A specialized structure “data cude” is used to hold the data, including its aggregates.

Star Schemas

A star schema consists of the schema for the fact table, which links to several other relations, called “dimension tables”.

Slicing and Dicing

A choice of partition for each dimension “dices” the cude. The result is that the cude is divided into smaller cubes that represent groups of points whose statistics are aggregated by a query that performs this partitioning in its “group by” clause. Through the “where” clause, a query has the option of focusing on particular partitions alone one or more dimensions.(on a particular “slice” of the cube).
Alt text

Alt text

Alt text

Alt text

Data Cubes

The formal data cube precomputes all possible aggregates in a systematic way.

The Cube Operator

Given a fact table F, we can define an augmented table CUBE(F) that adds an additional value, denoted , to each dimension. The has the intuitive meaning “any,” and it represents aggregation along the dimension in which it appears.

The Cube Operator in SQL

SQL gives us a way to apply the cube operator within queries. If we add the term “WITH CUBE” to a group-by clause, then we get not only the tuple for each group, but also the tuples that represent aggregation along one or more of the dimensions along which we have grouped. These tuples appear in the result with NULL where we have used *.
Alt text

Alt text

However, SalesRollup would not contain tuples such as
Alt text

bigdata - Graph Database

Why graphs

Now, we do know a way to avoid joins and studied it at length: denormalizing the data to (homogeneous) collections of trees is a way of “pre-computing” the joins statically, so that the data is already joined (via nesting) at runtime.

Now, why is it efficient? Because doing down a tree only necessitates following pointers in memory. But trees cannot have cycles. Graph databases provide a way of generalizing the use in-memory pointers to traverse data to the general case in which cycles are present: this is called “index-free adjacency.”

Kinds of graph databases

There are many different graph database system products on the market, and they can be classified along several dimensions: Labeled property graph model vs. triple stores;Read-intensive vs. write-intensive;Local vs. distributed;Native vs. non-native.

Graph data models

Labeled property graphs

Computer scientists need to go one step further and also design how to store graphs physically. One way of doing so is to create an associative array mapping each node to the list of nodes that it connects to via an edge (adjacency lists). Another storage form is with an adjacency matrix: each row and each column represent a node, and a 0 or a 1 indicate the absence or presence of an edge between the row node and the column node.

Now, this does not quite work for us, because labeled property graphs enhance mathematical graphs with extra ingredients: properties, and labels.
how to “convert” a relational table to a labeled property graph: labels can be seen as table names, nodes as records, and properties as the attribute values for the records. This shows that relational tables can be physically stored as labeled property graphs. Of course, this does not work the other way round: given a graph, it will often not be possible to convert it “back” to a table in this specific way.

Triple stores

Triple stores are a different and simpler model. It views the graph as nodes and edges that all have labels, but without any properties. The graph is then represented as a list of edges, where each edge is a triple with the label of the origin node (called the subject), the label of the edge (called the property), and the label of the destination node (called the object).

Triple stores typically provide SPARQL capabilities to reason about and stored RDF data.

Querying graph data

We will now have a look at query languages for querying graphs, with a focus on Cypher, which is neo4j’s query language.
Alt text

Cypher Philosophy

Cypher enables a user (or an application acting on behalf of a user) to ask the database to find data that matches a specific pattern. Colloquially, we ask the database to “find things like this.” And the way we describe what “things like this” look like is to draw them, using ASCII art.

Like most query languages, Cypher is composed of clauses. The simplest queries consist of a MATCH clause followed by a RETURN clause. There are other clauses we can use in a Cypher query: WHERE,WITH…AS…,CREATE,MERGE,DELETE,SET,UNION,FORWACH and so on.

A Comparison of Relational and Graph Modeling

Relational databases—with their rigid schemas and complex modeling characteristics—are not an especially good tool for supporting rapid change. What we need is a model that is closely aligned with the domain, but that doesn’t sacrifice performance, and that supports evolution while maintaining the integrity of the data as it undergoes rapid change and growth. That model is the graph model.

creating a graph

In practice, we tend to use CREATE when we’re adding to the graph and don’t mind duplication, and MERGE when duplication is not permitted by the domain.

Beginning a Query

In Cypher we always begin our queries from one or more well-known starting points in the graph—what are called bound nodes. Cypher uses any labels and property predicates supplied in the MATCH and WHERE clauses, together with metadata supplied by indexes and constraints, to find the starting points that anchor our graph patterns.

INDEXES AND CONSTRAINTS

To support efficient node lookup, Cypher allows us to create indexes per label and property combinations. For unique property values we can also specify constraints that assure uniqueness.

Neo4j

Neo4j is a graph database with native processing capabilities as well as native graph storage.

Native Graph Processing

Of the many different engine architectures, we say that a graph database has native processing capabilities if it exhibits a property called index-free adjacency.

A database engine that utilizes index-free adjacency is one in which each node maintains direct references to its adjacent nodes. Each node, therefore, acts as a micro-index of other nearby nodes, which is much cheaper than using global indexes. It means that query times are independent of the total size of the graph, and are instead simply proportional to the amount of the graph searched. A nonnative graph database engine, in contrast, uses (global) indexes to link nodes together.

Proponents for native graph processing argue that index-free adjacency is crucial for fast, efficient graph traversals. To understand why native graph processing is so much more efficient than graphs based on heavy indexing, consider the following. Depending on the implementation, index lookups could be O(log n) in algorithmic complexity versus O(1) for looking up immediate relationships. To traverse a network of m steps, the cost of the indexed approach, at O(m log n), dwarfs the cost of O(m) for an implementation that uses index-free adjacency.

Native Graph Storage

Neo4j stores graph data in a number of different store files. Each store file contains the data for a specific part of the graph (e.g., there are separate stores for nodes, relationships, labels, and properties).The division of storage responsibilities—particularly the separation of graph structure from property data—facilitates performant graph traversals, even though it means the user’s view of their graph and the actual records on disk are structurally dissimilar.

Like most of the Neo4j store files, the node store is a fixed-size record store, where each record is nine bytes in length. Fixed-size records enable fast lookups for nodes in the store file. If we have a node with id 100, then we know its record begins 900 bytes into the file. Based on this format, the database can directly compute a record’s location, at cost O(1), rather than performing a search, which would be cost O(log n).

Transactions

Transactions in Neo4j are semantically identical to traditional database transactions. Writes occur within a transaction context, with write locks being taken for consistency purposes on any nodes and relationships involved in the transaction. On successful completion of the transaction, changes are flushed to disk for durability, and the write locks released. These actions maintain the atomicity guarantees of the transaction.

CORE API, TRAVERSAL FRAMEWORK, OR CYPHER?

The Core API allows developers to fine-tune their queries so that they exhibit high affinity with the underlying graph. A well-written Core API query is often faster than any other approach. The downside is that such queries can be verbose, requiring considerable developer effort. Moreover, their high affinity with the underlying graph makes them tightly coupled to its structure. When the graph structure changes, they can often break. Cypher can be more tolerant of structural changes—things such as variable-length paths help mitigate variation and change.

The Traversal Framework is both more loosely coupled than the Core API (because it allows the developer to declare informational goals), and less verbose, and as a result a query written using the Traversal Framework typically requires less developer effort than the equivalent written using the Core API. Because it is a general-purpose framework, however, the Traversal Framework tends to perform marginally less well than a well-written Core API query.

exercise

Neo4j system design is different from mongodb on the consistency. MongoDB is eventually consistent, but neo4j is strong consistency and obey the ACID rules.

Cypher: note that label and property are case sensitive but the clause is not case sensitive.

RDF: Turtle Syntax

Mannual grouping and grouping sets: need to know how to translate between them.
grouping with rollup: rollup(c1,c2,c3) is equal to grouping sets((c1,c2,c3),(c1,c2),(c1),()). The ordering is important.

difference between neo4j and RDF:
Data Model:
Neo4j: Neo4j is a graph database that uses the property graph data model. In this model, nodes represent entities, relationships represent connections between entities, and both nodes and relationships can have key-value pairs as properties.
RDF: RDF is a data model for representing knowledge in the form of subject-predicate-object triples. Each triple represents a statement, and these triples can be used to build a graph of linked data. RDF is more abstract and can be implemented using various storage formats.

Query Language:
Neo4j: Neo4j uses the Cypher query language, which is specifically designed for querying graph databases. Cypher allows users to express graph patterns and relationships in a concise and readable manner.
RDF: RDF data is typically queried using SPARQL (SPARQL Protocol and RDF Query Language). SPARQL is a query language for querying and manipulating RDF data, and it provides powerful capabilities for navigating the graph structure.

Graph Structure:
Neo4j: In Neo4j, the graph is explicit, with nodes, relationships, and properties forming a connected graph structure. The focus is on relationships between entities and their properties.
RDF: RDF represents a graph, but the graph structure is more implicit. Resources are identified by URIs, and relationships are expressed through triples, allowing for the creation of a distributed and linked data web.

Schema:
Neo4j: Neo4j supports a flexible schema where nodes and relationships can have dynamic properties. While it provides some level of schema flexibility, users can define constraints and indexes to enforce certain structures.
RDF: RDF is schema-agnostic, allowing for more dynamic and extensible data representation. Schemas can be defined using RDF vocabularies and ontologies, such as RDFS and OWL.

Use Cases:
Neo4j: Neo4j is often used for applications where relationships and graph structures are central, such as social networks, recommendation engines, and network analysis.
RDF: RDF is commonly used in the context of the Semantic Web for representing and linking diverse data sources, allowing for interoperability and knowledge representation.

references

Robinson, I. et al. (2015). Graph Databases (2nd ed.)

bigdata - JSONiq

Querying denormalized data

imperative vs declarative

Imperative programming is a programming paradigm that expresses computation as a series of statements that change a program’s state. In imperative programming, the focus is on describing how a program operates step by step. Common imperative languages include C, C++, Java, and Python.

In contrast to imperative programming, declarative programming focuses on describing what the program should accomplish rather than how to achieve it. In a declarative host language, the emphasis is on specifying the desired outcome or properties, and the language itself takes care of the underlying implementation details. Such as SQL, HTML.

motivation

Denormalized data

it is characterized with two features: nestedness, and heterogeneity. In fact, denormalized datasets should not be seen as “broken tables pushed to their limits”, but rather as collections of trees.

Features of a query language

A query language for datasets has three main features: Declarative, Functional, and Set-based. First, it is declarative. This means that the users do not focus on how the query is computed, but on what it should return. Being functional means that the query language is made of composable expressions that nest with each other, like a Lego game. Finally, it is set-based, in the sense that the values taken and returned by expressions are not only single values (scalars), but are large sequences of items (in the case of SQL, an item is a row).

Query languages for denormalized data

For denormalized data though, sadly, the number of languages keeps increasing: the oldest ones being XQuery, JSONiq, but then now also JMESPath, SpahQL, JSON Query, PartiQL, UnQL, N1QL, ObjectPath, JSONPath, ArangoDB Query Language (AQL), SQL++, GraphQL, MRQL, Asterix Query Language (AQL), RQL.

JSONiq as a data calculator

it can perform arithmetics, but also comparison and logic. It is, however, more powerful than a common calculator and supports more complex constructs, for example variable binding.
It also supports all JSON values. Any copy-pasted JSON value literally returns itself.
And, unlike a calculator, it can access storage.

The JSONiq Data Model

Every expression of the JSONiq “data calculator” returns a sequence of items.An item can be either an object, an array, an atomic item, or a function item. A sequence can also be empty. Caution, the empty sequence is not the same logically as a null item.

Navigation

The general idea of navigation is that it is possible to “dive” into the nested data with dots and square brackets (originally, these were slashes with XPath) – all in parallel: starting with an original collection of objects (or, possibly, a single document), each step (i.e., for each dot and each pair of square brackets) goes down the nested structure and returns another sequence of nested items.

Object lookups (dot syntax)

json-doc(“file.json”).o

Array unboxing (empty square bracket syntax)

We can unbox the array, meaning, extract its members as a sequence of seven object items, with empty square brackets.json-doc(“file.json”).o[].

Parallel navigation

The dot syntax, in fact, works on sequences, too and will extract the value associated with a key in every object of the sequence (anything else than an object is ignored and thrown away).

Filtering with predicates (simple square bracket syntax)

It is possible to filter any sequence with a predicate, where .c = 3].It is also possible to access the item at position n in a sequence with this same notation: json-doc(“file.json”).o[].a.b[][5].

Array lookup (double square bracket syntax)

To access the n-th member of an array, you can use double-squarebrackets, like so:
json-doc(“file.json”).o[[2]].a.

A common pitfall: Array lookup vs. Sequence predicates

Do not confuse sequence positions (single square brackets) with array positions (double square brackets)!
([1, 2], [3, 4])[2] -> [ 3, 4 ]
([1, 2], [3, 4])[[2]] -> 2 4

Schema discovery

Collections

many datasets are in fact found in the form of large collections of smaller objects (as in document stores). Such collections are access with a function call together with a name or (if reading from a data lake) a path.

Getting all top-level keys

The keys function retrieves all keys. It can be called on the entire sequence of objects and will return all unique keys found (at the top level) in that collection.

Getting unique values associated with a key

With distinct-values, it is then possible to eliminate duplicates and look at unique values:
distinct-values(collection( “https://www.rumbledb.org/samples/git-archive.jsonl“ ).type)

Aggregations

Aggregations can be made on entire sequences with a single function call:The five basic functions are count, sum, avg, min, max.
count(distinct-values(collection( “https://www.rumbledb.org/samples/git-archive.jsonl“ ).type))

Construction

Construction of atomic values

Atomic values that are core to JSON can be constructed with exactly the same syntax as JSON.

Construction of objects and arrays

In fact, one can copy-paste any JSON value, and it will always be recognized as a valid JSONiq query returning that value.

Construction of sequences

Sequences can be constructed (and concatenated) using commas.Increasing sequences of integers can also be built with the to keyword.

Scalar expressions

Sequences of items can have any number of items. A few JSONiq expression (arithmetic, logic, value comparison…) work on the particular case that a sequence has zero or one items.

Arithmetic

JSONiq supports basic arithmetic: addition (+), subtraction (-), division (div), integer division (idiv) and modulo (mod).If the data types are different, then conversions are made automatically. The empty sequence enjoys special treatment: if one of the sides (or both) is the empty sequence, then the arithmetic expression returns an empty sequence (no error). If one of the two sides is null (and the other side is not the empty sequence), then the arithmetic expression returns null. If one of the sides (or both) is not a number, null, or the empty sequence, then a type error is thrown.

String manipulation

String concatenation is done with a double vertical bar: “foo” || “bar”.
Most other string manipulation primitives are available from the rich JSONiq builtin function library:
concat(“foo”, “bar”),string-join((“foo”, “bar”, “foobar”), “-“),substr(“foobar”, 4, 3),
string-length(“foobar”).

Value comparison

Sequences of one atomic item can be compared with eq (equal), ne (not equal), le (lower or equal), gt (greater or equal), lt (lower than) and gt (greater than).

Logic

JSONiq supports the three basic logic expressions and, or, and not. not has the highest precedence, then and, then or.
JSONiq also supports universal and existential quantification:
every $i in 1 to 10 satisfies $i gt 0, some $i in 1 to 10 satisfies $i gt 5.

If one of the two sides is not a sequence of a single Boolean item, then implicit conversions are made. This mechanism is called the Effective Boolean Value (EBV). For example, an empty sequence, or a sequence of one empty string, or a sequence of one zero integer, is considered false. A sequence of one non-empty string, or a sequence or one non-zero integer, or a sequence starting with one object (or array) is considered true.

General comparison

JSONiq has a shortcut for existential quantification on value comparisons. This is called general comparison.

some $i in (1, 2, 3, 4, 5) satisfies $i eq 1 ==
(1, 2, 3, 4, 5) = 1.

Composability

Data flow

A few expressions give some control over the data flow by picking the output or this or that expression based on the value of another expression. This includes conditional expressions. This includes conditional expressions. This also includes try-catch expressions.

Binding variables with cascades of let clauses

Variables in JSONiq start with a dollar sign. It is important to understand that this is not a variable assignment that would change the value of a variable. This is only a declarative binding.

FLWOR expressions

One of the most important and powerful features of JSONiq is the FLWOR expression. It corresponds to SELECT-FROM-WHERE queries in SQL, however, it is considerably more expressive and generic than them in several aspects. In JSONiq the clauses can appear in any order with the exception of the first and last clause. JSONiq supports a let clause, which does not exist in SQL.
In SQL, when iterating over multiple tables in the FROM clause, they “do not see each other”. In JSONiq, for clauses (which correspond to FROM clauses in SQL), do see each other, meaning that it is possible to iterate in higher and higher levels of nesting by referring to a previous for variable.

For clauses

It can thus be seen that the for clause is akin to the FROM clause in SQL, and the return is akin to the SELECT clause. Projection in JSONiq can be made with a project() function call, with the keys to keep. It is possible to implement a join with a sequence of two for clauses and a predicate. Note that allowing empty can be used to perform a left outer join, i.e., to account for the case when there are no matching records in the second collection.

Let clauses

A let clause outputs exactly one outgoing tuple for each incoming tuple (think of a map transformation in Spark). Let clauses also allow for joining the two datasets.

Where clauses

Where clauses are used to filter variable bindings (tuples) based on a predicate on these variables. They are the equivalent to a WHERE clause in SQL.

Order by clauses

Order by clauses are used to reorganize the order of the tuples, but without altering them. They are the same as ORDER BY clauses in SQL.It is also possible, like in SQL, to specify an ascending or a descending order. In case of ties between tuples, the order is arbitrary. But it is possible to sort on another variable in case there is a tie with the first one (compound sorting keys). It is possible to control what to do with empty sequences: they can be considered smallest or greatest.

Group by clauses

Group by clauses organize tuples in groups based on matching keys, and then output only one tuple for each group, aggregating other variables (count, sum, max, min…). It is also possible to opt out of aggregating other (non-grouping-key) variables.

Types

Variable types

Since every value in JSONiq is a sequence of item, a sequence type consists of two parts: an item type, and a cardinality.
Item types can be any of the builtin atomic types. as well as “object”, “array” and the most generic item type, “item”.
Cardinality can be one of the following four:Any number of items (suffix ); for example object, One or more items (suffix +); for example array+,Zero or one item (suffix ?); for example integer,Exactly one item (no suffix); for example boolean?

Type expressions

An “instance of” expression checks whether a sequences matches a sequence type, and returns true or false. A “cast as” expression casts single items to an expected item type.
A “castable as” expression tests whether a cast would succeed (in which case it returns true) or not (false).
A “treat as” expression checks whether its input sequence matches an expected type (like a type on a variable); if it does, the input sequence is returned unchanged. If not, an error is raised.

Types in user-defined functions

JSONiq supports user-defined functions. Parameter types can be optionally specified, and a return type can also be optionally specified.

Validating against a schema

It is possible to declare a schema, associating it with a user-defined type, and to validate a sequence of items against this user-defined type.

Architecture of a query engine

Static phase

When a query is received by an engine, it is text that needs to be parsed. The output of this is a tree structure called an Abstract Syntax Tree. An Abstract Syntax Tree, even though it already has the structure of a tree, is tightly tied to the original syntax. Thus, it needs to be converted into a more abstract Intermediate Representation called an expression tree. Every node in this tree corresponds to either an expression or a clause in the JSONiq language, making the design modular. At this point, static typing takes place, meaning that the engine infers the static type of each expression, that is, the most specific type possible expected at runtime (but without actually running the program). Engines like RumbleDB perform their optimization round on this Intermediate Representation. Once optimizations have been done, RumbleDB decides the mode with which each expression and clause will be evaluated (locally, sequentially, in parallel, in DataFrames, etc). The resulting expression tree is then converted to a runtime iterator tree; this is the query plan that will actually be evaluated by the engine.

Dynamic phase

During the dynamic phase, the root of the tree is asked to produce a sequence of items, which is to be the final output of the query as a whole. Then, recursively, each node in the tree will ask its children to produce sequences of items (or tuple streams). Each node then combines the sequences of items (or tuple streams) it receives from its children in order to produce its own sequence of items according to its semantics, and pass it to its parent.

Materialization

When a sequence of items is materialized, it means that an actual List (or Array, or Vector), native to the language of implementation (in this case Java) is stored in local memory, filled with the items.

Streaming

When a sequence of items (or tuple stream) is produced and consumed in a streaming fashion, it means that the items (or tuples) are produced and consumed one by one, iteratively. But the whole sequence of items (or tuple stream) is never stored anywhere. The classical pattern for doing so is known as the Volcano iterator architecture.

However, there are two problems with this: first, it can take a lot of time to go through the entire sequence (imagine doing so with billions or trillions of items). Second, there are expressions or clauses that are not compatible with streaming (consider, for example, the group by or order by clause, which cannot be implemented without materializing their full input).

Parallel execution (with RDDs)

When a sequence becomes unreasonably large, RumbleDB switches to a parallel execution, leveraging Spark capabilities: the sequences of items are passed and processed as RDDs of Item objects.

Parallel execution (with DataFrames)

The RDD implementation supports heterogeneous sequences by leveraging the polymorphism of Item objects. However, this is not efficient in the case that Items in the same sequence happen to have a regular structure. Thus, if the Items in a sequence are valid against a specific schema, or even against an array type or an atomic type, the underlying physical storage in memory relies on Spark DataFrames instead of RDDs.

To summarize, homogeneous sequences of the most common types are stored in DataFrames, and RDDs are used in all other cases.

Parallel execution (with Native SQL)

In some cases (more in every release), RumbleDB is able to evaluate the query using only Spark SQL, compiling JSONiq to SQL directly instead of packing Java runtime iterators in UDFs. This leads to faster execution, because UDFs are slower than a native execution in SQL. This is because, to a SQL optimizer, UDFs are opaque and prevent automatic optimizations.

bigdata - MongoDB

Document stores

Can we rebuild a similar system for collections of trees, in the sense that we drop all three constraints: relational integrity, domain integrity, and atomic integrity? Document stores bring us one step in this direction.

Challenges

Schema on read

In a relational database management system, it is not possible to populate a table without having defined its schema first. However, when encountering such denormalized data, in the real world, there is often no schema. In fact, one of the important features of a system that deals with denormalized data is the ability to discover a schema.

Making trees fit in tables

Several XML elements (or, likewise, several JSON objects) can be naturally mapped to a relational table with several rows if the collection is flat and homogeneous, but semi-structured data can generally be nested and heterogeneous. if we map nested and heterogeneous into a table,such mapping will at best have to be done for every single dataset, and requires in most cases a schema, whereas we are looking for a generic solution for semistructured data with no a-priori schema information.

Document stores

Document stores provide a native database management system for semi-structured data. Document stores work on collections of records, generalizing the way that relational tables can be seen as collections of rows. It is important to understand that document stores are optimized for the typical use cases of many records of small to medium sizes. Typically, a collection can have millions or billions of documents, while each single document weighs no more than 16 MB (or a size in a similar magnitude). Finally, a collection of documents need not have a schema: it is possible to insert random documents that have dissimilar structures with no problem at all. Most document stores, however, do provide the ability to add a schema. Document stores can generally do selection, projection, aggregation and sorting quite well, but many of them are typically not (yet) optimized for joining collections. In fact, often, their language or API does not offer any joining functionality at all, which pushes the burden to reimplement joins in a host language to the users. This is a serious breach of data independence.

Implementations

There is a very large number of products in the document store space for both JSON and XML, let us mention for example MongoDB, CouchDB, ElasticSearch, Cloudant, existDB, ArangoDB, BaseX, MarkLogic, DocumentDB, CosmosDB, and so on. We will focus, as an example, on MongoDB.

Physical storage

Just like the storage format is optimized for tabular data in a relational database management system, it is optimized for tree-like data in a document store. In MongoDB, the format is a binary version of JSON called BSON. BSON is basically based on a sequence of tokens that efficiently encode the JSON constructs found in a document. The immediate benefit of BSON is that it takes less space in storage than JSON stored as a text file: for example, null, true and false literals need four or five bytes in text format at best, while they can be efficiently encoded as single bytes in BSON. Furthermore, BSON supports additional types that JSON does not have, such as dates.

Querying paradigm (CRUD)

The API of MongoDB, like many document stores, is based on the CRUD paradigm. CRUD means Create, Read, Update, Delete and corresponds to low-level primitives similar to those for HBase. MongoDB supports several host languages to query collections via an API. This includes in particular JavaScript and Python, but many other languages are supported via drivers. We will use JavaScript here because this is the native host language of MongoDB. It is important to note that these APIs are not query languages. MongoDB also provides access to the data via a shall called mongo or, newly, mongosh. This is a simple JavaScript interface wrapped around the MongoDB’s node.js driver.

Populating a collection

To create a collection, one can simply insert a document in it, and it will be automatically created if it does not exist.

MongoDB automatically adds to every inserted document a special field called “ id” and associated with a value called an Object ID and with a type of its own.Object IDs are convenient for deleting or updating a specific document with no ambiguity.

Querying a collection

Scan a collection

Asking for the contents of an entire collection is done with a simple find() call on the previous object:db.collection.find().This function does not in fact return the entire collection; rather, it returns some pointer, called a cursor, to the collection; the user can then iterate on the cursor in an imperative fashion in the host language.

Selection

It is possible to perform a selection on a collection by passing a parameter to find() that is a JSON object:
db.collection.find({ “Theory” : “Relativity” }).

A disjunction (OR) uses a special MongoDB keyword, prefixed with a dollar sign:
db.collection.find( { “$or” : [ { “Theory”:”Relativity” }, { “Last”:”Einstein” } ] } ).

MongoDB offers many other keywords, for example for comparison other than equality:
db.collection.find( { “Publications” : { “$gte” : 100 } } )

Projection

Projections are made with the second parameter of this same find() method. This is done in form of a JSON object associating all the desired keys in the projection with the value 1. db.scientists.find( { “Theory” : “Relativity” }, { “First” : 1, “Last”: 1 } ).

It is also possible to project fields away in the same way with 0s, however 1s and 0s cannot be mixed in the projection parameter, except in the specific above case of projecting away the object ID

Counting

Counting can be done by chaining a count() method call:
db.scientists.find( { “Theory” : “Relativity” } ).count().

Sorting

Sorting can be done by chaining a sort() method call.
db.scientists.find( { “Theory” : “Relativity” }, { “First” : 1, “Last” : 1 } ).sort( { “First” : 1, “Name” : -1 } )

1 is for ascending order and -1 for descending order.It is also possible to add limits and offsets to paginate results also by chaining more method calls:
db.scientists.find( { “Theory” : “Relativity” }, { “First” : 1, “Last” : 1 } ).sort( { “First” : 1, “Name” : -1 } ).skip(30).limit(10).

Note that, contrary to intuition, the order of the calls does not matter, as this is really just the creation of a query plan by providing parameters (in any order).

Duplicate elimination

It is possible to obtain all the distinct values for one field with a distinct() call: db.scientists.distinct(“Theory”)

Querying for heterogeneity

Absent fields

Absent fields can be filtered with: db.scientists.find( { “Theory” : null } ).

Filtering for values across types

db.collection.find( { “$or” : [ { “Theory”: “Relativity” }, { “Theory”: 42 }, { “Theory”: null } ] } )

db.scientists.find( { “Theory” : { “$in” : [“Relativity”, 42, null ] } } ).

MongoDB is also able to sort on fields that have heterogeneous data types. It does so by first order by type in some (arbitrary, but documented) order, and then within each type.

Querying for nestedness

Nestedness in MongoDB is handled in several ad-hoc ways for specific use cases.

Values in nested objects

We saw how to select documents based on values associated with a top-level keys. What about values that are not at the top-level, but in nested objects?
The first solution that might come to mind is something like this: db.scientists.find({ “Name” : { “First” : “Albert” } }) However, this query will not have the behavior many would have expected: instead of finding documents that have a value “Albert” associated with the key “First” in an object itself associated with the top-level key ”Name”, this query looks for an exact match of the entire object.
In order to include documents such as above, MongoDB uses a dot syntax: db.scientists.find({ “Name.First” : “Albert” }).

Values in nested arrays

MongoDB allows to filter documents based on whether a nested array contains a specific value, like so: db.scientists.find({ “Theories” : “Special relativity” }).

Deleting objects from a collection

Objects can be deleted from a collection either one at a time with deleteOne(), or several at a time with deleteMany(): db.scientists.deleteMany( { “century” : “15” } ).

Updating objects in a collection

Documents can be updated with updateOne() and updateMany() by providing both a filtering criterion (with the same syntax as the first parameter of find()) and an update to apply. The command looks like so: db.scientists.updateMany( { “Last” : “Einstein” }, { $set : { “Century” : “20” } } ).

The granularity of updates is per document, that is, a single document can be updated by at most one query at the same time.However, within the same collection, several different documents can be modified concurrently by different queries in parallel.

Complex pipelines

For grouping and such more complex queries, MongoDB provides an API in the form of aggregation pipelines.
db.scientists.aggregate( { $match : { “Century” : 20 }}, { $group : { “Year” : “$year”, “Count” : { “$sum” : 1 } } }, { $sort : { “Count” : -1 } }, { $limit : 5 } ).

Limitations of a document store querying API

Simple use cases are straightforward to handle, however more complex use cases require a lot of additional code in the host language, be it JavaScript or Python. An example is that joins must be taken care of by the end user in the host language: the burden of implementing more complex use cases is pushed to the end user.

Architecture

The architecture of MongoDB follows similar principles to what we covered before: scaling out the hardware to multiple machine, and sharding as well as replicating the data.

Sharding collections

Collections in MongoDB can be sharded. Shards are determined by selecting one or several fields. (Lexicographically-ordered) intervals over these fields then determine the shards. The fields used to shard must be organized in a tree index structure.

Replica sets

A replica set is a set of several nodes running the MongoDB server process. The nodes within the same replica set all have a copy of the same data.

Each shard of each collection is assigned to exactly one replica set. Note that this architecture is not the same as that of HDFS, in which the replicas are spread over the entire cluster with no notion of “walls” between replica sets and no two DataNodes having the exact same block replicas.

Write concerns

When writing (be it delete, update or insert) to a collection, more exactly, to a specific shard of a collection, MongoDB checks that a specific minimum number of nodes (within the replica set that is responsible for the shard) have successfully performed the update.

Motivation

A document store, unlike a data lake, manages the physical data layout. This has a cost: the need to import (ETL) data before it is possible to query it, but this cost comes with a nice benefit: index support, just like relational database management systems.

Hash indices

Hash indices are used to optimize point queries and more generally query that select on a specific value of a field. The general idea is that all the values that a field takes in a specific collection can be hashed to an integer. The value, together with pointers to the corresponding documents, is then placed in a physical array in memory, at the position corresponding to this integer.

Tree indices

Hash indices are great and fast, but have limitations: first, they consume space. The more one wants to avoid hash collisions, the larger the array needs to be. But more importantly, hash indices cannot support range queries. This is because hashes do not preserve the order of the values and distribute them “randomly” in the index array structure.

Range queries are supported with tree indices. Instead of an array, tree indices use some sort of tree structure in which they arrange the possible values of the indexed field, such that the values are ordered when traversing the tree in a depth-first-search manner. More precisely, the structure is called a B+-tree. Unlike a simple binary tree, nodes have a large number of children.

Secondary indices

By default, MongoDB always builds a tree index for the id field. Users can request to build hash and tree indices for more fields. These indices are called secondary indices.

The command for building a hash index looks like so: db.scientists.createIndex({ “Name.Last” : “hash” })

And for a tree index (1 means in ascending order, -1 would be descending): db.scientists.createIndex({ “Name.Last” : 1 }).

When are indices useful

When building indices, it is important to get a feeling for whether a query will be faster or not with this index.

index types

Single Field Indexes

By default, all collections have an index on the _id field. You can add additional indexes to speed up important queries and operations. You can create a single-field index on any field in a document, including:Top-level document fields, Embedded documents ,Fields within embedded documents. When you create an index on an embedded document, only queries that specify the entire embedded document use the index. Queries on a specific field within the document do not use the index. In order for a dot notation query to use an index, you must create an index on the specific embedded field you are querying, not the entire embedded object.

Compound Indexes

Compound indexes collect and sort data from two or more fields in each document in a collection. Data is grouped by the first field in the index and then by each subsequent field.

The order of the indexed fields impacts the effectiveness of a compound index. Compound indexes contain references to documents according to the order of the fields in the index. To create efficient compound indexes, follow the ESR (Equality, Sort, Range) rule. The ESR (Equality, Sort, Range) Rule is to place fields that require exact matches first in your index. Sort follows equality matches because the equality matches reduce the number of documents that need to be sorted. Sorting after the equality matches also allows MongoDB to do a non-blocking sort. “Range” filters scan fields. The scan doesn’t require an exact match, which means range filters are loosely bound to index keys. To improve query efficiency, make the range bounds as tight as possible and use equality matches to limit the number of documents that must be scanned. MongoDB cannot do an index sort on the results of a range filter. Place the range filter after the sort predicate so MongoDB can use a non-blocking index sort.

Compound indexes cannot support queries where the sort order does not match the index or the reverse of the index.

Index prefixes are the beginning subsets of indexed fields. Compound indexes support queries on all fields included in the index prefix. Index fields are parsed in order; if a query omits an index prefix, it is unable to use any index fields that follow that prefix.

Multikey Indexes

Multikey indexes collect and sort data from fields containing array values. Multikey indexes improve performance for queries on array fields.

In a compound multikey index, each indexed document can have at most one indexed field whose value is an array.

exercise

data model

Embedded Data Models

In general, embedding provides better performance for read operations, as well as the ability to request and retrieve related data in a single database operation. Embedded data models make it possible to update related data in a single atomic write operation.

Normalized Data Models

Normalized data models describe relationships using references between documents.

references

https://www.mongodb.com/docs/manual/core/data-model-design/

bigdata - spark

Generic Dataflow Management

MapReduce is very simple and generic, but many more complex uses involve not just one, but a sequence of several MapReduce jobs. Furthermore, the MapReduce API is low-level, and most users need higherlevel interfaces, either in the form of APIs or query languages. This is why, after MapReduce, another generation of distributed processing technologies were invented. The most popular one is the open source Apache Spark.

A more general dataflow model

MapReduce consists of a map phase, followed by shuffling, followed by a reduce phase. In fact, the map phase and the reduce phase are not so different: both involve the computation of tasks in parallel on slots.

Resilient distributed datasets

The first idea behind generic dataflow processing is to allow the dataflow to be arranged in any distributed acyclic graph (DAG).

Alt text

All the rectangular nodes in the above graph correspond to intermediate data. They are called resilient distributed datasets, or in short, RDDs.

A major difference with MapReduce, though, is that RDDs need not be collections of pairs. In fact, RDDs can be (ordered) collections of just about anything: strings, dates, integers, objects, arrays, arbitrary JSON values, XML nodes, etc.The only constraint is that the values within the same RDD share the same static type, which does not exclude the use of polymorphism.

The RDD lifecycle

Creation

RDDs undergo a lifecycle. First, they get created. RDDs can be created by reading a dataset from the local disk, or from cloud storage, or from a distributed file system, or from a database source, or directly on the fly from a list of values residing in the memory of the client using Apache Spark

Transformation

Then, RDDs can be transformed into other RDDs. Mapping or reducing, in this model, become two very specific cases of transformations. However, Spark allows for many, many more kinds of transformations. This also includes transformations with several RDDs as input.

Action

RDDs can also undergo a final action leading to making an output persistent. This can be by outputting the contents of an RDD to the local disk, to cloud storage, to a distributed file system, to a database system, or directly to the screen of the user.

Lazy evaluation

Another important aspect of the RDD lifecycle is that the evaluation is lazy: in fact, creations and transformations on their own do nothing. It is only with an action that the entire computation pipeline is put into motion, leading to the computation of all the necessary intermediate RDDs all the way down to the final output corresponding to the action.

Transformations

Unary transformations

Let us start with unary transformations, i.e., those that take a single RDD as their input.

Binary transformations

There are also transformations that take two RDDs as input.

Pair transformations

Spark has transformations specifically tailored for RDDs of key-value pairs.

Actions

Gathering output locally

The collect action downloads all values of an RDD on the client machine and outputs them as a (local) list. It is important to only call this action on an RDD that is known to be small enough (e.g., after filtering) to avoid a memory overflow.

The count action computes (in parallel) the total number of values in the input RDD. This one is safe even for RDDs with billions of values, as it returns a simple integer to the client.

Writing to sharded datasets

There is also an action called saveAsTextFile that can save the entire RDD to a sharded dataset on Cloud storage (S3, Azure blob storage) or a distributed file system (HDFS).
Binary outputs can be saved with saveAsObjectFile.

Actions on Pair RDDs

Physical architecture

Narrow-dependency transformations

In a narrow-dependency transformation, the computation of each output value involves a single input value. In a narrow-dependency transformation, the computation of each output value involves a single input value.

By default, if the transformation is applied to an RDD freshly created from reading a dataset from HDFS, each partition will correspond to an HDFS block. Thus, the computation of the narrow-dependency transformation mostly involves local reads by short-circuiting HDFS.

In fact, the way this works is quite similar to MapReduce: the sequential calls of the transformation function on each input value within a single partition is called a task. And just like MapReduce, the tasks are assigned to slots. These slots correspond to cores within YARN containers. YARN containers used by Spark are called executors. The processing of the tasks is sequential within each executor, and tasks are executed in parallel across executors. And like in MapReduce, a queue of unprocessed tasks is maintained, and everytime a slot is done, it gets a new task. When all tasks have been assigned, the slots who are done become idle and wait for all others to complete.

Chains of narrow-dependency transformations

In fact, on the physical level, the physical calls of the underlying map/filter/etc functions are directly chained on each input value to directly produce the corresponding final, output value, meaning that the intermediate RDDs are not even materialized anywhere and exist purely logically.

Such a chain of narrow-dependency transformations executed efficiently as a single set of tasks is called a stage, which would correspond to what is called a phase in MapReduce.

Physical parameters

Users can parameterize how many executors there are, how many cores there are per executor and how much memory per executor (remember that these then correspond to YARN container requests).

Shuffling

What about wide-dependency transformations? They involve a shuffling of the data over the network, so that the data necessary to compute each partition of the output RDD is physically at the same location.Thus, on the high-level of a job, the physical execution consists of a sequence of stages, with shuffling happening everytime a new stage begins.

Optimizations

Pinning RDDs

Everytime an action is triggered, all the computations of the ”reverse transitive closure” (i.e., all the way up the DAG through the reverted edges) are set into motion. In some cases, several actions might share subgraphs of the DAG. It makes sense, then, to “pin” the intermediate RDD by persisting it.

Pre-partitioning

Shuffle is needed to bring together the data that is needed to jointly contribute to individual output values. If, however, Spark knows that the data is already located where it should be, then shuffling is not needed.

DataFrames in Spark

Data independence

Unlike a relational database that has everything right off-theshelf, with RDDs, the user has to re-implement all the primitives they need. This is a breach of the data independence principle. The developers behind Spark addressed this issue in a subsequent version of Spark by extending the model with support for DataFrames and Spark SQL, bringing back a widely established and popular declarative, high-level language into the ecosystem.

A specific kind of RDD

A DataFrame can be seen as a specific kind of RDD: it is an RDD of rows (equivalently: tuples, records) that has relational integrity, domain integrity, but not necessarily atomic integrity.

Performance impact

DataFrames are stored column-wise in memory, meaning that the values that belong to the same column are stored together. Furthermore, since there is a known schema, the names of the attributes need not be repeated in every single row, as would be the case with raw RDDs. DataFrames are thus considerably more compact in memory than raw RDDs.

Generally, Spark converts Spark SQL to internal DataFrame transformation and eventually to a physical query plan. An optimizer known as Catalyst is then able to find many ways of making the execution faster.

Input formats

Note that Spark automatically infers the schema from discovering the JSON Lines file, which adds a static performance overhead that does not exist for raw RDDs.

DataFrame transformations

It is also possible, instead of Spark SQL, to use a transformation API similar to (but distinct from) the RDD transformation API.

Unlike the RDD transformation API, there is no guarantee that the execution will happen as written, as the optimizer is free to reorganize the actual computations.

DataFrame column types

DataFrames also support the three structured types: arrays, structs, and maps. As a reminder, structs have string keys and arbitrary value types and correspond to generic JSON objects, while in maps, all values have the same type. Thus, structs are more common than maps.

The Spark SQL dialect

Note that both GROUP BY and ORDER BY will trigger a shuffle in the system, even though this can be optimized as the grouping key and the sorting key are the same. The SORT BY clause can sort rows within each partition, but not across partitions, i.e., does not induce any shuffling. The DISTRIBUTE BY clause forces a repartition by putting all rows with the same value (for the specified field(s)) into the same new partition.
Note that the SORT BY clause is used to return the result rows sorted within each partition in the user specified order. When there is more than one partition SORT BY may return result that is partially ordered. This is different than ORDER BY clause which guarantees a total order of the output.

A word of warning must be given on SORT, DISTRIBUTE and CLUSTER clauses: they are, in fact, a breach of data independence, because they expose partitions.

Spark SQL also comes with language features to deal with nested arrays and objects. First, nested arrays can be navigated with the explode() function. Lateral views are more powerful and generic than just an explode() because they give more control, and they can also be used to go down several levels of nesting. A lateral view can be intuitively described this way: the array mentioned in the LATERAL VIEW clause is turned into a second, virtual table with the rest of the original table is joined. The other clauses can then refer to columns in both the original and second, virtual table.

exercise

RDD

Why RDD should be immutable and lazy: immutable is for lineage.
Why need RDD partitioning: parallel computing and reduce shuffling.

DataFrame API

For nested array,use array_contains.

spark SQL

In jupyter notebook, we can use “%load_ext sparksql_magic” directly.

reference

https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/

bigdata - YARN

Resource management

The first version of MapReduce is inefficient in several respects. For this reason, the architecture was fundamentally changed by adding a resource management layer to the stack, adding one more level of decoupling between scheduling and monitoring. A resource management system, here YARN, is a very important building block not only for a better MapReduce, but also for many other technologies running on a cluster.

Limitations of MapReduce in its first version

The JobTracker has a lot on its shoulders! It has to deal with resource management, scheduling, monitoring, the job lifecycle, and fault tolerance.The first consequence of this is scalability. The second consequence is the bottleneck that this introduces at the JobTracker level, which slows down the entire system. The third issue is that it is difficult to design a system that do many things well. The fourth issue is that resources are statically allocated to the Map or the Reduce phase, meaning that parts of the cluster remain idle during both phases. The fifth issue is the lack of fungibility between the Map phase and the Reduce phase.

YARN

General architecture

YARN means Yet Another Resource manager. It was introduced as an additional layer that specifically handles the management of CPU and memory resources in the cluster.

YARN, unsurprisingly, is based on a centralized architecture in which the coordinator node is called the ResourceManager, and the worker nodes are called NodeManagers. NodeManagers furthermore provide slots (equipped with exclusively allocated CPU and memory) known as containers.

When a new application is launched, the ResourceManager assigns one of the container to act as the ApplicationMaster which will take care of running the application. The ApplicationMaster can then communicate with the ResourceManager in order to book and use more containers in order to run jobs. This is a fundamental change from the initial MapReduce architecture, in which the JobTracker was also taking care of running the MapReduce job.

Alt text

Thus, YARN cleanly separates between the general management of resources and bootstrapping new applications, which remains centralized on the coordinator node, and monitoring the job lifecycle, which is now delegated to one or more ApplicationMasters running concurrently. This means, in particular, that several applications can run concurrently in the same cluster. This ability, known as multi-tenancy, is very important for large companies or universities in order to optimize the use of their resources.

Resource management

In resource management, one abstracts away from hardware by distinguish between four specific resources used in a distributed database system. These four resources are: • Memory • CPU • Disk I/O • Network I/O.
ApplicationMasters can request and release containers at any time, dynamically. A container request is typically made by the ApplicationMasters with a specific demand. If the request is granted by the ResourceManager fully or partially, this is done indirectly by signing and issuing a container token to the ApplicationMaster that acts as proof that the resource was granted. The ApplicationMaster can then connect to the allocated NodeManager and send the token. The NodeManager will then check the validity of the token and provide the memory and CPU granted by the ResourceManager. The ApplicationMaster ships the code (e.g., as a jar file) as well as parameters, which then runs as a process with exclusive use of this memory and CPU.

Job lifecycle management and fault tolerance

Version 2 of MapReduce works on top of YARN by leaving the job lifecycle management to an ApplicationMaster. The ApplicationMaster requests containers for the Map phase, and sets these containers up to execute Map tasks. As soon as a container is done executing a Map task, the ApplicationMaster will assign a new Map task to this container from the remaining queue, until no Map tasks are left.

a container in the Map phase can contain several Map slots. Sharing memory and containers across slots in this way improves the overall efficiency, because setting up a container adds latency. It is thus more efficient to allocate 10 containers of each 4 cores, compared to 40 containers of each 1 core.

In the event of a container crashing during the Map phase, the ApplicationMaster will handle this by re-requesting containers and restarting the failed tasks. In the case that some data is lost in the Reduce phase, it is possible that the entire job must be restarted, because this the only way to recreate the intermediate data is to re-execute the Map tasks.

Scheduling

The ResourceManager keeps track of the list of available NodeManagers (who can dynamically come and go) and their status. Just like in HDFS, NodeManagers send periodic heartbeats to the ResourceManager to give a sign of life. The ResourceManager also offers an interface so that ApplicationMasters can register and send container requests. ApplicationMasters also send periodic heartbeats to the ResourceManager. It is important to understand that, unlike the JobTracker, the ResourceManager does not monitor tasks, and will not restart slots upon failure. This job is left to the ApplicationMasters.

Scheduling strategies

FIFO scheduling

In FIFO scheduling, there is one application at a time running on the entire cluster. When it is done, the next application runs again on the entire cluster, and so on.

Capacity scheduling

In capacity scheduling, the resources of the cluster are partitioned into several sub-clusters of various sizes. Each one of these sub-clusters has its own queue of applications running in a FIFO fashion within this queue. Capacity scheduling also exists in a more “dynamic flavour” in which, when a sub-cluster is not currently used, its resources can be temporarily lent to the other sub-clusters. This is also in the spirit of usage maximization.

Fair scheduling

Fair scheduling involves more complex algorithms that attempt to allocate resources in a way fair to all users of the cluster and based on the share they are normally entitled to. fair scheduling consists on making dynamic decisions regarding which requests get granted and which requests have to wait. Fair scheduling combines several ways to compute cluster shares: Steady fair share: this is the share of the cluster officially allocated to each department. Instantaneous fair share: this is the fair share that a department should ideally be allocated (according to economic and game theory considerations) at any point in time. This is a dynamic number that changes constantly, based on departments being idle: if a department is idle, then the instantaneous fair share of others department becomes higher than their steady fair shares. Current share: this is the actual share of the cluster that a department effectively uses at any point in time. This is highly dynamic. The current share does not necessarily match the instantaneous fair share because there is some inertia in the process: a department might be using more resources while another is idle. When the other department later stops being idle, these resources are not immediately withdrawn from the first department; rather, the first department will stop getting more resources, and the second department will gradually recover these resources as they get released by the first department.

The easiest case of fair scheduling is when only one resource is considered: for example, only CPU cores, or only memory. Things become more complicated when several resources are considered, in practice both CPU cores and memory. This problem was solved game-theoretically with the Dominant Resource Fairness algorithm. The two (or more) dimensions are projected again to a single dimension by looking at the dominant resource for each user.

exercise

motivation

improve hadoop 1.x by adding a layer YARN to separate resource management from data processing.

architecture

NodeMananger is generalized taskTracker.
On ResourceManager, there is an ApplicationManager rsponsible for admiting new jobs and collecting finished jobs and logs.
A scheduler has a global view to assign an ApplicationMaster. The applicationMaster will compute how many resources needed and send a request to the scheduler.

fault tolerance is provided by applicationMaster+NodeManager+HDFS.

bigdata - Map Reduce

Patterns of large-scale query processing

Textual input

We saw that the data we want to query can take many forms. First, it can be billions of lines of text. It can also be plenty of CSV lines. Some other formats (e.g., Parquet, …) can be binary. We also encountered HFiles in Chapter 6, which are lists of keyvalue pairs. In fact, Hadoop has another such kind of key-value-based format called Sequence File, which is simply a list of key-values, but not necessarily sorted by key (although ordered) and with keys not necessarily unique.

Shards

There are several practical motivations for the many-files pattern even in HDFS.

Querying pattern

This is the motivation behind the standard MapReduce pattern: a map-like phase on the entire input, then a shuffle phase on the intermediate data, then another map-like phase (called reduce) producing the entire output

MapReduce Model

In MapReduce, the input data, intermediate data, and output data are all made of a large collection of key-value pairs (with the keys not necessarily unique, and not necessarily sorted by key). The types of the keys and values are known at compile-time (statically), and they do not need to be the same across all three collections.

MapReduce architecture

On a cluster, the architecture is centralized, just like for HDFS and HBase. In the original version of MapReduce, the main node is called JobTracker, and the worker nodes are called TaskTrackers.

In fact, the JobTracker typically runs on the same machine as the NameNode (and HMaster) and the TaskTrackers on the same machines as the DataNodes (and RegionServers). This is called “bring the query to the data.”

As the map phase progresses, there is a risk that the memory becomes full. But we have seen this before with HBase: the intermediate pairs on that machine are then sorted by key and flushed to the disk to a Sequence File. And as more flushes happen, these Sequence Files can be compacted to less of them, very similarly to HBase’s Log-Structured Merge Trees. When the map phase is over, each TaskTracker runs an HTTP server listening for connections, so that they can connect to each other and ship the intermediate data over to create the intermediate partitions ensuring that the same keys are on the same machines.This is the phase called shuffling. Then, the reduce phase can start. When the reduce phase is completed, each output partition will be output to a shard (as we saw, a file named incrementally) in the output destination (HDFS, S3, etc) and in the desired format.

Note that shuffling can start before the map phase is over, but the reduce phase can only start after the map phase is over.

MapReduce input and output formats

Impedance mismatch

MapReduce can read its input from files lying in a data lake as well as directly from a database system such as HBase or a relational database management system. MapReduce only reads and writes lists of key-value pairs, where keys may be duplicates and need not appear in order. However, the inputs we considered are not key-value pairs. So we need an additional mechanism that allows MapReduce to interpret this input as key-value pairs. For tables, whereas relational or in a wide column stores, this is relatively easy: indeed, tables have primary keys, consisting of either a single column or multiple columns. Thus, each tuple can be interpreted as a key-value pair.

Mapping files to pairs

How do we read a (possibly huge) text file as a list of key-value pairs? The most natural way to do so is to turn each line of text in a key value pair1: the value is the string corresponding to the entire line, while the key is an integer that expresses the position (as a number of characters), or offset, at which the line starts in the current file being read.

A few examples

Counting words

Alt text
Alt text

Selecting

This is something easily done by having a map function that outputs a subset of its input, based on some predicate provided by the user.
Alt text
Here we notice that the output of the map phase already gives us the desired result; we still need to provide a reduce function, which is taken trivially as the identity function. This is not unusual (and there are also examples where the map function is trivial, and the reduce function is doing the actual processing).

Projecting

The map function can project this object to an object with less attributes:
Alt text

MapReduce and the relational algebra

As an exercise, try to figure out how to implement a GROUP BY clause and an ORDER BY clause. What about the HAVING clause? Naturally, executing these queries directly in MapReduce is very cumbersome because of the low-level code we need to write.

Combine functions and optimization

In addition to the map function and the reduce function, the user can supply a combine function. This combine function can then be called by the system during the map phase as many times as it sees fit to “compress” the intermediate key-value pairs. Strategically, the combine function is likely to be called at every flush of key-value pairs to a Sequence File on disk, and at every compaction of several Sequence Files into one.

However, there is no guarantee that the combine function will be called at all, and there is also no guarantee on how many times it will be called. Thus, if the user provides a combine function, it is important that they think carefully about a combine function that does not affect the correctness of the output data.

In fact, in most of the cases, the combine function will be identical to the reduce function, which is generally possible if the intermediate key-value pairs have the same type as the output key-value pairs, and the reduce function is both associative and commutative.

MapReduce programming API

Mapper classes

In Java, the user needs to define a so-called Mapper class that contains the map function, and a Reducer class that contains the reduce function. A map function takes in particular a key and a value. Note that it outputs key-value pairs via the call of the write method on the context, rather than with a return statement.

Reducer classes

A reduce function takes in particular a key and a list of values. Note that it outputs key-value pairs via the call of the write method on the context, rather than with a return statement.

Running the job

Finally, a MapReduce job can be created and invoked by supplying a Mapper and Reducer instance to the job. A combine function can also be supplied with the setCombinerClass method. It is also possible to use Python rather than Java, via the socalled Streaming API. The Streaming API is the general way to invoke MapReduce jobs in other languages than Java.

Using correct terminology

Functions

A map function is a mathematical, or programmed, function that takes one input key-value pair and returns zero, one or more intermediate key-value pairs.

A reduce function is a mathematical, or programmed, function that takes one or more intermediate key-value pairs and returns zero, one or more output key-value pairs.

A combine function is a mathematical, or programmed, function that takes one or more intermediate key-value pairs and returns zero, one or more intermediate key-value pairs.

Note that the combine function is an optional local aggregation step that occurs before shuffling and sorting, and its purpose is to reduce the amount of data that needs to be transferred to the reducers. The reduce function, on the other hand, performs the final aggregation and processing based on keys.

Tasks

A map task is an assignment (or “homework”, or “TODO”) that consists in a (sequential) series of calls of the map function on a subset of the input.

A reduce task is an assignment that consists in a (sequential) series of calls of the reduce function on a subset of the intermediate input.

We insist that the calls within a task are sequential, meaning that there is no parallelism at all within a task.

There is no such thing as a combine task. Calls of the combine function are not planned as a task, but is called ad-hoc during flushing and compaction.

Slots

The map tasks are processed thanks to compute and memory resources (CPU and RAM). These resources are called map slots. One map slot corresponds to one CPU core and some allocated memory. The number of map slots is limited by the number of available cores. Each map slot then processes one map task at a time, sequentially. The resources used to process reduce tasks are called reduce slots. So, there is no parallelism either within one map slot, or one reduce slot. In fact, parallelism happens across several slots.

Phases

The map phase thus consists of several map slots processing map tasks in parallel.

blocks vs. splits

HDFS blocks have a size of (at most) 128 MB. In every file, all blocks but the last one have a size of exactly 128 MB. Splits, however, only contain full records: a key-value pair will only belong to one split (and thus be processed by one map task).

bigdata - data model

data model

A data model is an abstract view over the data that hides the way it is stored physically.

The JSON Information Set

The appropriate abstraction for any JSON document is a tree. The nodes of that tree, which are JSON logical values, are naturally of six possible kinds: the six syntactic building blocks of JSON.

These are the four leaves corresponding to atomic values:Strings, Numbers,Booleans, Nulls. As well as two intermediate nodes: Objects, Arrays. These nodes are generally called information items and form the logical building blocks of the model, called information set.

The XML Information Set

A fundamental difference between JSON trees and XML trees is that for JSON, the labels (object keys) are on the edges connecting an object information item to each one of its children information items. In XML, the labels (these would be element and attribute names) are on the nodes (information items) directly.

In XML, there are many more information items: Document information items, Element, Attribute, Character, Comment, Processing instruction, Name space, Unexpanded entity reference, DTD, Unparsed entity, Notation. Here we only focus on documents, elements, attributes, and characters.

Validation

Once documents, JSON or XML, have been parsed and logically abstracted as a tree in memory, the natural next step is to check for further structural constraints.

In a relational database, the schema of a table is defined before any data is populated into the table. Thus, the data in the table is guaranteed, at all times, to fulfil all the constraints of the schema. The schema was enforced at write time (schema on write). A collection of JSON and XML documents out there can exist without any schema and contain arbitrary structures. Validation happens “ex post,” that is, only after reading the data (schema on read).

JSON and XML documents undergo two steps: a well-formedness check: attempt to parse the document and construct a tree representation in memory;(if the first step succeeded) a validation check given a specific schema. Note that, unlike well-formedness, validation is schema dependent: a given well-formed document can be valid against schema A and invalid against schema B.

Item types

A fundamental aspect of validation is the type system. A well-designed type system, in turn, allows for storing the data in much more efficient, binary formats tailored to the model.

Atomic types

Atomic types correspond to the leaf of a tree data model: these are types that do not contain any further nestedness. The kinds of atomic types available are also relatively standard and common to most technologies. Also, all atomic types have in common that they have a logical value space and a lexical value space.

An atomic type also has a (not necessarily injective) mapping from its lexical value space to its logical value space (e.g., mapping the hexadecimal literal x10 to the mathematical integer sixteen).
Atomic types can be in a subtype relationship: a type is a subtype of another type if its logical value space is a subset of the latter.

Strings

In “pure computer science” textbooks, strings are often presented as structured values rather than as atomic values because of their complexity on the physical layer. However, for us data scientists, strings are atomic values.

Numbers: integers

Integers correspond to finite cardinalities (counting) as well as their negative counterparts. In older programming languages, support for integers used to be bounded. However, in modern databases, it is customary to support unbounded integers. Engines can optimize computations for small integers, but might become less efficient with very large integers.

Numbers: decimals

Decimals correspond to real numbers that can be written as a finite sequence of digits in base 10, with an optional decimal period.

Numbers: floating-point

Support for the entire decimal value space can be costly in performance. In order to address this issue, a floating-point standard (IEEE 754) was invented and is still very popular today.

Floating-point numbers are limited both in precision and magnitude (both upper and lower) in order to fit on 32 bits (float) or 64 bits (double). Floats have about 7 digits of precision and their absolute value can be between roughly 10^−37 and 10^37, while doubles have 15 digits of precision and their absolute value can be between roughly 10^−307 and 10^308.

Booleans

The logical value space for the Boolean type is made of two values: true and false as in NoSQL queries, two-valued logic is typically assumed.

Dates and times

Dates are commonly using the Gregorian calendar (with some technologies possibly supporting more) with a year (BC or AD), a month and a day of the month. Times are expressed in the hexagesimal (60) basis with hours, minutes, seconds, where the seconds commonly go all the way to microseconds (six digits after the decimal period). Datetimes are expressed with a year, a month, a day of the month, hours, minutes and (decimal) seconds.

Timestamp values are typically stored as longs (64-bit integers) expressing the number of milliseconds elapsed since January 1, 1970 by convention.

XML Schema, JSound and JSONiq follow the ISO 8601 standard, where lexical values look like so (with many parts optional): 2022-08-07T14:18:00.123456+02:00.

Durations

The lexical representation can vary, but there is a standard defined by ISO 8601 as well, starting with a P and prefixing sub-day parts with a T.
4 days, 3 hours, 2 minutes and 1.123456 seconds: P4DT3H2M1.123456S.

Binary data

Binary data is, logically, simply a sequence of bytes. There are two main lexical representations used in data: hexadecimal and base64. Hexadecimal expresses the data with two hexadecimal digits per byte. Base 64, formally, does the same but in the base 64, which “wastes” less lexical space in the text. It does so by encoding the bits six by six, encoding each sequence of six bits with one base-64 digit.

Null

A schema can either allow, or disallow the null value.
XML also supports null values, but calls them “nil” and does so with a special attribute and no content rather than with a lexical representation

Structured types

Lists

Lists correspond to JSON arrays and are ordered sequences of (atomic or structured) values.

Records

Records, or structs, correspond to JSON objects and are maps from strings to values.

Maps

Maps (not be confused with records, which are similar) are maps from any atomic value to any value, i.e., generalize objects to keys that are not necessarily strings (e.g., numbers, dates, etc).
With a schema, it is possible to restrict the type of the keys, as well as the type of the values. However, unlike records, the type of the values must be the same for all keys.

Sets

Sets are like lists, but without any specific ordering, and without duplicate values.

XML elements and attributes

XML Schema stands apart from most other technologies and formats, in that it does not offer specific support for records and maps; it offers some limited support for lists, but considers them to be simple types, which are “inbetween” atomic types and structured types. n XML Schema, structure is obtained, instead, with elements and attributes, and the machinery for elements and attributes is highly specific to XML.

Type names

Alt text

Sequence types

Cardinality

Many type system give options regarding the number of occurrences of items in a sequence.

Collections vs. nested lists

A collection of items is on the outer level, and can be massively large (billions, trillions of items).

A list (or array) of items, however, usually refers to a nested structure, for example an array nested inside a document or object. Such lists of items are usually restricted in size for reasons of performance and scalability.

It is thus important to keep this subtle difference in mind, in particular, do not confuse a collection of integers with a collection that contains a single array of integers.

JSON validation

Validating flat objects

JSound is a schema language that was designed to be simple for 80% of the cases, making it particularly suitable in a teaching environment.It is independent of any programming language.JSON Schema is another technology for validating JSON documents. The available JSON Schema types are string, number, integer, boolean, null, array and object.

An example for a json document is like:
{ “name” : “Einstein”, “first” : “Albert”, “age” : 142 }
The JSound and the JSON Schema are as follows:
{ “name” : “string”, “first” : “string”, “age” : “integer” }

{ “type” : “object”, “properties” : { “name” : “string”, “first” : “string”, “age” : “number” } }.

The type system of JSON Schema is thus less rich than that of JSound, but extra checks can be done with so-called formats, which include date, time, duration, email, and so on including generic regular expressions.

Requiring the presence of a key

It is possible to require the presence of a key by adding an exclamation mark in JSound. The equivalent JSON Schema uses a “required” property associated with the list of required keys to express the same.

Open and closed object types

In the JSound compact syntax, extra keys are forbidden. The schema is said to be closed. There are ways to define JSound schemas to allow arbitrary additional keys (open schemas), with a more verbose syntax. Unlike JSound, in JSON Schema, extra properties are allowed by default. JSON Schema then allows to forbid extra properties with the “additionalProperties” property.

Nested structures

{ “numbers” : [ “integer” ] }
Every schema can be given a name, turning into a type.
JSound allows for the definition not only of arbitrary array and object types, but also user-defined types.

Primary key constraints, allowing for null, default values

There are a few more features available in the compact JSound syntax (not in JSON Schema) with the special characters @, ? and =. The question mark (?) allows for null values. The arobase (@) indicates that one or more fields are primary keys for a list of objects that are members of the same array. The equal sign (=) is used to indicate a default value that is automatically populated if the value is absent.

Note that validation only checks whether lexical values are part of the type’s lexical space.

Accepting any values

Accepting any values in JSound can be done with the type “item”, which contains all possible values. In JSON Schema, in order to declare a field to accept any values, you can use either true or an empty object in lieu of the type. JSON Schema additionally allows to use false to forbid a field.

Type unions

In JSON Schema, it is also possible to combine validation checks with Boolean combinations. JSound schema allows defining unions of types with the vertical bar inside type strings, like so: “string|array”.

Type conjunction, exclusive or, negation

In JSON Schema only (not in JSound), it is also possible to do a conjunction (logical and), as well as exclusive or (xor), as well as negation.

XML validation

Simple types

All elements in an XML Schema are in a namespace, the XML Schema namespace. It is recommended to stick to the prefix xs, or xsd, which is also quite popular. We do not recommend declaring the XML Schema namespace as a default namespace, because it can create confusion in several respects.

The list of predefined atomic types is the same as in JSound, except that in XML Schema, all these predefined types live in the XML Schema namespace and thus bear the prefix xs as well.

Builtin types

XML Schema allows you to define user-defined atomic types, for example restricting the length of a string to 3 for airport codes, and then use it with an element.

Complex types

It is also possible to constrain structures and the element/attribute/text hierarchy with complex types applying to element nodes.
There are four main kinds of complex types:• complex content: there can be nested elements, but there can be no text nodes as direct children. • simple content: there are no nested elements: just text, but attributes are also possible. • empty content: there are neither nested elements nor text, but attributes are also possible. • mixed content: there can be nested elements and it can be intermixed with text as well.

Attribute declarations

Finally, all types of content can additionally contain attributes. Attributes always have a simple type.

Anonymous types

Finally, it is not mandatory to give a name to all types. Be careful: if there is neither a type attribute nor a nested type declaration, then anything is allowed!

Miscellaneous

Finally, XML Schema documents are themselves XML documents, and can thus be validated against a “schema or schemas”, itself written as an XML Schema.This schema has the wonderful property of being valid against itself.

Data frames

Heterogeneous, nested datasets

The beauty of the JSON data model is that, unlike the relational model and the CSV syntax, it supports nested, heterogeneous datasets, while also supporting as a particular case flat, homogeneous datasets.

Dataframe visuals

There is a particular subclass of semi-structured datasets that are very interesting: valid datasets, which are collections of JSON objects valid against a common schema, with some requirements on the considered schemas. The datasets belonging to this particular subclass are called data frames, or dataframes.

Specifically, for the dataset to qualify as a data frame, firstly, we forbid schemas that allow for open object types. secondly, we forbid schemas that allow for object or array values to be too permissive and allow any values. We, however, include schemas that allow for null values and/or absent values. Relational tables are data frames, while data frames are not necessarily relational tables. Data frames are a generalization of (normalized) relational tables allowing for (organized and structured) nestedness.

exercies

complextType cannot contain character by default but with mixed=”true” it can.

protobuf

convert json-like data to columnar representation(why we want this: make it more efficient to get relevant data rather than get the whole table).

convert the columnar representation back to the original format. Replace the missing field with NULL. It’s a “lossless” conversion.

Dremel

optional: 0 or 1. repeated: 1 or more.