bigdata - MongoDB
Document stores
Can we rebuild a similar system for collections of trees, in the sense that we drop all three constraints: relational integrity, domain integrity, and atomic integrity? Document stores bring us one step in this direction.
Challenges
Schema on read
In a relational database management system, it is not possible to populate a table without having defined its schema first. However, when encountering such denormalized data, in the real world, there is often no schema. In fact, one of the important features of a system that deals with denormalized data is the ability to discover a schema.
Making trees fit in tables
Several XML elements (or, likewise, several JSON objects) can be naturally mapped to a relational table with several rows if the collection is flat and homogeneous, but semi-structured data can generally be nested and heterogeneous. if we map nested and heterogeneous into a table,such mapping will at best have to be done for every single dataset, and requires in most cases a schema, whereas we are looking for a generic solution for semistructured data with no a-priori schema information.
Document stores
Document stores provide a native database management system for semi-structured data. Document stores work on collections of records, generalizing the way that relational tables can be seen as collections of rows. It is important to understand that document stores are optimized for the typical use cases of many records of small to medium sizes. Typically, a collection can have millions or billions of documents, while each single document weighs no more than 16 MB (or a size in a similar magnitude). Finally, a collection of documents need not have a schema: it is possible to insert random documents that have dissimilar structures with no problem at all. Most document stores, however, do provide the ability to add a schema. Document stores can generally do selection, projection, aggregation and sorting quite well, but many of them are typically not (yet) optimized for joining collections. In fact, often, their language or API does not offer any joining functionality at all, which pushes the burden to reimplement joins in a host language to the users. This is a serious breach of data independence.
Implementations
There is a very large number of products in the document store space for both JSON and XML, let us mention for example MongoDB, CouchDB, ElasticSearch, Cloudant, existDB, ArangoDB, BaseX, MarkLogic, DocumentDB, CosmosDB, and so on. We will focus, as an example, on MongoDB.
Physical storage
Just like the storage format is optimized for tabular data in a relational database management system, it is optimized for tree-like data in a document store. In MongoDB, the format is a binary version of JSON called BSON. BSON is basically based on a sequence of tokens that efficiently encode the JSON constructs found in a document. The immediate benefit of BSON is that it takes less space in storage than JSON stored as a text file: for example, null, true and false literals need four or five bytes in text format at best, while they can be efficiently encoded as single bytes in BSON. Furthermore, BSON supports additional types that JSON does not have, such as dates.
Querying paradigm (CRUD)
The API of MongoDB, like many document stores, is based on the CRUD paradigm. CRUD means Create, Read, Update, Delete and corresponds to low-level primitives similar to those for HBase. MongoDB supports several host languages to query collections via an API. This includes in particular JavaScript and Python, but many other languages are supported via drivers. We will use JavaScript here because this is the native host language of MongoDB. It is important to note that these APIs are not query languages. MongoDB also provides access to the data via a shall called mongo or, newly, mongosh. This is a simple JavaScript interface wrapped around the MongoDB’s node.js driver.
Populating a collection
To create a collection, one can simply insert a document in it, and it will be automatically created if it does not exist.
MongoDB automatically adds to every inserted document a special field called “ id” and associated with a value called an Object ID and with a type of its own.Object IDs are convenient for deleting or updating a specific document with no ambiguity.
Querying a collection
Scan a collection
Asking for the contents of an entire collection is done with a simple find() call on the previous object:db.collection.find().This function does not in fact return the entire collection; rather, it returns some pointer, called a cursor, to the collection; the user can then iterate on the cursor in an imperative fashion in the host language.
Selection
It is possible to perform a selection on a collection by passing a parameter to find() that is a JSON object:
db.collection.find({ “Theory” : “Relativity” }).
A disjunction (OR) uses a special MongoDB keyword, prefixed with a dollar sign:
db.collection.find( { “$or” : [ { “Theory”:”Relativity” }, { “Last”:”Einstein” } ] } ).
MongoDB offers many other keywords, for example for comparison other than equality:
db.collection.find( { “Publications” : { “$gte” : 100 } } )
Projection
Projections are made with the second parameter of this same find() method. This is done in form of a JSON object associating all the desired keys in the projection with the value 1. db.scientists.find( { “Theory” : “Relativity” }, { “First” : 1, “Last”: 1 } ).
It is also possible to project fields away in the same way with 0s, however 1s and 0s cannot be mixed in the projection parameter, except in the specific above case of projecting away the object ID
Counting
Counting can be done by chaining a count() method call:
db.scientists.find( { “Theory” : “Relativity” } ).count().
Sorting
Sorting can be done by chaining a sort() method call.
db.scientists.find( { “Theory” : “Relativity” }, { “First” : 1, “Last” : 1 } ).sort( { “First” : 1, “Name” : -1 } )
1 is for ascending order and -1 for descending order.It is also possible to add limits and offsets to paginate results also by chaining more method calls:
db.scientists.find( { “Theory” : “Relativity” }, { “First” : 1, “Last” : 1 } ).sort( { “First” : 1, “Name” : -1 } ).skip(30).limit(10).
Note that, contrary to intuition, the order of the calls does not matter, as this is really just the creation of a query plan by providing parameters (in any order).
Duplicate elimination
It is possible to obtain all the distinct values for one field with a distinct() call: db.scientists.distinct(“Theory”)
Querying for heterogeneity
Absent fields
Absent fields can be filtered with: db.scientists.find( { “Theory” : null } ).
Filtering for values across types
db.collection.find( { “$or” : [ { “Theory”: “Relativity” }, { “Theory”: 42 }, { “Theory”: null } ] } )
db.scientists.find( { “Theory” : { “$in” : [“Relativity”, 42, null ] } } ).
MongoDB is also able to sort on fields that have heterogeneous data types. It does so by first order by type in some (arbitrary, but documented) order, and then within each type.
Querying for nestedness
Nestedness in MongoDB is handled in several ad-hoc ways for specific use cases.
Values in nested objects
We saw how to select documents based on values associated with a top-level keys. What about values that are not at the top-level, but in nested objects?
The first solution that might come to mind is something like this: db.scientists.find({ “Name” : { “First” : “Albert” } }) However, this query will not have the behavior many would have expected: instead of finding documents that have a value “Albert” associated with the key “First” in an object itself associated with the top-level key ”Name”, this query looks for an exact match of the entire object.
In order to include documents such as above, MongoDB uses a dot syntax: db.scientists.find({ “Name.First” : “Albert” }).
Values in nested arrays
MongoDB allows to filter documents based on whether a nested array contains a specific value, like so: db.scientists.find({ “Theories” : “Special relativity” }).
Deleting objects from a collection
Objects can be deleted from a collection either one at a time with deleteOne(), or several at a time with deleteMany(): db.scientists.deleteMany( { “century” : “15” } ).
Updating objects in a collection
Documents can be updated with updateOne() and updateMany() by providing both a filtering criterion (with the same syntax as the first parameter of find()) and an update to apply. The command looks like so: db.scientists.updateMany( { “Last” : “Einstein” }, { $set : { “Century” : “20” } } ).
The granularity of updates is per document, that is, a single document can be updated by at most one query at the same time.However, within the same collection, several different documents can be modified concurrently by different queries in parallel.
Complex pipelines
For grouping and such more complex queries, MongoDB provides an API in the form of aggregation pipelines.
db.scientists.aggregate( { $match : { “Century” : 20 }}, { $group : { “Year” : “$year”, “Count” : { “$sum” : 1 } } }, { $sort : { “Count” : -1 } }, { $limit : 5 } ).
Limitations of a document store querying API
Simple use cases are straightforward to handle, however more complex use cases require a lot of additional code in the host language, be it JavaScript or Python. An example is that joins must be taken care of by the end user in the host language: the burden of implementing more complex use cases is pushed to the end user.
Architecture
The architecture of MongoDB follows similar principles to what we covered before: scaling out the hardware to multiple machine, and sharding as well as replicating the data.
Collections in MongoDB can be sharded. Shards are determined by selecting one or several fields. (Lexicographically-ordered) intervals over these fields then determine the shards. The fields used to shard must be organized in a tree index structure.
Replica sets
A replica set is a set of several nodes running the MongoDB server process. The nodes within the same replica set all have a copy of the same data.
Each shard of each collection is assigned to exactly one replica set. Note that this architecture is not the same as that of HDFS, in which the replicas are spread over the entire cluster with no notion of “walls” between replica sets and no two DataNodes having the exact same block replicas.
Write concerns
When writing (be it delete, update or insert) to a collection, more exactly, to a specific shard of a collection, MongoDB checks that a specific minimum number of nodes (within the replica set that is responsible for the shard) have successfully performed the update.
Motivation
A document store, unlike a data lake, manages the physical data layout. This has a cost: the need to import (ETL) data before it is possible to query it, but this cost comes with a nice benefit: index support, just like relational database management systems.
Hash indices
Hash indices are used to optimize point queries and more generally query that select on a specific value of a field. The general idea is that all the values that a field takes in a specific collection can be hashed to an integer. The value, together with pointers to the corresponding documents, is then placed in a physical array in memory, at the position corresponding to this integer.
Tree indices
Hash indices are great and fast, but have limitations: first, they consume space. The more one wants to avoid hash collisions, the larger the array needs to be. But more importantly, hash indices cannot support range queries. This is because hashes do not preserve the order of the values and distribute them “randomly” in the index array structure.
Range queries are supported with tree indices. Instead of an array, tree indices use some sort of tree structure in which they arrange the possible values of the indexed field, such that the values are ordered when traversing the tree in a depth-first-search manner. More precisely, the structure is called a B+-tree. Unlike a simple binary tree, nodes have a large number of children.
Secondary indices
By default, MongoDB always builds a tree index for the id field. Users can request to build hash and tree indices for more fields. These indices are called secondary indices.
The command for building a hash index looks like so: db.scientists.createIndex({ “Name.Last” : “hash” })
And for a tree index (1 means in ascending order, -1 would be descending): db.scientists.createIndex({ “Name.Last” : 1 }).
When are indices useful
When building indices, it is important to get a feeling for whether a query will be faster or not with this index.
index types
Single Field Indexes
By default, all collections have an index on the _id field. You can add additional indexes to speed up important queries and operations. You can create a single-field index on any field in a document, including:Top-level document fields, Embedded documents ,Fields within embedded documents. When you create an index on an embedded document, only queries that specify the entire embedded document use the index. Queries on a specific field within the document do not use the index. In order for a dot notation query to use an index, you must create an index on the specific embedded field you are querying, not the entire embedded object.
Compound Indexes
Compound indexes collect and sort data from two or more fields in each document in a collection. Data is grouped by the first field in the index and then by each subsequent field.
The order of the indexed fields impacts the effectiveness of a compound index. Compound indexes contain references to documents according to the order of the fields in the index. To create efficient compound indexes, follow the ESR (Equality, Sort, Range) rule. The ESR (Equality, Sort, Range) Rule is to place fields that require exact matches first in your index. Sort follows equality matches because the equality matches reduce the number of documents that need to be sorted. Sorting after the equality matches also allows MongoDB to do a non-blocking sort. “Range” filters scan fields. The scan doesn’t require an exact match, which means range filters are loosely bound to index keys. To improve query efficiency, make the range bounds as tight as possible and use equality matches to limit the number of documents that must be scanned. MongoDB cannot do an index sort on the results of a range filter. Place the range filter after the sort predicate so MongoDB can use a non-blocking index sort.
Compound indexes cannot support queries where the sort order does not match the index or the reverse of the index.
Index prefixes are the beginning subsets of indexed fields. Compound indexes support queries on all fields included in the index prefix. Index fields are parsed in order; if a query omits an index prefix, it is unable to use any index fields that follow that prefix.
Multikey Indexes
Multikey indexes collect and sort data from fields containing array values. Multikey indexes improve performance for queries on array fields.
In a compound multikey index, each indexed document can have at most one indexed field whose value is an array.
exercise
data model
Embedded Data Models
In general, embedding provides better performance for read operations, as well as the ability to request and retrieve related data in a single database operation. Embedded data models make it possible to update related data in a single atomic write operation.
Normalized Data Models
Normalized data models describe relationships using references between documents.
references
bigdata - MongoDB
install_url
to use ShareThis. Please set it in _config.yml
.