bigdata - wide column stores

wide column stores

Wide column stores were invented to provide more control over performance and in particular, in order to achieve high-throughput and low latency for objects ranging from a few bytes to about 10 MB, which are too big and numerous to be efficiently stored as so-called clobs (character large objects) or blobs (binary large objects) in a relational database system, but also too small and numerous to be efficiently accessed in a distributed file system.

Why wide column stores

A wide column store will be more tightly integrated with the parallel data processing systems.
wide column stores have a richer logical model than the simple key-value model behind object storage. wide column stores also handle very small values (bytes and kBs) well thanks to batch processing.

different from RDBMS

It does not have any data model for values, which are just arrays of bytes; since it efficiently handles values up to 10 MB, the values can be nested data in various formats, which breaks the first normal form; tables do not have a schema; there is no language like SQL, instead the API is on a lower level and more akin to that of a key-value store; tables can be very sparse, allowing for billions of rows and millions of columns at the same time; this is another reason why data stored in HBase is denormalized.

BigTable and HBase

Rationale

HBase is an open-source equivalent to the BigTable as part of the Hadoop ecosystem.
The data model of HBase is based on the realization that joins are expensive, and that they should be avoided or minimized on a cluster architecture.

The second design principle underlying HBase is that it is efficient to store together what is accessed together. In the big picture, this is a flavor of batch processing, one of the overarching principles in Big Data. Batch processing reduces the impact of latency.

Tables and row IDs

From an abstract perspective, HBase can be seen as an enhanced keyvalue store, in the sense that:a key is compound and involves a row, a column and a version;values can be larger (clobs, blobs), up to around 10 MB; keys are sortable.

A row ID is logically an array of bytes, although there is a library to easily create row ID bytes from specific primitive values. In HBase, the key identifying every cell consists of: the row ID, the column family, the column qualifier, the version.

Column families

The other attributes, called columns, are split into so-called column families. This is a concept that does not exist in relational databases and that allows scaling the number of columns.

Column qualifiers

Columns in HBase have a name (in addition to the column family) called column qualifier, however unlike traditional RDBMS, they do not have a particular type. Column qualifiers are arrays of bytes (rather than strings), and as for row IDs, there is a library to easily create column qualifiers from primitive values. Unlike the values which can be large arrays of bytes (blobs), it is important to keep column families and column qualifiers short, because as we will see, they are repeated a gigantic number of times on the physical layer.

Versioning

HBase generally supports versioning, in the sense that it keeps track of the past versions of the data. As we will see, this is implemented by associating any value with a timestamp, also called version, at which it was created (or deleted).Users can also override timestamps with a value of their choice to have more control about versions.

Logical queries

HBase supports four kinds of low-level queries: get, put, scan and delete. Unlike a traditional key-value store, HBase also supports querying ranges of row IDs and ranges of timestamps.

HBase offers a locking mechanism at the row level, meaning that different rows can be modified concurrently, but the cells in the same row cannot: only one user at a time can modify any given row.

Physical architecture

Partitioning

A table in HBase is physically partitioned in two ways: on the rows and on the columns. The rows are split in consecutive regions. Each region is identified by a lower and an upper row key, the lower row key being included and the upper row key excluded. A partition is called a store and corresponds to the intersection of a region and of a column family.
Alt text

Network topology

HBase has exactly the same centralized architecture as HDFS. The HMaster and the RegionServers should be understood as processes running on the nodes, rather than the nodes themselves. The HMaster assigns responsibility of each region to one of the RegionServers. There is no need to attribute the responsibility of a region to more than one RegionServer at a time because, as we will see soon, fault tolerance is already handled on the storage level by HDFS. If a region grows too big, for example because of many writes in the same row ID interval, then the region will be automatically split by the responsible RegionServer. If a RegionServer has too many regions compared to other RegionServers, then the HMaster can reassign regions to other RegionServers.

Physical storage

The store is, physically, nothing less than a set of cells. Each cell is thus handled physically as a key-value pair where the key is a (row ID, column family, column qualifier, version) tuple. All the cells within a store are eventually persisted on HDFS, in files that we will call HFiles.

An HFile is, in fact, nothing else than a (boring) flat list of KeyValues, one per cell. What is important is that, in an HFile, all these KeyValues are sorted by key in increasing order, meaning, first sorted by row ID, then by column family (trivially unique for a given store), then by column qualifier, then by version (in decreasing order, recent to old). This means that all versions of a given cell that are in the same HFile are located together, and one of the cells (within this HFile) is the latest. If we zoom in at the bit level, a KeyValue consists in four parts: The length of the keys in bits (this length is encoded on a constant, known number of bits) • The length of the value in bits (this length is encoded on a constant, known number of bits) • The actual key (of variable length) • The actual value (of variable length). Why do we not just store the key and the value? This is because their length can vary. If we do not know their length, then it is impossible to know when they stop just looking at the bits.
Alt text

KeyValues, within an HFile, are organized in blocks. But to not confuse them with HDFS blocks, we will call them HBlocks. HBlocks have a size of 64 kB, but this size is variable: if the last KeyValue goes beyond this boundary, then the HBlock is simply longer and stops whenever the last KeyValue stops. The HFile then additionally contains an index of all blocks with their key boundaries. This separate index is loaded in memory prior to reading anything from the HFile.

Log-structured merge trees

When accessing data, HBase needs to generally look everywhere for cells to potentially return: in every HFile, and in memory. As long as there is room in memory, freshly created cells are added in memory. At some point, the memory becomes full (or some other limits are reached). When this happens, all the cells need to be flushed to a brand new HFile. Upon flushing, all cells are written sequentially to a new HFile in ascending key order, HBlock by HBlock, concurrently building the index structure. When cells are added to memory, they are added inside a data structure that maintains them in sorted order (such as tree maps) and then flushing is a linear traversal of the tree.

What happens if the machine crashes and we lose everything in memory? We have a so-called write-ahead-log for this. Before any fresh cells are written to memory, they are written in sequential order (append) to an HDFS file called the HLog. There is one HLog per RegionServer. A full write-ahead-log also triggers a flush of all cells in memory to a new HFile. If there is a problem and the memory is lost, the HLog can be retrieved from HDFS and “played back” in order to repopulate the memory and recreate the sorting tree structure.

After many flushes, the number of HFiles to read from grows and becomes impracticable. For this reason, there is an additional process called compaction that takes several HFiles and outputs a single,merged HFile. Since the cells within each HFile are already sorted, this can be done in linear time, as this is essentially the merge part of the merge-sort algorithm. Compaction is not done arbitrarily but follows a regular, logarithmic pattern. When the memory is flushed again, an standard-size HFile is written and the two standard-size HFiles are immediately compacted to a double-size HFile.

Additional design aspects

Bootstrapping lookups

In order to know which RegionServer a client should communicate with to receive cells corresponding to a specific region, there is a main, big lookup table that lists all regions of all tables together with the coordinates of the RegionServer in charge of this region as well as additional metadata.

Caching

In order to improve latency, cells that are normally persisted to HFiles (and thus no longer in memory) can be cached in a separate memory region, with the idea of keeping in the cache those cells that are frequently accessed.

Bloom filters

HBase has a mechanism to avoid looking for cells in every HFile. This mechanism is called a Bloom filter. It is basically a black box that can tell with absolute certainty that a certain key does not belong to an HFile, while it only predicts with good probability (albeit not certain) that it does belong to it.

Data locality and short-circuiting

When a RegionServer flushes cells to a new HFile, a replica of each (HDFS) block of the HFile is written, by the DataNode process living on the same machine as the RegionServer process, to the local disk. This makes accessing the cells in future reads by the RegionServer extremely efficient, because the RegionServer can read the data locally without communicating with the NameNode: this is known as short-circuiting in HDFS.

using Habse

After installing Hbase, we can use Hbase shell. There are some commands in hbase shell: get, scan.
We can use filters with scan to query and filter data.

exercises

bloom filter

perfect hash function should have uniform probability.
all hash functions set a bit to 1 = collide at the same place: probability of a FP case
deleting only happens when compacting.

references

https://ghislainfourny.github.io/big-data-textbook/
https://datakaresolutions.com/hbase-quick-guide-to-key-commands/
https://www.datapotion.io/blog/hbase-shell-column-filters/

bigdata - Syntax

syntax

CSV

CSV is a textual format, in the sense that it can be opened in a text editor. CSV means comma-separated values. The main challenge with CSV files is that, in spite of a standard (RFC 4180), in practice there are many different dialects and variations, which limits interoperability. For example, another character can be used instead of the comma (tabulation, semi-colons, etc). Also, when a comma (or the special character used in its stead) needs to actually appear in a value, it needs to be escaped.

Data denormalization

Data denormalization makes a lot of sense in read-intensive scenarios in which not having to join brings a significant performance improvement.

The difference with CSV is that, in JSON, the attributes appear in every tuple, while in CSV they do not appear except in the header line. JSON is appropriate for data denormalization because including the attributes in every tuple allows us to drop the identical support requirement.

The generic name for denormalized data (in the same of heterogeneous and nested) is “semi-structured data”. Textual formats such as XML and JSON have the advantage that they can both be processed by computers, and can also be read, written and edited by humans. Another very important and characterizing aspect of XML and JSON is that they are standards: XML is a W3C standard. W3C, also known as the World Wide Web consortium, is the same body that also standardizes HTML, HTTP, etc. JSON is now an ECMA standard, which is the same body that also standardizes JavaScript.

Whichever syntax is used, they have in common the concept of well-formedness. A string is said to be well-formed if it belongs to the language. Concretely, when a document is well-formed XML, it means that it can be successfully opened by an editor as XML with no errors.

JSON

JSON stands for JavaScript Object Notation because the way it looks like originates from JavaScript syntax, however it is now living its own life completely independently of JavaScript.

JSON is made of exactly six building blocks: strings, numbers, Booleans, null, objects, and arrays. Strings are simply text. In JSON, strings always appear in double quotes. Obviously, strings could contain quotes and in order not to confuse them with the surrounding quotes, they need to be differentiated. This is called escaping and, in JSON, escaping is done with backslash characters ().

JSON generally supports numbers, without explicitly naming any types nor making any distinction between numbers apart from how they appear in syntax. The way a number appears in syntax is called a lexical representation, or a literal. JSON places a few restrictions: a leading + is not allowed. Also, a leading 0 is not allowed except if the integer part is exactly 0.

There are two Booleans, true and false. Arrays are simply lists of values. The concept of list is abstract and mathematical. The concept of array is the syntactic counterpart of a list. Objects are simply maps from strings to values. The concept of object is the syntactic counterpart of a map,i.e., an object is a physical representation of an abstract map that explicitly lists all string-value pairs. The keys of an object must be strings. The JSON standard recommends for keys to be unique within an object.

XML

XML stands for eXtensible Markup Language. It resembles HTML, except that it allows for any tags and that it is stricter in what it allows.

XML’s most important building blocks are elements, attributes, text and comments. XML is a markup language, which means that content is “tagged”. Tagging is done with XML elements. An XML element consists of an opening tag, and a closing tag. What is “tagged” is everything inbetween the opening tag and the closing tag. ags consist of a name surrounded with angle brackets < … >, and the closing tag has an additional slash in front of the name. We use a convenient shortcut to denote the empty element with a single tag and a slash at the end. For example, \ is equal to
\\.
Unlike JSON keys, element names can repeat at will.

Attributes appear in any opening elements tag and are basically keyvalue pairs. Values can be either double-quoted or single-quoted. The key is never quoted, and it is not allowed to have unquoted values. Within the same opening tag, there cannot be duplicate keys. Attributes can also appear in an empty element tag. Attributes can never appear in a closing tag. It is not allowed to create attributes that start with XML or xml, or any case combination.

Text, in XML syntax, is simply freely appearing in elements and without any quotes (attribute values are not text!). Within an element, text can freely alternate with other elements. This is called mixed content and is unique to XML.

Comments in XML look like so: \. XML documents can be identified as such with an optional text declaration containing a version number and an encoding, like \<?xml version=”1.0” encoding=”UTF-8”?>. The version is either 1.0 or 1.1. Another tag that might appear is the doctype declaration, like \<!DOCTYPE person>.

Remember that in JSON, it is possible to escape sequences with a backslash character. In XML, this is done with an ampersand (&) character. There are exactly five possible escape sequences pre-defined in XML:
Alt text. Escape sequences can be used anywhere in text, and in attribute values. At other places (element names, attribute names, inside comments), they will not be recognized.
There are a few places where they are mandatory:& and < MUST be escaped. ” and ‘ should also be escaped in quoted qttribute values.

Namespaces are an extension of XML that allows users to group their elements and attributes in packages, similar to Python modules, Java packages or C++ namespaces. A namespace is identified with a URI. A point of confusion is that XML namespaces often start with http://, but are not meant to be entered as an address into a browser. A namespace declaration is like: \. If you remember, we saw that attributes starting with xml are forbidden, and this is because this is reserved for namespace declarations. What about documents that use multiple namespaces? This is done by associating namespaces with prefixes, which act as shorthands for a namespace. Then, we can use the prefix shorthand in every element that we want to have in this namespace.
Alt text
So, given any element, it is possible to find its local name, its (possibly absent) prefix, and its (possibly absent) namespace. The triplet (namespace, prefix, localname) is called a QName

Attributes can also live in namespaces, that is, attribute names are generally QNames. However, there are two very important aspects to consider. First, unprefixed attributes are not sensitive to default namespaces: unlike elements, the namespace of an unprefixed attribute is always absent even if there is a default namespace. Second, it is possible for two attributes to collide if they have the same local name, and different prefixes but associated with the same namespace (but again, we told you: do not do that!).

references

https://ghislainfourny.github.io/big-data-textbook/

bigdata - introduction

big data

Big Data is a portfolio of technologies that were designed to store, manage and analyze data that is too large to fit on a single machine while accommodat-ing for the issue of growing discrepancy between capacity, throughput and latency.
Alt text

data independence

Data independence means that the logical view on the data is cleanly separated, decoupled, from its physical storage.

architecture

Stack: storage, compute, model ,language

data model

what data looks like and what you can do with it.

data shapes

tables, trees, cubes

table

Row(Tuple), Column(Attribute), Primary Key,Value.

relational tables

schema: A set of attributes.
extension: A set/bag/list of tuples.
Three constraints: Relational integrity, domain integrity and atomic integrity.
Superkey, Candidate key(minimal superkey).

Database Normalization

In database management systems (DBMS), normal forms are a series of guidelines that help to ensure that the design of a database is efficient, organized, and free from data anomalies.
First Normal Form (1NF):In 1NF, each table cell should contain only a single value, and each column should have a unique name.
Second Normal Form (2NF): 2NF eliminates redundant data by requiring that each non-key attribute be dependent on the primary key.
Third Normal Form (3NF): 3NF builds on 2NF by requiring that all non-key attributes are independent of each other.

Denormalization

Denormalization is a database optimization technique in which we add redundant data to one or more tables. This can help us avoid costly joins in a relational database. Note that denormalization does not mean ‘reversing normalization’ or ‘not to normalize’. It is an optimization technique that is applied after normalization.

SQL

SQL was originally named SEQUEL, for Structured English QUEry Language. SQL is a declarative language, which means that the user specifies what they want, and not how to compute it: it is up to the underlying system to figure out how to best execute the query.

View and Table

The view is a result of an SQL query and it is a virtual table, whereas a Table is formed up of rows and columns that store the information of any object and be used to retrieve that data whenever required. A view contains no data of its own but it is like a ‘window’ through which data from tables can be viewed or changed. The view is stored as a SELECT statement in the data dictionary. Creating a view fulfills the requirement without storing a separate copy of the data because a view does not store any data of its own and always takes the data from a base table. as the data is taken from the base table, accurate and up-to-date information is required.

SQL:1999 added the with clause to define “statement scoped views”. They are not stored in the database schema: instead, they are only valid in the query they belong to. This makes it possible to improve the structure of a statement without polluting the global namespace.With is not a stand alone command like create view is: it must be followed by select.

Natural Join and Inner Join

Natural Join joins two tables based on the same attribute name and datatypes. The resulting table will contain all the attributes of both the table but keep only one copy of each common column while Inner Join joins two tables on the basis of the column which is explicitly specified in the ON clause. The resulting table will contain all the attributes from both tables including the common column also.

data storage

Stack: Storage, Encoding, Syntax, Data models, Validation, Processing, Indexing, Data stores,Querying, User interfaces.

database and data lake

two main paradigms for storing and retrieving data:database and data lake. Data can be imported into the database (this is called ETL, for Extract-Transform-Load. ETL is often used as a verb).The data is internally stored as a proprietary format that is optimized to make queries faster. This includes in particular building indices on the data.
On the other hand, data can also just be stored on some file system.This paradigm is called the data lake paradigm and gained a lot of popularity in the past two decades. It is slower, however users can start querying their data without the effort of ETLing.

scaling up and scaling out

First, one can buy a bigger machine: more memory, more or faster CPU cores, a larger disk, etc. This is called scaling up. Second, one can buy more, similar machines and share the work across them. This is called scaling out.

Object stores

Amazon’s object storage system is called Simple Storage Service, abbreviated S3. From a logical perspective, S3 is extremely simple: objects are organized in buckets. Buckets are identified with a bucket ID, and each object within a bucket is identified with an Object ID.

CAP theorem

Consistency: at any point in time, the same request to any server returns the same result, in order words, all nodes see the same data;
Availability: the system is available for requests at all times with very high availability.
Partition tolerance: the system continues to function even if the network linking its machines is occasionally partitioned.
The CAP theorem is basically an impossibility triangle: a system cannot guarantee at the same time: usually are CP,AP or AC.

REST APIs

REST (Representational State Transfer) is an architectural style for designing networked applications.RESTful services often use HTTP as the communication protocol. A client and server communicated with the HTTP protocol interact in terms of methods applied to resources.A resource is referred to with what is called a URI. URI stands for Uniform Resource Identifier. A client can act on resources by invoking methods, with an optional body. The most important methods are: GET, PUT,DELETE,POST.

REST is not a standard or protocol, this is an approach to or architectural style for writing API.REST is an architectural style, and RESTful is the interpretation of it. That is, if your back-end server has REST API and you make client-side requests (from a website/application) to this API, then your client is RESTful. All requests you make have their HTTP status codes. There are a lot of them and they are divided into 5 classes. The first number indicates which of them a code belongs to:
1xx - informational
2xx - success
3xx - redirection
4xx - client error
5xx - server error

Amazon S3 and Azure Blob Storage

Alt text

Key-value store

A key-value store differs from a typical relational database in three aspects: • Its API is considerably simpler than that of a relational database (which comes with query languages) • It does not ensure atomic consistency; instead, it guarantees eventual consistency, which we covered earlier in this Chapter. • A key-value store scales out well, in that it is very fast also at large scales.

Amazon Dynamo

It is itself based (with some modifications) on the Chord protocol, which is a Distributed Hash Table.On the physical level, a distributed hash table is made of nodes (the machines we have in a data center, piled up in racks) that work following a few design principles. The first design principle is incremental stability. This means that new nodes can join the system at any time, and nodes can leave the system at any time, sometimes gracefully, sometimes in a sudden crash.The second principle is symmetry: no node is particular in any way The third principle is decentralization: there is no “central node” that orchestrates the others.The fourth principle is heterogeneity: the nodes may have different CPU power, amounts of memory, etc.

A central aspect of the design of a distributed hash table, and part in particular of the Chord protocol, is that every logical key is hashed to bits that we will call IDs. In the case of Dynamo, the hash is made of 128 bits (7 bytes).In the chord protocol, a technology called a finger table is used. Each node knows the next node clockwise, and the second node, and the 4th node, and the 8th node. Dynamo changes this design to so-called “preference lists”: each node knows, for every key (or key range), which node(s) are responsible (and hold a copy) of it. This is done by associating every key (key range) with a list of nodes, by decreasing priority (going down the ring clockwise).

Distributed hash tables, including Dynamo, are typically AP. A fundamental conceptual tool in AP systems is the use of vector clocks.Vector clocks are a way to annotate the versions when they follow a DAG structure.A vector clock can logically be seen as a map from nodes (machines) to integers, i.e., the version number is incremented per machine rather than globally.

vector clock
Lamport’s logical clock

Lamport’s Logical Clock was created by Leslie Lamport. It is a procedure to determine the order of events occurring. It provides a basis for the more advanced Vector Clock Algorithm. Due to the absence of a Global Clock in a Distributed Operating System Lamport Logical Clock is needed.

Implementation Rules:
[IR1]: If a -> b [‘a’ happened before ‘b’ within the same process] then, Ci(b) =Ci(a) + d
[IR2]: Cj = max(Cj, tm + d) [If there’s more number of processes, then tm = value of Ci(a), Cj = max value between Cj and tm + d]

Vector Clocks in Distributed Systems

Vector Clock is an algorithm that generates partial ordering of events and detects causality violations in a distributed system.
How does the vector clock algorithm work :

Initially, all the clocks are set to zero.
Every time, an Internal event occurs in a process, the value of the processes’s logical clock in the vector is incremented by 1.
Every time, a process receives a message, the value of the processes’s logical clock in the vector is incremented by 1, and moreover, each element is updated by taking the maximum of the value in its own vector clock and the value in the vector in the received message (for every element).

To sum up, Vector clocks algorithms are used in distributed systems to provide a causally consistent ordering of events but the entire Vector is sent to each process for every message sent, in order to keep the vector clocks in sync.

partial order relation

Note that vector clocks can be compared to each other with a partial order relation ≤. A partial order relation is any relation that is reflexive, antisymmetric, and transitive. A total order relation is a partial order in which every element of the set is comparable with every other element of the set. All total order relations are partial order relations, but the converse is not always true.

references

https://www.geeksforgeeks.org/normal-forms-in-dbms/
https://www.geeksforgeeks.org/denormalization-in-databases/
https://www.geeksforgeeks.org/difference-between-view-and-table/
https://modern-sql.com/feature/with
https://www.geeksforgeeks.org/sql-natural-join/
https://mlsdev.com/blog/81-a-beginner-s-tutorial-for-understanding-restful-api
https://ghislainfourny.github.io/big-data-textbook/
https://www.geeksforgeeks.org/lamports-logical-clock/

bigdata - Distributed file systems

Distributed file systems

requirements of a distributed file system

Going back to our capacity-throughput-latency view of storage, a distributed file system is designed so that, in cruise mode, its bottleneck will be the data flow (throughput), not the latency. We saw that capacity increased much faster than throughput, and that this can be solved with parallelism. We saw that throughput increased much faster than latency decreased, and that this can be solved with batch processing. Distributed file systems support both parallelism and batch processing natively, forming the core part of the ideal storage system accessed by MapReduce or Apache Spark.The origins of such a system come back to the design of GoogleFS, the Google File System. Later on, an open source version of it was released as part of the Hadoop project, initiated by Doug Cutting at Yahoo, and called HDFS, for Hadoop Distributed File System.

HDFS

HDFS does not follow a key-value model: instead, an HDFS cluster organizes its files as a hierarchy, called the file namespace. Files are thus organized in directories, similar to a local file system. Unlike in S3, HDFS files are furthermore not stored as monolithic blackboxes, but HDFS exposes them as lists of blocks. As for the block size: HDFS blocks are typically 64 MB or 128 MB large, and are thus considerably larger than blocks on a local hard drive (around 4 kB).HDFS is designed to a run on a cluster of machines.

architecture

HDFS is implemented on a fully centralized architecture, in which one node is special and all others are interchangeable and connected to it.
In the case of HDFS, the central node is called the NameNode and the other nodes are called the DataNodes. Every file is divided into chunks called blocks. All blocks have a size of exactly 128 MB, except the last one which is usually smaller. Each one of the blocks is then replicated and stored on several DataNodes. How many times? This is a parameter called the replication factor. By default, it is 3.

The NameNode is responsible for the system-wide activity of the HDFS cluster. It store in particular three things: • the file namespace, that is, the hierarchy of directory names and file names, as well as any access control (ACL) information similar to Unix-based systems. • a mapping from each file to the list of its blocks. Each block, in this list, is represented with a 64-bit identifier; the content of the blocks is not on the NameNode. • a mapping from each block, represented with its 64-bit identifier, to the locations of its replicas, that is, the list of the DataNodes that store a copy of this block. The DataNodes store the blocks themselves. These blocks are stored on their local disks. DataNodes send regular heartbeats to the NameNode. The frequency of these heartbeats is configurable and is by default a few seconds (e.g., 3s, but this value may change across releases). This is a way to let the NameNode know that everything is alright.Finally, the DataNode also sends, every couple of hours (e.g., 6h, but this value may change across releases), a full report including all the blocks that it contains. A NameNode never initiates a connection to a DataNode.

Finally, DataNodes are also capable of communicating with each other by forming replication pipelines. A pipeline happens whenever a new HDFS file is created. The client does not send a copy of the block to all the destination DataNodes, but only to the first one. This first DataNode is then responsible for creating the pipeline and propagating the block to its counterparts. When a replication pipeline is ongoing and a new block is being written to the cluster, the content of the block is not sent in one single 128 MB packet. Rather, it is sent in smaller packets (e.g., 64 kB) in a streaming fashion via a network protocol.

replicas

Having this in mind, the first replica of the block, by default, gets written to the same machine that the client is running on. The second replica is written on a DataNode sitting in a different rack than the client, that we call B. The third replica is written to another DataNode on the same rack B.And further replicas are written mostly at random, but respecting two simple rules for resilience: at most one replica per node, and at most two replicas per rack.

Fault tolerance

HDFS has a single point of failure: the NameNode. If the metadata stored on it is lost, then all the data on the cluster is lost, because it is not possible to reassemble the blocks into files any more.For this reason, the metadata is backed up. More precisely, the file namespace containing the directory and file hierarchy as well as the mapping from files to block IDs is backed up to a so-called snapshot. What is done is that updates to the file system arriving after the snapshot has been made are instead stored in a journal, called edit log, that lists the updates sorted by time of arrival. The snapshot and edit log are stored either locally or on a networkattached drive (not HDFS itself).

Logging and importing data

Two tools are worth mentioning: Apache Flume lets you collect, aggregate and move log data to HDFS. Apache Sqoop lets you import data from a relational database management system to HDFS.

references

https://ghislainfourny.github.io/big-data-textbook/
https://hadoop.apache.org/docs/r3.3.0/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html#Data_Replication

Elasticsearch

ELK Stack

Elasticsearch, Logstash and Kibana

Elasticsearch

Elasticsearch is a NoSQL database.When you feed data into Elasticsearch, the data is placed into Apache Lucene indexes.

Apache Lucene

Apache Lucene™ is a high-performance, full-featured search engine library written entirely in Java.

API

Logstash

Using more than 50 input plugins for different platforms, databases and applications, Logstash can be defined to collect and process data from these sources and send them to other systems for storage and analysis.

project

https://trecpodcasts.github.io/
https://doc.yonyoucloud.com/doc/mastering-elasticsearch/chapter-2/21_README.html
https://cloud.tencent.com/developer/article/1600163
https://www.elastic.co/cn/blog/how-to-improve-elasticsearch-search-relevance-with-boolean-queries
https://www.elastic.co/guide/en/app-search/current/relevance-tuning-guide.html
https://medium.com/mlearning-ai/enhancing-information-retrieval-via-semantic-and-relevance-matching-64973ff81818
https://www.elastic.co/cn/blog/how-to-improve-elasticsearch-search-relevance-with-boolean-queries
https://bigdataboutique.com/blog/optimizing-elasticsearch-relevance-a-detailed-guide-c9efd3
NDCG:
https://www.javatips.net/api/MyMediaLiteJava-master/src/org/mymedialite/eval/measures/NDCG.java

SQL(1)-基本用法

SQL(Structured Query Language)是用于管理和操作关系型数据库的标准语言。以下是SQL的一些常用知识点,涵盖了基本的查询、数据操作、数据定义和数据控制等方面:

1. 基本查询

  • SELECT语句:用于从数据库中检索数据。

    1
    2
    SELECT column1, column2, ...
    FROM table_name;
  • WHERE子句:用于过滤记录。

    1
    2
    3
    SELECT column1, column2, ...
    FROM table_name
    WHERE condition;
  • AND / OR:用于组合多个条件。

    1
    2
    3
    SELECT column1, column2, ...
    FROM table_name
    WHERE condition1 AND condition2;
  • IN子句:用于指定多个可能的值。

    1
    2
    3
    SELECT column1, column2, ...
    FROM table_name
    WHERE column_name IN (value1, value2, ...);
  • BETWEEN子句:用于选取介于两个值之间的数据。

    1
    2
    3
    SELECT column1, column2, ...
    FROM table_name
    WHERE column_name BETWEEN value1 AND value2;
  • LIKE子句:用于模式匹配。

    1
    2
    3
    SELECT column1, column2, ...
    FROM table_name
    WHERE column_name LIKE pattern;

2. 排序和分组

  • ORDER BY子句:用于对结果集进行排序。

    1
    2
    3
    SELECT column1, column2, ...
    FROM table_name
    ORDER BY column1 ASC/DESC;
  • GROUP BY子句:用于将数据分组。

    1
    2
    3
    SELECT column1, column2, AGGREGATE_FUNCTION(column3)
    FROM table_name
    GROUP BY column1, column2;
  • HAVING子句:用于过滤分组后的结果。

    1
    2
    3
    4
    SELECT column1, AGGREGATE_FUNCTION(column2)
    FROM table_name
    GROUP BY column1
    HAVING AGGREGATE_FUNCTION(column2) condition;

3. 聚合函数

  • COUNT:用于计数。

    1
    2
    SELECT COUNT(column_name)
    FROM table_name;
  • SUM:用于求和。

    1
    2
    SELECT SUM(column_name)
    FROM table_name;
  • AVG:用于求平均值。

    1
    2
    SELECT AVG(column_name)
    FROM table_name;
  • MAX:用于求最大值。

    1
    2
    SELECT MAX(column_name)
    FROM table_name;
  • MIN:用于求最小值。

    1
    2
    SELECT MIN(column_name)
    FROM table_name;

4. 数据操作

  • INSERT语句:用于插入新记录。

    1
    2
    INSERT INTO table_name (column1, column2, ...)
    VALUES (value1, value2, ...);
  • UPDATE语句:用于更新现有记录。

    1
    2
    3
    UPDATE table_name
    SET column1 = value1, column2 = value2, ...
    WHERE condition;
  • DELETE语句:用于删除记录。

    1
    2
    DELETE FROM table_name
    WHERE condition;

5. 数据定义

  • CREATE TABLE语句:用于创建新表。

    1
    2
    3
    4
    5
    CREATE TABLE table_name (
    column1 datatype constraints,
    column2 datatype constraints,
    ...
    );
  • ALTER TABLE语句:用于修改表结构。

    1
    2
    ALTER TABLE table_name
    ADD column_name datatype;
  • DROP TABLE语句:用于删除表。

    1
    DROP TABLE table_name;

6. 约束

  • PRIMARY KEY:用于唯一标识表中的每条记录。

    1
    2
    3
    4
    5
    CREATE TABLE table_name (
    column1 datatype PRIMARY KEY,
    column2 datatype,
    ...
    );
  • FOREIGN KEY:用于建立表之间的关系。

    1
    2
    3
    4
    5
    CREATE TABLE table_name (
    column1 datatype,
    column2 datatype,
    FOREIGN KEY (column1) REFERENCES other_table_name(column1)
    );
  • UNIQUE:用于确保列中的所有值都是唯一的。

    1
    2
    3
    4
    5
    CREATE TABLE table_name (
    column1 datatype UNIQUE,
    column2 datatype,
    ...
    );
  • NOT NULL:用于确保列中的值不能为空。

    1
    2
    3
    4
    5
    CREATE TABLE table_name (
    column1 datatype NOT NULL,
    column2 datatype,
    ...
    );

7. 子查询

  • 子查询:在一个查询中嵌套另一个查询。
    1
    2
    3
    SELECT column1, column2, ...
    FROM table_name
    WHERE column1 = (SELECT column1 FROM another_table WHERE condition);

8. 联接(JOIN)

  • INNER JOIN:返回两个表中匹配的记录。

    1
    2
    3
    4
    SELECT table1.column1, table2.column2, ...
    FROM table1
    INNER JOIN table2
    ON table1.common_column = table2.common_column;
  • LEFT JOIN:返回左表中的所有记录,以及右表中匹配的记录。

    1
    2
    3
    4
    SELECT table1.column1, table2.column2, ...
    FROM table1
    LEFT JOIN table2
    ON table1.common_column = table2.common_column;
  • RIGHT JOIN:返回右表中的所有记录,以及左表中匹配的记录。

    1
    2
    3
    4
    SELECT table1.column1, table2.column2, ...
    FROM table1
    RIGHT JOIN table2
    ON table1.common_column = table2.common_column;
  • FULL OUTER JOIN:返回两个表中的所有记录,如果没有匹配,则返回 NULL。

    1
    2
    3
    4
    SELECT table1.column1, table2.column2, ...
    FROM table1
    FULL OUTER JOIN table2
    ON table1.common_column = table2.common_column;

9. 事务控制

  • BEGIN TRANSACTION:开始一个事务。

    1
    BEGIN TRANSACTION;
  • COMMIT:提交事务。

    1
    COMMIT;
  • ROLLBACK:回滚事务。

    1
    ROLLBACK;

10. 视图

  • CREATE VIEW:创建视图。

    1
    2
    3
    4
    CREATE VIEW view_name AS
    SELECT column1, column2, ...
    FROM table_name
    WHERE condition;
  • DROP VIEW:删除视图。

    1
    DROP VIEW view_name;

11. 索引

  • CREATE INDEX:创建索引。

    1
    2
    CREATE INDEX index_name
    ON table_name (column1, column2, ...);
  • DROP INDEX:删除索引。

    1
    DROP INDEX index_name;

12. 存储过程和函数

  • 存储过程:预编译的SQL代码块,可以多次调用。

    1
    2
    3
    4
    5
    CREATE PROCEDURE procedure_name
    AS
    BEGIN
    -- SQL statements
    END;
  • 函数:返回一个值的预编译的SQL代码块。

    1
    2
    3
    4
    5
    6
    7
    CREATE FUNCTION function_name (parameters)
    RETURNS datatype
    AS
    BEGIN
    -- SQL statements
    RETURN result;
    END;