bigdata - Distributed file systems

Distributed file systems

requirements of a distributed file system

Going back to our capacity-throughput-latency view of storage, a distributed file system is designed so that, in cruise mode, its bottleneck will be the data flow (throughput), not the latency. We saw that capacity increased much faster than throughput, and that this can be solved with parallelism. We saw that throughput increased much faster than latency decreased, and that this can be solved with batch processing. Distributed file systems support both parallelism and batch processing natively, forming the core part of the ideal storage system accessed by MapReduce or Apache Spark.The origins of such a system come back to the design of GoogleFS, the Google File System. Later on, an open source version of it was released as part of the Hadoop project, initiated by Doug Cutting at Yahoo, and called HDFS, for Hadoop Distributed File System.

HDFS

HDFS does not follow a key-value model: instead, an HDFS cluster organizes its files as a hierarchy, called the file namespace. Files are thus organized in directories, similar to a local file system. Unlike in S3, HDFS files are furthermore not stored as monolithic blackboxes, but HDFS exposes them as lists of blocks. As for the block size: HDFS blocks are typically 64 MB or 128 MB large, and are thus considerably larger than blocks on a local hard drive (around 4 kB).HDFS is designed to a run on a cluster of machines.

architecture

HDFS is implemented on a fully centralized architecture, in which one node is special and all others are interchangeable and connected to it.
In the case of HDFS, the central node is called the NameNode and the other nodes are called the DataNodes. Every file is divided into chunks called blocks. All blocks have a size of exactly 128 MB, except the last one which is usually smaller. Each one of the blocks is then replicated and stored on several DataNodes. How many times? This is a parameter called the replication factor. By default, it is 3.

The NameNode is responsible for the system-wide activity of the HDFS cluster. It store in particular three things: • the file namespace, that is, the hierarchy of directory names and file names, as well as any access control (ACL) information similar to Unix-based systems. • a mapping from each file to the list of its blocks. Each block, in this list, is represented with a 64-bit identifier; the content of the blocks is not on the NameNode. • a mapping from each block, represented with its 64-bit identifier, to the locations of its replicas, that is, the list of the DataNodes that store a copy of this block. The DataNodes store the blocks themselves. These blocks are stored on their local disks. DataNodes send regular heartbeats to the NameNode. The frequency of these heartbeats is configurable and is by default a few seconds (e.g., 3s, but this value may change across releases). This is a way to let the NameNode know that everything is alright.Finally, the DataNode also sends, every couple of hours (e.g., 6h, but this value may change across releases), a full report including all the blocks that it contains. A NameNode never initiates a connection to a DataNode.

Finally, DataNodes are also capable of communicating with each other by forming replication pipelines. A pipeline happens whenever a new HDFS file is created. The client does not send a copy of the block to all the destination DataNodes, but only to the first one. This first DataNode is then responsible for creating the pipeline and propagating the block to its counterparts. When a replication pipeline is ongoing and a new block is being written to the cluster, the content of the block is not sent in one single 128 MB packet. Rather, it is sent in smaller packets (e.g., 64 kB) in a streaming fashion via a network protocol.

replicas

Having this in mind, the first replica of the block, by default, gets written to the same machine that the client is running on. The second replica is written on a DataNode sitting in a different rack than the client, that we call B. The third replica is written to another DataNode on the same rack B.And further replicas are written mostly at random, but respecting two simple rules for resilience: at most one replica per node, and at most two replicas per rack.

Fault tolerance

HDFS has a single point of failure: the NameNode. If the metadata stored on it is lost, then all the data on the cluster is lost, because it is not possible to reassemble the blocks into files any more.For this reason, the metadata is backed up. More precisely, the file namespace containing the directory and file hierarchy as well as the mapping from files to block IDs is backed up to a so-called snapshot. What is done is that updates to the file system arriving after the snapshot has been made are instead stored in a journal, called edit log, that lists the updates sorted by time of arrival. The snapshot and edit log are stored either locally or on a networkattached drive (not HDFS itself).

Logging and importing data

Two tools are worth mentioning: Apache Flume lets you collect, aggregate and move log data to HDFS. Apache Sqoop lets you import data from a relational database management system to HDFS.

references

https://ghislainfourny.github.io/big-data-textbook/
https://hadoop.apache.org/docs/r3.3.0/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html#Data_Replication

bigdata - Distributed file systems

http://yoursite.com/2023/09/19/bigdata2/

Author

s-serenity

Posted on

2023-09-19

Updated on

2024-10-25

Licensed under

You need to set install_url to use ShareThis. Please set it in _config.yml.
You forgot to set the business or currency_code for Paypal. Please set it in _config.yml.

Comments

You forgot to set the shortname for Disqus. Please set it in _config.yml.