The proliferation of unstructured data has made the collection and handling of large and varied data sets at scale a real challenge for organizations. Organizations often grapple with the challenge of setting up and sizing NoSQL databases when managing unstructured data from organisational processes, systems and from their customers. Sizing is a central challenge in the broad set of big data challenges that such organizations face. How can data engineers and IT managers arrive at the right hardware projections for their data? What factors and considerations affect the size, performance and scalability of the database to massive data sets and modern customer outreach considerations? These are some of the questions we hope to address in this post. Specifically, we will look at HBase, which is a modern NoSQL database that provides real-time read and write access to large datasets.
HBase is a remarkable tool for indexing and managing massive volume of data. Apart from its characteristics of random access, consistent read/writes, variable schema, et al, HBase is widely acknowledged for its ability to scale horizontally. The rate at which data can multiply in enterprises can unpredictable at times. HBase’s automatic sharding makes the data balancing efficiently possible with the addition of new data nodes i.e., horizontal scaling, without affecting the existing data. Organizations can truly leverage HBase’s benefits, however, only when they size the database appropriately, according to the requirements of the application and use cases in the organization. There is therefore no one-size-fits-all guideline to size a HBase cluster. That said, there are a number of key factors that influence HBase sizing, and we will discuss these in this post.
Memory required for Data Storage
The size of the data is very vital in determining the HBase configuration. However, physical data size on disk is distinct from logical size of the data, although dependent. Physical data size on disc is:
- Increased by HBase overhead
- Decreased by compression and data block encoding
- Increased by size of region server
- Increased by HDFS replication – usually a three times (3x) replication scheme, although this is customizable
Note: At the time of writing this post, the version of HBase in practice is 1.1.2
We will do a case study below, to illustrate the factors involved in HBase sizing. Let us start with the following assumptions:
- Volume of data: The estimated data per day is assumed to be 5 Million rows with each record size costing 1000 bytes
- Data Retention: No:of days the data needs to be persisted in the database. Assumption:10 years
- Year on year increment: Expected increase in the volume of data with each passing year. Assumption: 10%
- Compression ratio: Expected compression on the data available. Assumption: 10%
|Year||Rows per year (Millions of)||Size (in GB)||Size (in GB) including overhead||With compression (GB)||Cumulative (GB)||Machines reqd.|
Table 1.1 Sizing of HBase (year-on-year in rows)
Volume of Data
The volume of data stored in HBase in this example is illustrated in the second column in Table 1.1 above. Clearly, there is a straightforward relationship between the data collected per day, and the total size of data stored. We arrive at the raw data size, overhead considerations, and also consider data compression aspects, to generate the data added per year. We see the cumulative build up of data in the data store as we see year-on-year data size figures above. One example is given below for clarity.
Rows per year = 50000000*30*12 = 1800000000 or 1800 Million
Considering the initial assumption of 1000 bytes per row, =>1800000000 x 1000
Converting to GB, 1800000000 x 1000/3072 ≈ 1676.38 (approx)
With 10% YoY increment, Y2 is 1800000000 + (1800000000 * 0.1) = 1980 Million
Impact of Overhead on Data Volume
This section describes the calculation in column 3, size(in GB) including overhead from Table 1.1
Overhead in terms of memory indicates the additional memory requirements, apart from the data itself. The HBase key, consisting of the row key, column family, column qualifier, time-stamp and data type, is the indexing key for data storage.
Size with overhead = Rows per year x overhead ratio
The overhead ratio by itself is defined using many keys.
Now, we can calculate the size of the record in bytes as shown below:
Fixed part needed by KeyValue format = Key Length + Value Length + Row Length + CF Length + Timestamp + Key Value
Variable part needed by KeyValue format = Row + Column Family + Column Qualifier + Value
Total bytes required = Fixed part + Variable part
HFiles store the rows as sorted key and value pairs, on disk. HFiles contain a multi-layered index, which allows HBase to seek to the data. The calculation of the overhead ratio will also have to consider these vital aspects around the overhead and the HFiles. One illustration is below.
|#columns (C)||10||avg col size(ACS)||100|
|Bytes overhead per column(BpC)||24||#column families||1|
|Rowkey size (Rk)||10|
|Hfile Indexes (Hi)||20%|
Table 1.2 Overhead Assumptions
Overhead ratio = (C x (ACS+BpC)+Rk) ÷ (C x ACS) x (1+Hi)
Overhead and Overhead Ratio for the example above are:
Overhead = ( 10 x (100+24)+10) ÷ (10 x 100) x (1+0.2) = 1250/1200 = 1.04167
Ratio = Overhead x 100 = 104.167 ≈ 105 % (ceiling round-off)
Average column size: BpC / C
Note: Volume of data makes an assumption of 1000 bytes per record. 1000/10 ≈ 100
Compression Effects on Data Volume
Compression helps in reducing the data volume. Depending on the compression scheme adopted, we may have different algorithms in use, and their compression performance will impact the data size. Provided we know the compression ratio, we can calculate the size of the data (considering overheads) as below:
Size (in GB) with compression = Size with overhead * compression ratio
Column 5 = Column 4 * 0.1 => 4190.95 x 0.1 ≈ 419.10 (first record)
Compression ratio: (Uncompressed data size – compressed data size)/uncompressed data size percentage
Enabling compression has no positive impact on HBase query performance but does add some slight overhead with data loading. Gzip codec has better compression ratio than Snappy.
Data Compression and HBase query performance are directly proportional. The greater the compression, the more is the time taken to query HBase.
In the following test set, the original uncompressed file (before compression) is 100 GB:
|Uncompressed||gzip (.gz)||Snappy (.snappy)|
|Time to load data into HBase table(milli second)||565||1029||585|
Table 1.3 HBase workload with compression
Now, based on the earlier assumption of 10%, we can arrive at an estimate for the data size after compression. In Table 1.1, the database size post-compression is provided for the example in question. Note that depending on the type of data being stored and considerations of size and performance, we may use different compression algorithms with different compression ratios.
More on compression here
Hardware Requirements: Key Factors
Now that we have understood data volume and compression aspects, we can estimate the hardware requirements.
Machines required = Maximum( [Size of compressed data ÷ Usable data per machine], Min no:of machines)
Before computing the hardware requirements, let’s look at two important considerations – the region server and the JVM memory management considerations below.
Region Server and JVM Heap considerations
We have calculated the size of compressed data in the earlier sections. It is important to note that the usable data per machine depends on the Region Server. Region servers are responsible for all read and write requests, for all regions they serve. It also splits regions that have exceeded the configured region size thresholds. HBase maintains two cache structures: the “memory store” and the “block cache”:
- The memory store accumulates data edits as they’re received, buffering them in memory
- The block cache keeps data blocks resident in memory after they’re read
It is also important to note that block caches use the underlying Java Virtual Machine heap in HBase to store cache data. This means that any factor which adversely impacts the JVM’s garbage collection (GC) processes will also impact cache performance performance, and by extension, database query performance. Naturally, this does not suit the demands of many applications, where JVM heap size may be used by many applications. To mitigate such performance issues, HBase includes another in-memory data structure called the BucketCache, which utilizes off-heap memory. This is indicated as Ohc (Off Heap Cache) in the table below.
|Region Server (RS) best practices|
|Memory per RS (M)||30.50 GB|
|Recommended region size (RecRS)||5.00 GB|
|Block cache (Bc)||12.20 GB|
|Memstore (Ms)||12.20 GB|
|Memstore flush size (Mfs)||128.00 MB|
|#regions per RS||95.31|
|Data handled per RS (DpRS)||476.56 GB|
|Off heap cache (Ohc)||95.31 GB|
Table 1.4 Region server best practices
In Table 1.2 above, we assume that 40 % of RS memory (M) is allocated for the memstore and block cache – this is often the default recommended value. Now,
Memstore= Memory per RS x Memstore fraction
Block Cache = Memory per RS x Blockcache fraction
Memstore & Blockcache = 30.50 x 0.4 = 12.20 GB
Regions per Region Server
Although it is best to limit the number of regions per region server, we can use a thumb rule to determine how many should be in each region server. Determining the number of regions should take following into consideration:
Number of regions per RS = ( Ms ÷ Mfs ) x 1000
Data handled per RS = No. of regions per RS x RecRS
The Off Heap Cache that mitigates dependence on JVM heap is in this case assumed to be 20 % for data handled per region server. Therefore, our calculations done so far lead us to: 1 region server → 476.56 GB of data.
Machine Specifications and Sizing
Finally, it is impossible to fully size an HBase cluster without considering the machine specifications of each node in the server. The promise of HBase is the ability to use commodity hardware to enable businesses to scale horizontally. That said, commodity hardware can produce a range of results, depending the core specification and how the cluster is set up. What we see below is a depiction of a cluster set up for a dedicated HBase cluster. Naturally, in real world scenarios, we see shared use of these resources – from which arises the need to consider factors extraneous to HBase, when determining the eventual configuration.
|#Disks||6 to 10|
|Size of each disk (GB)||600 to 1024|
Table 1.5 Machine specifications for a HBase-only hypothetical cluster
HBase available memory = 512 GB, because of the assumption that the machine is fully available for HBase.
No. of region servers per machine = HBase memory available ÷ (off-heap cache + Memory per RS)
and therefore, for the first example:
#region server per machine = 512 / (95.31+30.50) ≈ 4
Usable data = Region servers per machine * Data handled per region server
Usable data = 4 x 476.56 = 1906.24 GB
Estimating the Number of Machines
Finally, now that we have addressed the many factors and requirements of the HBase cluster, we can narrow down to the number of machines required.
No. of machines required= maximum ( [size of compressed data ÷ usable data per machine], min. no. of machines)
We can estimate the maximum number of machines based on the compressed data size and the usable data per machine as below:
Size of compressed data ÷ Usable data per machine = 419.10 ÷ 1906.24 ≈ 0.21
Max (0.21,3) = 3
With the above assumptions, the sizing chart indicates that at the 10th year, the cluster needs to be horizontally scaled by adding an additional node to foster the application’s data thirst. Readers need to understand that the sizing of a NoSQL database like HBase majorly depends on the machine configuration, volume of data and tool specific overhead discussed. Any change in assumption done in the post will provide newer sizing requirements. To help better, the precalculated workbook is attached here – Sizing Sheet (HBase), to help readers understand and size better.
In this post, we have seen a detailed case study of NoSQL database sizing with HBase, and we have also discussed the many factors that affect the sizing of such databases. The proliferation of big data use cases in the industry requires data engineers to develop scalable databases that perform effectively. Not only did we consider year-on-year data requirements and how they change, but have also discussed the impact of the java virtual machine and other memory considerations that affect HBase performance.