PROJECT is a column based database that

PROJECT BKetan KarandeX17100062I. INTRODUCTIONToday we are experiencing a heavy flow of data coming from varied sources in varied format. It is important for any organization to store this data for analysis purposes. However, storing this type of data is difficult since the data is not in a specific schema which forms the limitation of generic RDBMS.To satisfy this requirement, No-SQL databases came into play. However, choosing a No-SQL database can be a great task especially when the data retrieval demands specific requirements. For this purposes tools like YCSB are used to benchmark the performances of various No-SQL databases.The motive of the report is to stress on these findings of performance testing performed for HBase and MongoDB using YCSB. In this experiment, a set of total 60 operations is conducted for different workloads like Workload A, Workload B and Workload C and the performance of HBase and MongoDB is judged on the basis of there Runtime, Overall Throughput and Read & Write Latencies. This report talks about the key features of both the Storage Systems and their architecture along with the benchmarking result for both using the YCSB tool.II. KEY CHARACTERISTICSA. HBASE 1HBase is a database developed by Apache. It is a column based database that stores data in key-value pairs. It uses the fault tolerance features of HDFS and forms a crucial component of Hadoop. It can be termed as a data store rather than a database since it does not have some features of RDBMS like typed columns and triggers.HBase is used to store semi-structured data with different types of data, row sizes, and column sizes. Hbase uses certain concepts of row key and column family. Every column family can be a group of multiple columns. This column can be dynamic in nature but the column family stays static. The key terminologies in HBase areHBase Tables: A collection of rows is stored in specific locations termed as RegionsHbase Row: The collected data that is stored in HBaseRow Key: Every new entry in HBase can be termed as a Row KeyColumn Family: This is a collection of columns with multiple rows that are stored in form of Row KeyHBase provides a platform to perform real-time read and write operations provided a Hadoop cluster is installed on the machine. HBase does not have a predefined schema, so the data of any form can be loaded and stored in it. Hence it is easier for users to store the data without predefining the schema. Since HBase tables are stored in different regions, in cases of data increment the regions can automatically split and get distributed 2. This adds to the scalability of HBase.B. MongoDB 3MongoDB is a schema-free, collection-based No-SQL database that groups the data in sets termed as collections. This collection is unique and is designed to contain different types of the data object. Data objects are also called documents that are derived in key-value pairs. These documents are made in a JSON like a format known as BSON format which makes the storage much easy and viable for retrieval. This BSON format helps the document to store the Date and Binary data types that cannot be done in JSON.MongoDB facilitates the use of aggregation in a very efficient manner. It uses keys to compile the values together and aggregates there corresponding values. In order to solve the problem of scalability that can be faced due to abrupt increments in the data, MongoDB uses Sharding. Basically, the data is stored in form of shards that work parallel and help in retrieval of data along with storing it in a distributed format. They are accessed using services like Query Routers. While accessing the data MongoDB uses BSON format that is much lighter than other formats and gives better performance while encoding and decoding data.MongoDB is a schema-less storage system which can store data that are not in any specific format. This increases the applications of MongoDB in the real world.III. DATABASE ARCHITECTUREA. HBASE 1HBase works on a master-slave architecture. Where one single master machine handles multiple slave machines. In HBase, the set of rows termed as a Region are grouped together in forms of sets and are handled by Region servers. One Region Server can handle one region at a time. This region servers act as Slaves and are controlled by the HMaster which acts like a master machine. HMaster is responsible to allocate work to region servers whenever any write request is given. It acts as a load balancer for the region servers.The three main components of HBase are1.HMaster 1HMaster is the main working head of the HBase cluster. If any failure occurs to the HMaster, then the work is allocated to secondary HMasters and thus it handles the failovers.Whatever operation performed on HBase is the responsibility of the HMaster. It manages and monitors over this operations.2.Region Server 1Region server acts like data nodes for the HBase cluster. They handle all the Read and write caches and are responsible for the Block cache which is a read cache that evicts recently used data in case it gets full. It stores the non written data in form of MemStore which is a type of write Cache.3.Zookeeper 1Zookeeper is a service that handles work allocation to regions and withholds the information of the region servers and the data nodes they are allocated to. In case of node failure, zookeeper allocates the request to next performing region and manages the failover. For any client to use HBase, they must communicate with the zookeeper by accessing the ZK Quorum before connecting to the region servers and Hmaster. In case of failures, zookeeper is responsible to repair the failed nodes.CLIENTZOOKEEPER (ZK QUORUM)ZOOKEEPER (ZK QUORUM) ZOOKEEPER (ZK QUORUM) ZOOKEEPER (ZK QUORUM) ZOOKEEPER (ZK QUORUM) ZOOKEEPER (ZK QUORUM) ZOOKEEPER (ZK QUORUM) ZOOKEEPER (ZK QUORUM) ZOOKEEPER (ZK QUORUM) ZOOKEEPER (ZK QUORUM)REGIONSERVER REGIONSERVER REGIONSERVER REGIONSERVER REGIONSERVER REGIONSERVER AREGIONSERVER REGIONSERVER REGIONSERVER REGIONSERVER REGIONSERVER REGIONSERVER BREGIONSERVER REGIONSERVER REGIONSERVER REGIONSERVER REGIONSERVER REGIONSERVER CREGIONSERVER REGIONSERVER REGIONSERVER REGIONSERVER REGIONSERVER REGIONSERVER DHMASTERHMASTERHMASTER HMASTER HMASTERB. MongoDB 4MongoDB is a No-SQL database that stores all its data in forms of key-value pairs. This data according to its types are stored in form of BSON format which is much easier to use and execute. They are stored in Documents which are compiled in the collection which is termed as a set of data with different data types.MongoDB works on the principle of Multi-Model Architecture.For other systems where data needs to be collected and passed through multiple technologies using complex integration codes, MongoDB offers a flexible architecture where the database manages to automatically transfer data through multiple storage engines using the concept of native replication. This kind of flexibility can allow the user to integrate their system with various storage engines according to there needs of deployment and workloads. It offers Pluggable Storage Architecture which can be customised according to the need, resources, and designs for deployment specified by the organization. This reduces the complexities caused by running multiple databases with different specifications and goals.MongoDB provides different types of tools to interact with the databaseMongo shell: This is a rich and interactive shell scripted by javascript.MongoDB Compass: This is an interactive GUI for users to feel comfortable while using MongoDBThese tools help in creating queries in MongoDB without any pre-requisite knowledge of the query language of mongo. It helps in visualizing the data in both, JSON as well as a table format.MongoDB architecture also felicitates Auto-Horizontal Sharding that is responsible for automatic distribution of data into smaller partitions termed as shards.This facility helps in identifying the bottlenecks in the system and thus opens scope for repairment. It helps in load balancing of theMONGODBMONGODB MONGODB MONGODBDATABASE ADATABASE A DATABASE A DATABASE A DATABASE ACOLLECTION COLLECTION COLLECTION ADOCUMENT DOCUMENT DOCUMENT DOCUMENT ACOLLECTION COLLECTION COLLECTIONBDOCUMENT DOCUMENT DOCUMENT DOCUMENT AKEY -VALUE VALUE VALUE PAIR APAIR A PAIR A PAIR AKEY -VALUE VALUE VALUE PAIR BPAIR B PAIR B PAIR BKEY -VALUE VALUE VALUE PAIR CPAIR C PAIR C PAIR CKEY -VALUE VALUE VALUE PAIR DPAIR D PAIR D PAIR DDOCUMENTDOCUMENT DOCUMENT DOCUMENTBDATABASE BDATABASE B DATABASE B DATABASE B DATABASE Bcluster.Unlike other databases, developers don’t have to work on their shard codes since MongoDB offers an automatic shard generation.IV. SECURITYNowadays with an increase in cyber crimes and data thefts, the need for additional security is a must for any data storage system. Not only retrieval but user restricted data retrieval is the need of the hour. It is been predicted that about $6 trillion can be the annual blow to the economy by 2021 if the pace of cyber crimes continues as it is today. In this report, additional features provided by HBase and MongoDB are discussed to understand how secure it is to use this datastores on a commercial level.For comparison purposes, we will compare the security options provided by both the data storage systems using the following features1.AuthenticationUser Authentication can be useful for companies to give restricted access to employees to the data limited to there respective departments. However, when the user needs to be given different authentication from within the systems, it is hard to monitor the access from all angles.2.AuthorizationAfter authentication comes authorization where the actions of the users are restricted. Not every user with authentication is authorized to make changes in the database. Generally, authorization is done by creating classes in the workforce for e.g Admins, Server Head, System admin etc.3.AuditingEven after allocation authentication and authorization to a class of users, it is important for any organization to track what the users are doing with there authorizations. Its crucial for any business to procture to see if their resources are not misused.4.EncryptionEncryption of the data within a database restricts the data to be read by the authorized personnel only. Even if the database gets hacked, it is important that the data remains encrypted to minimise damages.A.HBase1.Authentication 5Hbase authentication resides on simple Remote Procedure Call (RPC) level that is based on Simple Authentication and Security Layer (SASL). This strongly supports Kerberos, that uses network protocols for authentication and secret encrypted keys. SASL works as per connection basis to grantauthentication which can sometimes bring huge loads on the system and hence hinder its performance 62.Authorization 5The authorization feature in HBase is known as Access Control List (ACL). ACL was not available until HBase 0.92(CDH4) was designed. Hence the system using the previous versions are not viable for authorization operations that need to be done in the organization.Since HBase depends on HDFS and Zookeeper to handle all its operations including Authentication and Authorization, it affects the performance of data retrieval or modification every time when a user attempts to do so since it has to first come from zookeeper and then gets processed.3.Auditing 5HBase has unique files to view the user actions but since it works in distributed mode, forensics for auditing this logs becomes difficult and a lot of the time can get wasted in retrieving the logs from all the HBase Nodes.4.Encryption 5 7Encrypting in HBase works by encrypting the entire HFile. It is not yet possible to encrypt a specific set of columns from the data which forms big limitations for Encryption in HBase.B. MongoDB1.Authentication 8Integrating MongoDB with the information infrastructure enables centralized control over the user access. So in the case where the user access is to be revoked, only a small change in the central directory can block the user from accessing the data across the whole system. MongoDB can be integrated easily with external mechanisms specifically designed for security.2.Authorization 8It is very feasible in MongoDB to expose a certain subset of data to a certain class of employees and to mask or filter out certain fields from the data based on the employee role.3.Auditing 8MongoDB’s native logs meant for auditing can help in tracing the user actions on the database tracking every small read or write event performed in the database. It gives a great forensics report that can be very useful in detecting any malicious activity within the organization4.Encryption 8By using MongoDB, it is easy to encrypt data on disks and even in backups since it uses an individual encrypted storage engine that protects the data inside the database. Because of its natively build encryption engine, the overhead of managing the external encrypters can be pocketed.V.PERFORMANCE TEST PLANFor benchmarking purposes, YCSB (Yahoo! Cloud Serving Benchmark) tool is used. YCSB is an open source tool used to evaluate capabilities of computer programs and mostly No-SQL databases.The benchmarking of HBase and MongoDB is carried out by taking a total of 60 operations performed over 3 different workloads (Workload A, Workload B, Workload C) for at least two times each (to remove noise). The overall test plan is explained by the following STEPS.1. Installing Hadoop-2.9.0, HBase-1.2.6 in Pseudo-Distributed Mode, YCSB-0.11.0 and MongoDB-3.2.102. Running following record counts twice in HBase for Workload A, Workload B, and Workload C. Then the average of the 2 achieved outputs for every workload is taken to remove the possible noise created during the test and is used as the final output. A total of 30 Operations are performed here {5 Record counts * 3 Workloads}*23. Running the same record counts as above twice in MongoDB for Workload A, Workload B, and Workload C. Then the average of the 2 achieved outputs for every workload is taken to remove the possible noise created during the test and is used as the final output. A total of 30 Operations are performed here {5 Record counts * 3 Workloads}*24. The averaged final Outputs are used for plotting and comparing the two data storage systems in order of the 3 chosen workloads. Plots are made in the following mannera) Read Average Latency against No.of Read Operationsb) Update Average Latency against No.of Update Operationsc) Overall Throughput against Total No. of Operationsd) Runtime vs Total No.of Operations5. ConclusionVI. EVALUATION AND RESULTSNOTE: The operations are performed twice for every operation count given for every workload and then the average is taken for the 2 results to get a final result. This process helps in removing the Noise from the process.e.g. At 25000 operation count for Workload A in MongoDB, following are the 2 iterations of outputs takenOUTPUT ITERATION 1OUTPUT ITERATION 2FINAL RESULT (for Workload A with record count 25000 in MongoDB) = AVERAGE OF OUTPUT ITERATION 1 AND OUTPUT ITERATION 2Runtime = 6019+5685/2 = 5852Throughput =4153.513 +4397.53/2 = 4275.53Read Latency =180.795 +163.088/2 = 172.34Read operations = 12516+12591/2 = 12553.5Write latency = 245.149+238.072/2 = 241.935Write operations = 12484+12409/2 =12446.5Similarly, all other calculations are made and the averaged and Noise proof output is consideredA. Workload AWorkload A is 50% read 50% write mixSample parameter1.Read operationGraph: Read average Latency against No. of Read OperationsVisual Aids: MongoDBHBaseScale: X-axis : 100-350 (us)Y-axis : 10,000 to 1,50,000Theory: Read latency is time taken by the system to complete a Read Request initiated by the user. It should as minimum as possible to get faster outputs.Conclusion: MongoDB has lower Read latency as compared to HBase, hence MongoDB is better in reading Operations than HBase2.Write OperationGraph: Write average Latency against No. of Write OperationsVisual Aids: MongoDBHBaseScale: X-axis : 100-800 (us)Y-axis : 10,000 to 1,50,000Theory: Write latency is time taken by the system to complete a Write Request initiated by the user. It should as minimum as possible to get faster outputs.Conclusion: MongoDB has lower Write latency as compared to HBase, hence MongoDB is better in Write Operations than HBase3.Overall ThroughputGraph: Overall Throughput against No. of Record OperationsVisual Aids: MongoDBHBaseScale: X axis : 1,000-6,500 (ops/sec)Y-axis : 10,000 to 3,00,000Theory: Overall Throughput is the amount of data that passes through a common point in a given time. Higher the Throughput, better is the performance of the system.Conclusion: The Overall Throughput of MongoDB is much higher as compared to HBase. Hence the Operation can be completed much faster in MongoDB which implies it better than HBase.4.RuntimeGraph: Runtime against Total Workload RecordsVisual Aids: MongoDBHBaseScale: X axis : 1,000-1,50,000 (ms)Y-axis : 10,000 to 3,00,000Theory: Runtime is the Total time taken by the system to run all the operations for the given record counts. Lower the Runtime, faster is the outputConclusion: Runtime of MongoDB is much less than HBase.Hence MongoDB is better.B. Workload BWorkload A is 95% read 5% write mixSample parameter1.Read operationGraph: Read average Latency against No. of Read OperationsVisual Aids: MongoDBHBaseScale: X-axis : 100-300 (us)Y-axis : 1,000 to 3,00,000Theory: Read latency is time taken by the system to complete a Read Request initiated by the user. It should as minimum as possible to get faster outputs.Conclusion: MongoDB has lower Read latency as compared to HBase, hence MongoDB is better in Read Operations than HBase2.Write OperationGraph: Write average Latency against No. of Write OperationsVisual Aids: MongoDBHBaseScale: X axis : 200-1,400 (us)Y-axis : 1,000 to 16,000Theory: Write latency is time taken by the system to complete a Write Request initiated by the user. It should as minimum as possible to get faster outputs.Conclusion: MongoDB has lower Write latency as compared to HBase, hence MongoDB is better in Write Operations than HBase3.Overall ThroughputGraph: Overall Throughput against No. of Record OperationsVisual Aids: MongoDBHBaseScale: X axis : 1,000-8,100 (ops/sec)Y-axis : 10,000 to 3,00,000Theory: Overall Throughput is the amount of data that passes through a common point in a given time. Higher the Throughput, better is the performance of the system.Conclusion: The Overall Throughput of MongoDB is much higher as compared to HBase. Hence the Operation can be completed much faster in MongoDB which implies it better than HBase.4.RuntimeGraph: Runtime against Total Workload RecordsVisual Aids: MongoDBHBaseScale: X axis : 10,000-1,00,000 (ms)Y-axis : 10,000 to 3,00,000Theory: Runtime is the Total time taken by the system to run all the operations for the given record counts. Lower the Runtime, faster is the outputConclusion: Runtime of MongoDB is much less than HBase.Hence MongoDB is better.C. Workload CWorkload C is a 100% readSample Parameter1.Read operationGraph: Read average Latency against No. of Read OperationsVisual Aids: MongoDBHBaseScale: X-axis : 100-250 (us)Y-axis : 1,000 to 3,00,000Theory: Read latency is time taken by the system to complete a Read Request initiated by the user. It should as minimum as possible to get faster outputs.Conclusion: MongoDB has lower Read latency as compared to HBase, hence MongoDB is better in Read Operations than HBase2.Overall ThroughputGraph: Overall Throughput against No. of Record OperationsVisual Aids: MongoDBHBaseScale: X axis : 1,000-9,000 (ops/sec)Y-axis : 10,000 to 3,00,000Theory: Overall Throughput is the amount of data that passes through a common point in a given time. Higher the Throughput, better is the performance of the system.Conclusion: The Overall Throughput of MongoDB is much higher as compared to HBase. Hence the Operation can be completed much faster in MongoDB which implies it better than HBase.3.RuntimeGraph: Runtime against Total Workload RecordsVisual Aids: MongoDBHBaseScale: X axis : 1,000-75,000 (ms)Y axis : 10,000 to 3,00,000Theory: Runtime is the Total time taken by the system to run all the operations for the given record counts. Lower the Runtime, faster is the outputConclusion: Runtime of MongoDB is much less than HBase.Hence MongoDB is better.VII. CONCLUSIONThe overall conclusion of the experiment can be derived from the individual observations of the test performed on three different workloads as follows1.WORKLOAD ASince Workload A is an equal mix of Reading and Write operations that are mostly used in stores to monitor the recent actions, it is important that the storage system should perform equally well in both the types of operations.In our experiment, it can be clearly seen that MongoDB’s performance in Read as well as in Write Operations outperforms that in HBase. Hence MongoDB should be suggested for such types of workloads.2.WORKLOAD BSince Workload B is a Read Heavy mix mostly used in tags given through various social media sites on the uploaded pictures, it is important for the Storage System to perform well specifically in the Read Operations as compared to Write Operations.In the conducted experiment it is conclusive that MongoDB is much better than HBase in Read Heavy operations. Hence MongoDB should be advised for storage to businesses that work on Read heavy operations.3.WORKLOAD CSince Workload C is a Read-only operation usually used in profile generation, It is absolutely important for the Storage System to dominate the Read operations.In our experiment, it is evident that MongoDB is superior to HBase in Read operations and hence should be suggested for such workloads.FINAL COMMENTS: After considering the Experiment Results and the Literature Reviews on Security offered by both the Storage Sytems, it is clear that MongoDB outperforms HBase in all the tasks and hence is the better Data Storage and Management System than HBase.REFERENCES 1 “Overview of HBase Architecture and its Components”Online. Available:https://www.dezyre.com/article/overview-of-hbase-architecture-and-its-components/2952 “Key features in HBase”Online. Available: http://www.hadooptpoint.org/key-features-in-hbase/3 “What are the Key Features of MongoDB?”Online. Available: https://www.tutorialsjar.com/key-features-of-mongodb/4 “MongoDB Architecture” Online. Available: https://www.mongodb.com/mongodb-architecture5 “How-to: Enable User Authentication and Authorization in Apache HBase”Online. Available: http://blog.cloudera.com/blog/2012/09/understanding-user-authentication-and-authorization-in-apache-hbase/6 Frank Pallas, Johannes Gunther, and David Bermbach “Pick Your Choice in HBase: Security or Performance”, IEEE International Conference on Big Data,20167 “Apache HBase”Online.Available: https://blogs.apache.org/hbase/entry/hbase_cell_security8 “MongoDB security Architecture”, A MongoDB White paper,2017