NoSQL Glossary

question

BigTable
answer

BigTable is Google’s proprietary NoSQL database, although it also can refer to a NoSQL database architecture. BigTable databases have many tables, each of which has many rows. Unlike a relational database, rows in a BigTable database may contain thousands of columns, compound columns, multiple row versions, and columns do not need to be predefined. The basics of the BigTable architecture are explained in this white paper from Google.
question

Cassandra
answer

An Open Source database originally developed at Facebook. Cassandra combines architectural elements from BigTable and Dynamo to create a decentralized, massively scalable database.
question

Cluster
answer

A collection of sharded servers. The physical organization of a cluster varies from implementation to implementation.
question

Columnar database
answer

In a columnar database, data is stored by column rather than by row. This model is advantageous for working with aggregations of data, systems that perform mass updates. Columnar databases excel at analytical processing.
question

Compaction Compaction
answer

is the process of removing unused data and merging older data files. At the end of a compaction process, a set of data files will exist with only the current version of the data.
question

Document database
answer

Document databases focus on how data is accessed rather than how data is stored. Data access is optimized for discrete documents (typically an entire object graph rather than a single atomic row of data). Document-oriented databases may be physical structured as a columnar, BigTable, or key-value store; the implementation is not as important as the way data is accessed.
question

Dynamo Dynamo
answer

is a massively scaleable key-value storage system that was developed at Amazon. Dynamo provides an always-on system that uses sophisticated versioning and conflict resolution techniques to be \”self-healing\” in the event of network failures. Amazon published the details in the white paper: Dynamo: Amazon’s Highly Available Key-value Store
question

Elasticity Elastic databases
answer

make it trivial to add nodes to a cluster as needed with no downtime. Read and write operations scale linearly as more machines are added.
question

Hadoop Hadoop
answer

is a framework for working with data-intensive distributed applications. Hadoop was based on Google’s MapReduce paper. In addition to MapReduce functionality, Hadoop also provides location awareness and a set of common tools.
question

HBase
answer

A BigTable columnar database build on Hadoop. HBase has a large number of features that make it well suited for the enterprise (MapReduce, elastic storage, massively distributed, data compression)
question

HDFS
answer

Hadoop Distributed Files System – this is a distributed, location aware, replicated file system. HDFS has data balancing features – if any single node contains a disproportionate amount of data, the data can be easily redistributed to other nodes. Despite the name, HDFS cannot be directly mounted by an operating system (without additional, third party, libraries).
question

Key/Value Store Data
answer

is stored as an arbitrary value that is looked up via an arbitrary key. Frequent uses of key/value stores are shopping carts, session state, and other caching mechanisms
question

MapReduce
answer

An algorithm initially developed at Google for performing parallel data processing. Other MapReduce implementations and frameworks have been developed for different databases. MapReduce workloads can be spread over thousands of nodes and multiple Map and Reduce phases. A Map operation is like a SQL SELECT statement – it produces zero or more results from one or more inputs. A Reduce operation is like a SQL GROUP BY combined with aggregate functions – it combines the results of multiple map operations.
question

MongoDB
answer

is a scalable, high-performance, document-oriented database. MongoDB has a variety of features designed to bridge the gap between key/value stores and and traditional RDBMSes; some of these features are ad hoc querying, secondary indexes, replication, and aggregation.
question

Network partitioning
answer

happens when multiple parts of a cluster become separated due to some type of failure.
question

Node
answer

A single computational unit in a cluster. Typically this would be a server or computer. If there are multiple nodes on a single computer, they may be referred to as virtual nodes.
question

NoSQL
answer

A generic term reserved for any one of a variety of non-relational databases. Original it didn’t mean much of anything but there have been attempts made to co-opt the term into an acronym standing for \”Not Only SQL\”.
question

Replication
answer

In Dynamo based systems, data is written to multiple nodes, also called replicas. N copies of data will be stored in the system. Any time data is written, W replicas need to respond before the write is considered to be a success. Likewise, R replicas need to respond for a read to be considered a success. Replication settings can be tuned for different levels of performance, however the general rule is that R + W > N (the number of nodes for read and write should be greater than the total number of replicas).
question

Riak Riak
answer

is a key/value stored based on Amazon’s Dynamo. Riak provides linear scaling, background replication, flexibility, and fault-tolerance.
question

Shard Not
answer

to be confused with chard, a shard is a segment of data. Sharding is a method of spreading data across multiple servers in a cluster to balance storage and CPU load. Shards are typically identified by a sharding key although the mechanism varies from product to product.
question

Tombstone
answer

A special value written in a database to indicate that a record should be deleted. Data marked with a tombstone will still be present in the database until a compaction occurs.

Get instant access to
all materials

Become a Member