NoSQL

Sections:
1. Background
2. Memcached (not NoSQL)
3. Redis
4. Cassandra
5. MongoDB

1. Background

We have emphasized relational database systems, but there is another major trend called NoSQL databases, which discard the relational concept for the following characteristics:

NoSQL often involve discussions of the CAP Theorem. This is not actually a theorem in the sense of it being something to be proven, but the claim is that no distributed system can provide all three of Consistency, Availability, and Partition tolerance (the C, A, and P properties that give rise to the name CAP).

A traditional relational database focuses on Consistency and Availability. Most major NoSQL systems focus on Availability and Partition Tolerance, though some NoSQL systems do C and A, while others do A and P.

We'll briefly explore three of the most popular NoSQL systems: Redis, Cassandra, and MongoDB. But first we'll start with a different system that is also quite important: Memcached.

2. Memcached

Memcached isn't widely considered a NoSQL system at all, but it's so widely used that it's worth noting. Memcached is a program that basically provides a network-accessible hash table mapping strings to strings. The key feature is that this hash table is stored entirely in memory, so it is extremely fast. Here's an example of how it's used, using a Python library.

import memcache
mc = memcache.Client(['127.0.0.1:11211')

mc.set('credits:Burch''32')
print(mc.get('credits:Burch'))    # displays 32
mc.incr('credits:Burch'delta=4)
print(mc.get('credits:Burch'))    # displays 36

Notice how the memcache.Client constructor takes a list of hosts? That is because a major feature of Memcached is its ability to distribute the hash table's keys across multiple hosts. Distributing in this way provides greater performance — as well as allowing more memory, so that it takes more time when the memory becomes full.

(By the way, when memory does become full, Memcached simply throws out the least recently used keys.)

Since nothing in Memcached in stored on disk, there is no persistence at all. However, it is extremely useful for caching results of queries. For example, suppose we want to find the name associated with a user ID:

name = mc.get('user:93123813:name')
if not name:
    # execute SQL query to retrieve name
    mc.set('user:93123813:name'name)

Caching results of queries in this way can dramatically reduce how busy a database server is, thus improving overall performance. See this page for a short, fun-to-read story of how this works.

3. Redis

Redis is similar to Memcached except:

  1. It provides some persistence features.
  2. It supports a wider range of structures than a simple hash table.
  3. It provides a slightly wider range of options on which keys should be deleted than simply least-recently-used for when memory becomes full.

Redis has bindings in a number of languages. My examples will be using the redis Python library.

import redis

# Redis would run as a process listening on a network port
db = redis.Redis(host='localhost'port=6379)

# The basic Redis structure is basically a hash table from strings to
# strings; the Python library allows us to treat it as a dictionary.
db['credits:Burch'] = '32'        # associates 32 with the name credits:Burch
print(db['credits:Burch'])        # displays 32
db.incr('credits:Burch'4)
print(db['credits:Burch'])        # displays 36

# Redis has a separate structure allowing names to be associated with
# string-string pairs.
db.hmset('student:Burch', {'id':'12345''credits':'32'})
db.hset('student:Burch''first''Carl')
db.hget('student:Burch''credits')  # displays 32

# And it has a structure allowing names to be associated with
# linked lists
db.rpush('students''12345')
db.rpush('students''54321')    # now students is [12345, 54321]
print(db.lpop('students'))       # displays 12345

# And it has a structure allowing names to be associated with sets,
# and another allowing names to be associated with sorted sets.

4. Cassandra

On the other end of the NoSQL spectrum is Cassandra. I describe it as “the other end”, because in fact Cassandra is closely related to relational databases with SQL: We store our data in “tables”, which have a set of rows, each having several columns. (In this respect, it is very similar to another important NoSQL system, Google's BigTable.)

The difference is that rows can be sparse: There might well be some “columns” that in fact are only applicable to a small subset of the rows. This can make a huge difference: For example, a business Web site might well violate all the relational normal forms by having each product have a column for each user's review: review_alice for Alice's review, review_bob for Bob's review, and so on.

Cassandra provides an SQL-oriented query language called CQL. Major differences are that CQL does not support transactions, joins, or subqueries. Also, the key value for each row cannot be modified once it is created.

import cassandra

cluster = cassandra.cluster.Cluster(['10.1.1.3''10.1.1.4'])
session = cluster.connect()

rows = session.execute('SELECT name, passwd FROM users WHERE login = %s',
    [loginid])
for r in rows:
    print(r[0], r[1])

session.execute('INSERT INTO users (login, name, passwd) ' +
    'VALUES (%s %s %s)', ['obama''Barack''20500'])

Cassandra emphasizes its scalabality: It can distribute its data across a large number of systems, with some replication in case some isolated nodes fail.

5. MongoDB

MongoDB is currently the most popular NoSQL system. With MongoDB, information is structured into collections of documents. A document is basically a JSON object, so it is essentially a map of key-value pairs. Esesntially, you can think of a collection as being analogous to a table and a document as being analogous to a row from a table.

// Accessing "students" as a field in "db" creates a collection
// named "students." The "insert" method adds a document into
// the collection.
db.students.insert({ name'Burch'id31414entry2004 });

// The "find" method can take two parameters. The first
// indicates the criterion by which to select documents out of
// the collecion, and the second gives the attributes to
// retrieve out of each document. (In the second parameter, the
// value associated with each key is meaningless, and we
// customarily use 1.) The below retrieves the ID numbers for students
// whose name is Burch.
db.students.find({ name'Burch' }, { id1 });

// MongoDB has a system allowing for more complex queries, such
// as the following that retrieves all documents for which the
// value associated with entry is less than 2007. The documents
// are given unfiltered since the second parameter is missing.
db.students.find({ entry: { $lt2007 } });

// MongoDB suppports indexing by attributes. The following
// creates an index on the "name" key, so that queries based on
// "name" can be executed more efficiently.
db.students.ensureIndex({ name1 });