Data Modeling with Apache Cassandra

This blogpost will explain what a non-relational (or NoSQL) database is, when to use one, and lastly how to model data to create a non-relational database in Apache Cassandra and complete an ETL pipeline for a (fictional) music streaming app.

Imagine the following scenario:

A startup called Sparkify wants to analyze the data they’ve been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to. Currently, there is no easy way to query the data to generate the results, since the data reside in a directory of CSV files on user activity on the app. They’d like a data engineer to create an Apache Cassandra database which can create queries on song play data to answer the questions. Her role is to create a database for this analysis. She’s expected to model and test her database as well as the ETL pipeline by running queries given to her by the analytics team from Sparkify to compare her output with their expected results.

This was the context given for my project I completed with the Nanodegree program Data Engineering at Udacity – similar to that given for the Postgres project.

Again, let’s lay the foundation for a solid understanding of non-relational databases:

Fundamentals

NoSQL means Not only SQL. NoSQL and non-relational are interchangeable terms, so you probably will hear one or the other. While relational databases are great, they do come with trade-offs: When you’re dealing with a large amount of data, you cannot add more machines in a relational database, but can only scale vertically by adding more storage to the machine itself. That’s why relational dababases are called not distributed databases. Also, relational databases are not designed to handle unstructured data and won’t let you store different data type formats. If you need fast reads and writes, a relational databases will not suffice. Lastly, relational databases have a single point of failure: When that database shuts down, a failover to a backup system must happen, which takes time.

That’s when NoSQL databases come into play. They were developed to be able to handle the limitations of relational databases.

When to use a Non-Relational Database

NoSQL databases are especially useful

  • for working with large sets of distributed data
  • when you need horizontal scalability, i.e. the ability to add more machines or nodes to increase performance and space
  • when you need high throughput, i.e. very fast reads and writes

and

  • when you need a flexible schema, for example a column that does not have to be used by every row
  • when you need high availability and cannot afford a shut down
  • when you need to be able to store different data type formats, such as structured data, semi-structured data, or un-structured data (JSON, XML)

NoSQL Database Types

There are many different ways to implement NoSQL technology. Generally, NoSQL databases include the following families:

  • Key-value stores are the simplest NoSQL databases. Every single item in the database is stored as an attribute name, or key, together with its value. Examples of key-value stores are Riak and Voldemort.

  • Document databases pair each key with a complex data structure known as a document. Documents can contain many different key-value pairs, or key-array pairs, or even nested documents. An example of one such document database is mongoDB.

  • Wide-column stores such as Apache Cassandra and Apache HBase are optimized for queries over large datasets. They store columns of data together rather than rows.

  • Graph stores are used to store information about networks, such as social connections. Graph stores include Neo4J and HyperGraphDB.

Apache Cassandra

Aside from being a backbone for Facebook, Uber, and Netflix, Cassandra is a very scalable and resilient database that is easy to master and simple to configure. Apache Cassandra uses its own query language – CQL – which is similar to SQL. Note that JOINS, GROUP BY, or subqueries are not supported by CQL.

Some terms used in Cassandra differ from those we already know:

A keyspace, for example, is analogous to the term database in a relational database.

Another example is a partition, which is a collection of rows. Cassandra organizes data into partitions; there, each partition consists of multiple columns.

Partitions are stored on a node. Nodes (or servers) are generally part of a cluster where each node is responsible for a fraction of the partitions.

The Primary Key defines how each row can be uniquely identified and how the data is distributed across the nodes in our system. A partition key is responsible for identifying the partition or node in the cluster that stores a row – whereas the purpose of a clustering key (or clustering column) is to store row data within a partition in a sorted order.

When we have only one partition key and no clustering column, it is called a Single Primary Key. Should we use one (or more) partition key(s) and one (or more) clustering column(s) instead, we call it a Compound Primary Key or Composite Primary Key.

Denormalizing the Database

The goal of denormalization in this context is to reduce the amount of time needed to read data. Unlike relational databases, non-relational databases have been optimized for fast reads and writes; therefore, denormalization is a must! Always think about the necessary queries first and design your denormalized schema accordingly. One table per query is a good strategy.

Denormalization changes your application: First, it means data redundancy, which translates to significantly increased storage costs. Second, fixing data inconsistency is now the main job of the application.

Again, Data Modeling in Apache Cassandra is query focused – and that focus needs to be put on the WHERE clause. WHERE allows us to do fast reads. Note that the partition key always needs to be included in the query! The clustering columns can be used to put the results in order.

ETL Pipeline

Extract, transform, load (ETL) is the general procedure of copying data from one or more sources into a destination system which represents the data differently from, or in a different context than, the sources.

What does the ETL pipeline look like in our case?

  • Running ETL-Pipeline.ipynb first preprocesses the csv file, and then includes Apache Cassandra CREATE and INSERT statements in order to load processed records into relevant tables. The tables are tested by running SELECT statements.
  • The file sql_queries.py contains all SQL queries and is imported into ETL-Pipeline.ipynb.
  • The file functions.py contains the preprocessing function that creates new csv files to be used for Apache Cassandra tables and is imported into ETL-Pipeline.ipynb.

After this is all done, our Apache Cassandra database is finally modeled and ready to be queried!

And that’s pretty much it! Feel free to check out my full notebook if you’re interested.

. . . . . . . . . . . . . .

I hope you enjoyed reading and getting a glimpse at non-relational databases. If you think I’ve missed something or have made an error somewhere, please let me know: bb@brittabettendorf.com