Data Modeling with Postgres

This blogpost will explain what relational databases are, what data modeling means, and finally, how to model user activity to create such a database together with an ETL pipeline in Postgres for a (fictional) music streaming app.

Imagine the following scenario:

A startup called Sparkify wants to analyze the data they’ve been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to. Currently, they don’t have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app. They’d like a data engineer to create a Postgres database with tables designed to optimize queries on song play analysis. Her role is to create a database schema and ETL pipeline (Extract, Transform, Load) for this analysis. She’s expected to model and test her database as well as the ETL pipeline by running queries given to her by the analytics team from Sparkify to compare her output with their expected results.

This was the context given for a project I completed with the Nanodegree program Data Engineering at Udacity.

First, we need to lay the foundation for a solid understanding of relational databases:

Fundamentals

A database is a set of related data and the way it’s organized. Access to a database is usually provided by a Database Management System (DBMS).

Data modeling is the process of creating a data model for the data to be stored in a database and helps in the visual representation of data. The data model is a conceptual representation of data objects, the relationship between different data objects, and the rules for naming conventions, default values, semantics.

The relational database model was invented by researchers at IBM as a way to make database management systems more independent of any particular application. A relational database organizes data into one or more tables – or relations – of columns and rows, with a unique key identifying each row. Generally, each table represents one entity type, such as customers or products. SQL (Standard Query Language) is the language used across almost all relational database systems for querying and maintaining the database. Some common examples of relational database systems are Oracle, MySQL, PostgreSQL (or: Postgres), and SQLite.

When to use a Relational Database

Relational databases come in handy when

  • we have a small dataset
  • we need the ability to do JOINS across tables
  • we need database transactions to guarantee validity even in the event of errors or power failures (so-called ACID transactions)

and if

  • we want to do aggregations and analytics
  • we have changing business requirements
  • our queries are not available and therefore we need flexibility

Relational databases are extremely flexible when we need to add or alter tables. They are intuitively organized and simple, as the data is systematically stored and modeled in tabular format. And they are essential when data integrity is our top priority.

Structuring the Database

The two biggest data modeling concepts in relational databases are:

  • normalization, and
  • denormalization.

Database normalization is a process used to organize a database into tables and columns.  The idea is that a table should be about a specific topic. By limiting a table to one purpose you reduce the number of duplicate data contained within your database. To achieve these objectives, we’ll use some established rules. As you apply these rules, new tables are formed. The progression from unruly to optimized passes through several normal forms. There are three main reasons to normalize a database. The first is to minimize duplicate data, the second is to minimize or avoid data modification issues, and the third is to simplify queries. 

However, retrieving data from a normalized database can be slow, as queries need to address many different tables where different pieces of data are stored. On the contrary, when all pieces of data are stored in a single place, updating can proceed much more quickly. The majority of modern applications need to be able to retrieve data in the shortest time possible. And that’s when you can consider denormalizing a relational database. As the name suggests, denormalization is the opposite of normalization. When you normalize a database, you organize data to ensure integrity and eliminate redundancies. Database denormalization means you deliberately put the same data in several places, thus increasing redundancy. “Why denormalize a database at all?” you may ask. The main purpose of denormalization is to significantly speed up data retrieval.

Database Schema for Sparkify

In our scenario, we have one dataset of songs where each file is in JSON format and contains metadata about a song and the artist of that song. The second dataset consists of log files in JSON format and simulates activity logs from a music streaming app (based on specified configurations).

Using the song and log datasets, the database schema looks like this:

You can see each table’s name, all attributes belonging to that table, and  the corresponding datatypes.

ETL Pipeline

Extract, transform, load (ETL) is the general procedure of copying data from one or more sources into a destination system which represents the data differently from, or in a different context than, the sources.

What does the ETL pipeline look like in our case?

  1. Running the create_tables.py script in a terminal creates and initializes the tables for the sparkifydb database. Another file sql_queries.py contains all SQL queries and is imported into create_tables.py.
  2. Running test.ipynb in a Jupyter notebook confirms that the tables were successfully created with the correct columns.
  3. Running the notebook etl.ipynb develops the ETL processes for each table and is used to prepare a python script for processing all the datasets.
  4. In a terminal, the python etl.py script is run to process the all the datasets. The script connects to the Sparkify database, extracts and processes the log_data and song_data, and loads data into the five tables. The file sql_queries.py contains all SQL queries and is imported into etl.py.
  5. Again, running test.ipynb confirms that the records were successfully inserted into each table.

After this is all done, our Postgres database is finally modeled and ready to be queried! One example of a question we now could answer is: What is the number of paid and free plans by gender? The corresponding SQL statement would look like this:

SELECT count(users.user_id), users.level, users.gender FROM users GROUP BY users.level, users.gender

And the output we retrieve is this:

And that’s pretty much it! Feel free to check my full notebook if you’re interested.

. . . . . . . . . . . . . .

I hope you enjoyed reading and got a glimpse at relational databases. If you think I’ve missed something or have made an error somewhere, please let me know: bb@brittabettendorf.com