This blogpost will explain what relational databases are, what data modeling means, and finally, how to model user activity to create such a database together with an ETL pipeline in Postgres for a (fictional) music streaming app.
Imagine the following scenario:
A startup called Sparkify wants to analyze the data they’ve been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to. Currently, they don’t have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app. They’d like a data engineer to create a Postgres database with tables designed to optimize queries on song play analysis. Her role is to create a database schema and ETL pipeline (Extract, Transform, Load) for this analysis. She’s expected to model and test her database as well as the ETL pipeline by running queries given to her by the analytics team from Sparkify to compare her output with their expected results.
This was the context given for a project I completed with the Nanodegree program Data Engineering at Udacity.
First, we need to lay the foundation for a solid understanding of relational databases:
A database is a set of related data and the way it’s organized. Access to a database is usually provided by a Database Management System (DBMS).
Data modeling is the process of creating a data model for the data to be stored in a database and helps in the visual representation of data. The data model is a conceptual representation of data objects, the relationship between different data objects, and the rules for naming conventions, default values, semantics.
The relational database model was invented by researchers at IBM as a way to make database management systems more independent of any particular application. A relational database organizes data into one or more tables – or relations – of columns and rows, with a unique key identifying each row. Generally, each table represents one entity type, such as customers or products. SQL (Standard Query Language) is the language used across almost all relational database systems for querying and maintaining the database. Some common examples of relational database systems are Oracle, MySQL, PostgreSQL (or: Postgres), and SQLite.
When to use a Relational Database
Relational databases come in handy when
- we have a small dataset
- we need the ability to do JOINS across tables
- we need database transactions to guarantee validity even in the event of errors or power failures (so-called ACID transactions)
- we want to do aggregations and analytics
- we have changing business requirements
- our queries are not available and therefore we need flexibility
Relational databases are extremely flexible when we need to add or alter tables. They are intuitively organized and simple, as the data is systematically stored and modeled in tabular format. And they are essential when data integrity is our top priority.
Structuring the Database
The two biggest data modeling concepts in relational databases are:
- normalization, and
Database normalization is a process used to organize a database into tables and columns. The idea is that a table should be about a specific topic. By limiting a table to one purpose you reduce the number of duplicate data contained within your database. To achieve these objectives, we’ll use some established rules. As you apply these rules, new tables are formed. The progression from unruly to optimized passes through several normal forms. There are three main reasons to normalize a database. The first is to minimize duplicate data, the second is to minimize or avoid data modification issues, and the third is to simplify queries.
However, retrieving data from a normalized database can be slow, as queries need to address many different tables where different pieces of data are stored. On the contrary, when all pieces of data are stored in a single place, updating can proceed much more quickly. The majority of modern applications need to be able to retrieve data in the shortest time possible. And that’s when you can consider denormalizing a relational database. As the name suggests, denormalization is the opposite of normalization. When you normalize a database, you organize data to ensure integrity and eliminate redundancies. Database denormalization means you deliberately put the same data in several places, thus increasing redundancy. “Why denormalize a database at all?” you may ask. The main purpose of denormalization is to significantly speed up data retrieval.
Database Schema for Sparkify
In our scenario, we have one dataset of songs where each file is in JSON format and contains metadata about a song and the artist of that song. The second dataset consists of log files in JSON format and simulates activity logs from a music streaming app (based on specified configurations).
Using the song and log datasets, the database schema looks like this:
You can see each table’s name, all attributes belonging to that table, and the corresponding datatypes.
Extract, transform, load (ETL) is the general procedure of copying data from one or more sources into a destination system which represents the data differently from, or in a different context than, the sources.
What does the ETL pipeline look like in our case?
- Running the
create_tables.pyscript in a terminal creates and initializes the tables for the sparkifydb database. Another file
sql_queries.pycontains all SQL queries and is imported into
test.ipynbin a Jupyter notebook confirms that the tables were successfully created with the correct columns.
- Running the notebook
etl.ipynbdevelops the ETL processes for each table and is used to prepare a python script for processing all the datasets.
- In a terminal, the
python etl.pyscript is run to process the all the datasets. The script connects to the Sparkify database, extracts and processes the
song_data, and loads data into the five tables. The file
sql_queries.pycontains all SQL queries and is imported into
- Again, running
test.ipynbconfirms that the records were successfully inserted into each table.
After this is all done, our Postgres database is finally modeled and ready to be queried! One example of a question we now could answer is: What is the number of paid and free plans by gender? The corresponding SQL statement would look like this:
SELECT count(users.user_id), users.level, users.gender FROM users GROUP BY users.level, users.gender
And the output we retrieve is this:
And that’s pretty much it! Feel free to check my full notebook if you’re interested.
. . . . . . . . . . . . . .
I hope you enjoyed reading and got a glimpse at relational databases. If you think I’ve missed something or have made an error somewhere, please let me know: firstname.lastname@example.org