Learn how to avoid having long database tables with too many columns by applying a simple data modeling technique. In this post, we'll use the Entity-Attribute-Value pattern to inject flexibility into our architecture.
Data modeling skills are a fundamental building block in software engineering, and they have a substantial influence over the quality of the code we produce. As a developer in the consulting space, I get to help and learn from clients and their different types of architectures. A few months ago, my team ran into an interesting problem where we were able to use a simple data modeling technique in our implementation. It is formally known as the Entity-Attribute-Value model, and after several iterations of building features on top of this, we are incredibly pleased with how it has allowed our application to evolve. This can help to avoid a common data modeling anti-pattern.
If you have worked on legacy systems, the chances of having encountered this are high. It rears its ugly head in the form of a ridiculously long database table with dozens (even hundreds) of columns. Also known as the "God Object" anti-pattern (a reference to the massive object you will have as a result in your application, containing lots of properties and being responsible for too many things), this can happen due to poor data modeling practices and years of developers carelessly dumping columns onto the same table. The scenario under which this can happen is quite common, and it can happen to the best of us.
In the world of evolving requirements and aggressive deadlines, a database schema can quickly slip into this pattern if we do not pay attention to the warning signs. It starts as a normal object or entity with a core set of well-defined attributes. As time passes, requirements continue to dictate that we add properties to the same entity. These attributes are not what we would consider core attributes, but if they trickle along one at a time every month or so, engineers are more likely to drop a new column alongside their core attribute columns. Add a few months or years of development, and you have a table guaranteed to make any DB admin cringe. Let me show you how this would play out in a real application.
Let us assume we are building a simple game application. The main characters in the game are superheroes. Our client gives us the following requirements. For a superhero entity, we must store:
We are assured by our client that this will not change, and at most there will be one or two additional attributes in the future. Fast-forward 12 iterations, and we have added several more attributes like sidekick and butler_name to the super_hero table. The majority of these will only apply to a specific use case, so they will be null for 99% of the rest of the superhero records in the table. In mathematical terms, we have created a sparse matrix. Going back to our linear algebra fundamentals, we know that storing a sparse matrix "as-is" is extremely memory-inefficient because we have to store lots of cells with 0, null or "N/A" values.
This model gives us an efficient way to store a growing number of attributes. However, before using this model we must analyze our data, and see if current and future attributes fit either of the following conditions:
If your entity's attributes do not fall into either of these categories, this model may not be the right fit. For our superhero example, we can already see the sparseness of the attributes, so we can apply the Entity-Attribute-Value model by splitting our data into:
Here we need to analyze existing attributes and determine which ones should remain as core attributes, and which ones are more likely to be less common across all existing and future superheroes. The non-core attribute columns will be turned into attribute type rows and stored in a reference table (super_hero_attribute_type), and we will need to create another table to store the sparse attribute values (super_hero_attribute). Our schema now looks like this:
Despite having three tables instead of one, this schema is much more efficient and flexible:
Asking a client to know what their data will look like in a year is unrealistic. Priorities shift and consumer behavior drives the direction of features. Knowing simple data patterns like this can help us to introduce some flexibility and keep technical debt away.