BackendDatabaseSQLPostgreSQLBackendTutorial

Relational Database Schema Design: Principles Every Developer Should Know

Poor database schema design is the leading cause of performance problems and data integrity issues in web applications. Learn normalization, relationships, and indexing from first principles.

Abdur Razzak

Abdur Razzak

Full-Stack Web Developer

May 26, 2026 11 min read

Database schema design is one of the most consequential architectural decisions in any application. A well-designed schema makes queries efficient, data consistent, and the codebase easy to maintain. A poorly designed schema creates cascading performance problems as data volumes grow, introduces data integrity issues that corrupt application state, and forces increasingly complex workarounds as the application evolves. Most performance issues in production web applications trace back to schema design decisions made early in development that seemed harmless at small scale but proved catastrophically slow as the database grew to millions of rows. Understanding relational schema design principles enables you to make informed trade-offs between normalization, denormalization, query complexity, and write performance from the beginning of a project rather than discovering the consequences of poor decisions after launch.

Normalization: Eliminating Redundancy

Normalization is the process of organizing a relational schema to reduce data redundancy and improve data integrity. The first normal form requires that every column contains atomic values and there are no repeating groups. The second normal form additionally requires that every non-key column is fully dependent on the entire primary key, eliminating partial dependencies in tables with composite primary keys. The third normal form eliminates transitive dependencies, where a non-key column depends on another non-key column rather than directly on the primary key. A fully normalized schema stores each fact in exactly one place, meaning that when data changes, only one row needs to be updated to keep the entire database consistent. Denormalization, the deliberate violation of normalization rules for performance reasons, is sometimes appropriate but should be a conscious decision with clear justification rather than the accidental result of careless design.

Choosing Primary Keys: Natural vs Surrogate

A primary key uniquely identifies each row in a table. A natural key uses existing data attributes that have inherent uniqueness, such as an email address, a product SKU, or a national identification number. A surrogate key is an artificially generated identifier with no business meaning, such as an auto-incrementing integer or a UUID. Natural keys seem appealing because they carry meaning and eliminate the need for joins when the natural key is also the lookup value. However, natural keys carry significant risks: business data changes, email addresses are updated, product codes are reassigned, and when a natural key changes, every foreign key reference to it must be updated simultaneously. UUID primary keys are globally unique across systems, making them ideal for distributed databases and systems that merge data from multiple sources. Sequential integer keys produce more compact indexes and faster range scans but reveal information about record counts and creation order.

Relationships: One-to-Many and Many-to-Many

The one-to-many relationship is the most common relationship in relational databases. A single author can have many posts, a single order can have many line items, and a single user can have many addresses. Implement this with a foreign key column in the child table pointing to the primary key of the parent table. The many-to-many relationship requires a junction table, also called a join table or associative entity, that holds pairs of foreign keys from both related tables. A user can have many roles and a role can belong to many users: the user_roles table holds user_id and role_id columns, each a foreign key, and the combination of both forms the primary key of the junction table. Junction tables can carry additional attributes of the relationship itself, such as the date a user was assigned a role or who assigned it, making them first-class entities rather than just connectors.

Indexing Strategy: Which Columns to Index

Indexes dramatically accelerate read queries by allowing the database to find rows matching a condition without scanning the entire table, at the cost of slower writes and additional storage space. Every primary key is automatically indexed. Create indexes on foreign key columns to avoid full table scans when performing joins. Index columns that appear in WHERE clauses of frequent queries, particularly equality conditions and range conditions. Composite indexes covering multiple columns can satisfy queries that filter or sort on multiple columns simultaneously, but the column order in a composite index matters: the index is useful only for queries that use the leading columns in the order they appear in the index definition. Avoid indexing columns with very low cardinality, such as a boolean status column with only two distinct values, because the index is rarely selective enough to be faster than a table scan for such columns.

Handling Soft Deletes and Audit Trails

Hard-deleting rows from a production database is often the wrong choice because data relationships, audit requirements, and business recovery needs frequently require the ability to see or restore deleted records. Soft deletes mark records as deleted using a deleted_at timestamp column rather than physically removing the row. Application queries include a WHERE deleted_at IS NULL condition to exclude soft-deleted rows from normal results. One challenge with soft deletes is that unique constraints on active records must account for the possibility that a deleted record has the same value. A partial index that enforces uniqueness only on rows where deleted_at is null solves this elegantly in PostgreSQL. For audit trails that record who changed what and when, create an audit_log table that captures the table name, row identifier, the type of change, the before and after values as JSON, the user who made the change, and the timestamp.

Partitioning and Sharding for Large Tables

As tables grow to hundreds of millions or billions of rows, even well-indexed queries slow down because index structures become large and cache efficiency drops. Table partitioning splits a large table into smaller physical pieces based on a partition key, such as a date range or a hash of the primary key. Queries that filter on the partition key only need to scan the relevant partition rather than the entire table, dramatically improving performance for time-series data and historical records. PostgreSQL supports native declarative partitioning for range, list, and hash partition strategies. Index maintenance operations and vacuum operations run per-partition, reducing lock contention and maintenance windows. Sharding distributes partitions across multiple database servers for write scaling beyond what a single server can handle, introducing distributed transaction complexity that should only be embraced when the simpler alternatives of better indexing and vertical scaling have been exhausted.

Share this article

All posts
#Database#SQL#PostgreSQL#Backend#Tutorial
Abdur Razzak — Full Stack Web Developer

Free Consultation

Got a Project Idea? Let's Talk.