Notion's Crisis: How They Solved Their Database Problem
In 2021, Notion's popularity skyrocketed, but its service became unbearably slow. The problem lay in its unique data model, where everything is a block, which can be a piece of text, an image, or an entire page itself. This structure allows for incredible versatility but also means that even a simple document can result in hundreds or thousands of database entries.
Each block is stored as a row in a Postgres database with its own unique ID. The sheer volume of data eventually caused users to notice increased latency when requesting page data. Notion's single monolithic database could no longer handle the load, and their Postgres vacuum process began to install consistently, leading to bloated tables and degraded performance.
The Solution: Horizontal Scaling and Sharding
Notion decided to sharding their database, creating 32 physical database instances, each with 15 separate logical schemas. Each schema would have its own table, like block, workspace, and comments, for a total of 480 total shards across the 32 physical databases. The routing mechanism was determined at the application level to determine where a piece of data is stored.
The Challenges: Data Migration and Connection Limitations
Notion had to migrate their existing data to the new shards while maintaining data consistency. They used Postgres logical replication to continuously apply new changes to the new databases. The process involved setting up three Postgres publications on each existing database, with each publication covering five logical schemas on the new databases. Subscriptions were created to consume one of the three publications, effectively covering over the relevant set of data.
However, testing uncovered a critical issue: each old shard mapped to three new shards, requiring them to either reduce the number of connections per PG bouncer instance or increase it by 3x. They chose to increase the connection limit, which allowed them to maintain an appropriate number of connections before rolling out the changes to production.
The Outcome: Increased Capacity and Improved Performance
The recharting project was a significant success for Notion. Some key outcomes included:
- Increased capacity
- Improved performance
- CPU and IOPS utilization decreased dramatically, with new utilization hovering around 20% during peak traffic compared to the previous 90%
- This new architecture positioned Notion to handle continued user growth and continued data demands.
In conclusion, Notion's crisis was solved through a combination of horizontal scaling and sharding, careful data migration, and ingenious solutions to connection limitations. Their new architecture has positioned them for continued growth and success.