- Glean heavily tests all layers of their search stack, enabling quick verification of changes and catching bugs early through end-to-end simulations.
- They have revamped their deployment operations to ensure faster, safer, and more reliable updates, including a fast rollback operation for safety.
- Continuous improvements and iterations are made to maintain and enhance the performance and reliability of their search engine clusters, aiming to provide the best customer experience.
At Glean, we have a lofty goal of delivering a best-in-class search product.
In a previous post, we talked about how we quickly deploy new ranking code. In this post, we focus the discussion on how we test and deploy the search engine cluster.
While this discussion may be technical in nature, we strive to always keep our focus on the top-level goal of improving the customer experience. This means all the engineering we do, even in the infrastructure layers, should result in faster feature updates and performance, fewer bugs and interruptions, and an overall building of trust by consistently holding ourselves to higher standards.
Fast and reliable simulation and testing
At Glean, we heavily test all layers of our search stack. Each release qualification goes through a bevy of end-to-end search tests on fully loaded deployments. In addition to this, we realized the need to enable all our search engineers to simulate the full search stack within seconds. Every day we aim to put in new improvements while closely guarding the quality and robustness of the search engine.
To that end, we’ve invested heavily in being able to quickly bootstrap an in-memory multi-node search engine within our unit tests. This framework enables us to very quickly verify any changes we make to our search stack. This may involve updates to our index schemas, retrieval and scoring logic, text tokenization, and more.
While for every release we go through a battery of end-to-end tests using fully loaded deployments, we prioritized this fast, lightweight path because:
- The earlier engineers can catch a bug within their local setup, the fewer leaks we’ll have. We all know the arduous experience of having to isolate a bug on a fully loaded setup.
- The quicker engineers can verify a fix, the faster we can help customers get back on track.
- The more we are able to simulate the end-to-end search logic as part of our regular build process, the more we actually deeply understand our product instead of just making assumptions about how things ought to work.
We’ve been pleasantly surprised by how much can be caught through our end-to-end simulations. When working on optimizing our text tokenization path, we discovered that even simple search query tests run against the embedded cluster were able to find issues that weren’t caught by the tokenization unit tests themselves. We love when leaks are caught before even merging the bug to the codebase!
Fast and safe deployment operations
We aim to quickly and reliably deliver thousands of search engines. This is a very high-level statement, and in practice, it means every day we need to evaluate and improve our processes for how we operate and maintain all these search engines. This is a constant work-in-progress that keeps paying off as we continue to expand our customer base.
Recently we invested in an overhaul of all our deploy operations. We reviewed every operation and asked ourselves how we can make every step faster, safer, and more easily testable.
- We’ve revamped our setup and teardown operation to more quickly and safely rebuild search engines across our entire customer base. This allows us to easily move customers to new clusters that have new search features and important performance improvements.
- We’ve added a battery of unit tests for every deployment operation. Catching a bug while running on a customer is not somewhere we want to find ourselves.
- We’ve added important verification procedures for sensitive steps such as cluster switchover (moving from an old to new search engine). We run test queries on any new search engine before we proceed with switchover. We run this same verification procedure on the active search engine before we ever proceed to delete the old search engine in case we need to rollback.
- We’ve implemented a fast rollback operation in case we discover problems in the new search engine. While we’ve never had to use this on a customer deployment, we have this because we would rather be safe than sorry.
Through these efforts, we’re now spending much less time monitoring and babysitting previously arduous maintenance routines. We’re also catching bugs earlier which is helping reduce on-call incidents.
Going forward
While these investments have yielded some serious improvements and given us some peace of mind, we’re constantly revisiting and iterating. In future posts, we’ll share more about the other investments we’re making to provide the best search possible. We’re learning every day and excited about the opportunity to build an industry-defining product.
If building or using a best-in-class search product sounds interesting to you, please reach out! We’d love to talk with you.