I’m on paternity leave till the end of year since my daughter is on the way, and since I have some little time left before getting really busy, I want to reflect on how I’ve grown as an engineer in 2020.
I left Facebook at the end of 2019 to join Rockset, and it has been a fun year. For those who don’t know, Rockset is a real-time analytics database. The company is also a startup with about 30 people at the end of 2020. So there are a lot of things I get to learn, which comes from the combination of a relatively new field and a new working environment.
I’ll separate this note into 2 sections: technical topics that I learned, as well as some personal growth I have as an engineer.
Since Rockset is a real-time analytics database, the first topic that comes to mind would be columnar storage. I’ve kinda known of columnar storage before: basically store your data by column for fast scan. However, after joining Rockset, I get to actually deep dive into this. How exactly is a field organized? How do you handle updates? What optimizations can you make in order to make scanning fast?
There are a bunch of little things I’ve known from school: avoid branch mis-prediction, cache lines, vectorized execution, etc. But learning is one thing. Seeing it implemented, before and after, and how much it improves performance help me appreciate it a lot more. Sometimes it’s not about how many different ideas you know of to improve things. It’s the understanding of how much of an impact the idea can have that matters.
I also read a bunch of research papers about columnar databases this year, now that I get to work on it. VLDB, a leading conference in databases, also happens to feature a lot of HTAP systems this year: F1, TiDB-Flash, Alibaba Analytical DB, etc. It’s a lot of fun to read these papers and think about how Rockset’s system is compared to those.
Since Rockset uses RocksDB-Cloud, I get to learn about RocksDB! And somehow I became the maintainer of the RocksDB-Cloud repository (I guess because I touched it last ?).
I have to read a lot of RocksDB code to debug problems, understanding how things are implemented internally. There are a lot of learnings since this codebase is completely new to me.
Since I get to learn about RocksDB-Cloud, I’m also taking this opportunity to read more about Key-Value stores. There is a lot of research on this topic, but I particularly focus on how compaction scheduling can impact the performance of LSM trees.
Also, I learned a bit about other data structures as well (mostly B+ tree and its relatives) to see what are the pros and cons of LSM trees compared to others, and what impact a change in storage medium (we go from HDD to SSD and now to NVMe) can have on what trees to choose.
SQL Query Engine
Rockset built our own SQL query engine in C++, so I’m taking this opportunity to learn about this as well. I don’t get to contribute much to this – but I get to read the codebase and talk to people who work on this. When I joined, we were still early in our journey to implement the query engine, so it’s actually easier to learn about it – as opposed to starting from a full-fledged one. There is less to learn, and I get to understand the limitations on the current implementation and how to improve in the next version.
This is also one of the reasons why I left Facebook last year: there is a difference in learnings when you scale a system from a small one to a big one, as opposed to arriving at a gigantic one. With a gigantic system, you know how things are done correctly. After all, if a system can handle millions of queries per second, it has to be done right. However, you miss a lot of details on why certain things are built this way – small little decisions are made along the way – and what benefits they bring as opposed to other implementations.
Also, the perks of working at a startup is that: you get to know about almost everything other people are working on. It’s pretty simple to learn about what they’re doing – it’s just a Slack message away! I routinely annoy people by messaging them, “Hey, what you did sounds really cool. Can you explain to me a bit more? Just wanna learn.” Even though it probably brings zero benefit to them ?.
One of the tasks I did towards the end of this year was to figure out how to eliminate 5xx errors for clients. Sounds pretty simple, I thought – just wait for requests to finish before shutting down the server!
However, as it turns out, this problem opens a whole can of worms: I had to learn about how Kubernetes networking works to solve this problem! Unfortunately, I didn’t even take a networking class in college, so I had to learn basically everything from scratch. (I didn’t even know the difference between a Level 4 load balancer and Level 7 one. What’s level 4 even?).
I’ve always taken networking and infrastructure for granted. Back at Facebook, I just requested machines, and they would come up, and I ran my code there. Things just worked. Here, I get to actually understand how all these components work together (calico, kubelet, kube-proxy, etcd, …). Still not an expert yet, but at least now I know what people are talking about ?.
The fix for my task was very simple: less than 50 lines of code. But the learning was pretty cool!
I like solving problems, but one of the problems I had was that I sometimes understand a problem at a pretty shallow level before suggesting a solution. A lot of times, it turns out to be a wrong solution! This year, I was pushed to understand the problem at a much deeper level, a lot of times by questions from my colleagues. It was challenging! There are a lot of things I consider a blackbox, but in order to answer those questions, or explain the problem clearly, I have to actually learn about those blackboxes. And sometimes it turns out I understand the problem completely wrongly. This was quite a wake-up call, but also a growth opportunity.
Give a Public Talk
I gave a talk on Remote Compaction at the RocksDB meetup a few months ago. This was the first time I’ve ever given a talk in the Bay! I was pretty nervous and didn’t answer some of the related questions from the audience well. But I learned quite a bit about public speaking and presentation.
This is something I really appreciate from Rockset: my managers actually encourage me to give these talks. Besides raising awareness for our company, this also benefits me a great deal. This is also a good opportunity to meet others from different companies who work on the same problem.
This is something I didn’t expect to learn. Basically, our team was planning for what to do next year. I, being an over-enthusiastic member, decided to write up a bunch of ideas that could improve the system.
However, the feedback from my manager was that the proposal I wrote was actually pretty one-sided. I tend to look at systems from one angle: how do I improve the performance of this system so that it runs faster and more reliably. I think it is an important angle to look at, but that’s not enough.
There is a lot more to a system than just performance. How is the debuggability of a system? What kind of visibility to the system do you have when problems arise? Are you alerted on the right thing? What kind of tests do you have to ensure the system works across deployments? What kind of tools do you have to debug and fix problems? Having considered these questions, I realize there is a lot we can, and have to, do to improve the system besides just performance.
Previously, because of my one-sided way of looking at things, I tended to get stuck when asked for ways to improve a system. This lesson helps me a lot in my journey to become a more senior engineer.
Personally, I think I grew a lot as an engineer this year. The stuff I hoped for when I left my previous job, I think in some ways I have gotten it. I really look forward to a lot more learnings next year!