Router on a white background

Preprocessing trillions of tokens with Rust

development / architecture / Rust

anchorAbout Aleph Alpha

Aleph Alpha is a German AI startup, a leader in the field of explainable and trustworthy sovereign AI. They're one of the few major players in the AI space based entirely in Europe.

anchorThe challenge

Aleph Alpha wanted to train their next generation of their AI foundational models.
The name of the game in the world of AI is data: You want to train using a large high-quality dataset to get the best results.

Processing a lot of data is a challenge in itself. When you are dealing with petabytes of text, you can't just spin up a single large server to process it all. That would take forever and forever is not an option when you're working in the AI space. You have to iterate quickly to improve your models and ensure that you as well as your customers are always building with the latest and greatest.

Aleph Alpha had reserved GPU clusters to train their models a few months in advance. The dataset had to be processed and be ready-to-go by the time the clusters were available. The clock was ticking!
The plan: Build a distributed data pipeline in Rust to process the data in a reasonable amount of time.

But, like all startups, Aleph Alpha must deal with limited resources. Their engineering team had to ensure that the existing product platform is running smoothly: They wouldn't neglect their existing customers for the sake of a new project! That's where we at Mainmatter came into the picture.

anchorMainmatter's role

We partnered with Aleph Alpha in September 2023 to help them design and implement the data pipeline they needed. We followed a Team reinforcement approach. Our Principal Engineering Consultant, Luca Palmieri, embedded into Aleph Alpha's team for three months to help their project team to deliver on their goals.

In particular, we supported the team in four key areas: architecture, infrastructure, Rust and mentoring.

anchorArchitecture

To run such a large-scale pipeline to completion in a reasonable amount of time, you need to distribute the workload across multiple machines. That's how you fully leverage the capabilities of modern cloud computing: Going from zero to thousands of CPUs for a few hours, then back to zero.

The system as a whole must also satisfy a variety of other constraints:

  • Fault-tolerance: If a machine fails, the pipeline should be able to recover and continue processing from where it left off, without losing any data.
  • Cost-efficiency: Storage fees, egress fees, compute fees... all of these can add up quickly. The system must be designed with cost-efficiency in mind from the get-go.
  • Data lineage: It should be possible to trace every piece of output data back to its source. This is critical for debugging and auditing purposes.

We worked closely with Aleph Alpha's team to design a system that would satisfy all of these constraints, while being simple to operate and easy to reason about.

anchorInfrastructure

You can't architect a system in a vacuum: You need to take into consideration the infrastructure it will run on. The underlying provider determines the capabilities and constraints you have to work with, such as the maximum download/upload throughput you can expect from a single data bucket or the cost of moving GB of data from one cloud region to another, etc.

For this data pipeline, we worked with Aleph Alpha's team to assess different cloud providers and pick the one that would best fit the needs of the system as well as the company's long-term strategy: StackIt, a German cloud provider.

We chose to rely on managed versions of Open-Source software. Aleph Alpha benefitted from our extensive experience in setting up these tools, while still benefiting from the ease of use of a managed service. At the same time, it reduced the overall infrastructure risk for the project. If we were to run into an unsolvable issue with StackIt's managed offering, we could always fall back to the Open-Source version or switch to another provider.

The final infrastructure stack looked like this:

  • Object storage: We relied on StackIt's managed object storage to store hyperparameters, intermediate artefacts and the final output for the pipeline. The interface is compatible with the S3 API, allowing us to rely on AWS' battle-tested Rust SDK.
  • Message broker: We picked RabbitMQ to pass messages between the different components of the pipeline. lapin served well as a Rust client.
  • Metadata storage: We used PostgreSQL to store metadata about the pipeline's progress. We relied on sqlx to interact with the database from Rust.
  • Compute: We relied on StackIt's managed Kubernetes offering to run the pipeline.
A MacBook displaying data

anchorRust

The entire data pipeline was built in Rust. That's the reason Aleph Alpha reached out to Mainmatter in the first place: They needed someone with deep expertise in Rust to help them deliver the project.

Rust is a great fit for this kind of project as it delivers high and predictable performance, while giving you precise control over the memory layout of your data. That efficiency is critical when dealing with such a large dataset and wanting to make sure you are not wasting CPU cycles or RAM. Building on Aleph Alpha's existing experience in using Rust, we came to appreciate a few more advantages of using Rust:

Throughout the project we came to appreciate a few more advantages of using Rust:

  • Correctness: Rust's type system and borrow checker make it easy to write code that is correct by construction. That's even more important in a project with such an aggressive timeline: you don’t want to waste time debugging runtime errors or memory unsafety bugs. The more static analysis the compiler can do for you, the more confident you can be that your code is correct.
  • Interoperability: Aleph Alpha’s AI researchers were working very closely with us on the project: tuning parameters, testing filters, checking data quality, etc. Researchers, unlike engineers, are not familiar with Rust; Python is king in the AI research ecosystem. We tried to make their lives as easy as possible while minimizing the amount of time spent on rewrites. Researchers tend to prototype in Python. Using Rust’s excellent Python interop capabilities, we would then plug the Python code into the Rust pipeline (thanks pyo3!) to verify its functionality and run it at scale. If the change was desirable, and we needed better performance, we would then port the Python code over to Rust.
  • Ecosystem: A growing number of companies are building core machine learning infrastructure in Rust. As a by-product, you can find high-quality implementations of key algorithms on crates.io. A special mention goes to Hugging Face’s tokenizers crate, a load-bearing component of the final pipeline.

anchorOutcome

We successfully preprocessed 4.5 trillion tokens to assemble a high-quality multilingual dataset. The data pipeline was designed, developed and ran on time and on budget—Aleph Alpha didn't have to delay their training schedule by a single day. The data was ready to go by the time the GPU cluster was available.

Working with Mainmatter's experts has been a great experience. They helped us develop a state-of-the-art data pipeline, mentored our internal team and introduced several improvements around our Rust code and infrastructure along the way. I've learned so much, especially during our pairing sessions—it allowed me to improve my technical skills and grow as an engineer.
Andreas Hartel, Senior Engineer at Aleph Alpha

Discover more case studies

Team up with us to go further!

Our experts are ready to guide you through your next big move. Let us know how we can help.
Get in touch