Crafting Software Diary

Beyond Big-O: How Hardware Shapes Code Performance.

Lam PHAM — Wed, 02 Jul 2025 13:18:43 GMT

In programming, code performance tends to be tighten to algorithm. We often think of code complexity as the primary factor for code speed. The rule of thumb is:

The fewer operations a program performs, the faster it runs.

While this principle is valid, it overlooks a critical aspect: performance comparisons only make sense when the environment remains the same. And by environment, I mean the hardware: CPU, GPU, RAM etc— because ultimately, code is just instructions, what executes tasks is the hardware. Code doesn't just define what tasks to perform, but can also tell the hardware how to do it.

Yet, most developers focus heavily on minimizing the what—reducing the number of operations—while completely ignoring the how. The reason is understandable: optimizing the how requires a solid understanding of hardware-level behavior, and as software engineers, we often:

Tend to prioritize high-level features and functionality.
Lack familiarity with low-level hardware operations.
May not even realize how much hardware-aware coding can boost performance.

In this article, we'll see how a deeper understanding of hardware can turn even small code changes into significant performance gains.

Memory reading

Let’s look at a simple example.

Given this array of length 2 073 600 (1920×1080):

Note: this example only works with arrays, not linked lists— we’ll explore why later in the section.

private const val COLUMN = 1920
private const val ROW = 1080
private val length = COLUMN * ROW
private val list = ArrayList<Int>()
for (i in 0 until length) {
    list.add(i)
}

Now, let’s write a program that iterates through the entire array and processes each element. For this, we are going to study 2 approaches:

The first method iterates through items sequentially: 0, 1, 2, 3, 4 …
The second method iterates in a leapfrogging pattern: 0, 1920, 3840, 7680, …, 1, 1921, 3841, ...

Sequential	Leapfrogging
0, 1, 2, 3, 4 …	0, 1920, 3840, 7680, …, 1, 1921, 3841, ...

These 2 approaches have the same complexity, perform exactly the same number of operation, the only difference is the order in which the items are accessed.

Before reading further, take a moment to think: what could possibly make one method faster than the other?

...

Alright, let’s reveal the answer. Here’s the benchmark comparison of the two methods:

Note: The benchmark is done with kotlin-benchmark library, in the Mode.AverageTime

Benchmark                        Mode  Cnt  Score   Error  Units
Benchmark.sequentialIteration    avgt    5  0.952 ± 0.003  ms/op
Benchmark.leapfroggingIteration  avgt    5  3.517 ± 0.015  ms/op

The benchmark shows that the sequential approach completes an operation in approximately 0.952 millisecond , while the leapfrogging approach takes about 3.517 milliseconds. In other words, sequential reading is roughly 3.7 times faster (this number is not consistent as it differs depending on the hardware and the data structure used) than leapfrogging reading 🔥🔥.

Wondering why?

To understand the reason, we’ll examine the data reading function—since the only difference between the two methods lies in the order in which data is read.

Although memory architecture can vary across CPU designs, all systems generally include RAM and multiple layers of cache. When a program requests data, the CPU first checks the cache, if the data isn't there, cache fetches it from RAM, which is significantly slower. However, if future data requests hit previously fetched blocks, those slow RAM accesses can be avoided.

In short, the fewer memory fetches from RAM, the faster the program runs.

Below is what happens in the two above programs:

Sequential	Leapfrogging
0, 1, 2, 3, 4 …	0, 1920, 3840, 7680, …, 1, 1921, 3841, ...

With the same amount of data requested, sequential reading typically requires far fewer RAM fetches than leapfrogging, making it significantly more efficient.

Important note: As mentioned earlier, arrays are the best fit for sequential reading because they store data in contiguous memory blocks. Non-contiguous structures (like LinkedLists) don’t offer the same performance benefits.

Best-practice takeaway: Always aim to reduce the number of RAM fetches. In case of contiguous storing, sequential reading is the most efficient approach. This could also be optimized further by knowing the capacity of cache— newly fetched data is loaded into the cache, while older data is evicted to make room—and tailoring the data reading pattern accordingly.

By understanding how cache fetching works, you can significantly improve your program’s performance with just a tiny code change.

Numpy vectorization operations

I have recently started exploring some Machine Learning basics. While studying the Linear Regression model (with multiple features), I came across a problem that involves multiplying two arrays element-wise:

$$ \mathbf{w} = [w_1, w_2, \ldots, w_n],\quad \mathbf{x} = [x_1, x_2, \ldots, x_n] $$

$$ \mathbf{w} \cdot \mathbf{x} = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n $$

From a programming perspective, this is a straightforward problem. The easiest and most intuitive approach is to use a for loop to multiply each pair of elements and then sum all the products.

f = 0
for i in range(0, n):
    f = f + w[i] * x[i]

To me, iterating over every element and doing the math seemed inevitable. Therefore, I believed the for loop was the most efficient solution and that no better alternative existed. And then, I got introduced to the Numpy library with its built-in dot() operation:

f = np.dot(w,x)

At first, I assumed the dot() function was just a fancy wrapper around a for loop—until the instructor made this bold claim:

Numpy's dot function can be hundreds or even thousands of times faster than a for-loop.

Wait what? 😮😮 Like how?

Out of curiosity, I looked it up and discovered how it was made possible. It turns out that the Numpy library uses a variety of sophisticated techniques that take advantage of modern hardware capabilities to maximize performance when working with massive numerical data.

Pre-compiled code

Even though Numpy is a python library, the vectorization methods are pre-compiled C (or sometimes Fortran) code. This compiled code runs much faster than a Python for-loop— where each iteration is interpreted at runtime.

Vectorization (SIMD)

In ML context, data is usually vectorized, meaning it is transformed into numerical vectors that models can process efficiently. Vectorization allows computations on entire vectors at once, leveraging parallel processing capabilities of modern CPUs and GPUs. This is enabled thanks to Single Instruction, Multiple Data (SIMD), a technique where one instruction performs operations on multiple data points simultaneously within a single CPU cycle. For example, in a dot product operation, all multiplications $ w_i x_i $ are performed in parallel, and their results are summed. This results in significantly faster computations and simpler code.

Efficient Memory Management

Memory layout

Numpy's data structures—primarily the ndarray—store data in contiguous memory blocks. As analyzed in the previous section, this memory layout allows for much faster reading and processing of large datasets.

Cache optimization

We've seen how utilizing the cache can improve code performance. NumPy uses blocking/tile algorithms to maximize data locality and cache utilization. The idea is to divide computations into smaller, more manageable chunks called blocks or tiles. Each block is processed entirely before moving on to the next, helping ensure that the necessary data stays in the cache for as long as possible.

Parallelism and Multi-Core Utilization

In addition to parallel data processing within a single CPU core, Numpy also supports parallelism across multiple cores, benefiting from the same optimization techniques used for single-core performance.

And more...

There are still many other algorithms tailored to specific hardware, memory designs, and device architectures. All of them contribute to creating a sophisticated solution for processing large datasets.

Takeaways

Algorithm complexity isn't everything: Reducing the number of operations helps, but it's only effective if the hardware environment stays consistent.
Hardware matters: A solid understanding of fundamental hardware operations—such as data access, cache behavior, memory management, and CPU architecture—can help you design significantly more efficient solutions.
Code optimization has limits, thus hardware matters even more: At some point, algorithmic improvements reach their theoretical minimum. Hardware, on the other hand, continues to evolve. Major tech companies invest heavily in better processors, memory, and accelerators—giving you new tools to push performance further.
Hybrid optimization is essential: For large-scale or performance-critical tasks, neither algorithmic efficiency nor hardware tuning alone is sufficient. Combining both—writing efficient code and aligning it with hardware behavior—yields the best results.

Optimizing Large Android Project Builds : When Recommended Tunings Aren't Enough - Part 2

Lam PHAM — Fri, 25 Apr 2025 09:02:25 GMT

In the previous part, we discuss about why solely following standard advice often isn’t enough to boost your build performance. In this part, we will explore strategies to really unlock your builds— while also keeping them consistently optimized and ensuring no regression silently sneak back in.

Link to Part 1

Keys to improving build performance

Understand the tools

If you have read through the previous part, you are likely noticed a common pattern behind why simply enabling the recommended features fall short: developers often do not understand how the tools work. They tends to understand what these features do, but not how they do it. And that how is essential when it comes to troubleshooting build issues and getting the most out of each feature.

At the end of the day, improving build time ultimately comes down to making ways for the existing features to work at their best. There’s no need to introduce new techniques but adjusting the system so that the Gradle features can do what they are designed to do. Therefore, understanding the tools deeply is the inevitable path to the success.

Take Build Cache as an example, its concept is straightforward, not new and relatively easy to understand: it skips task execution if there are no changes between builds. Most developers would easily grasp the idea, or the what. However, imagine having a problem where a second build is launched right after the first one without modifying any thing, and yet some tasks still get executed. Knowing the what might help you spot that something’s off, but it will not guide you how to investigate and fix it. This is where understanding the how truly matters. Moreover, knowing the concept input-output of Build Cache helps ensure you do not introduce regression while extending your build scripts.

In Android ecosystem, understanding the tools starts with understanding Gradle itself and its build mechanisms: Gradle Daemon, Build Cache, incremental build, parallelism and so on. Next comes the Android Gradle Plugin (a.k.a AGP), another major tool in managing Android-specific builds. A solid grasp of AGP helps you navigate platform specific concerns like build types, variants, minification and others. Beyond that, it’s also essential to understand the Programming Languages involved and their compilation processes. Knowing how Java and Kotlin handle compilation—especially features like compilation avoidance and the differences between the two—allows you to write production code that’s much more build-friendly. Lastly, familiarity with the underlying build environment—the JVM—can be a great advantage when it comes to optimizing build memory usage or tuning other performance-related aspects.

Understand your project

To choose the right tools or setups, knowing your project’s characteristics and what it needs is everything. Get inspired by solutions from other projects, but always remember that your project is different.

For instance, both Android build and Java servers run on JVM, but they are optimized for different goals. A Java server JVM is tuned for runtime performance, focusing on high throughput, low latency, and efficient resource management (memory, CPU) to handle concurrent requests and long-running processes.. On the other hand, Android builds are short-lived processes that prioritize build speed, consistency and toolchain stability.

As a result, a GC optimized for a server JVM may not work for an Android build. Dalvik, the Android runtime, is a great example for this distinction, as it is designed specifically for mobile application runtime environment with different prioritizes.

Another example, if your project is structured as a large single module with a high volume of unit tests, enabling parallel test is critical and helps significantly speed up your testing task.

However, if your project is already structured as a multi-module setup, this option generally won’t provide much additional benefit when running all unit tests across the entire project—and in some situation, it may even backfire the build with the cost of spawning extra processes.

Note that there are still cases where this feature is quite useful even in a multi-module project—I will show an example of that in a later section.

Understand build environment

So far, we have seen that tools’ efficiency can vary much depending on projects. Another key factor that contributes significantly to the effectiveness of these solutions is the build environment.

Building in local is different than building in CI, or in Android context, building a debug version is different than building a release one.

Locally, incremental builds and build cache are incredibly efficient because cache are stored on your machine and you probably have enough space to retain them over time. Furthermore, the differences between consecutive builds are often minimal, this allows JVM to optimize itself and improve performance across builds. That said, building in local is often faster and you have the flexibility to fine-tune them based on your work machine’s characteristics.

In contrast, building on CI is a bit trickier. For instance, if you are using ephemeral agents, you don’t have a persistent storage on agents to retain cache data. As a result, incremental build and build cache can’t function as straightforwardly as they do in local environments. In addition, the JVM’s ability to optimize itself over successive runs is completely lost.

Note: Actually, while there is no way to preserve incremental builds, you can always be able to apply build cache in your CI setup. I will talk about this in more detail in the next section.

Secondly, CI agents often come with hardware limitations. If unfortunately you aren’t given a well-provisioned setup, it will be challenging to adjust the builds to run effectively on these constrained agents.

With that, knowing the characteristics and limitations of your build environment is important— because even the best tools can fall short if the environment isn’t set up to support them.

Do proper benchmarks and monitoring

If there is not one-size-fits-all solution, and the effectiveness of build tools depends heavily on the project and environment, how can we ensure that a solution actually works well for your use cases?

Well, benchmark it!

Only benchmarking can validate the efficiency of a solution. There’s no theory that can guarantee a specific Garbage Collector would fit your system well— benchmarking is a reliable way to finding out.

Besides, setting up a build metrics monitoring system is just as important. Similar to benchmarking, it help approve a solution’s outcomes. However, benchmarking is a tricky business— it requires deep expertise as it is not always easy to accurately simulating a real-world system. If any part of the setup doesn’t reflect your actual production scenario, the benchmark result may be misleading. On the other hand, a monitoring system captures data from real system usage, making it far more trustworthy and 100% reflective of how your builds truly perform. The only tradeoff of a monitoring system is it takes time to gather data, while benchmarking gives quicker insights.

Beyond validation, a monitoring system also offers other benefits: it provides a holistic view about build heath, helps detect bottlenecks or spot areas for potential improvement.

Ultimately, benchmarking and monitoring provide the data-driven foundation needed to validate any build solution—without them, applying new tools should be done with caution.

Real world scenarios

Enough with the theory, let’s dive into a series of practical solutions I’ve applied, as a part of my team, while optimizing a large-scale Android project with over 500 modules.

Keep in mind that these solutions are based on the aforementioned underlying tunings that are already enabled by default.

Remote build cache 😎

Our main project is backed by a huge volume of unit tests, which counts around 75k tests. We run the unit test CI pipeline as a mandatory check for every commit in Pull Requests (PR). In order for the full testsuite to complete, it takes almost one hour and a half🔥🔥. This is certainly an unacceptable amount of time for a single PR check.

As explained earlier, having incremental build and build cache enabled doesn’t cut any second of CI build time. Therefore, we had to deploy a remote build cache system. The core idea remains the same as local build cache, the only difference compared to a local setup is that the cache is stored remotely instead of on local machines. We use S3 as the remote storage to store cache and share it across the builds. This dramatically reduced our unit test execution time from a consistent one hour and a half to a range from several minutes to 40 minutes🚀🚀— depends on how many tests are skipped because of the cache.

While unit test was the original reason we set up this remote build cache, all other builds have benefited as well, leading to a significant overall reduction in build times.

Note: We don’t share this remote cache with local builds for a couple of reasons:

Local builds are already well supported by a rich set of local cache.
Local environments tend to have more noisy or inconsistent builds, and pushing cache from these could degrade the overall quality of the shared remote cache.
While having little benefits for local builds, interacting with remote cache does cost additional money.

You can find details on how to enable remote build cache here, or consider using Develocity solution to help setup and manage it more easily.

Addressing build cache misses 😎

After enabling remote build cache, we still observed an odd behavior where all unit tests in a particular module—the application module specifically— are re-executed even when no changes are introduced in the new builds.

This is when we learned about build cache misses and how they were affecting our builds. We manually diagnosed build cache by utilizing the -Dorg.gradle.caching.debug=true flag, as suggested here. While this is a quite time-consuming and challenging approach, we will show a much simpler alternative a bit later. Through this manual debugging process, we identified 2 libraries as the root cause: NewRelic and Crashlytics. If you are interested in the details, I opened 2 tickets to both repositories: Newrelic and Crashlytics. Unfortunately, neither library has provided a proper fix for the issues so far. We had to implement our own workaround for the Newrelic case, but the Crashlytics issue remains unresolved. Luckily, the Crashlytics problem only affect release builds, which represent a smaller portion of the builds in our workflow.

Develocity 😎

Up until that point, we had been receiving frequent complaints about local builds hanging for dozens of minutes, even half an hour. Unfortunately, we had no visibility on what happened on other people’ machines. Given that we were actively working on improving our build system— including migrating the tools, troubleshooting build issues, optimizing some build pipelines etc— it became clear that we were missing a proper monitoring system to support these works. This is when we decided to run a POC with Develocity, the tool that could address all our needs.

Develocity, formerly known as Gradle Enterprise, is a tool that helps monitor all your builds, either CI or local.

This tool firstly opened a ways simpler approach to debug build cache missing issue. By utilizing this, we discovered numerous other cache misses across different scenarios and resolved them, which ultimately generate an approximate 30%🚀🚀 reduction in build times across all pipelines.

In general, we use Develocity to:

Monitor builds and to get a holistic view about system’s heath: both in local and in CI.
Improve unit tests speed as well as quality with Predictive Test Selection and Flaky Test features.
Investigate issues more effectively. For example, by looking at some of the longest local builds, we found out that the hanging build issue often comes from dependencies fetching— usually due to people not being connected to the company VPN, since some dependencies are hosted in our internal artifactories.
Identify areas for improvement. I have a great example about this that I will talk about in very shortly.
Validate our solutions. With comprehensive build metrics, it's easy to back up changes with clear data—whether it's through simple charts or key numbers.

Build regression pipeline 😎

The example above shows how impactful a cache miss is to the build. That’s why it’s important not only to resolve existing cache misses, but also prevent new ones from getting introduced into the project. Investigating cache misses is time-consuming and we definitely don’t want to do it regularly. Hence, having recurring jobs to run the build validation scripts will help automatically detect cache misses early, ensuring the quality and reliability of our build cache over time.

Here is an example how Netflix did it.

Parallel unit tests 😎

In the previous section, I mentioned that parallel test execution isn’t beneficial in a multi-module project. This is because when running unit tests across the entire project, each module runs its tests in a separate task. And given that the parallel task execution is already enabled, those test tasks will naturally run in parallel anyway.

However, there are still cases where parallel test execution is actually useful. The chart below—taken from the Develocity dashboard— roughly shows what our unit test pipeline looked like at the time:

It is not hard to spot the problem in this chart. The pipeline starts off strong, utilizing all 6 cores in parallel. But as the build progresses, most modules finish quickly—except for two large ones: :payment and :search. These two tasks continue running on just two cores while the remaining cores sit idle.

What a waste of resources!

The solution was to break these 2 heavy modules into smaller chunks— which means we only enabled parallel test execution for :payment and :search. This allows their tests to run across more cores— in other words, in parallel. This way, we could utilize all the available resources to significantly speed up the build.

Here’s what the build looked like after applying the fix:

This piece of work has improved 40%🚀🚀 of the unit test builds involving these 2 modules— excluding cases where the results were already cached. This is a great example of how understanding your project—and having proper monitoring in place—can lead to meaningful improvements in your builds.

Garbage Collector benchmarking 🤔

This suggestion from Google also caught our attention— they recommended trying the Parallel GC. We decided to do a benchmark and compare between G1 GC and ParallelGC.

We used gradle-profiler and setup a CI pipeline to run the benchmark over night (for a benchmark, we run with fresh builds, without cache, so it often take hours for the benchmark to finish).

Here’s an example showing how the Parallel GC performed on one of our pipelines compared to G1 GC:

Based on the results, we didn’t find enough evidence to justify switching to Parallel GC 😞😞. So we continue to stick with G1 GC for now.

Takeaways

There’s no shortcut to optimizing build performance. While Gradle’s default options can help improve 70% of yous builds, the remaining 30% will drag them down if you don’t know how to navigate it. The essentials to overcoming build issues is a deep understanding about the ecosystem:

Learn to effectively use major tools in the Android build system: Gradle, AGP, Languages (Java/Kotlin), JVM
Understand your project from a holistic perspective. This includes setting up proper metrics and monitoring to consistently track the system’s health and spot regression early.
Always experiment and benchmark the impact of a feature on your project before applying them.

Consistent, informed iteration is what leads to meaningful and sustainable performance gains.

Resources

For those who want to learn more about Gradle and AGP, I strongly recommend this book:

Extending Android Builds: Pragmatic Gradle and AGP Skills with Kotlin

Optimizing Large Android Project Builds : When Recommended Tunings Aren't Enough - Part 1

Lam PHAM — Tue, 22 Apr 2025 22:36:56 GMT

Link to Part 2

Overview

In large Android projects—or any Gradle-based project—build speed is often a pain point. This is understandable because building a JVM-based application involves multiple steps, including compiling code and resources, minifying and optimizing the output, and assembling everything into the final artifact. These steps are inherently time-consuming, especially in large codebases with hundreds or even thousands of modules.

Gradle is the go-to build tool for these types of project. Over years, it has developed numerous of genius solutions to optimize build performance. Yet, even with all these features enabled, many app engineers still find themselves frustrated by sluggish build times. This is often where Gradle takes the blame—criticized for being overly complex without addressing developers’ expectation.

In this Part 1 of a 2-part series, we’ll explore why, despite benefiting all Gradle’s optimizations, your builds might still feel inefficient. And from there, in Part 2, we will dive into what we could do to actually improve your builds even further.

Why recommended tunings don’t often help?

When searching for ways to optimize a Gradle-based build— whether for Android project or in general— you will often come across these typical suggestions, commonly found in Gradle’s official documentation or shared through articles, blog posts, and conference talks:

Update to the latest tools (Gradle, AGP, JDK, Java/Kotlin and so on) version.
Enable build cache: org.gradle.caching=true
Use incremental build.
Use compilation avoidance.
Enable parallel execution: org.gradle.parallel=true
Enable parallel test.
Enable Gradle daemon: org.gradle.daemon=true
Increase heap size: org.gradle.jvmargs=-Xmx...M
Enable some specific Garbarge Collectors, for i.e: parallelGC org.gradle.jvmargs=-XX:+UseParallelGC
Fine-tune JVM options
and so on

These solutions look appealing— one single line and your builds are supposed to be supercharged. What more could we ask for? Here comes the very common mistake: many developers do no more than appending these lines into there gradle.properties file and expect a dramatic speedup. The hard truth is, while these options are genuinely effective, we won't often see the build performance change much.

Just because you enabled it doesn’t mean you made a difference

In new versions of Gradle, most of the above options are enabled by default. In other word, explicitly adding them— without any deeper tuning— usually doesn’t make any changes. That’s one of the main reasons why your build time often remains unchanged.

And when you think about it, it makes perfect sense. If simply pasting a few lines could boost your builds, wouldn’t Gradle just enable them out of the box?

The only benefit of adding these lines is to give you the confidence that you haven’t missed anything. Ever found yourself hitting CTRL + C dozens of times before finally pressing CTRL + V? Yeah, it’s kind of like that.

One size doesn’t fit all

The majority of Gradle solutions will have an impact in many types of projects. However, some are tailored for specific systems, based on factors like project size, system requirements (e.g, memory efficiency, low latency, high throughput etc.), team size or other considerations.

Let’s take a look at the Gradle official documentation, it recommends running Java compilation in a separate process. However, a crucial information is tucked away at the very end of the section:

Forking compilation rarely impacts the performance of small projects. But you should consider it if a single task compiles more than a thousand source files together.

Or in case of the ParallelGC, Google suggests experimenting and testing with the JVM parallel garbage collector. The emphasis on "experiment" is important—GC performance can vary greatly depending on the specific project. That’s why Google recommends testing it first. However, this nuance is often overlooked, and people tend to misinterpret the advice as simply “use the JVM parallel garbage collector.”

When desperately looking for a quick solution, who really has the patience to read all the way through—especially when a shiny code snippet, seemingly the holy grail to all your problems, has already got your focus?

Tuning isn’t about maxing out settings

I’ve seen many blogs advising to allocate more heap size— which is completely a great suggestion, except they often specify a magic value (usually 4gb) and don’t give any further explanation.

org.gradle.jvmargs=-Xmx4g

Gradle puts 2gb in its recommended code snippet.

In a regular JVM, bigger heap could indeed help reduce garbage collection pause-time, increase application throughput, reduce latency and everything. However, for a given system, increasing the heap size eventually stops yielding benefits. It's also important to keep in mind that device memory sets an upper limit.

Therefore, an efficient max heap size really depends on many factors: the project size, the libraries used (e.g, Robolectric is famous for being memory-hungry), the device memory, the environment: either it’s a local build where you don’t want to sacrifice all your device’s memory to the builds, because you still want your browser to play music— or a CI build in a Linux agent where you can allocate all available memory, but you will have to watch out for the OOM Killer, which can terminate your processes at any time.

That said, allocating too little memory can slow down the build process, while giving too much doesn’t always make an impact— and in certain conditions, it could backfire the builds.

The most important thing is that these articles, blogs, talks shouldn’t be the primary source for defining performance metrics in your own system.

Notes: You may assume that hitting the upper limit (the device memory capacity) is rare. However, in practice, the max heap size is often well below from the total memory the OS allocates to your build at any given time. For instance, on a machine of 32GB of RAM, you may think it’s safe to set the max heap size to 16GB. However, that heap only represents the heap of the main process— typically the Gradle daemon. Your builds may also spin up other processes, such as Kotlin Daemons or Gradle Workers, which collectively can push the peak memory usage much higher and close to the upper bound 32GB.

Not all build flags are plug-and-play—some need your code to meet specific criteria.

Gradle build cache, one of the Gradle greatest features, functions based on a task’s inputs and outputs. If inputs stay unchanged, the task is skipped and cached outputs are reused. However, if you create a task without any inputs or outputs, or inputs are different between builds even though there’s no updates, build cache won’t work.

You may say you never create any custom Gradle tasks in your project. Well, your libraries certainly do!

The same principles apply for some other features such as compilation avoidance (this depends mostly on the languages like Groovy/Java/Kotlin, not Gradle) where you must write your production code in a way that the compiler could skip recompiling as much code as possible.

Therefore, enabling these options without crafting your production and build code won’t yield fruitful results.

Wrong expectation

Sometimes, you just have high expectations for what these options could deliver.

Configuration Cache is another piece of engineering by Gradle. As of the time of writing, it’s still in active development, and not yet enabled by default. Its fundamental is similar to Build Cache, but it targets the configuration phase. If you're hoping it will drastically cut down your build time, like Build Cache does, you're likely to be disappointed.

In general, configuration phase takes up only a small portion of the Gradle build lifecycle. This duration varies based on the complexity of the project and the tasks you invoke, however, it typically ranges from hundreds millisecond— for small projects, to seconds even a few minutes—for large ones. Hence, compared to the execution phase—which can take dozens of minutes or even hours, improvement from configuration caching is often hard to notice.

This is not to say Configuration Cache isn’t worthwhile. It is definitely valuable, especially in large project builds. However, setting wrong expectations would lead to disappointment and misunderstandings about the feature.

Part 1 wrapping up

All the previous points are not meant to deny the great benefits of the recommended tunings. Simply enabling them (which, by the way, are often on by default) could help optimize 70% of your builds. However, the remaining 30% tends to be the most complex and requires a deeper, more targeted approach to truly improve. In small projects, this part is often negligible and has little impact on developers’ productivity. The real problem arises as the projects grow larger and larger.

In part 2, we will dive into some key strategies to tackle that remaining 30%, backed by some real-world examples.

Disclaimer: This 70-30 split is not based on any hard analysis or data. It's just a metaphor, inspired from this blog about AI-assisted coding tools to convey the idea.