All posts

I Tried Using DL To Optimize Python’s Garbage Collection (2/2)

Overview

In the previous blog post, we explored how we can use deep learning models to optimize Python's GC calls. In this post, we’ll look at how the model performs with :

  • More training data
  • More complex parameters
  • More epochs
  • Different model architectures
Tip

Thorough analysis with run-over-run metrics and Welch t-test for those can be found here.

System Information

PropertyValue
Operating SystemmacOS 14.6
Architecturearm64
CPUarm
CPU Cores8 (logical: 8)
Memory24.0 GB
Disk460.4 GB
Python Version3.14.0

Different Cases

Every model has the same training and evaluation load i.e. locust -f locustfile.py --headless -u 100 -r 10 --run-time 1m unless mentioned otherwise

Case 1 : Complex model

Model architecture is a normal LSTM but with more layers and increased sequence length :

1"lstm": {
2  "input_size": 10,
3  "hidden_size": 64,
4  "num_layers": 10,               // increased this from 2
5  "sequence_length": 100,         // increased this from 10
6  "epochs": 100,
7  "learning_rate": 0.001,
8  "batch_size": 32
9}

Realtime Monitor (pretty 😤) and Raw Results

With and without LSTM NeuroGC (click to enlarge)
With and without LSTM NeuroGC (click to enlarge)
MetricWithout NeuroGCWith NeuroGCImprovement
Avg CPU (%)36.837.6🔴 -2.1%
Avg Memory (%)46.146.1🟢 +0.1%
Avg Disk Read8399.868133.23🟢 +3.2%
Avg Disk Write5317931.934209032.94🟢 +20.9%
Avg Net Sent67313.1563477.44🟢 +5.7%
Avg Net Recv74458.7086484.33🔴 -16.2%
P95 Latency (ms)3724.23751.8🔴 -0.7%
P99 Latency (ms)4566.05080.4🔴 -11.3%
Avg RPS29.428.2🔴 -4.3%
GC Events1815🔴 -16.7%

Case 2: Different model architecture

Here, we change the underlying ML model itself. Again, everything else follows the baseline config

Case 2a. Using normal feed-forward networks

We use the following config as baseline :

1"feedforward": {
2  "hidden_sizes": [64, 32, 16, 8],
3  "lookback": 20,
4  "epochs": 100,
5  "learning_rate": 0.001,
6  "batch_size": 32
7},

Realtime Monitor and Raw Results

With and without normal feed-forward NeuroGC (click to enlarge)
With and without normal feed-forward NeuroGC (click to enlarge)
MetricWithout NeuroGCWith NeuroGCImprovement
Avg CPU (%)39.843.0🔴 -8.0%
Avg Memory (%)55.355.30.0%
Avg Disk Read7675.6911055.08🔴 -44.0%
Avg Disk Write5940870.388232453.21🔴 -38.6%
Avg Net Sent93112.1474657.59🟢 +19.8%
Avg Net Recv95884.7190835.58🟢 +5.3%
P95 Latency (ms)3855.24139.6🔴 -7.4%
P99 Latency (ms)5410.85231.7🟢 +3.3%
Avg RPS29.829.3🔴 -1.8%
GC Events1418🟢 +28.6%

Case 2b. Using transformers

We use the following config as baseline :

1"transformer": {
2  "d_model": 64,
3  "nhead": 4,
4  "num_layers": 2,
5  "sequence_length": 10,
6  "epochs": 100,
7  "learning_rate": 0.001,
8  "batch_size": 32
9}

Realtime Monitor and Raw Results

With and without transformers NeuroGC (click to enlarge)
With and without transformers NeuroGC (click to enlarge)
MetricWithout NeuroGCWith NeuroGCImprovement
Avg CPU (%)37.533.7🟢 +10.2%
Avg Memory (%)55.155.10.0%
Avg Disk Read2433.941638.77🟢 +32.7%
Avg Disk Write6325593.395631381.21🟢 +11.0%
Avg Net Sent116008.9268573.32🟢 +40.9%
Avg Net Recv98824.2392021.42🟢 +6.9%
P95 Latency (ms)3780.63718.8🟢 +1.6%
P99 Latency (ms)4827.94913.1🔴 -1.8%
Avg RPS29.830.6🟢 +2.9%
GC Events1817🔴 -5.6%

Case 2c. Using classical ML algorithms like Random Forest

We use the following config as baseline :

1"classical": {
2  "algorithm": "random_forest",
3  "n_estimators": 100,
4  "max_depth": null,
5  "lookback": 20
6}

Realtime Monitor and Raw Results

With and without random forest NeuroGC (click to enlarge)
With and without random forest NeuroGC (click to enlarge)
MetricWithout NeuroGCWith NeuroGCImprovement
Avg CPU (%)39.039.3🔴 -0.8%
Avg Memory (%)54.454.50.0%
Avg Disk Read2786.16591.43🟢 +78.8%
Avg Disk Write6037546.505408297.63🟢 +10.4%
Avg Net Sent90606.6067460.24🟢 +25.5%
Avg Net Recv97007.7174088.75🟢 +23.6%
P95 Latency (ms)3488.83450.9🟢 +1.1%
P99 Latency (ms)4587.64786.3🔴 -4.3%
Avg RPS30.429.9🔴 -1.6%
GC Events1619🟢 +18.8%

Case 3: With more training data

Training load is now for 5 minutes. Evaluation load remains the same :

1# Training load (increased to 5 mins)
2locust -f locustfile.py --headless -u 100 -r 10 --run-time 5m
3
4# Evaluation load (same as before)
5locust -f locustfile.py --headless -u 100 -r 10 --run-time 5m

Case 3a. Using normal feed-forward networks

We use the same baseline as before :

1"feedforward": {
2  "hidden_sizes": [64, 32, 16, 8],
3  "lookback": 20,
4  "epochs": 100,
5  "learning_rate": 0.001,
6  "batch_size": 32
7},

Realtime Monitor and Raw Results

With and without normal feed-forward NeuroGC with more training data (click to enlarge)
With and without normal feed-forward NeuroGC with more training data (click to enlarge)
MetricWithout NeuroGCWith NeuroGCImprovement
Avg CPU (%)36.245.6🔴 -26.2%
Avg Memory (%)55.055.00.0%
Avg Disk Read1186.423065.07🔴 -158.3%
Avg Disk Write6996910.505513574.88🟢 +21.2%
Avg Net Sent101125.85156501.60🔴 -54.8%
Avg Net Recv120104.51128264.26🔴 -6.8%
P95 Latency (ms)4038.03979.7🟢 +1.4%
P99 Latency (ms)5765.15523.3🟢 +4.2%
Avg RPS36.934.9🔴 -5.5%
GC Events1820🟢 +11.1%

Case 3b. Using transformers

We use the same baseline as before :

1"transformer": {
2  "d_model": 64,
3  "nhead": 4,
4  "num_layers": 2,
5  "sequence_length": 10,
6  "epochs": 100,
7  "learning_rate": 0.001,
8  "batch_size": 32
9}

Realtime Monitor and Raw Results

With and without transformer NeuroGC with more training data (click to enlarge)
With and without transformer NeuroGC with more training data (click to enlarge)
MetricWithout NeuroGCWith NeuroGCImprovement
Avg CPU (%)39.338.4🟢 +2.2%
Avg Memory (%)55.055.0🔴 -0.1%
Avg Disk Read6150.337740.47🔴 -25.9%
Avg Disk Write5745533.816575194.41🔴 -14.4%
Avg Net Sent75869.39102453.01🔴 -35.0%
Avg Net Recv91002.29102489.63🔴 -12.6%
P95 Latency (ms)3579.43825.1🔴 -6.9%
P99 Latency (ms)4604.04856.3🔴 -5.5%
Avg RPS33.030.2🔴 -8.5%
GC Events2016🔴 -20.0%
Surprisingly worse ...

Case 3c. Using classical ML algorithms like Random Forest

Yeahhh, again, same baseline as before :

1"classical": {
2  "algorithm": "random_forest",
3  "n_estimators": 100,
4  "max_depth": null,
5  "lookback": 20
6}

Realtime Monitor and Raw Results

With and without random forest NeuroGC with more training data (click to enlarge)
With and without random forest NeuroGC with more training data (click to enlarge)
MetricWithout NeuroGCWith NeuroGCImprovement
Avg CPU (%)35.333.7🟢 +4.4%
Avg Memory (%)54.654.60.0%
Avg Disk Read75813.1248446.51🟢 +36.1%
Avg Disk Write8276695.125972533.47🟢 +27.8%
Avg Net Sent119775.59109045.08🟢 +9.0%
Avg Net Recv130127.70141767.53🔴 -8.9%
P95 Latency (ms)2879.63019.3🔴 -4.9%
P99 Latency (ms)3710.63820.0🔴 -2.9%
Avg RPS42.940.1🔴 -6.6%
GC Events2320🔴 -13.0%

Case 3d. [Bonus] Using updated LSTM

1"lstm": {
2  "input_size": 10,
3  "hidden_size": 64,
4  "num_layers": 10,               // increased this from 2
5  "sequence_length": 100,         // increased this from 10
6  "epochs": 100,
7  "learning_rate": 0.001,
8  "batch_size": 32
9},

Realtime Monitor and Raw Results

With and without LSTM NeuroGC with more training data (click to enlarge)
With and without LSTM NeuroGC with more training data (click to enlarge)
MetricWithout NeuroGCWith NeuroGCImprovement
Avg CPU (%)34.332.6🟢 +5.0%
Avg Memory (%)54.654.6🟢 +0.1%
Avg Disk Read23345.7516588.36🟢 +28.9%
Avg Disk Write7753257.957022988.96🟢 +9.4%
Avg Net Sent110566.35129003.40🔴 -16.7%
Avg Net Recv154011.56142645.32🟢 +7.4%
P95 Latency (ms)3028.42962.2🟢 +2.2%
P99 Latency (ms)3713.13597.0🟢 +3.1%
Avg RPS40.541.0🟢 +1.2%
GC Events2223🟢 +4.5%
Yeah boiii 😏

Finally, some greenery 😭

Conclusion

NeuroGC demonstrates that Python’s garbage collector can get a little smarter by learning from application and process-level metrics. Across all experiments, deeper LSTMs with more training clearly shine, delivering the most noticeable gains. While I had expected Transformers and classical algorithms to dominate, the LSTMs ended up taking the crown.


Overall, I’d call this experiment a modest success. Next, it might be interesting to explore CPython and see whether training a model at the C level could bring even greater benefits.


That said, Python already does an impressive job with GC using reference counting. In languages like Java, where garbage collection can be a bigger bottleneck, approaches like this could potentially have a more significant impact.

More like this

Comments