I Tried Using DL To Optimize Python’s Garbage Collection (2/2)

Vishvam Feb 9, 2026 7 min read

python deep-learning machine-learning exploratory benchmarks uv

Overview

In the previous blog post, we explored how we can use deep learning models to optimize Python's GC calls. In this post, we’ll look at how the model performs with :

More training data
More complex parameters
More epochs
Different model architectures

Tip

Thorough analysis with run-over-run metrics and Welch t-test for those can be found here.

System Information

Property	Value
Operating System	macOS 14.6
Architecture	arm64
CPU	arm
CPU Cores	8 (logical: 8)
Memory	24.0 GB
Disk	460.4 GB
Python Version	3.14.0

Different Cases

Every model has the same training and evaluation load i.e. locust -f locustfile.py --headless -u 100 -r 10 --run-time 1m unless mentioned otherwise

Case 1 : Complex model

Model architecture is a normal LSTM but with more layers and increased sequence length :

1"lstm": {
2  "input_size": 10,
3  "hidden_size": 64,
4  "num_layers": 10,               // increased this from 2
5  "sequence_length": 100,         // increased this from 10
6  "epochs": 100,
7  "learning_rate": 0.001,
8  "batch_size": 32
9}

Realtime Monitor (pretty 😤) and Raw Results

With and without LSTM NeuroGC (click to enlarge)

Metric	Without NeuroGC	With NeuroGC	Improvement
Avg CPU (%)	36.8	37.6	🔴 -2.1%
Avg Memory (%)	46.1	46.1	🟢 +0.1%
Avg Disk Read	8399.86	8133.23	🟢 +3.2%
Avg Disk Write	5317931.93	4209032.94	🟢 +20.9%
Avg Net Sent	67313.15	63477.44	🟢 +5.7%
Avg Net Recv	74458.70	86484.33	🔴 -16.2%
P95 Latency (ms)	3724.2	3751.8	🔴 -0.7%
P99 Latency (ms)	4566.0	5080.4	🔴 -11.3%
Avg RPS	29.4	28.2	🔴 -4.3%
GC Events	18	15	🔴 -16.7%

Case 2: Different model architecture

Here, we change the underlying ML model itself. Again, everything else follows the baseline config

Case 2a. Using normal feed-forward networks

We use the following config as baseline :

1"feedforward": {
2  "hidden_sizes": [64, 32, 16, 8],
3  "lookback": 20,
4  "epochs": 100,
5  "learning_rate": 0.001,
6  "batch_size": 32
7},

Realtime Monitor and Raw Results

With and without normal feed-forward NeuroGC (click to enlarge)

Metric	Without NeuroGC	With NeuroGC	Improvement
Avg CPU (%)	39.8	43.0	🔴 -8.0%
Avg Memory (%)	55.3	55.3	0.0%
Avg Disk Read	7675.69	11055.08	🔴 -44.0%
Avg Disk Write	5940870.38	8232453.21	🔴 -38.6%
Avg Net Sent	93112.14	74657.59	🟢 +19.8%
Avg Net Recv	95884.71	90835.58	🟢 +5.3%
P95 Latency (ms)	3855.2	4139.6	🔴 -7.4%
P99 Latency (ms)	5410.8	5231.7	🟢 +3.3%
Avg RPS	29.8	29.3	🔴 -1.8%
GC Events	14	18	🟢 +28.6%

Case 2b. Using transformers

We use the following config as baseline :

1"transformer": {
2  "d_model": 64,
3  "nhead": 4,
4  "num_layers": 2,
5  "sequence_length": 10,
6  "epochs": 100,
7  "learning_rate": 0.001,
8  "batch_size": 32
9}

Realtime Monitor and Raw Results

With and without transformers NeuroGC (click to enlarge)

Metric	Without NeuroGC	With NeuroGC	Improvement
Avg CPU (%)	37.5	33.7	🟢 +10.2%
Avg Memory (%)	55.1	55.1	0.0%
Avg Disk Read	2433.94	1638.77	🟢 +32.7%
Avg Disk Write	6325593.39	5631381.21	🟢 +11.0%
Avg Net Sent	116008.92	68573.32	🟢 +40.9%
Avg Net Recv	98824.23	92021.42	🟢 +6.9%
P95 Latency (ms)	3780.6	3718.8	🟢 +1.6%
P99 Latency (ms)	4827.9	4913.1	🔴 -1.8%
Avg RPS	29.8	30.6	🟢 +2.9%
GC Events	18	17	🔴 -5.6%

Case 2c. Using classical ML algorithms like Random Forest

We use the following config as baseline :

1"classical": {
2  "algorithm": "random_forest",
3  "n_estimators": 100,
4  "max_depth": null,
5  "lookback": 20
6}

Realtime Monitor and Raw Results

With and without random forest NeuroGC (click to enlarge)

Metric	Without NeuroGC	With NeuroGC	Improvement
Avg CPU (%)	39.0	39.3	🔴 -0.8%
Avg Memory (%)	54.4	54.5	0.0%
Avg Disk Read	2786.16	591.43	🟢 +78.8%
Avg Disk Write	6037546.50	5408297.63	🟢 +10.4%
Avg Net Sent	90606.60	67460.24	🟢 +25.5%
Avg Net Recv	97007.71	74088.75	🟢 +23.6%
P95 Latency (ms)	3488.8	3450.9	🟢 +1.1%
P99 Latency (ms)	4587.6	4786.3	🔴 -4.3%
Avg RPS	30.4	29.9	🔴 -1.6%
GC Events	16	19	🟢 +18.8%

Case 3: With more training data

Training load is now for 5 minutes. Evaluation load remains the same :

1# Training load (increased to 5 mins)
2locust -f locustfile.py --headless -u 100 -r 10 --run-time 5m
3
4# Evaluation load (same as before)
5locust -f locustfile.py --headless -u 100 -r 10 --run-time 5m

Case 3a. Using normal feed-forward networks

We use the same baseline as before :

1"feedforward": {
2  "hidden_sizes": [64, 32, 16, 8],
3  "lookback": 20,
4  "epochs": 100,
5  "learning_rate": 0.001,
6  "batch_size": 32
7},

Realtime Monitor and Raw Results

With and without normal feed-forward NeuroGC with more training data (click to enlarge)

Metric	Without NeuroGC	With NeuroGC	Improvement
Avg CPU (%)	36.2	45.6	🔴 -26.2%
Avg Memory (%)	55.0	55.0	0.0%
Avg Disk Read	1186.42	3065.07	🔴 -158.3%
Avg Disk Write	6996910.50	5513574.88	🟢 +21.2%
Avg Net Sent	101125.85	156501.60	🔴 -54.8%
Avg Net Recv	120104.51	128264.26	🔴 -6.8%
P95 Latency (ms)	4038.0	3979.7	🟢 +1.4%
P99 Latency (ms)	5765.1	5523.3	🟢 +4.2%
Avg RPS	36.9	34.9	🔴 -5.5%
GC Events	18	20	🟢 +11.1%

Case 3b. Using transformers

We use the same baseline as before :

1"transformer": {
2  "d_model": 64,
3  "nhead": 4,
4  "num_layers": 2,
5  "sequence_length": 10,
6  "epochs": 100,
7  "learning_rate": 0.001,
8  "batch_size": 32
9}

Realtime Monitor and Raw Results

With and without transformer NeuroGC with more training data (click to enlarge)

Metric	Without NeuroGC	With NeuroGC	Improvement
Avg CPU (%)	39.3	38.4	🟢 +2.2%
Avg Memory (%)	55.0	55.0	🔴 -0.1%
Avg Disk Read	6150.33	7740.47	🔴 -25.9%
Avg Disk Write	5745533.81	6575194.41	🔴 -14.4%
Avg Net Sent	75869.39	102453.01	🔴 -35.0%
Avg Net Recv	91002.29	102489.63	🔴 -12.6%
P95 Latency (ms)	3579.4	3825.1	🔴 -6.9%
P99 Latency (ms)	4604.0	4856.3	🔴 -5.5%
Avg RPS	33.0	30.2	🔴 -8.5%
GC Events	20	16	🔴 -20.0%

Surprisingly worse ...

Case 3c. Using classical ML algorithms like Random Forest

Yeahhh, again, same baseline as before :

1"classical": {
2  "algorithm": "random_forest",
3  "n_estimators": 100,
4  "max_depth": null,
5  "lookback": 20
6}

Realtime Monitor and Raw Results

With and without random forest NeuroGC with more training data (click to enlarge)

Metric	Without NeuroGC	With NeuroGC	Improvement
Avg CPU (%)	35.3	33.7	🟢 +4.4%
Avg Memory (%)	54.6	54.6	0.0%
Avg Disk Read	75813.12	48446.51	🟢 +36.1%
Avg Disk Write	8276695.12	5972533.47	🟢 +27.8%
Avg Net Sent	119775.59	109045.08	🟢 +9.0%
Avg Net Recv	130127.70	141767.53	🔴 -8.9%
P95 Latency (ms)	2879.6	3019.3	🔴 -4.9%
P99 Latency (ms)	3710.6	3820.0	🔴 -2.9%
Avg RPS	42.9	40.1	🔴 -6.6%
GC Events	23	20	🔴 -13.0%

Case 3d. [Bonus] Using updated LSTM

1"lstm": {
2  "input_size": 10,
3  "hidden_size": 64,
4  "num_layers": 10,               // increased this from 2
5  "sequence_length": 100,         // increased this from 10
6  "epochs": 100,
7  "learning_rate": 0.001,
8  "batch_size": 32
9},

Realtime Monitor and Raw Results

With and without LSTM NeuroGC with more training data (click to enlarge)

Metric	Without NeuroGC	With NeuroGC	Improvement
Avg CPU (%)	34.3	32.6	🟢 +5.0%
Avg Memory (%)	54.6	54.6	🟢 +0.1%
Avg Disk Read	23345.75	16588.36	🟢 +28.9%
Avg Disk Write	7753257.95	7022988.96	🟢 +9.4%
Avg Net Sent	110566.35	129003.40	🔴 -16.7%
Avg Net Recv	154011.56	142645.32	🟢 +7.4%
P95 Latency (ms)	3028.4	2962.2	🟢 +2.2%
P99 Latency (ms)	3713.1	3597.0	🟢 +3.1%
Avg RPS	40.5	41.0	🟢 +1.2%
GC Events	22	23	🟢 +4.5%

Yeah boiii 😏

Finally, some greenery 😭

Conclusion

NeuroGC demonstrates that Python’s garbage collector can get a little smarter by learning from application and process-level metrics. Across all experiments, deeper LSTMs with more training clearly shine, delivering the most noticeable gains. While I had expected Transformers and classical algorithms to dominate, the LSTMs ended up taking the crown.

Overall, I’d call this experiment a modest success. Next, it might be interesting to explore CPython and see whether training a model at the C level could bring even greater benefits.

That said, Python already does an impressive job with GC using reference counting. In languages like Java, where garbage collection can be a bigger bottleneck, approaches like this could potentially have a more significant impact.

I Tried Using DL To Optimize Python’s Garbage Collection (2/2)

Overview

Different Cases

Case 1 : Complex model

Case 2: Different model architecture

Case 2a. Using normal feed-forward networks

Case 2b. Using transformers

Case 2c. Using classical ML algorithms like Random Forest

Case 3: With more training data

Case 3a. Using normal feed-forward networks

Case 3b. Using transformers

Case 3c. Using classical ML algorithms like Random Forest

Case 3d. [Bonus] Using updated LSTM

Conclusion

More like this

I Fixed a Neural Network That's Supposed to Print Its Own Weights

I Made A Neural Network That Prints Its Own Weights

I Tried Using DL To Optimize Python’s Garbage Collection (1/2)

Comments