Performance Degradation Introduced in New Relic PHP Agent v10.13.0.2 #806

theophileds · 2023-12-20T09:21:57Z

Description

A significant increase in CPU usage, latency, and fluctuating php-fpm processes occurred after upgrading the New Relic PHP agent from version 10.0.0.312 to version 10.13.0.2. Despite attempting to downgrade New Relic, compatibility issues arose with PHP 8.2, leading to agent disablement and subsequent performance improvement.

Hypothesis: Hypervisor Clock Settings

Upon contacting New Relic support, a potential connection to hypervisor clock settings was suggested. Despite transitioning to TSC (Timestamp Counter) for clock configuration, benchmark results displayed a marginal improvement in average duration.

This benchmark was executed with 100,000,000 iterations, repeated a hundred times on two different containers running on machines set with TSC and kvm-clock configurations.

for (int i = 0; i < iterations; i++) {
    gettimeofday(&end, NULL);
}

Benchmark Results:

TSC-based Configuration: Average Duration 2.321919 seconds
kvm-clock-based Configuration: Average Duration 2.817715 seconds
The observed result indicated a 17.56% decrease in average time when using TSC.

However, we acknowledge that our benchmarking approach may not accurately mirror the load pattern experienced by the New Relic agent. Moreover, despite conducting tests using TSC, we did not observe any noteworthy improvement in performance.

Feature Disabling and Version Testing

To pinpoint the source of the issue, extensive testing was conducted, including the disabling of features such as distributed tracing, code-level metrics, and application logging. The performance impact persisted across multiple tests and versions.

newrelic.distributed_tracing_enabled = false
newrelic.code_level_metrics.enabled = false
newrelic.application_logging.enabled = false
newrelic.custom_events.max_samples_stored = 10000

newrelic.daemon.dont_launch = 3
newrelic.daemon.utilization.detect_aws = false
newrelic.daemon.utilization.detect_azure = false
newrelic.daemon.utilization.detect_gcp = false
newrelic.daemon.utilization.detect_pcf = false
newrelic.daemon.utilization.detect_docker = false
newrelic.daemon.app_timeout = "2m"
newrelic.browser_monitoring.auto_instrument = false
newrelic.framework = "symfony4"

newrelic.error_collector.enabled = false
newrelic.transaction_tracer.enabled = false
newrelic.transaction_tracer.detail = 0
newrelic.transaction_tracer.slow_sql = false
newrelic.transaction_events.enabled = false
newrelic.attributes.enabled = false
newrelic.custom_insights_events.enabled = false
newrelic.synthetics.enabled = false
newrelic.datastore_tracer.instance_reporting.enabled = false
newrelic.datastore_tracer.database_name_reporting.enabled = false
newrelic.application_logging.forwarding.enabled = false

Regrettably, these efforts did not result in any substantial improvement. After repeating the experiment multiple times, it became evident that enabling New Relic consistently led to a significant negative impact on performance. This observation persisted across various versions of the New Relic agent, including:

10.7.0.319
10.13.0.2
10.14.0.3

PHP-fpm Processes and CPU Usage

As illustrated in the Grafana metrics screen captures, the tests were conducted in the following sequence with the specified configurations:

New Relic fully disabled
New Relic enabled (All features disabled) with TSC clock
New Relic enabled (All features disabled) with kvm-clock configuration

Conclusion

The bump to version 10.13.0.2 introduced significant performance degradation, challenging explanations based solely on new features or clock system changes. The issue persists despite clock configuration adjustments and feature disabling.

Your Environment

PHP backend applications built on Symfony, Docker image php:8.2.13-fpm
Deployed on EKS 1.24, EC2 instance type: m5.xlarge (Hypervisor Nitro)
Clock configuration tested with TSC and kvm-clock

The text was updated successfully, but these errors were encountered:

theophileds · 2023-12-26T09:23:48Z

Additional Experiment with Version 10.15.0.4

Further experiments were conducted with New Relic agent version 10.15.0.4 (under the same newrelic.ini configuration), both enabled and disabled. Unfortunately, no significant improvement was observed in performance.

In terms of memory consumption, we observed an increase of approximately 70 MB per pod when the New Relic agent is enabled, resulting in an average of approximately 375 MB per pod. In comparison, when the agent is disabled, the memory usage averages around 305 MB per pod.

dorain47 · 2024-01-01T09:43:00Z

@theophileds agree with your observation 💯
CPU spike has reduced (slightly) for me since 10.15.0.4 but for the memory part, I have been facing higher memory usage since last few newrelic agent releases.

theophileds · 2024-01-25T08:38:48Z

Hello,

I have some exciting updates to share with you.

Firstly, we conducted performance tests using the latest version of the New Relic agent, v10.16.0.5, and observed a modest ~5% reduction in CPU overhead.

Additionally, after thorough performance testing, we noticed a significant efficiency improvement by transitioning to Amazon EC2 C7a instances. These instances utilize AMD processors, surpassing their Intel chip generation counterparts in performance.

Our comparison involved several machines, including c7a.xlarge (AMD), c7i.xlarge (Intel), c5.xlarge (Intel), and our current m5.xlarge. Attached are screenshots depicting the results.

The c7a.xlarge emerged as the top performer, demonstrating significant performance improvements. Although some of this enhancement can be linked to the higher frequencies of AMD processors, the noteworthy performance variability, particularly given that c7a.xlarge instances still use kvm-clock while utilizing AMD chips that provide 3.7 GHz per core compared to the 3.5 GHz performance per core of c5.xlarge instances, hints at the potential influence of AMD's architecture and cache structure on these outcomes.

Winfle · 2024-09-01T20:07:09Z

@theophileds I think, main difference between current Intel and c7a instances is in SMT- vCPU is locked not on cpu thread but on actual CPU core.
So you have more real cores especially on heavy tasks, that involve high CPU usage

theophileds added the bug Something isn't working label Dec 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Degradation Introduced in New Relic PHP Agent v10.13.0.2 #806

Performance Degradation Introduced in New Relic PHP Agent v10.13.0.2 #806

theophileds commented Dec 20, 2023 •

edited

Loading

theophileds commented Dec 26, 2023

dorain47 commented Jan 1, 2024 •

edited

Loading

theophileds commented Jan 25, 2024 •

edited

Loading

Winfle commented Sep 1, 2024

Performance Degradation Introduced in New Relic PHP Agent v10.13.0.2 #806

Performance Degradation Introduced in New Relic PHP Agent v10.13.0.2 #806

Comments

theophileds commented Dec 20, 2023 • edited Loading

Description

Hypothesis: Hypervisor Clock Settings

Feature Disabling and Version Testing

PHP-fpm Processes and CPU Usage

Conclusion

Your Environment

theophileds commented Dec 26, 2023

Additional Experiment with Version 10.15.0.4

dorain47 commented Jan 1, 2024 • edited Loading

theophileds commented Jan 25, 2024 • edited Loading

Winfle commented Sep 1, 2024

theophileds commented Dec 20, 2023 •

edited

Loading

dorain47 commented Jan 1, 2024 •

edited

Loading

theophileds commented Jan 25, 2024 •

edited

Loading