Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #1748: High db update load because of callback event circuit breaker #1749

Merged
merged 3 commits into from
Oct 22, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions docs/Configuration-Properties.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,8 +106,9 @@ In certain scenarios, repeatedly attempting to dispatch callback events may be p
receiver's side. To address this, if multiple callback events with the same configuration fail consecutively, the
service temporarily halts further dispatch attempts and marks these events as failed without retrying. The number of
consecutive failures allowed before stopping dispatch is defined by the `failureThreshold` property, while the halt
period is configurable via the `resetTimeout` property. After this period, a callback dispatch attempt will be made again
to check the receiver's availability.
period is configurable via the `failureResetTimeout` property. After this period, a callback dispatch attempt will be
made again to check the receiver's availability. If the `failureThreshold` is set to `-1`, the functionality is not
enabled.

PowerAuth dispatches a callback as soon as a change in operation or activation status is detected. Each newly created
callback is passed to a configurable thread pool executor for dispatch. Even if the thread pool's queue is full, the
Expand All @@ -132,8 +133,8 @@ to callback events with max attempts set to 1, such callback events are never sc
| `powerauth.service.callbacks.threadPoolMaxSize` | `2` | Maximum number of threads in the thread pool used by the executor. |
| `powerauth.service.callbacks.threadPoolQueueCapacity` | `1000` | Queue capacity of the thread pool used by the executor. |
| `powerauth.service.callbacks.forceRerunPeriod` | | Time period after which a currently processed callback event is considered stale and should be scheduled to rerun. |
| `powerauth.service.callbacks.failureThreshold` | `200` | The number of consecutive failures allowed for callback events with the same configuration. |
| `powerauth.service.callbacks.resetTimeout` | `60s` | Time period after which a Callback URL Event will be dispatched, even if failure threshold has been reached. |
| `powerauth.service.callbacks.failureThreshold` | `200` | The number of consecutive failures allowed for callback events with the same configuration. If set to `-1`, unlimited number of failures is allowed. |
| `powerauth.service.callbacks.failureResetTimeout` | `60s` | Time period after which a Callback URL Event will be dispatched, even if failure threshold has been reached. |
| `powerauth.service.callbacks.clients.cache.refreshAfterWrite` | `5m` | Callback REST clients are cached and automatically evicted if updated through the Callback Management API on a single node. Time-based refreshing mechanism is a fallback in clustered environments. |

The backoff period after the `N-th` attempt is calculated as follows:
Expand Down
2 changes: 0 additions & 2 deletions docs/Database-Structure.md
Original file line number Diff line number Diff line change
Expand Up @@ -185,8 +185,6 @@ Stores callback URLs - per-application endpoints that are notified whenever an a
| max_attempts | INTEGER | - | Maximum number of attempts to dispatch a callback. |
| initial_backoff | VARCHAR(64) | - | Initial backoff period before the next send attempt, stored as a ISO 8601 string. |
| retention_period | VARCHAR(64) | - | Minimal duration for which is a completed callback event persisted, stored as a ISO 8601 string. |
| timestamp_last_failure | DATETIME | - | The timestamp of the most recent failed callback event associated with this configuration. |
| failure_count | INTEGER | DEFAULT 0 NOT NULL | The number of consecutive failed callback events associated with this configuration. |
| enabled | BOOLEAN | - | Indicator specifying whether the Callback URL should be used. |
| timestamp_created | DATETIME | DEFAULT NOW() NOT NULL | Timestamp when the record was created. |
| timestamp_last_updated | DATETIME | - | Timestamp of the last update of the record via the Callback Management API. |
Expand Down
7 changes: 0 additions & 7 deletions docs/PowerAuth-Server-1.9.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,13 +55,6 @@ options for the retry strategy with an exponential backoff algorithm. Namely:

These settings at the individual callback level overrides the global default settings at the application level.

### Add Columns to Enable Callback Failures Monitoring

Following columns has been added to the `pa_application_callback` table to enable monitoring of callback dispatch
failures:
- `failure_count` to hold the number of consecutive failed callbacks of the same configuration, and
- `timestamp_last_failure` to store the timestamp of the most recent failed callback attempt.

### Add Column Indicating If a Callback Is Enabled

A new column `enabled` has been added to the `pa_application_callback` table to indicate whether a Callback URL is
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -110,32 +110,6 @@
<createSequence sequenceName="pa_app_callback_event_seq" startValue="1" incrementBy="50" cacheSize="20"/>
</changeSet>

<changeSet id="8" logicalFilePath="powerauth-java-server/1.9.x/20240704-callback-event-table.xml" author="Jan Pesek">
Copy link
Member

@banterCZ banterCZ Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zcgandcomp @korbelm How will we manage our environments? Will we remove these columns manually?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there is any manual activity needed, please create ticket, i will fix test environments

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is about dropping two unused columns, but there will be another issue regarding the FK constraint dropping. I will create more detailed issue for you later.

<preConditions onFail="MARK_RAN">
<not>
<columnExists tableName="pa_application_callback" columnName="timestamp_last_failure" />
</not>
</preConditions>
<comment>Add timestamp_last_failure column to pa_application_callback table.</comment>
<addColumn tableName="pa_application_callback">
<column name="timestamp_last_failure" type="timestamp(6)" />
</addColumn>
</changeSet>

<changeSet id="9" logicalFilePath="powerauth-java-server/1.9.x/20240704-callback-event-table.xml" author="Jan Pesek">
<preConditions onFail="MARK_RAN">
<not>
<columnExists tableName="pa_application_callback" columnName="failure_count" />
</not>
</preConditions>
<comment>Add failure_count column to pa_application_callback table.</comment>
<addColumn tableName="pa_application_callback">
<column name="failure_count" type="integer" defaultValueNumeric="0">
<constraints nullable="false" />
</column>
</addColumn>
</changeSet>

<changeSet id="10" logicalFilePath="powerauth-java-server/1.9.x/20240704-callback-event-table.xml" author="Jan Pesek">
<preConditions onFail="MARK_RAN">
<not>
Expand Down
Binary file modified docs/images/arch_db_structure.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
10 changes: 0 additions & 10 deletions docs/sql/mssql/migration_1.8.0_1.9.0.sql
Original file line number Diff line number Diff line change
Expand Up @@ -58,16 +58,6 @@ GO
CREATE SEQUENCE pa_app_callback_event_seq START WITH 1 INCREMENT BY 50;
GO

-- Changeset powerauth-java-server/1.9.x/20240704-callback-event-table.xml::8::Jan Pesek
-- Add timestamp_last_failure column to pa_application_callback table.
ALTER TABLE pa_application_callback ADD timestamp_last_failure datetime2(6);
GO

-- Changeset powerauth-java-server/1.9.x/20240704-callback-event-table.xml::9::Jan Pesek
-- Add failure_count column to pa_application_callback table.
ALTER TABLE pa_application_callback ADD failure_count int CONSTRAINT DF_pa_application_callback_failure_count DEFAULT 0 NOT NULL;
GO

-- Changeset powerauth-java-server/1.9.x/20240704-callback-event-table.xml::10::Jan Pesek
-- Add enabled column to pa_application_callback table.
ALTER TABLE pa_application_callback ADD enabled bit CONSTRAINT DF_pa_application_callback_enabled DEFAULT 1 NOT NULL;
Expand Down
8 changes: 0 additions & 8 deletions docs/sql/oracle/migration_1.8.0_1.9.0.sql
Original file line number Diff line number Diff line change
Expand Up @@ -46,14 +46,6 @@ CREATE INDEX pa_app_cb_event_ts_del_idx ON pa_application_callback_event(timesta
-- Create a new sequence pa_app_callback_event_seq
CREATE SEQUENCE pa_app_callback_event_seq START WITH 1 INCREMENT BY 50 CACHE 20;

-- Changeset powerauth-java-server/1.9.x/20240704-callback-event-table.xml::8::Jan Pesek
-- Add timestamp_last_failure column to pa_application_callback table.
ALTER TABLE pa_application_callback ADD timestamp_last_failure TIMESTAMP(6);

-- Changeset powerauth-java-server/1.9.x/20240704-callback-event-table.xml::9::Jan Pesek
-- Add failure_count column to pa_application_callback table.
ALTER TABLE pa_application_callback ADD failure_count INTEGER DEFAULT 0 NOT NULL;

-- Changeset powerauth-java-server/1.9.x/20240704-callback-event-table.xml::10::Jan Pesek
-- Add enabled column to pa_application_callback table.
ALTER TABLE pa_application_callback ADD enabled BOOLEAN DEFAULT 1 NOT NULL;
Expand Down
8 changes: 0 additions & 8 deletions docs/sql/postgresql/migration_1.8.0_1.9.0.sql
Original file line number Diff line number Diff line change
Expand Up @@ -46,14 +46,6 @@ CREATE INDEX pa_app_cb_event_ts_del_idx ON pa_application_callback_event(timesta
-- Create a new sequence pa_app_callback_event_seq
CREATE SEQUENCE IF NOT EXISTS pa_app_callback_event_seq START WITH 1 INCREMENT BY 50 CACHE 20;

-- Changeset powerauth-java-server/1.9.x/20240704-callback-event-table.xml::8::Jan Pesek
-- Add timestamp_last_failure column to pa_application_callback table.
ALTER TABLE pa_application_callback ADD timestamp_last_failure TIMESTAMP(6) WITHOUT TIME ZONE;

-- Changeset powerauth-java-server/1.9.x/20240704-callback-event-table.xml::9::Jan Pesek
-- Add failure_count column to pa_application_callback table.
ALTER TABLE pa_application_callback ADD failure_count INTEGER DEFAULT 0 NOT NULL;

-- Changeset powerauth-java-server/1.9.x/20240704-callback-event-table.xml::10::Jan Pesek
-- Add enabled column to pa_application_callback table.
ALTER TABLE pa_application_callback ADD enabled BOOLEAN DEFAULT TRUE NOT NULL;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,13 @@

package io.getlime.security.powerauth.app.server.configuration;

import com.github.benmanes.caffeine.cache.Cache;
import com.github.benmanes.caffeine.cache.Caffeine;
import com.github.benmanes.caffeine.cache.LoadingCache;
import io.getlime.security.powerauth.app.server.database.model.entity.CallbackUrlEntity;
import io.getlime.security.powerauth.app.server.service.callbacks.CallbackUrlRestClientCacheLoader;
import io.getlime.security.powerauth.app.server.service.callbacks.model.CachedRestClient;
import io.getlime.security.powerauth.app.server.service.callbacks.model.FailureStats;
import lombok.extern.slf4j.Slf4j;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;
Expand Down Expand Up @@ -56,4 +58,16 @@ public LoadingCache<String, CachedRestClient> callbackUrlRestClientCache(
.build(cacheLoader);
}

/**
* Configuration of the cache for gathering failure statistics during callback processing.
* {@link CallbackUrlEntity#getId()} is used as a cache key.
*
* @return Cache for FailureStats.
*/
@Bean
public Cache<String, FailureStats> callbackFailureStatsCache() {
return Caffeine.newBuilder()
.build();
}

}
Original file line number Diff line number Diff line change
Expand Up @@ -90,11 +90,11 @@ public class PowerAuthCallbacksConfiguration {
* Number of allowed Callback URL Events failures in a row. When the threshold is reached no other
* events with the same Callback URL configuration will be posted.
*/
private Integer failureThreshold = 200;
private int failureThreshold = 200;

/**
* Period after which a Callback URL Event will be dispatched even though failure threshold is reached.
*/
private Duration resetTimeout = Duration.ofSeconds(60);
private Duration failureResetTimeout = Duration.ofSeconds(60);

}
Original file line number Diff line number Diff line change
Expand Up @@ -121,18 +121,6 @@ public class CallbackUrlEntity implements Serializable {
@Convert(converter = DurationConverter.class)
private Duration retentionPeriod;

/**
* Timestamp of last callback failure.
*/
@Column(name = "timestamp_last_failure")
private LocalDateTime timestampLastFailure;

/**
* Number of failed callbacks in a row.
*/
@Column(name = "failure_count", nullable = false)
private Integer failureCount;

/**
* Whether the callback is enabled and can be used.
*/
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,22 +39,6 @@ public interface CallbackUrlRepository extends CrudRepository<CallbackUrlEntity,

List<CallbackUrlEntity> findByApplicationIdAndTypeOrderByName(String applicationId, CallbackUrlType type);

@Modifying
@Query("""
UPDATE CallbackUrlEntity c
SET c.failureCount = c.failureCount + 1, c.timestampLastFailure = :timestampLastFailure
WHERE c.id = :id
""")
void incrementFailureCount(String id, LocalDateTime timestampLastFailure);

@Modifying
@Query("""
UPDATE CallbackUrlEntity c
SET c.failureCount = 0, c.timestampLastFailure = NULL
WHERE c.id = :id
""")
void resetFailureCount(String id);

@Modifying
@Query("""
UPDATE CallbackUrlEntity c
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,6 @@ public CreateCallbackUrlResponse createCallbackUrl(CreateCallbackUrlRequest requ
entity.setType(CallbackUrlTypeConverter.convert(request.getType()));
entity.setCallbackUrl(request.getCallbackUrl());
entity.setAttributes(request.getAttributes());
entity.setFailureCount(0);
final EncryptableString encrypted = callbackUrlAuthenticationEncryptor.encrypt(request.getAuthentication(), entity.getApplication().getId());
entity.setAuthentication(encrypted.encryptedData());
entity.setEncryptionMode(encrypted.encryptionMode());
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,13 +18,14 @@

package io.getlime.security.powerauth.app.server.service.callbacks;

import com.github.benmanes.caffeine.cache.Cache;
import io.getlime.security.powerauth.app.server.configuration.PowerAuthCallbacksConfiguration;
import io.getlime.security.powerauth.app.server.database.model.entity.CallbackUrlEntity;
import io.getlime.security.powerauth.app.server.database.model.entity.CallbackUrlEventEntity;
import io.getlime.security.powerauth.app.server.database.model.enumeration.CallbackUrlEventStatus;
import io.getlime.security.powerauth.app.server.database.repository.CallbackUrlEventRepository;
import io.getlime.security.powerauth.app.server.database.repository.CallbackUrlRepository;
import io.getlime.security.powerauth.app.server.service.callbacks.model.CallbackUrlEvent;
import io.getlime.security.powerauth.app.server.service.callbacks.model.FailureStats;
import lombok.AllArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.stereotype.Component;
Expand All @@ -47,8 +48,8 @@
public class CallbackUrlEventResponseHandler {

private final CallbackUrlEventRepository callbackUrlEventRepository;
private final CallbackUrlRepository callbackUrlRepository;
private final PowerAuthCallbacksConfiguration powerAuthCallbacksConfiguration;
private final Cache<String, FailureStats> callbackFailureStatsCache;

/**
* Handle successful Callback URL Event attempt.
Expand All @@ -68,7 +69,7 @@ public void handleSuccess(final CallbackUrlEvent callbackUrlEvent) {
callbackUrlEventEntity.setAttempts(callbackUrlEventEntity.getAttempts() + 1);
callbackUrlEventEntity.setStatus(CallbackUrlEventStatus.COMPLETED);
callbackUrlEventRepository.save(callbackUrlEventEntity);
callbackUrlRepository.resetFailureCount(callbackUrlEventEntity.getCallbackUrlEntity().getId());
resetFailureCount(callbackUrlEventEntity.getCallbackUrlEntity().getId());
}

/**
Expand Down Expand Up @@ -104,7 +105,7 @@ public void handleFailure(final CallbackUrlEvent callbackUrlEvent, final Throwab
}

callbackUrlEventRepository.save(callbackUrlEventEntity);
callbackUrlRepository.incrementFailureCount(callbackUrlEntity.getId(), LocalDateTime.now());
incrementFailureCount(callbackUrlEntity.getId());
}

/**
Expand All @@ -125,4 +126,27 @@ private static Duration calculateExponentialBackoffPeriod(final int attempts, fi
return Duration.ofMillis(Math.min(backoffMillis, maxBackoff.toMillis()));
}

private void incrementFailureCount(final String callbackUrlId) {
final int failureThreshold = powerAuthCallbacksConfiguration.getFailureThreshold();
if (failureThreshold == -1) {
logger.debug("Failure stats are turned off for Callback URL processing");
return;
}

callbackFailureStatsCache.asMap().compute(callbackUrlId, (key, cachedFailureStats) -> {
if (cachedFailureStats == null) {
return new FailureStats(1, LocalDateTime.now());
} else {
return new FailureStats(cachedFailureStats.failureCount() + 1, LocalDateTime.now());
}
});

}

private void resetFailureCount(final String callbackUrlId) {
callbackFailureStatsCache.asMap().computeIfPresent(callbackUrlId,
(key, cachedFailureStats) -> new FailureStats(0, cachedFailureStats.timestampLastFailure())
);
}

}
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@

package io.getlime.security.powerauth.app.server.service.callbacks;

import com.github.benmanes.caffeine.cache.Cache;
import com.github.benmanes.caffeine.cache.LoadingCache;
import com.wultra.core.rest.client.base.RestClient;
import com.wultra.core.rest.client.base.RestClientException;
Expand All @@ -30,6 +31,7 @@
import io.getlime.security.powerauth.app.server.service.callbacks.model.CachedRestClient;
import io.getlime.security.powerauth.app.server.service.callbacks.model.CallbackUrlConvertor;
import io.getlime.security.powerauth.app.server.service.callbacks.model.CallbackUrlEvent;
import io.getlime.security.powerauth.app.server.service.callbacks.model.FailureStats;
import io.getlime.security.powerauth.app.server.service.util.TransactionUtils;
import jakarta.annotation.PostConstruct;
import lombok.AllArgsConstructor;
Expand Down Expand Up @@ -63,6 +65,7 @@ public class CallbackUrlEventService {
private final CallbackUrlEventRepository callbackUrlEventRepository;
private final CallbackUrlEventResponseHandler callbackUrlEventResponseHandler;
private final LoadingCache<String, CachedRestClient> restClientCache;
private final Cache<String, FailureStats> callbackFailureStatsCache;

private final PowerAuthServiceConfiguration powerAuthServiceConfiguration;
private final PowerAuthCallbacksConfiguration powerAuthCallbacksConfiguration;
Expand Down Expand Up @@ -175,13 +178,21 @@ public int obtainMaxAttempts(final CallbackUrlEntity callbackUrlEntity) {
* @return True if the callback should be processed, false otherwise.
*/
public boolean failureThresholdReached(final CallbackUrlEntity callbackUrlEntity) {
final Integer failureThreshold = powerAuthCallbacksConfiguration.getFailureThreshold();
final Duration resetTimeout = powerAuthCallbacksConfiguration.getResetTimeout();
final String callbackUrlId = callbackUrlEntity.getId();
final FailureStats failureStats = callbackFailureStatsCache.getIfPresent(callbackUrlId);
if (failureStats == null) {
logger.debug("No failure stats available yet for Callback URL processing: id={}", callbackUrlId);
return false;
}

final int failureThreshold = powerAuthCallbacksConfiguration.getFailureThreshold();
final Duration resetTimeout = powerAuthCallbacksConfiguration.getFailureResetTimeout();

final Integer failureCount = callbackUrlEntity.getFailureCount();
final LocalDateTime timestampLastFailure = Objects.requireNonNullElse(callbackUrlEntity.getTimestampLastFailure(), LocalDateTime.MAX);
final int failureCount = failureStats.failureCount();
final LocalDateTime timestampLastFailure = failureStats.timestampLastFailure();

if (failureCount >= failureThreshold && LocalDateTime.now().minus(resetTimeout).isAfter(timestampLastFailure)) {
logger.debug("Callback URL reached failure threshold, but before specified reset timeout period, id={}", callbackUrlId);
return false;
}

Expand Down
Loading