Place offset manager in commons #373

Claudenw · 2024-12-16T12:56:35Z

Fix for KCON-57

While this looks like a large change, there are multiple cases where files were migrated from s3-source-connector to common module. Those files are counted twice. This change also removes unused classes/files.

Significant changes are in OffsetManager, S3SourceTask, S3SourceRecord and AWSV2SourceClient.

Made OffsetManager generic to handle multiple OffsetManagerRecord types while simplifying access from sources.

Source should implement an instance of OffsetManager.OffsetManagerEntry that tracks the specific data for the source.

OffsetManagerEntry is included in the Source specific record (e.g. S3SourceRecord), is updated as processing continues, and is the source of record for many of the S3 and Kafka specific values (e.g. partition, topic, S3Object key) as well as some dynamic data such as the current record number.

Transformer was modified to update the OffsetManagerEntry as records are returned.

Due to bug in Kafka this implementation can not guarantee write once functionality. https://issues.apache.org/jira/browse/KAFKA-14947

Added javadoc.

Claudenw · 2024-12-19T08:33:17Z

Units tests pass, there is an issue with the integration tests not picking up the changes in commons.

s3-source-connector/src/main/java/io/aiven/kafka/connect/s3/source/S3SourceTask.java

commons/src/main/java/io/aiven/kafka/connect/common/source/input/ByteArrayTransformer.java

commons/src/main/java/io/aiven/kafka/connect/common/source/input/JsonTransformer.java

commons/src/main/java/io/aiven/kafka/connect/common/source/input/ParquetTransformer.java

commons/src/test/java/io/aiven/kafka/connect/common/source/input/JsonTransformerTest.java

s3-source-connector/src/main/java/io/aiven/kafka/connect/s3/source/S3SourceTask.java

s3-source-connector/src/main/java/io/aiven/kafka/connect/s3/source/utils/AWSV2SourceClient.java

...-connector/src/main/java/io/aiven/kafka/connect/s3/source/utils/S3ObjectSummaryIterator.java

aindriu-aiven · 2024-12-19T10:33:03Z

...-connector/src/main/java/io/aiven/kafka/connect/s3/source/utils/S3ObjectSummaryIterator.java

+            if (objectListing.isTruncated()) {
+                // get the next set of data and create an iterator on it.
+                request.setStartAfter(null);
+                request.withContinuationToken(objectListing.getContinuationToken());


I am pretty sure the continuation token is all that is required here, you can create a new request and only add the contiuation token (possibly also require the bucket though)

...-connector/src/main/java/io/aiven/kafka/connect/s3/source/utils/S3ObjectSummaryIterator.java

...rce-connector/src/main/java/io/aiven/kafka/connect/s3/source/utils/SourceRecordIterator.java

aindriu-aiven

I had a few comments some are for future follow ups but we should create issues for them so we dont miss them.

muralibasani · 2024-12-20T09:09:13Z

...rce-connector/src/main/java/io/aiven/kafka/connect/s3/source/utils/SourceRecordIterator.java

-            throw new AmazonClientException(e);
-        }
+        this.s3ObjectIterator = IteratorUtils.filteredIterator(sourceClient.getIteratorOfObjects(null),
+                s3Object -> extractOffsetManagerEntry(s3Object));


Lambda can be replaced with method reference

Suggested change

s3Object -> extractOffsetManagerEntry(s3Object));

this::extractOffsetManagerEntry);

muralibasani · 2024-12-20T09:16:11Z

commons/src/main/java/io/aiven/kafka/connect/common/source/input/Transformer.java

+     *            the Abstract Config to use.
+     * @return a Stream of SchemaAndValue objects.
+     */
+    public final Stream<SchemaAndValue> getRecords(final IOSupplier<InputStream> inputStreamIOSupplier,


this is looking great, much simplified version

muralibasani

Need to find why no events are pushed to kafka offsets topic

muralibasani · 2024-12-20T12:59:11Z

s3-source-connector/src/main/java/io/aiven/kafka/connect/s3/source/S3SourceTask.java

@@ -119,6 +118,7 @@ public List<SourceRecord> poll() throws InterruptedException {

            while (!connectorStopped.get()) {
                try {
+                    waitForObjects();
                    extractSourceRecords(results);
                    LOGGER.info("Number of records extracted and sent: {}", results.size());
                    return results;


I have an extract of what is sent to kafka offsets topic, before this PR, and with this PR.

Before this PR :

SourceRecord{ sourcePartition={bucket=test-bucket0, topic=bytesTest, topicPartition=0}, sourceOffset={object_key_s3-source-connector-for-apache-kafka-test-2024-12-20T13:34:01.62052/bytesTest-00000-1734698057527.txt=1} } ConnectRecord{topic='bytesTest', kafkaPartition=0, key=[B@6e96f788, keySchema=null, value=[B@49e57a97, valueSchema=null, timestamp=null, headers=ConnectHeaders(headers=)}

With this PR :

SourceRecord{ sourcePartition={partition=0, bucket=test-bucket0, objectKey=s3-source-connector-for-apache-kafka-test-2024-12-20T13:28:08.047694/bytesTest-00000-1734697707480.txt, topic=bytesTest}, sourceOffset={bucket=test-bucket0, topic=bytesTest, partition=0, objectKey=s3-source-connector-for-apache-kafka-test-2024-12-20T13:28:08.047694/bytesTest-00000-1734697707480.txt, recordCount=0} } ConnectRecord{topic='bytesTest', kafkaPartition=0, key=[B@67e2252f, keySchema=null, value=[B@1d001ae2, valueSchema=null, timestamp=null, headers=ConnectHeaders(headers=)}

There are some duplicate keys sent in sourcePartition, and sourceOffset, which should be removed.

Have tested locally, and no events are pushed to connect-offset-topic- topic

Am not sure where the problem is, am going to debug further. May be something to do with the new structure

Partition has been changed to contain only bucket and S3Object.key()
Offset has been changed to only contain the number of records produced.

Partition should not contain any information related to objects and keys.

It should only contain partition ids.

I see sourcePartition now has bucket and objectKey. Move them to sourceOffset.
recordCount is part of sourceOffset, create a map for every object key and value to retrieve them

muralibasani · 2024-12-24T13:18:50Z

...rce-connector/src/main/java/io/aiven/kafka/connect/s3/source/utils/S3OffsetManagerEntry.java

+     */
+    @Override
+    public OffsetManager.OffsetManagerKey getManagerKey() {
+        return () -> Map.of(BUCKET, data.get(BUCKET), OBJECT_KEY, data.get(OBJECT_KEY));


Instead of objectkey storing as keys, it is better to store partition ids in key.
We will have fewer number of keys.

Just verified lenses s3 source connector and adobe s3 source connector, and they store partitionids.

Can we think about this too ?

topic.partitions we have this config. Our earlier implementation was based on this.

@gharris1727 your suggestion will be helpful here.
According to javadocs of OffsetStorageReader : offsets() method, I was thinking we would have to store topic and partition id in offset storage keys atleast ?

@Override public OffsetManager.OffsetManagerKey getManagerKey() { return () -> Map.of(BUCKET, data.get(BUCKET), TOPIC, TOPIC, PARTITION, PARTITION); }

When we have several objects under specified topics and partitions and to retrieve the stored offset map, how can be better structure the keys ?

We need to look at a couple of things.

When pulling the data from kafka we only need the file location (bucket and S3Object key). All other items are currently extracted from the key. So the bucket and key uniquely identify the object in S3.

Adding more elements to the key means that we need to extract those items before we can look up the data in the offset manger.

Finally, the implementation for S3 is specific to the S3 source and does not impact the commons OffsetManger implementation.

This is incorrect. We need to have only partition information in sourcePartition and sourceOffset should contain the object keys and record counts etc.

muralibasani · 2024-12-24T21:09:13Z

...ce-connector/src/integration-test/java/io/aiven/kafka/connect/s3/source/IntegrationTest.java

+                IntegrationBase.consumeOffsetMessages(consumer).forEach(s -> {
+                    offsetRecs.merge(s.getKey(), s.getRecordCount(), (x, y) -> x > y ? x : y);
+                });
+                // FIXME after KAFKA-14947 is fixed.


But it is already working in feature branch. Not sure if it's totally related

This has been mostly fixed. There are edge cases where KAFKA-14947 applies.

aindriu-aiven · 2025-01-09T14:46:26Z

commons/src/main/java/io/aiven/kafka/connect/common/source/OffsetManager.java

+     * @return the entry.
+     */
+    public Optional<E> getEntry(final OffsetManagerKey key, final Function<Map<String, Object>, E> creator) {
+        LOGGER.info("getEntry: {}", key.getPartitionMap());


Probably should be debug for the amount of times we could be accessing this

Actually all these infos could be debug.

aindriu-aiven · 2025-01-10T10:04:33Z

...ce-connector/src/integration-test/java/io/aiven/kafka/connect/s3/source/IntegrationBase.java

@@ -262,7 +265,7 @@ static Map<String, Object> consumeOffsetMessages(KafkaConsumer<byte[], byte[]> c
        for (final ConsumerRecord<byte[], byte[]> record : records) {
            Map<String, Object> offsetRec = OBJECT_MAPPER.readValue(record.value(), new TypeReference<>() { // NOPMD
            });
-            messages.putAll(offsetRec);
+            messages.put((String) offsetRec.get(OBJECT_KEY), offsetRec.get(RECORD_COUNT));


The record value has changed, this now bring back the record Count only and not any details of the key.
to get the key we need to change this for loop to.

for (final ConsumerRecord<byte[], byte[]> record : records) { Map<String, Object> offsetRec = OBJECT_MAPPER.readValue(record.value(), new TypeReference<>() { // NOPMD }); List<Object> key = OBJECT_MAPPER.readValue(record.key(), new TypeReference<>() { // NOPMD }); //key.get(0) is always the connector name that could be added as a check here if we wanted. Map<String,Object> keyDetails = (Map<String,Object>)key.get(1); messages.put((String) keyDetails.get(OBJECT_KEY), offsetRec.get(RECORD_COUNT)); }

Alternatively go to S3OffsetManagerEntry and alter getProperties() as below to put the objectKey back into the value.

@Override public Map<String, Object> getProperties() { final Map<String, Object> result = new HashMap<>(data); result.put(RECORD_COUNT, recordCount); result.put(OBJECT_KEY, objectKey); return result; }

I went with the first option.

aindriu-aiven · 2025-01-10T13:49:31Z

commons/src/main/java/io/aiven/kafka/connect/common/source/OffsetManager.java

+     *            the key for the entry to remove.
+     */
+    public void remove(final OffsetManagerKey key) {
+        LOGGER.info("Removing: {}", key.getPartitionMap());


debug here too please.

...rce-connector/src/main/java/io/aiven/kafka/connect/s3/source/utils/SourceRecordIterator.java

aindriu-aiven

The IntegrationBase needs to be updated and I had a couple of small questions and NITs

aindriu-aiven · 2025-01-10T14:12:51Z

...ce-connector/src/integration-test/java/io/aiven/kafka/connect/s3/source/IntegrationBase.java

-            messages.putAll(offsetRec);
+            final List<Object> key = OBJECT_MAPPER.readValue(record.key(), new TypeReference<>() { // NOPMD
+            });
+            final Map<String, Object> keyDetails = (Map<String, Object>) key.get(1);


NIT: add comment about key.get(0) being the name of the connector the commit is from.

aindriu-aiven

LGTM thank you

muralibasani

The way SourceRecord is populated is incorrect. sourcePartition and sourceOffset should have the right information.
sourceOffset with just recordCount field is not correct at all.
If sourcePartition contains partitionId, sourceOffset would contain all the keys and corresponding offset positions which makes the map simple and readable.

Claudenw · 2025-01-13T14:20:04Z

The way SourceRecord is populated is incorrect. sourcePartition and sourceOffset should have the right information. sourceOffset with just recordCount field is not correct at all. If sourcePartition contains partitionId, sourceOffset would contain all the keys and corresponding offset positions which makes the map simple and readable.

The proposed structure leads to 2 problems:

Since Kafka does the updates using the data from the SourceRecord, every SourceRecord will need to include all the keys and corresponding offset positions on every update. The size of this data would continue to increase without bound.
We can not request data on a subset of records from Kafka We have to retrieve all the data on every request. The Kafka context object allows us to provide a list of keys (source Partitions) an retrieve information about those partitions. So we can still do bulk requests.

muralibasani · 2025-01-13T14:28:45Z

The way SourceRecord is populated is incorrect. sourcePartition and sourceOffset should have the right information. sourceOffset with just recordCount field is not correct at all. If sourcePartition contains partitionId, sourceOffset would contain all the keys and corresponding offset positions which makes the map simple and readable.

The proposed structure leads to 2 problems:

Since Kafka does the updates using the data from the SourceRecord, every SourceRecord will need to include all the keys and corresponding offset positions on every update. The size of this data would continue to increase without bound.

We can not request data on a subset of records from Kafka We have to retrieve all the data on every request. The Kafka context object allows us to provide a list of keys (source Partitions) an retrieve information about those partitions. So we can still do bulk requests.

Either way the size of the map would increase. So that concern should be eliminated. And a process should be in place to remove all the processed objects and offset positions, which makes the map smaller.

Here the problem I see is with storing object keys into sourcePartition only makes the map larger, and looking at the javadocs of SourceRecord for sourcePartition and sourceOffset, these maps are not compatible.

muralibasani

Looks good. Few minor comments.

muralibasani · 2025-01-14T09:05:10Z

commons/src/main/java/io/aiven/kafka/connect/common/source/OffsetManager.java

+     * @param key
+     *            the key for the entry to remove.
+     */
+    public void remove(final OffsetManagerKey key) {


Suggested change

public void remove(final OffsetManagerKey key) {

public void removeOffsetEntry(final OffsetManagerKey key) {

And I could not find any dependencies for this.

This method and the one on line 133 should have the same name as they do the same thing just us a different argument to get the job done.

commons/src/main/java/io/aiven/kafka/connect/common/source/OffsetManager.java

muralibasani · 2025-01-14T09:07:19Z

commons/src/main/java/io/aiven/kafka/connect/common/source/OffsetManager.java

+     * @param sourceRecord
+     *            the SourceRecord that contains the key to be removed.
+     */
+    public void remove(final SourceRecord sourceRecord) {


Suggested change

public void remove(final SourceRecord sourceRecord) {

public void removeEntry(final SourceRecord sourceRecord) {

commons/src/test/java/io/aiven/kafka/connect/common/source/OffsetManagerTest.java

muralibasani · 2025-01-14T09:10:31Z

s3-source-connector/src/test/java/io/aiven/kafka/connect/s3/source/S3SourceTaskTest.java

@@ -19,6 +19,9 @@
 import static io.aiven.kafka.connect.common.config.SchemaRegistryFragment.INPUT_FORMAT_KEY;
 import static io.aiven.kafka.connect.common.config.SourceConfigFragment.TARGET_TOPICS;
 import static io.aiven.kafka.connect.common.config.SourceConfigFragment.TARGET_TOPIC_PARTITIONS;


Can we delete this config TARGET_TOPIC_PARTITIONS from SourceConfigFragment and all its dependencies from tests and readme.

My understanding was this option was to assign specific partitions to the task. I don't have visibility into how it is used. Opened KCON-100 to track this.

I have a clean up PR for removing unused config, if we need this removed as part of it I can add to that

muralibasani · 2025-01-14T09:12:22Z

commons/src/test/java/io/aiven/kafka/connect/common/source/OffsetManagerTest.java

+        assertThat(result).isNotPresent();
+    }
+
+    @SuppressWarnings("PMD.TestClassWithoutTestCases") // TODO figure out why this fails.


If this fails, can we create a ticket ?

PMD thinks this is a test class but it has nothing annotated with @Test so it fails with the error. The note is to figure out why and how to get around the problem.

...rce-connector/src/main/java/io/aiven/kafka/connect/s3/source/utils/S3OffsetManagerEntry.java

muralibasani · 2025-01-14T09:23:19Z

...rce-connector/src/main/java/io/aiven/kafka/connect/s3/source/utils/S3OffsetManagerEntry.java

+    /** THe record count for the data map. Extracted here because it is used/updated frequently during processing */
+    private long recordCount;
+
+    private final String bucket;


In some places a mention of 'bucketName'. can we make it consistent, and same with topic.

AWS documentation calls the String bucket.
Kafka documentations calls the String topic

Changes to code to align with those standards.

...connector/src/test/java/io/aiven/kafka/connect/s3/source/utils/S3OffsetManagerEntryTest.java

commons/src/main/java/io/aiven/kafka/connect/common/source/OffsetManager.java

…setManager.java Co-authored-by: Murali Basani <[email protected]>

…setManagerTest.java Co-authored-by: Murali Basani <[email protected]>

…urce/utils/S3OffsetManagerEntry.java Co-authored-by: Murali Basani <[email protected]>

…urce/utils/S3OffsetManagerEntryTest.java Co-authored-by: Murali Basani <[email protected]>

…setManager.java Co-authored-by: Murali Basani <[email protected]>

…commons

Claudenw force-pushed the KCON-57_place_OffsetManager_in_commons branch from b5278e0 to 69ea274 Compare December 17, 2024 15:15

Claudenw marked this pull request as ready for review December 19, 2024 08:32

Claudenw requested review from a team as code owners December 19, 2024 08:32