Skip to content

Commit

Permalink
Merge pull request #9 from lensesio/feat/LC-166
Browse files Browse the repository at this point in the history
Support record field timestamp
  • Loading branch information
andrewstevenson authored Mar 29, 2024
2 parents 9a5b46a + 13d66d1 commit 9b32f24
Show file tree
Hide file tree
Showing 19 changed files with 2,001 additions and 55 deletions.
126 changes: 126 additions & 0 deletions InsertFieldTimestampHeaders.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
# Insert Wallclock

## Description

Got it, here's the revised text with just the paragraph:

---

This Kafka Connect Single Message Transform (SMT) facilitates the insertion of date and time components (year, month,
day, hour, minute, second) as headers into Kafka messages using a timestamp field within the message payload. The
timestamp field can be in various valid formats, including long integers, strings, or date objects. The timestamp field
can originate from either the record Key or the record Value. When extracting from the record Key, prefix the field
with `_key.`; otherwise, extract from the record Value by default or explicitly using the field without prefixing. For
string-formatted fields, specify a `format.from.pattern` parameter to define the parsing pattern. Long integer fields
are assumed to be Unix timestamps; the desired Unix precision can be specified using the `unix.precision` parameter.

The headers inserted are of type STRING. By using this SMT, you can partition the data by `yyyy-MM-dd/HH`
or `yyyy/MM/dd/HH`, for example, and only use one SMT.

The list of headers inserted are:

* date
* year
* month
* day
* hour
* minute
* second

All headers can be prefixed with a custom prefix. For example, if the prefix is `wallclock_`, then the headers will be:

* wallclock_date
* wallclock_year
* wallclock_month
* wallclock_day
* wallclock_hour
* wallclock_minute
* wallclock_second

When used with the Lenses connectors for S3, GCS or Azure data lake, the headers can be used to partition the data.
Considering the headers have been prefixed by `_`, here are a few KCQL examples:

```
connect.s3.kcql=INSERT INTO $bucket:prefix SELECT * FROM kafka_topic PARTITIONBY _header._date, _header._hour
connect.s3.kcql=INSERT INTO $bucket:prefix SELECT * FROM kafka_topic PARTITIONBY _header._year, _header._month, _header._day, _header._hour
```

## Configuration

| Name | Description | Type | Default |
|-----------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------|--------------|
| `field` | The field name. If the key is part of the record Key prefix with `_key` otherwise `_value`. If `_value` or `_key` is not used it defaults to the record Value to resolve the field. | String | |
| `format.from.pattern` | Optional date timeFormatter-compatible format for the timestamp. Used to parse the input if the input is a string. | String | |
| `unix.precision` | Optional "The desired Unix precision for the timestamp: seconds, milliseconds, microseconds, or nanoseconds. Used to parse the input if the input is a Long. | String | milliseconds |
| `header.prefix.name` | Optional header prefix. | String | |
| `date.format` | Optional Java date time formatter. | String | yyyy-MM-dd |
| `year.format` | Optional Java date time formatter for the year component. | String | yyyy |
| `month.format` | Optional Java date time formatter for the month component. | String | MM |
| `day.format` | Optional Java date time formatter for the day component. | String | dd |
| `hour.format` | Optional Java date time formatter for the hour component. | String | HH |
| `minute.format` | Optional Java date time formatter for the minute component. | String | mm |
| `second.format` | Optional Java date time formatter for the second component. | String | ss |
| `timezone` | Optional. Sets the timezone. It can be any valid Java timezone. | String | UTC |
| `locale` | Optional. Sets the locale. It can be any valid Java locale. | String | en |

## Example

To use the record Value field named `created_at` as the unix timestamp, use the following:

```properties
transforms=fieldTs
field=_value.created_at
transforms.fieldTs.type=io.lenses.connect.smt.header.InsertFieldTimestampHeaders
```

To use the record Key field named `created_at` as the unix timestamp, use the following:

```properties
transforms=fieldTs
field=_key.created_at
transforms.fieldTs.type=io.lenses.connect.smt.header.InsertFieldTimestampHeaders
```

To prefix the headers with `wallclock_`, use the following:

```properties
transforms=fieldTs
field=created_at
transforms.fieldTs.type=io.lenses.connect.smt.header.InsertFieldTimestampHeaders
transforms.fieldTs.header.prefix.name=wallclock_
```

To change the date format, use the following:

```properties
transforms=fieldTs
field=created_at
transforms.fieldTs.type=io.lenses.connect.smt.header.InsertFieldTimestampHeader
transforms.fieldTs.date.format=yyyy-MM-dd
```

To use the timezone `Asia/Kolkoata`, use the following:

```properties
transforms=fieldTs
field=created_at
transforms.fieldTs.type=io.lenses.connect.smt.header.InsertFieldTimestampHeader
transforms.fieldTs.timezone=Asia/Kolkata
```

To facilitate S3, GCS, or Azure Data Lake partitioning using a Hive-like partition name format, such
as `date=yyyy-MM-dd / hour=HH`, employ the following SMT configuration for a partition strategy.

```properties
transforms=fieldTs
field=created_at
transforms.fieldTs.type=io.lenses.connect.smt.header.InsertFieldTimestampHeader
transforms.fieldTs.date.format="date=yyyy-MM-dd"
transforms.fieldTs.hour.format="hour=yyyy"
```

and in the KCQL setting utilise the headers as partitioning keys:

```properties
connect.s3.kcql=INSERT INTO $bucket:prefix SELECT * FROM kafka_topic PARTITIONBY _header.date, _header.year
```
129 changes: 129 additions & 0 deletions InsertRollingFieldTimestampHeaders.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# Insert Wallclock

## Description

A Kafka Connect Single Message Transform (SMT) that inserts date, year, month,day, hour, minute and second headers using
a timestamp field from the record payload and a rolling time window configuration. The timestamp field can be in various
valid formats, including long integers, strings, or date objects. The timestamp field
can originate from either the record Key or the record Value. When extracting from the record Key, prefix the field
with `_key.`; otherwise, extract from the record Value by default or explicitly using the field without prefixing. For
string-formatted fields, specify a `format.from.pattern` parameter to define the parsing pattern. Long integer fields
are assumed to be Unix timestamps; the desired Unix precision can be specified using the `unix.precision` parameter.

The headers inserted are of type STRING. By using this SMT, you can partition the data by `yyyy-MM-dd/HH`
or `yyyy/MM/dd/HH`, for example, and only use one SMT.

The list of headers inserted are:

* date
* year
* month
* day
* hour
* minute
* second

All headers can be prefixed with a custom prefix. For example, if the prefix is `wallclock_`, then the headers will be:

* wallclock_date
* wallclock_year
* wallclock_month
* wallclock_day
* wallclock_hour
* wallclock_minute
* wallclock_second

When used with the Lenses connectors for S3, GCS or Azure data lake, the headers can be used to partition the data.
Considering the headers have been prefixed by `_`, here are a few KCQL examples:

```
connect.s3.kcql=INSERT INTO $bucket:prefix SELECT * FROM kafka_topic PARTITIONBY _header._date, _header._hour
connect.s3.kcql=INSERT INTO $bucket:prefix SELECT * FROM kafka_topic PARTITIONBY _header._year, _header._month, _header._day, _header._hour
```

## Configuration

| Name | Description | Type | Default |
|-----------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------|--------------|
| `field` | The field name. If the key is part of the record Key prefix with `_key` otherwise `_value`. If `_value` or `_key` is not used it defaults to the record Value to resolve the field. | String | |
| `format.from.pattern` | Optional date timeFormatter-compatible format for the timestamp. Used to parse the input if the input is a string. | String | |
| `unix.precision` | Optional "The desired Unix precision for the timestamp: seconds, milliseconds, microseconds, or nanoseconds. Used to parse the input if the input is a Long. | String | milliseconds |
| `header.prefix.name` | Optional header prefix. | String | |
| `date.format` | Optional Java date time formatter. | String | yyyy-MM-dd |
| `year.format` | Optional Java date time formatter for the year component. | String | yyyy |
| `month.format` | Optional Java date time formatter for the month component. | String | MM |
| `day.format` | Optional Java date time formatter for the day component. | String | dd |
| `hour.format` | Optional Java date time formatter for the hour component. | String | HH |
| `minute.format` | Optional Java date time formatter for the minute component. | String | mm |
| `second.format` | Optional Java date time formatter for the second component. | String | ss |
| `timezone` | Optional. Sets the timezone. It can be any valid Java timezone. | String | UTC |
| `locale` | Optional. Sets the locale. It can be any valid Java locale. | String | en |
| `rolling.window.type` | Sets the window type. It can be fixed or rolling. | String | minutes |
| `rolling.window.size` | Sets the window size. It can be any positive integer, and depending on the `window.type` it has an upper bound, 60 for seconds and minutes, and 24 for hours. | Int | 15 |

## Example

To store the epoch value, use the following configuration:

```properties
transforms=rollingWindow
transforms.rollingWindow.type=io.lenses.connect.smt.header.InsertRollingFieldTimestampHeaders
transforms.rollingWindow.field=created_at
transforms.rollingWindow.rolling.window.type=minutes
transforms.rollingWindow.rolling.window.size=15
```

To prefix the headers with `wallclock_`, use the following:

```properties
transforms=rollingWindow
transforms.rollingWindow.type=io.lenses.connect.smt.header.InsertRollingFieldTimestampHeaders
transforms.rollingWindow.field=created_at
transforms.rollingWindow.header.prefix.name=wallclock_
transforms.rollingWindow.rolling.window.type=minutes
transforms.rollingWindow.rolling.window.size=15
```

To change the date format, use the following:

```properties
transforms=rollingWindow
transforms.rollingWindow.type=io.lenses.connect.smt.header.InsertRollingFieldTimestampHeaders
transforms.rollingWindow.field=created_at
transforms.rollingWindow.header.prefix.name=wallclock_
transforms.rollingWindow.rolling.window.type=minutes
transforms.rollingWindow.rolling.window.size=15
transforms.rollingWindow.date.format="date=yyyy-MM-dd"
```

To use the timezone `Asia/Kolkoata`, use the following:

```properties
transforms=rollingWindow
transforms.rollingWindow.type=io.lenses.connect.smt.header.InsertRollingFieldTimestampHeaders
transforms.rollingWindow.field=created_at
transforms.rollingWindow.header.prefix.name=wallclock_
transforms.rollingWindow.rolling.window.type=minutes
transforms.rollingWindow.rolling.window.size=15
transforms.rollingWindow.timezone=Asia/Kolkata
```

To facilitate S3, GCS, or Azure Data Lake partitioning using a Hive-like partition name format, such
as `date=yyyy-MM-dd / hour=HH`, employ the following SMT configuration for a partition strategy.

```properties
transforms=rollingWindow
transforms.rollingWindow.type=io.lenses.connect.smt.header.InsertRollingFieldTimestampHeaders
transforms.rollingWindow.field=created_at
transforms.rollingWindow.rolling.window.type=minutes
transforms.rollingWindow.rolling.window.size=15
transforms.rollingWindow.timezone=Asia/Kolkata
transforms.rollingWindow.date.format="date=yyyy-MM-dd"
transforms.rollingWindow.hour.format="hour=yyyy"
```

and in the KCQL setting utilise the headers as partitioning keys:

```properties
connect.s3.kcql=INSERT INTO $bucket:prefix SELECT * FROM kafka_topic PARTITIONBY _header.date, _header.year
```
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ Furthermore, they support [Stream-Reactor](https://github.com/lensesio/stream-re
* [InsertRollingRecordTimestampHeaders](./InsertRollingRecordTimestampHeaders.md) - Inserts date, year, month, day, hour, minute, and second headers using the record timestamp and a rolling time window configuration.
* [InsertRollingWallclockHeaders](./InsertRollingWallclockHeaders.md) - Inserts date, year, month, day, hour, minute, and second headers using the system timestamp and a rolling time window configuration.
* [InsertRecordTimestampHeaders](./InsertRecordTimestampHeaders.md) - Inserts date, year, month, day, hour, minute, and second headers using the record timestamp.
* [InsertFieldTimestampHeaders](./InsertFieldTimestampHeaders.md) - Inserts date, year, month, day, hour, minute, and second headers using a field in the payload, record Key or Value.
* [InsertRollingFieldTimestampHeaders](./InsertRollingFieldTimestampHeaders.md) - Inserts date, year, month, day, hour, minute, and second headers using a field in the payload, record Key or Value and a rolling window boundary.
* [InsertWallclockHeaders](./InsertWallclockHeaders.md) - Inserts date, year, month, day, hour, minute, and second headers using the system clock.
* [TimestampConverter](./TimestampConverter.md) - Converts a timestamp field in the payload, record Key or Value to a different format, and optionally applies a rolling window boundary. An adapted version of the one packed in the Kafka Connect framework.
* [InsertWallclockDateTimePart](./InsertWallclockDateTimePart.md) - Inserts the system clock year, month, day, minute, or seconds as a message header, with a value of type STRING.
Expand Down
17 changes: 17 additions & 0 deletions src/main/java/io/lenses/connect/smt/header/FieldType.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
/**
* Licensed to the Apache Software Foundation (ASF) under one or more contributor license
* agreements. See the NOTICE file distributed with this work for additional information regarding
* copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance with the License. You may obtain a
* copy of the License at: http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable
* law or agreed to in writing, software distributed under the License is distributed on an "AS IS"
* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License
* for the specific language governing permissions and limitations under the License.
*/
package io.lenses.connect.smt.header;

enum FieldType {
KEY,
VALUE,
TIMESTAMP
}
17 changes: 17 additions & 0 deletions src/main/java/io/lenses/connect/smt/header/FieldTypeConstants.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
/**
* Licensed to the Apache Software Foundation (ASF) under one or more contributor license
* agreements. See the NOTICE file distributed with this work for additional information regarding
* copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance with the License. You may obtain a
* copy of the License at: http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable
* law or agreed to in writing, software distributed under the License is distributed on an "AS IS"
* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License
* for the specific language governing permissions and limitations under the License.
*/
package io.lenses.connect.smt.header;

public class FieldTypeConstants {
public static final String KEY_FIELD = "_key";
public static final String VALUE_FIELD = "_value";
public static final String TIMESTAMP_FIELD = "_timestamp";
}
Loading

0 comments on commit 9b32f24

Please sign in to comment.