Skip to content

Commit

Permalink
Fix Documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
anmunoz committed Oct 16, 2019
1 parent 55d6f3f commit 786e006
Show file tree
Hide file tree
Showing 3 changed files with 9 additions and 9 deletions.
10 changes: 5 additions & 5 deletions docs/processors_catalogue/ngsi_carto_sink.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ It must be said [PostgreSQL only accepts](https://www.postgresql.org/docs/curren

PostgreSQL [databases name length](http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html#SQL-SYNTAX-IDENTIFIERS) is limited to 63 characters.

Also, becouse of a Carto's requirement, the name must begin with a letter (a-z).
Also, because of a Carto's requirement, the name must begin with a letter (a-z).

[Top](#top)

Expand All @@ -40,7 +40,7 @@ It must be said [PostgreSQL only accepts](https://www.postgresql.org/docs/curren

PostgreSQL [schemas name length](http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html#SQL-SYNTAX-IDENTIFIERS) is limited to 63 characters.

Also, becouse of a Carto's requirement, the name must begin with a letter (a-z).
Also, because of a Carto's requirement, the name must begin with a letter (a-z).
[Top](#top)

#### PostgreSQL tables naming conventions
Expand Down Expand Up @@ -85,7 +85,7 @@ It must be enabled the `enable_raw` parameter, unless `enable_distance` is not a
- `geo:point`: a point.
- `geo:json`: GeoJSON representing a point.
- `the_geom_webmercator`:Exactly the same as `the_geom` bt changing the EPSH sistem reference to 3857.
- For each not-geolocated attribute, the insert will contain two additional field, one named with the `attrName` received and another with the metadata.
- For each not-geolocated attribute, the insert will contain two additional field, one named with the `attrName` received and another with the metadata.

[Top](#top)

Expand Down Expand Up @@ -451,7 +451,7 @@ In addition, the same values but for the insertion in Carto


## Administration guide
## Administration guide
## Configuration

NGSIToCarto is configured through the following parameters
Expand All @@ -470,7 +470,7 @@ NGSIToCarto is configured through the following parameters
|Swap coordinates |false |true, false |true changes position between latitude and longitude |
|Enable lowercase |true |true, false |true for creating the Schema and Tables name with lowercase |
|**Batch size** |10 | |The preferred number of FlowFiles to put to the database in a single transaction |
|Transaction timeout |no |30 |Specify how errors are going to be handled. By default(false), if there is an error the FlowFile will be routed to "failure" or "retry". If it is enabled the failure FlowFiles will stay in the input relation without penalizing it and being processed repeatedly until it gets processed succesfully or removed.|
|Transaction timeout |no |30 |Specify how errors are going to be handled. By default(false), if there is an error the FlowFile will be routed to "failure" or "retry". If it is enabled the failure FlowFiles will stay in the input relation without penalizing it and being processed repeatedly until it gets processed successfully or removed.|

An example of this configuration can be:

Expand Down
2 changes: 1 addition & 1 deletion docs/processors_catalogue/ngsi_cassandra_sink.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
## Functionality

`NGSIToCassandra`, is a processor designed to persist NGSI-like context data events within a
[Cassandra server](https://http://cassandra.apache.org//). Usually, such a context data is notified by a
[Cassandra server](http://cassandra.apache.org//). Usually, such a context data is notified by a
[Orion Context Broker](https://github.com/telefonicaid/fiware-orion) instance, but could be any other system speaking
the <i>NGSI language</i>.

Expand Down
6 changes: 3 additions & 3 deletions docs/processors_catalogue/ngsi_hdfs_sink.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Next sections will explain this in detail.
### Mapping NGSI events to `NGSIEvent` objects
Notified NGSI events (containing context data) are transformed into `NGSIEvent` objects (for each context element a `NGSIEvent` is created; such an event is a mix of certain headers and a `ContextElement` object), independently of the NGSI data generator or the final backend where it is persisted.

This is done at the Draco-ngsi Http listeners (in Flume jergon, sources) thanks to [`NGSIRestHandler`](/ngsi_rest_handler.md). Once translated, the data (now, as `NGSIEvent` objects) is put into the internal channels for future consumption (see next section).
This is done at the Draco-ngsi Http listeners (in Flume jergon, sources) thanks to [`NGSIRestHandler`](./processors_cataloge/ngsi_rest_handler.md). Once translated, the data (now, as `NGSIEvent` objects) is put into the internal channels for future consumption (see next section).

[Top](#top)

Expand Down Expand Up @@ -300,15 +300,15 @@ There exists an [issue](https://github.com/telefonicaid/fiware-cosmos/issues/111
[Top](#top)

#### About batching
As explained in the [programmers guide](#section3), `NGSIToHDFS` extends `NGSISink`, which provides a built-in mechanism for collecting events from the internal Flume channel. This mechanism allows extending classes have only to deal with the persistence details of such a batch of events in the final backend.
`NGSIToHDFS` extends `NGSISink`, which provides a built-in mechanism for collecting events from the internal Flume channel. This mechanism allows extending classes have only to deal with the persistence details of such a batch of events in the final backend.

What is important regarding the batch mechanism is it largely increases the performance of the sink, because the number of writes is dramatically reduced. Let's see an example, let's assume a batch of 100 `NGSIEvent`s. In the best case, all these events regard to the same entity, which means all the data within them will be persisted in the same HDFS file. If processing the events one by one, we would need 100 writes to HDFS; nevertheless, in this example only one write is required. Obviously, not all the events will always regard to the same unique entity, and many entities may be involved within a batch. But that's not a problem, since several sub-batches of events are created within a batch, one sub-batch per final destination HDFS file. In the worst case, the whole 100 entities will be about 100 different entities (100 different HDFS destinations), but that will not be the usual scenario. Thus, assuming a realistic number of 10-15 sub-batches per batch, we are replacing the 100 writes of the event by event approach with only 10-15 writes.

The batch mechanism adds an accumulation timeout to prevent the sink stays in an eternal state of batch building when no new data arrives. If such a timeout is reached, then the batch is persisted as it is.

Regarding the retries of not persisted batches, a couple of parameters is used. On the one hand, a Time-To-Live (TTL) is used, specifing the number of retries Draco will do before definitely dropping the event. On the other hand, a list of retry intervals can be configured. Such a list defines the first retry interval, then se second retry interval, and so on; if the TTL is greater than the length of the list, then the last retry interval is repeated as many times as necessary.

By default, `NGSIToHDFS` has a configured batch size and batch accumulation timeout of 1 and 30 seconds, respectively. Nevertheless, as explained above, it is highly recommended to increase at least the batch size for performance purposes. Which are the optimal values? The size of the batch it is closely related to the transaction size of the channel the events are got from (it has no sense the first one is greater then the second one), and it depends on the number of estimated sub-batches as well. The accumulation timeout will depend on how often you want to see new data in the final storage. A deeper discussion on the batches of events and their appropriate sizing may be found in the [performance document](https://github.com/telefonicaid/fiware-Draco/blob/master/doc/Draco-ngsi/installation_and_administration_guide/performance_tips.md).
By default, `NGSIToHDFS` has a configured batch size and batch accumulation timeout of 1 and 30 seconds, respectively. Nevertheless, as explained above, it is highly recommended to increase at least the batch size for performance purposes. Which are the optimal values? The size of the batch it is closely related to the transaction size of the channel the events are got from (it has no sense the first one is greater then the second one), and it depends on the number of estimated sub-batches as well. The accumulation timeout will depend on how often you want to see new data in the final storage.

[Top](#top)

Expand Down

0 comments on commit 786e006

Please sign in to comment.