Pronounced Scribe Engine
Scribengin is a highly reliable (HA) and performant event/logging transport that registers data under defined schemas in a variety of end systems. Scribengin enables you to have multiple flows of data from a source to a sink. Scribengin will tolerate system failures of individual nodes and will do a complete recovery in the case of complete system failure.
Reads data from sources:
- Kafka
- AWS Kinesis
Writes data to sinks:
- HDFS, Hbase, Hive with HCat Integration and Elastic Search
Additonal:
- Monitoring with Ganglia
- Heart Alerting with Nagios
This is part of NeverwinterDP the Data Pipeline for Hadoop
To get your VM up and running:
git clone git://github.com/DemandCube/Scribengin
cd Scribengin/vagrant
vagrant up
For more info on how it all works take a look at [The DevSetup Guide] (https://github.com/DemandCube/Scribengin/blob/master/DevSetup.md)
- Mailing List
- IRC channel #Scribengin on irc.freenode.net
See the [NeverwinterDP Guide to Contributing] (https://github.com/DemandCube/NeverwinterDP#how-to-contribute)
The core problem is how to reliably and at scale have a distributed application write data to multiple destination data systems. This requires the ability to todo data mapping, partitioning with optional filtering to the destination system.
Currently we are reorganizing the code for V2 of Scribengin to make things more modular and better organized.
- A Flow - is data being moved from a single source to a single sink
- Source - is a system that is being read to get data from (Kafka, Kinesis e.g.)
- Sink - is a destination system that is being written to (HDFS, Hbase, Hive e.g.)
- A Tributary - is a portion or partition of data from a Flow
See the [NeverwinterDP Guide to Yarn] (https://github.com/DemandCube/NeverwinterDP#Yarn)
Poc
- Storm
- Spark-streaming
- Yarn
- Local Mode (Single Node No Yarn)
- Distributed Standalone Cluster (No-Yarn)
- Hadoop Distributed (Yarn)
There is a question of how to implement quaranteed delivery of logs to end systems.
- Storm to HCat
- Storm to HBase
- Create Framework to pick other destination sources
- Architecture Proposal
- Kafka -> HCatalog
- Notification API
- Notification API Close Partitions HCatalog
- Ganglia Integration
- Nagios Integration
- Unix Man page
- Guide
- Untar and Deploy - Work out of the box
- CentOS Package
- CentOS Repo Setup and Deploy of CentOS Package
- RHEL Package
- RHEL Repo Setup and Deploy of CentOS Package
- Scribengin/Ambari Deployment
- Scribengin/Ambari Monitoring/Ganglia
- Scribengin/Ambari Notification/Nagios
Yarn Documentation
- Add the remote, call it "upstream":
git remote add upstream [email protected]:DemandCube/Scribengin.git
- Fetch all the branches of that remote into remote-tracking branches,
- such as upstream/master:
git fetch upstream
- Make sure that you're on your master branch:
git checkout master
- Merge upstream changes to your master branch
git merge upstream/master