Flowbots is a Ruby-based document processing system designed to handle various file types with an emphasis on text analysis and content extraction.
- Markdown (with YAML frontmatter)
- Structured text (JSON, JSONL, CSV)
- Documents (PDF)
- Media files (Audio, Video)
- Images
- Markdown section extraction
- YAML frontmatter parsing
- File metadata collection
- Content type detection
- Text statistics
Flowbots uses Ohm, an object-hash mapping library for Redis, to manage its data models. Ohm provides a flexible and efficient way to store and retrieve structured data in Redis.
Primary model for handling all types of content with a unified structure.
Document
├── Attributes
│ ├── path # File path
│ ├── name # File name
│ ├── type # File type (e.g., text, audio, video)
│ ├── mime # MIME type
│ ├── extension # File extension
│ ├── size # File size
│ ├── mtime # Last modification time
│ ├── ctime # Creation time
│ ├── checksum # File checksum
│ ├── collection # Associated collection name
│ ├── content # Main content
│ └── metadata # Hash of file metadata
│
├── Indices
│ ├── path
│ ├── name
│ ├── type
│ ├── collection
│ └── checksum
│
└── Methods
├── paragraphs()
├── sentences()
├── words(conditions)
└── topics()
These models remain largely unchanged, but now reference the Document model instead of TextFile.
These models remain largely unchanged, but now reference the Document model instead of Item.
Organizes related content.
Collection
├── Attributes
│ └── name # Collection name
│
└── Sets
└── documents # Document references
Manages content relationships and categorization.
Topic
├── Attributes
│ ├── name # Topic name
│ ├── description # Topic description
│ └── vector # Topic vector representation
│
└── Collections
├── documents # Associated Documents
├── segments # Associated Segments
├── paragraphs # Associated Paragraphs
├── sentences # Associated Sentences
└── phrases # Associated Phrases
# Creating a document with content
document = Document.create(
path: "/path/to/file.txt",
name: "file.txt",
type: "text",
content: "Some content",
metadata: { source: "import" }
)
# Processing content
preprocessor = PreprocessDocument.new(document.id)
preprocessor.execute
# Adding topic classification
topic = Topic.find_or_create(name: "Example Topic")
document.update(metadata: document.metadata.merge(topics: [topic.id]))
# Processing media content
audio_document = Document.create(
path: "/path/to/audio.mp3",
name: "audio.mp3",
type: "audio"
)
# Process audio content...
This data structure enables:
- Unified content organization
- Rich text analysis and processing
- Media content segmentation
- Topic classification and relationships
- Flexible content relationships
[The rest of the README content remains unchanged]