Skip to content

Commit

Permalink
towards v1.10.0 [BREAKING] (#263)
Browse files Browse the repository at this point in the history
* towards v1.10.0

* change legacy behavior

* Change in header de-duplication; Refactoring (#264)

* Change in header de-duplication

* refactor enforce utf8 encoding

* more code refactoring

* restructure tests (#265)

* improve tests

* small refactor & performance improvement

* improve chunk handling

* speed-up count_quote_chars

* small performance improvements

* accelerate hash_transformations

* more performance improvements

* coverage

* adding Ruby 3.3 to CI tests
  • Loading branch information
tilo authored Dec 31, 2023
1 parent db257e8 commit 90f3dc1
Show file tree
Hide file tree
Showing 61 changed files with 610 additions and 459 deletions.
18 changes: 18 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,24 @@

# SmarterCSV 1.x Change Log

## 1.10.0 (2023-12-31) ⚡ BREAKING ⚡

* BREAKING CHANGES:

Changed behavior:
+ when `user_provided_headers` are provided:
* if they are not unique, an exception will now be raised
* they are taken "as is", no header transformations can be applied
* when they are given as strings or as symbols, it is assumed that this is the desired format
* the value of the `strings_as_keys` options will be ignored

+ option `duplicate_header_suffix` now defaults to `''` instead of `nil`.
* this allows automatic disambiguation when processing of CSV files with duplicate headers, by appending a number
* explicitly set this option to `nil` to get the behavior from previous versions.

* performance and memory improvements
* code refactor

## 1.9.3 (2023-12-16)
* raise SmarterCSV::IncorrectOption when `user_provided_headers` are empty
* code refactor / no functional changes
Expand Down
32 changes: 26 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,33 @@
# SmarterCSV

[![codecov](https://codecov.io/gh/tilo/smarter_csv/branch/main/graph/badge.svg?token=1L7OD80182)](https://codecov.io/gh/tilo/smarter_csv) [![Gem Version](https://badge.fury.io/rb/smarter_csv.svg)](http://badge.fury.io/rb/smarter_csv)



#### LATEST CHANGES

* Version 1.10.0 has BREAKING CHANGES:

Changed behavior:
+ when `user_provided_headers` are provided:
* if they are not unique, an exception will now be raised
* they are taken "as is", no header transformations can be applied
* when they are given as strings or as symbols, it is assumed that this is the desired format
* the value of the `strings_as_keys` options will be ignored

+ option `duplicate_header_suffix` now defaults to `''` instead of `nil`.
* this allows automatic disambiguation when processing of CSV files with duplicate headers, by appending a number
* explicitly set this option to `nil` to get the behavior from previous versions.

#### Development Branches

* default branch is `main` for 1.x development
* 2.x development is on `2.0-development` (check this branch for 2.0 documentation)

* 2.x development is on `2.0-development` (check this branch for 2.0 documentation)
- This is an EXPERIMENTAL branch - DO NOT USE in production

#### Work towards Future Version 2.0
#### Work towards Future Version 2.x

* Work towards SmarterCSV 2.0 is still ongoing, with improved features, and more streamlined options, but consider it as experimental at this time.
* Work towards SmarterCSV 2.x is still ongoing, with improved features, and more streamlined options, but consider it as experimental at this time.
Please check the [2.0-develop branch](https://github.com/tilo/smarter_csv/tree/2.0-develop), open any issues and pull requests with mention of tag v2.0.

---------------
Expand Down Expand Up @@ -83,10 +101,11 @@ $ hexdump -C spec/fixtures/bom_test_feff.csv
00000030 0a 33 38 37 35 39 31 35 30 2c 71 75 69 7a 7a 65 |.38759150,quizze|
00000040 73 2c 35 36 37 38 0d 0a |s,5678..|
```

### Articles
* [Processing 1.4 Million CSV Records in Ruby, fast ](https://lcx.wien/blog/processing-14-million-csv-records-in-ruby/)
* [Speeding up CSV parsing with parallel processing](http://xjlin0.github.io/tech/2015/05/25/faster-parsing-csv-with-parallel-processing)

### Examples

Here are some examples to demonstrate the versatility of SmarterCSV.
Expand Down Expand Up @@ -281,7 +300,8 @@ The options and the block are optional.
| :headers_in_file | true | Whether or not the file contains headers as the first line. |
| | | Important if the file does not contain headers, |
| | | otherwise you would lose the first line of data. |
| :duplicate_header_suffix | nil | If set, adds numbers to duplicated headers and separates them by the given suffix |
| :duplicate_header_suffix | '' | Adds numbers to duplicated headers and separates them by the given suffix. |
| | | Set this to nil to raise `DuplicateHeaders` error instead (previous behavior) |
| :user_provided_headers | nil | *careful with that axe!* |
| | | user provided Array of header strings or symbols, to define |
| | | what headers should be used, overriding any in-file headers. |
Expand Down
8 changes: 8 additions & 0 deletions lib/smarter_csv.rb
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,21 @@
require "smarter_csv/options_processing"
require "smarter_csv/auto_detection"
require "smarter_csv/variables"
require 'smarter_csv/header_transformations'
require 'smarter_csv/header_validations'
require "smarter_csv/headers"
require "smarter_csv/hash_transformations"
require "smarter_csv/parse"

# load the C-extension:
case RUBY_ENGINE
when 'ruby'
begin
if `uname -s`.chomp == 'Darwin'
#
# Please report if you see cases where the rake-compiler is building x86_64 code on arm64 cpus:
# https://github.com/rake-compiler/rake-compiler/issues/231
#
require 'smarter_csv/smarter_csv.bundle'
else
# :nocov:
Expand Down
91 changes: 91 additions & 0 deletions lib/smarter_csv/hash_transformations.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# frozen_string_literal: true

module SmarterCSV
class << self
def hash_transformations(hash, options)
# there may be unmapped keys, or keys purposedly mapped to nil or an empty key..
# make sure we delete any key/value pairs from the hash, which the user wanted to delete:
remove_empty_values = options[:remove_empty_values] == true
remove_zero_values = options[:remove_zero_values]
remove_values_matching = options[:remove_values_matching]
convert_to_numeric = options[:convert_values_to_numeric]
value_converters = options[:value_converters]

hash.each_with_object({}) do |(k, v), new_hash|
next if k.nil? || k == '' || k == :""
next if remove_empty_values && (has_rails ? v.blank? : blank?(v))
next if remove_zero_values && v.is_a?(String) && v =~ /^(0+|0+\.0+)$/ # values are Strings
next if remove_values_matching && v =~ remove_values_matching

# deal with the :only / :except options to :convert_values_to_numeric
if convert_to_numeric && !limit_execution_for_only_or_except(options, :convert_values_to_numeric, k)
if v =~ /^[+-]?\d+\.\d+$/
v = v.to_f
elsif v =~ /^[+-]?\d+$/
v = v.to_i
end
end

converter = value_converters[k] if value_converters
v = converter.convert(v) if converter

new_hash[k] = v
end
end

# def hash_transformations(hash, options)
# # there may be unmapped keys, or keys purposedly mapped to nil or an empty key..
# # make sure we delete any key/value pairs from the hash, which the user wanted to delete:
# hash.delete(nil)
# hash.delete('')
# hash.delete(:"")

# if options[:remove_empty_values] == true
# hash.delete_if{|_k, v| has_rails ? v.blank? : blank?(v)}
# end

# hash.delete_if{|_k, v| !v.nil? && v =~ /^(0+|0+\.0+)$/} if options[:remove_zero_values] # values are Strings
# hash.delete_if{|_k, v| v =~ options[:remove_values_matching]} if options[:remove_values_matching]

# if options[:convert_values_to_numeric]
# hash.each do |k, v|
# # deal with the :only / :except options to :convert_values_to_numeric
# next if limit_execution_for_only_or_except(options, :convert_values_to_numeric, k)

# # convert if it's a numeric value:
# case v
# when /^[+-]?\d+\.\d+$/
# hash[k] = v.to_f
# when /^[+-]?\d+$/
# hash[k] = v.to_i
# end
# end
# end

# if options[:value_converters]
# hash.each do |k, v|
# converter = options[:value_converters][k]
# next unless converter

# hash[k] = converter.convert(v)
# end
# end

# hash
# end

protected

# acts as a road-block to limit processing when iterating over all k/v pairs of a CSV-hash:
def limit_execution_for_only_or_except(options, option_name, key)
if options[option_name].is_a?(Hash)
if options[option_name].has_key?(:except)
return true if Array(options[option_name][:except]).include?(key)
elsif options[option_name].has_key?(:only)
return true unless Array(options[option_name][:only]).include?(key)
end
end
false
end
end
end
63 changes: 63 additions & 0 deletions lib/smarter_csv/header_transformations.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# frozen_string_literal: true

module SmarterCSV
class << self
# transform the headers that were in the file:
def header_transformations(header_array, options)
header_array.map!{|x| x.gsub(%r/#{options[:quote_char]}/, '')}
header_array.map!{|x| x.strip} if options[:strip_whitespace]

unless options[:keep_original_headers]
header_array.map!{|x| x.gsub(/\s+|-+/, '_')}
header_array.map!{|x| x.downcase} if options[:downcase_header]
end

# detect duplicate headers and disambiguate
header_array = disambiguate_headers(header_array, options) if options[:duplicate_header_suffix]
# symbolize headers
header_array = header_array.map{|x| x.to_sym } unless options[:strings_as_keys] || options[:keep_original_headers]
# doesn't make sense to re-map when we have user_provided_headers
header_array = remap_headers(header_array, options) if options[:key_mapping]

header_array
end

def disambiguate_headers(headers, options)
counts = Hash.new(0)
headers.map do |header|
counts[header] += 1
counts[header] > 1 ? "#{header}#{options[:duplicate_header_suffix]}#{counts[header]}" : header
end
end

# do some key mapping on the keys in the file header
# if you want to completely delete a key, then map it to nil or to ''
def remap_headers(headers, options)
key_mapping = options[:key_mapping]
if key_mapping.empty? || !key_mapping.is_a?(Hash) || key_mapping.keys.empty?
raise(SmarterCSV::IncorrectOption, "ERROR: incorrect format for key_mapping! Expecting hash with from -> to mappings")
end

key_mapping = options[:key_mapping]
# if silence_missing_keys are not set, raise error if missing header
missing_keys = key_mapping.keys - headers
# if the user passes a list of speciffic mapped keys that are optional
missing_keys -= options[:silence_missing_keys] if options[:silence_missing_keys].is_a?(Array)

unless missing_keys.empty? || options[:silence_missing_keys] == true
raise SmarterCSV::KeyMappingError, "ERROR: can not map headers: #{missing_keys.join(', ')}"
end

headers.map! do |header|
if key_mapping.has_key?(header)
key_mapping[header].nil? ? nil : key_mapping[header]
elsif options[:remove_unmapped_keys]
nil
else
header
end
end
headers
end
end
end
34 changes: 34 additions & 0 deletions lib/smarter_csv/header_validations.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# frozen_string_literal: true

module SmarterCSV
class << self
def header_validations(headers, options)
check_duplicate_headers(headers, options)
check_required_headers(headers, options)
end

def check_duplicate_headers(headers, _options)
header_counts = Hash.new(0)
headers.each { |header| header_counts[header] += 1 unless header.nil? }

duplicates = header_counts.select { |_, count| count > 1 }

unless duplicates.empty?
raise(SmarterCSV::DuplicateHeaders, "Duplicate Headers in CSV: #{duplicates.inspect}")
end
end

require 'set'

def check_required_headers(headers, options)
if options[:required_keys] && options[:required_keys].is_a?(Array)
headers_set = headers.to_set
missing_keys = options[:required_keys].select { |k| !headers_set.include?(k) }

unless missing_keys.empty?
raise SmarterCSV::MissingKeys, "ERROR: missing attributes: #{missing_keys.join(',')}"
end
end
end
end
end
Loading

0 comments on commit 90f3dc1

Please sign in to comment.