Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

towards v1.10.0 [BREAKING] #263

Merged
merged 24 commits into from
Dec 31, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/ruby.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ jobs:
- "3.0"
- 3.1
- 3.2
- 3.3
- head
- truffleruby
- truffleruby-head
Expand Down
18 changes: 18 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,24 @@

# SmarterCSV 1.x Change Log

## 1.10.0 (2023-12-31) ⚡ BREAKING ⚡

* BREAKING CHANGES:

Changed behavior:
+ when `user_provided_headers` are provided:
* if they are not unique, an exception will now be raised
* they are taken "as is", no header transformations can be applied
* when they are given as strings or as symbols, it is assumed that this is the desired format
* the value of the `strings_as_keys` options will be ignored

+ option `duplicate_header_suffix` now defaults to `''` instead of `nil`.
* this allows automatic disambiguation when processing of CSV files with duplicate headers, by appending a number
* explicitly set this option to `nil` to get the behavior from previous versions.

* performance and memory improvements
* code refactor

## 1.9.3 (2023-12-16)
* raise SmarterCSV::IncorrectOption when `user_provided_headers` are empty
* code refactor / no functional changes
Expand Down
32 changes: 26 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,33 @@
# SmarterCSV

[![codecov](https://codecov.io/gh/tilo/smarter_csv/branch/main/graph/badge.svg?token=1L7OD80182)](https://codecov.io/gh/tilo/smarter_csv) [![Gem Version](https://badge.fury.io/rb/smarter_csv.svg)](http://badge.fury.io/rb/smarter_csv)



#### LATEST CHANGES

* Version 1.10.0 has BREAKING CHANGES:

Changed behavior:
+ when `user_provided_headers` are provided:
* if they are not unique, an exception will now be raised
* they are taken "as is", no header transformations can be applied
* when they are given as strings or as symbols, it is assumed that this is the desired format
* the value of the `strings_as_keys` options will be ignored

+ option `duplicate_header_suffix` now defaults to `''` instead of `nil`.
* this allows automatic disambiguation when processing of CSV files with duplicate headers, by appending a number
* explicitly set this option to `nil` to get the behavior from previous versions.

#### Development Branches

* default branch is `main` for 1.x development
* 2.x development is on `2.0-development` (check this branch for 2.0 documentation)

* 2.x development is on `2.0-development` (check this branch for 2.0 documentation)
- This is an EXPERIMENTAL branch - DO NOT USE in production

#### Work towards Future Version 2.0
#### Work towards Future Version 2.x

* Work towards SmarterCSV 2.0 is still ongoing, with improved features, and more streamlined options, but consider it as experimental at this time.
* Work towards SmarterCSV 2.x is still ongoing, with improved features, and more streamlined options, but consider it as experimental at this time.
Please check the [2.0-develop branch](https://github.com/tilo/smarter_csv/tree/2.0-develop), open any issues and pull requests with mention of tag v2.0.

---------------
Expand Down Expand Up @@ -83,10 +101,11 @@ $ hexdump -C spec/fixtures/bom_test_feff.csv
00000030 0a 33 38 37 35 39 31 35 30 2c 71 75 69 7a 7a 65 |.38759150,quizze|
00000040 73 2c 35 36 37 38 0d 0a |s,5678..|
```

### Articles
* [Processing 1.4 Million CSV Records in Ruby, fast ](https://lcx.wien/blog/processing-14-million-csv-records-in-ruby/)
* [Speeding up CSV parsing with parallel processing](http://xjlin0.github.io/tech/2015/05/25/faster-parsing-csv-with-parallel-processing)

### Examples

Here are some examples to demonstrate the versatility of SmarterCSV.
Expand Down Expand Up @@ -281,7 +300,8 @@ The options and the block are optional.
| :headers_in_file | true | Whether or not the file contains headers as the first line. |
| | | Important if the file does not contain headers, |
| | | otherwise you would lose the first line of data. |
| :duplicate_header_suffix | nil | If set, adds numbers to duplicated headers and separates them by the given suffix |
| :duplicate_header_suffix | '' | Adds numbers to duplicated headers and separates them by the given suffix. |
| | | Set this to nil to raise `DuplicateHeaders` error instead (previous behavior) |
| :user_provided_headers | nil | *careful with that axe!* |
| | | user provided Array of header strings or symbols, to define |
| | | what headers should be used, overriding any in-file headers. |
Expand Down
8 changes: 8 additions & 0 deletions lib/smarter_csv.rb
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,21 @@
require "smarter_csv/options_processing"
require "smarter_csv/auto_detection"
require "smarter_csv/variables"
require 'smarter_csv/header_transformations'
require 'smarter_csv/header_validations'
require "smarter_csv/headers"
require "smarter_csv/hash_transformations"
require "smarter_csv/parse"

# load the C-extension:
case RUBY_ENGINE
when 'ruby'
begin
if `uname -s`.chomp == 'Darwin'
#
# Please report if you see cases where the rake-compiler is building x86_64 code on arm64 cpus:
# https://github.com/rake-compiler/rake-compiler/issues/231
#
require 'smarter_csv/smarter_csv.bundle'
else
# :nocov:
Expand Down
91 changes: 91 additions & 0 deletions lib/smarter_csv/hash_transformations.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# frozen_string_literal: true

module SmarterCSV
class << self
def hash_transformations(hash, options)
# there may be unmapped keys, or keys purposedly mapped to nil or an empty key..
# make sure we delete any key/value pairs from the hash, which the user wanted to delete:
remove_empty_values = options[:remove_empty_values] == true
remove_zero_values = options[:remove_zero_values]
remove_values_matching = options[:remove_values_matching]
convert_to_numeric = options[:convert_values_to_numeric]
value_converters = options[:value_converters]

hash.each_with_object({}) do |(k, v), new_hash|
next if k.nil? || k == '' || k == :""
next if remove_empty_values && (has_rails ? v.blank? : blank?(v))
next if remove_zero_values && v.is_a?(String) && v =~ /^(0+|0+\.0+)$/ # values are Strings
next if remove_values_matching && v =~ remove_values_matching

# deal with the :only / :except options to :convert_values_to_numeric
if convert_to_numeric && !limit_execution_for_only_or_except(options, :convert_values_to_numeric, k)
if v =~ /^[+-]?\d+\.\d+$/
v = v.to_f
elsif v =~ /^[+-]?\d+$/
v = v.to_i
end
end

converter = value_converters[k] if value_converters
v = converter.convert(v) if converter

new_hash[k] = v
end
end

# def hash_transformations(hash, options)
# # there may be unmapped keys, or keys purposedly mapped to nil or an empty key..
# # make sure we delete any key/value pairs from the hash, which the user wanted to delete:
# hash.delete(nil)
# hash.delete('')
# hash.delete(:"")

# if options[:remove_empty_values] == true
# hash.delete_if{|_k, v| has_rails ? v.blank? : blank?(v)}
# end

# hash.delete_if{|_k, v| !v.nil? && v =~ /^(0+|0+\.0+)$/} if options[:remove_zero_values] # values are Strings
# hash.delete_if{|_k, v| v =~ options[:remove_values_matching]} if options[:remove_values_matching]

# if options[:convert_values_to_numeric]
# hash.each do |k, v|
# # deal with the :only / :except options to :convert_values_to_numeric
# next if limit_execution_for_only_or_except(options, :convert_values_to_numeric, k)

# # convert if it's a numeric value:
# case v
# when /^[+-]?\d+\.\d+$/
# hash[k] = v.to_f
# when /^[+-]?\d+$/
# hash[k] = v.to_i
# end
# end
# end

# if options[:value_converters]
# hash.each do |k, v|
# converter = options[:value_converters][k]
# next unless converter

# hash[k] = converter.convert(v)
# end
# end

# hash
# end

protected

# acts as a road-block to limit processing when iterating over all k/v pairs of a CSV-hash:
def limit_execution_for_only_or_except(options, option_name, key)
if options[option_name].is_a?(Hash)
if options[option_name].has_key?(:except)
return true if Array(options[option_name][:except]).include?(key)
elsif options[option_name].has_key?(:only)
return true unless Array(options[option_name][:only]).include?(key)
end
end
false
end
end
end
63 changes: 63 additions & 0 deletions lib/smarter_csv/header_transformations.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# frozen_string_literal: true

module SmarterCSV
class << self
# transform the headers that were in the file:
def header_transformations(header_array, options)
header_array.map!{|x| x.gsub(%r/#{options[:quote_char]}/, '')}
header_array.map!{|x| x.strip} if options[:strip_whitespace]

unless options[:keep_original_headers]
header_array.map!{|x| x.gsub(/\s+|-+/, '_')}
header_array.map!{|x| x.downcase} if options[:downcase_header]
end

# detect duplicate headers and disambiguate
header_array = disambiguate_headers(header_array, options) if options[:duplicate_header_suffix]
# symbolize headers
header_array = header_array.map{|x| x.to_sym } unless options[:strings_as_keys] || options[:keep_original_headers]
# doesn't make sense to re-map when we have user_provided_headers
header_array = remap_headers(header_array, options) if options[:key_mapping]

header_array
end

def disambiguate_headers(headers, options)
counts = Hash.new(0)
headers.map do |header|
counts[header] += 1
counts[header] > 1 ? "#{header}#{options[:duplicate_header_suffix]}#{counts[header]}" : header
end
end

# do some key mapping on the keys in the file header
# if you want to completely delete a key, then map it to nil or to ''
def remap_headers(headers, options)
key_mapping = options[:key_mapping]
if key_mapping.empty? || !key_mapping.is_a?(Hash) || key_mapping.keys.empty?
raise(SmarterCSV::IncorrectOption, "ERROR: incorrect format for key_mapping! Expecting hash with from -> to mappings")
end

key_mapping = options[:key_mapping]
# if silence_missing_keys are not set, raise error if missing header
missing_keys = key_mapping.keys - headers
# if the user passes a list of speciffic mapped keys that are optional
missing_keys -= options[:silence_missing_keys] if options[:silence_missing_keys].is_a?(Array)

unless missing_keys.empty? || options[:silence_missing_keys] == true
raise SmarterCSV::KeyMappingError, "ERROR: can not map headers: #{missing_keys.join(', ')}"
end

headers.map! do |header|
if key_mapping.has_key?(header)
key_mapping[header].nil? ? nil : key_mapping[header]
elsif options[:remove_unmapped_keys]
nil
else
header
end
end
headers
end
end
end
34 changes: 34 additions & 0 deletions lib/smarter_csv/header_validations.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# frozen_string_literal: true

module SmarterCSV
class << self
def header_validations(headers, options)
check_duplicate_headers(headers, options)
check_required_headers(headers, options)
end

def check_duplicate_headers(headers, _options)
header_counts = Hash.new(0)
headers.each { |header| header_counts[header] += 1 unless header.nil? }

duplicates = header_counts.select { |_, count| count > 1 }

unless duplicates.empty?
raise(SmarterCSV::DuplicateHeaders, "Duplicate Headers in CSV: #{duplicates.inspect}")
end
end

require 'set'

def check_required_headers(headers, options)
if options[:required_keys] && options[:required_keys].is_a?(Array)
headers_set = headers.to_set
missing_keys = options[:required_keys].select { |k| !headers_set.include?(k) }

unless missing_keys.empty?
raise SmarterCSV::MissingKeys, "ERROR: missing attributes: #{missing_keys.join(',')}"
end
end
end
end
end
Loading