towards v1.10.0 [BREAKING] (#263)

* towards v1.10.0 * change legacy behavior * Change in header de-duplication; Refactoring (#264) * Change in header de-duplication * refactor enforce utf8 encoding * more code refactoring * restructure tests (#265) * improve tests * small refactor & performance improvement * improve chunk handling * speed-up count_quote_chars * small performance improvements * accelerate hash_transformations * more performance improvements * coverage * adding Ruby 3.3 to CI tests
tilo · Dec 31, 2023 · 90f3dc1 · 90f3dc1
1 parent db257e8
commit 90f3dc1
Show file tree

Hide file tree

Showing 61 changed files with 610 additions and 459 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,6 +1,24 @@
 
 # SmarterCSV 1.x Change Log
 
+## 1.10.0 (2023-12-31) ⚡ BREAKING ⚡
+
+  * BREAKING CHANGES:
+
+    Changed behavior:
+     + when `user_provided_headers` are provided:
+       * if they are not unique, an exception will now be raised
+       * they are taken "as is", no header transformations can be applied
+       * when they are given as strings or as symbols, it is assumed that this is the desired format
+       * the value of the `strings_as_keys` options will be ignored
+
+     + option `duplicate_header_suffix` now defaults to `''` instead of `nil`.
+       * this allows automatic disambiguation when processing of CSV files with duplicate headers, by appending a number
+       * explicitly set this option to `nil` to get the behavior from previous versions.
+
+  * performance and memory improvements
+  * code refactor
+
 ## 1.9.3 (2023-12-16)
   * raise SmarterCSV::IncorrectOption when `user_provided_headers` are empty
   * code refactor / no functional changes

diff --git a/README.md b/README.md
@@ -2,15 +2,33 @@
 # SmarterCSV
 
  [![codecov](https://codecov.io/gh/tilo/smarter_csv/branch/main/graph/badge.svg?token=1L7OD80182)](https://codecov.io/gh/tilo/smarter_csv) [![Gem Version](https://badge.fury.io/rb/smarter_csv.svg)](http://badge.fury.io/rb/smarter_csv)
-
+
+
+#### LATEST CHANGES
+
+* Version 1.10.0 has BREAKING CHANGES:
+
+    Changed behavior:
+     + when `user_provided_headers` are provided:
+       * if they are not unique, an exception will now be raised
+       * they are taken "as is", no header transformations can be applied
+       * when they are given as strings or as symbols, it is assumed that this is the desired format
+       * the value of the `strings_as_keys` options will be ignored
+
+     + option `duplicate_header_suffix` now defaults to `''` instead of `nil`.
+       * this allows automatic disambiguation when processing of CSV files with duplicate headers, by appending a number
+       * explicitly set this option to `nil` to get the behavior from previous versions.
+
 #### Development Branches
 
 * default branch is `main` for 1.x development
-* 2.x development is on `2.0-development` (check this branch for 2.0 documentation)
+
+* 2.x development is on `2.0-development` (check this branch for 2.0 documentation) 
+  - This is an EXPERIMENTAL branch - DO NOT USE in production
 
-#### Work towards Future Version 2.0
+#### Work towards Future Version 2.x
 
-* Work towards SmarterCSV 2.0 is still ongoing, with improved features, and more streamlined options, but consider it as experimental at this time.
+* Work towards SmarterCSV 2.x is still ongoing, with improved features, and more streamlined options, but consider it as experimental at this time.
   Please check the [2.0-develop branch](https://github.com/tilo/smarter_csv/tree/2.0-develop), open any issues and pull requests with mention of tag v2.0.
 
 ---------------
@@ -83,10 +101,11 @@ $ hexdump -C spec/fixtures/bom_test_feff.csv
 00000030  0a 33 38 37 35 39 31 35  30 2c 71 75 69 7a 7a 65  |.38759150,quizze|
 00000040  73 2c 35 36 37 38 0d 0a                           |s,5678..|
 ```
+
 ### Articles
 * [Processing 1.4 Million CSV Records in Ruby, fast ](https://lcx.wien/blog/processing-14-million-csv-records-in-ruby/)
 * [Speeding up CSV parsing with parallel processing](http://xjlin0.github.io/tech/2015/05/25/faster-parsing-csv-with-parallel-processing)
-  
+
 ### Examples
 
 Here are some examples to demonstrate the versatility of SmarterCSV.
@@ -281,7 +300,8 @@ The options and the block are optional.
      | :headers_in_file            |   true   | Whether or not the file contains headers as the first line.                          |
      |                             |          | Important if the file does not contain headers,                                      |
      |                             |          | otherwise you would lose the first line of data.                                     |
-     | :duplicate_header_suffix    |   nil    | If set, adds numbers to duplicated headers and separates them by the given suffix    |
+     | :duplicate_header_suffix    |   ''     | Adds numbers to duplicated headers and separates them by the given suffix.           |
+     |                             |          | Set this to nil to raise `DuplicateHeaders` error instead (previous behavior)        |
      | :user_provided_headers      |   nil    | *careful with that axe!*                                                             |
      |                             |          | user provided Array of header strings or symbols, to define                          |
      |                             |          | what headers should be used, overriding any in-file headers.                         |

diff --git a/lib/smarter_csv.rb b/lib/smarter_csv.rb
@@ -5,13 +5,21 @@
 require "smarter_csv/options_processing"
 require "smarter_csv/auto_detection"
 require "smarter_csv/variables"
+require 'smarter_csv/header_transformations'
+require 'smarter_csv/header_validations'
 require "smarter_csv/headers"
+require "smarter_csv/hash_transformations"
 require "smarter_csv/parse"
 
+# load the C-extension:
 case RUBY_ENGINE
 when 'ruby'
   begin
     if `uname -s`.chomp == 'Darwin'
+      #
+      # Please report if you see cases where the rake-compiler is building x86_64 code on arm64 cpus:
+      # https://github.com/rake-compiler/rake-compiler/issues/231
+      #
       require 'smarter_csv/smarter_csv.bundle'
     else
       # :nocov:

diff --git a/lib/smarter_csv/hash_transformations.rb b/lib/smarter_csv/hash_transformations.rb
@@ -0,0 +1,91 @@
+# frozen_string_literal: true
+
+module SmarterCSV
+  class << self
+    def hash_transformations(hash, options)
+      # there may be unmapped keys, or keys purposedly mapped to nil or an empty key..
+      # make sure we delete any key/value pairs from the hash, which the user wanted to delete:
+      remove_empty_values = options[:remove_empty_values] == true
+      remove_zero_values = options[:remove_zero_values]
+      remove_values_matching = options[:remove_values_matching]
+      convert_to_numeric = options[:convert_values_to_numeric]
+      value_converters = options[:value_converters]
+
+      hash.each_with_object({}) do |(k, v), new_hash|
+        next if k.nil? || k == '' || k == :""
+        next if remove_empty_values && (has_rails ? v.blank? : blank?(v))
+        next if remove_zero_values && v.is_a?(String) && v =~ /^(0+|0+\.0+)$/ # values are Strings
+        next if remove_values_matching && v =~ remove_values_matching
+
+        # deal with the :only / :except options to :convert_values_to_numeric
+        if convert_to_numeric && !limit_execution_for_only_or_except(options, :convert_values_to_numeric, k)
+          if v =~ /^[+-]?\d+\.\d+$/
+            v = v.to_f
+          elsif v =~ /^[+-]?\d+$/
+            v = v.to_i
+          end
+        end
+
+        converter = value_converters[k] if value_converters
+        v = converter.convert(v) if converter
+
+        new_hash[k] = v
+      end
+    end
+
+    # def hash_transformations(hash, options)
+    #   # there may be unmapped keys, or keys purposedly mapped to nil or an empty key..
+    #   # make sure we delete any key/value pairs from the hash, which the user wanted to delete:
+    #   hash.delete(nil)
+    #   hash.delete('')
+    #   hash.delete(:"")
+
+    #   if options[:remove_empty_values] == true
+    #     hash.delete_if{|_k, v| has_rails ? v.blank? : blank?(v)}
+    #   end
+
+    #   hash.delete_if{|_k, v| !v.nil? && v =~ /^(0+|0+\.0+)$/} if options[:remove_zero_values] # values are Strings
+    #   hash.delete_if{|_k, v| v =~ options[:remove_values_matching]} if options[:remove_values_matching]
+
+    #   if options[:convert_values_to_numeric]
+    #     hash.each do |k, v|
+    #       # deal with the :only / :except options to :convert_values_to_numeric
+    #       next if limit_execution_for_only_or_except(options, :convert_values_to_numeric, k)
+
+    #       # convert if it's a numeric value:
+    #       case v
+    #       when /^[+-]?\d+\.\d+$/
+    #         hash[k] = v.to_f
+    #       when /^[+-]?\d+$/
+    #         hash[k] = v.to_i
+    #       end
+    #     end
+    #   end
+
+    #   if options[:value_converters]
+    #     hash.each do |k, v|
+    #       converter = options[:value_converters][k]
+    #       next unless converter
+
+    #       hash[k] = converter.convert(v)
+    #     end
+    #   end
+
+    #   hash
+    # end
+
+    protected
+
+    # acts as a road-block to limit processing when iterating over all k/v pairs of a CSV-hash:
+    def limit_execution_for_only_or_except(options, option_name, key)
+      if options[option_name].is_a?(Hash)
+        if options[option_name].has_key?(:except)
+          return true if Array(options[option_name][:except]).include?(key)
+        elsif options[option_name].has_key?(:only)
+          return true unless Array(options[option_name][:only]).include?(key)
+        end
+      end
+      false
+    end
+  end
+end
diff --git a/lib/smarter_csv/header_transformations.rb b/lib/smarter_csv/header_transformations.rb
@@ -0,0 +1,63 @@
+# frozen_string_literal: true
+
+module SmarterCSV
+  class << self
+    # transform the headers that were in the file:
+    def header_transformations(header_array, options)
+      header_array.map!{|x| x.gsub(%r/#{options[:quote_char]}/, '')}
+      header_array.map!{|x| x.strip} if options[:strip_whitespace]
+
+      unless options[:keep_original_headers]
+        header_array.map!{|x| x.gsub(/\s+|-+/, '_')}
+        header_array.map!{|x| x.downcase} if options[:downcase_header]
+      end
+
+      # detect duplicate headers and disambiguate
+      header_array = disambiguate_headers(header_array, options) if options[:duplicate_header_suffix]
+      # symbolize headers
+      header_array = header_array.map{|x| x.to_sym } unless options[:strings_as_keys] || options[:keep_original_headers]
+      # doesn't make sense to re-map when we have user_provided_headers
+      header_array = remap_headers(header_array, options) if options[:key_mapping]
+
+      header_array
+    end
+
+    def disambiguate_headers(headers, options)
+      counts = Hash.new(0)
+      headers.map do |header|
+        counts[header] += 1
+        counts[header] > 1 ? "#{header}#{options[:duplicate_header_suffix]}#{counts[header]}" : header
+      end
+    end
+
+    # do some key mapping on the keys in the file header
+    # if you want to completely delete a key, then map it to nil or to ''
+    def remap_headers(headers, options)
+      key_mapping = options[:key_mapping]
+      if key_mapping.empty? || !key_mapping.is_a?(Hash) || key_mapping.keys.empty?
+        raise(SmarterCSV::IncorrectOption, "ERROR: incorrect format for key_mapping! Expecting hash with from -> to mappings")
+      end
+
+      key_mapping = options[:key_mapping]
+      # if silence_missing_keys are not set, raise error if missing header
+      missing_keys = key_mapping.keys - headers
+      # if the user passes a list of speciffic mapped keys that are optional
+      missing_keys -= options[:silence_missing_keys] if options[:silence_missing_keys].is_a?(Array)
+
+      unless missing_keys.empty? || options[:silence_missing_keys] == true
+        raise SmarterCSV::KeyMappingError, "ERROR: can not map headers: #{missing_keys.join(', ')}"
+      end
+
+      headers.map! do |header|
+        if key_mapping.has_key?(header)
+          key_mapping[header].nil? ? nil : key_mapping[header]
+        elsif options[:remove_unmapped_keys]
+          nil
+        else
+          header
+        end
+      end
+      headers
+    end
+  end
+end
diff --git a/lib/smarter_csv/header_validations.rb b/lib/smarter_csv/header_validations.rb
@@ -0,0 +1,34 @@
+# frozen_string_literal: true
+
+module SmarterCSV
+  class << self
+    def header_validations(headers, options)
+      check_duplicate_headers(headers, options)
+      check_required_headers(headers, options)
+    end
+
+    def check_duplicate_headers(headers, _options)
+      header_counts = Hash.new(0)
+      headers.each { |header| header_counts[header] += 1 unless header.nil? }
+
+      duplicates = header_counts.select { |_, count| count > 1 }
+
+      unless duplicates.empty?
+        raise(SmarterCSV::DuplicateHeaders, "Duplicate Headers in CSV: #{duplicates.inspect}")
+      end
+    end
+
+    require 'set'
+
+    def check_required_headers(headers, options)
+      if options[:required_keys] && options[:required_keys].is_a?(Array)
+        headers_set = headers.to_set
+        missing_keys = options[:required_keys].select { |k| !headers_set.include?(k) }
+
+        unless missing_keys.empty?
+          raise SmarterCSV::MissingKeys, "ERROR: missing attributes: #{missing_keys.join(',')}"
+        end
+      end
+    end
+  end
+end