Skip to content

Commit

Permalink
Support Numo Gem for performing SVD
Browse files Browse the repository at this point in the history
**Background:**
The slow step of LSI is computing the SVD (singular value decomposition)
of a matrix. Even with a relatively small collection of documents (say,
about 20 blog posts), the native ruby implementation is too slow to be
usable (taking hours to complete).

To work around this problem, classifier-reborn allows you to optionally
use the `gsl` gem to make use of the [Gnu Scientific
Library](https://www.gnu.org/software/gsl/) when performing matrix
calculations. Computations with this gem perform orders of magnitude
faster than the ruby-only matrix implementation, and they're fast enough
that using LSI with Jekyll finishes in a reasonable amount of time
(seconds).

Unfortunately, [rb-gsl](https://github.com/SciRuby/rb-gsl) is
unmaintained -- there's a commit on main that makes it compatible with
Ruby 3, but nobody has released the gem so the only way to use rb-gsl
with Ruby 3 right now is to specify the git hash in your Gemfile. See
SciRuby/rb-gsl#67. This will be increasingly
problematic because Ruby 2.7 is now in [security
maintenance](https://www.ruby-lang.org/en/news/2022/04/12/ruby-2-7-6-released/)
and will become end of life in less than a year.

Notably, `rb-gsl` depends on the
[narray](https://github.com/masa16/narray#new-version-is-under-development---rubynumonarray)
gem. `narray` is deprecated, and the readme suggests using
`Numo::NArray` instead.

**Changes:**
In this PR, my goal is to provide an alternative matrix implementation
that can perform singular value decomposition quickly and works with
Ruby 3. Doing so will make classifier-reborn compatible with Ruby 3
without depending on the unmaintained/unreleased gsl gem. There aren't
many gems that provide fast matrix support for ruby, but
[Numo](https://github.com/ruby-numo) seems to be more actively
maintained than rb-gsl, and Numo has a working Ruby 3 implementation
that can perform a singular value decomposition, which is exactly what
we need. This requires
[numo-narray](https://github.com/ruby-numo/numo-narray) and
[numo-linalg](https://github.com/ruby-numo/numo-linalg).

My goal is to allow users to (optionally) use classifier-reborn with
Numo/Lapack the same way they'd use it with GSL. That is, the user
should install the `numo-narray` and `numo-linalg` gems (with their
required C libraries), and classifier-reborn will detect and use these
if they are found.
  • Loading branch information
mkasberg committed Jun 4, 2022
1 parent fb5da8e commit 914f708
Show file tree
Hide file tree
Showing 8 changed files with 90 additions and 30 deletions.
18 changes: 12 additions & 6 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,17 +14,17 @@ on:

jobs:
ci:
name: "Run Tests (Ruby ${{ matrix.ruby_version }}, GSL: ${{ matrix.gsl }})"
name: "Run Tests (Ruby ${{ matrix.ruby_version }}, Linalg: ${{ matrix.linalg_gem }})"
runs-on: "ubuntu-latest"
env:
# See https://github.com/marketplace/actions/setup-ruby-jruby-and-truffleruby#matrix-of-gemfiles
BUNDLE_GEMFILE: ${{ matrix.gemfile }}
LOAD_GSL: ${{ matrix.gsl }}
LINALG_GEM: ${{ matrix.linalg_gem }}
strategy:
fail-fast: false
matrix:
ruby_version: ["2.7", "3.0", "3.1", "jruby-9.3.4.0"]
gsl: [true, false]
linalg_gem: ["none", "gsl", "numo"]
# We use `include` to assign the correct Gemfile for each ruby_version
include:
- ruby_version: "2.7"
Expand All @@ -39,17 +39,23 @@ jobs:
# Ruby 3.0 does not work with the latest released gsl gem
# https://github.com/SciRuby/rb-gsl/issues/67
- ruby_version: "3.0"
gsl: true
linalg_gem: "gsl"
# Ruby 3.1 does not work with the latest released gsl gem
# https://github.com/SciRuby/rb-gsl/issues/67
- ruby_version: "3.1"
gsl: true
linalg_gem: "gsl"
# jruby-9.3.4.0 doesn't easily build the gsl gem on a GitHub worker. Skipping for now.
- ruby_version: "jruby-9.3.4.0"
gsl: true
linalg_gem: "gsl"
# jruby-9.3.4.0 doesn't easily build the numo gems on a GitHub worker. Skipping for now.
- ruby_version: "jruby-9.3.4.0"
linalg_gem: "numo"
steps:
- name: Checkout Repository
uses: actions/checkout@v3
- name: Install Lapack
if: ${{ matrix.linalg_gem == 'numo' }}
run: sudo apt-get install -y liblapacke-dev libopenblas-dev
- name: "Set up ${{ matrix.label }}"
uses: ruby/setup-ruby@v1
with:
Expand Down
2 changes: 1 addition & 1 deletion .rubocop.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
inherit_from: .rubocop_todo.yml

Style/GlobalVars:
AllowedVariables: [$GSL]
AllowedVariables: [$SVD]

Naming/MethodName:
Exclude:
Expand Down
7 changes: 6 additions & 1 deletion Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,9 @@ source 'https://rubygems.org'
gemspec name: 'classifier-reborn'

# For testing with GSL support & bundle exec
gem 'gsl' if ENV['LOAD_GSL'] == 'true'
gem 'gsl' if ENV['LINALG_GEM'] == 'gsl'

if ENV['LINALG_GEM'] == 'numo'
gem 'numo-narray'
gem 'numo-linalg'
end
60 changes: 49 additions & 11 deletions lib/classifier-reborn/lsi.rb
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,28 @@
# Copyright:: Copyright (c) 2005 David Fayram II
# License:: LGPL

# Try to load Numo first - it's the most current and the most well-supported.
# Fall back to GSL.
# Fall back to native vector.
begin
raise LoadError if ENV['NATIVE_VECTOR'] == 'true' # to test the native vector class, try `rake test NATIVE_VECTOR=true`
raise LoadError if ENV['GSL'] == 'true' # to test with gsl, try `rake test GSL=true`

require 'gsl' # requires https://github.com/SciRuby/rb-gsl
require_relative 'extensions/vector_serialize'
$GSL = true
require 'numo/narray' # https://ruby-numo.github.io/narray/
require 'numo/linalg' # https://ruby-numo.github.io/linalg/
$SVD = :numo
rescue LoadError
$GSL = false
require_relative 'extensions/vector'
require_relative 'extensions/zero_vector'
begin
raise LoadError if ENV['NATIVE_VECTOR'] == 'true' # to test the native vector class, try `rake test NATIVE_VECTOR=true`

require 'gsl' # requires https://github.com/SciRuby/rb-gsl
require_relative 'extensions/vector_serialize'
$SVD = :gsl
rescue LoadError
$SVD = :ruby
require_relative 'extensions/vector'
require_relative 'extensions/zero_vector'
end
end

require_relative 'lsi/word_list'
Expand Down Expand Up @@ -140,7 +152,15 @@ def build_index(cutoff = 0.75)
doc_list = @items.values
tda = doc_list.collect { |node| node.raw_vector_with(@word_list) }

if $GSL
if $SVD == :numo
tdm = Numo::NArray.asarray(tda.map(&:to_a)).transpose
ntdm = numo_build_reduced_matrix(tdm, cutoff)

ntdm.each_over_axis(1).with_index do |col_vec, i|
doc_list[i].lsi_vector = col_vec
doc_list[i].lsi_norm = col_vec / Numo::Linalg.norm(col_vec)
end
elsif $SVD == :gsl
tdm = GSL::Matrix.alloc(*tda).trans
ntdm = build_reduced_matrix(tdm, cutoff)

Expand Down Expand Up @@ -201,7 +221,9 @@ def proximity_array_for_content(doc, &block)
content_node = node_for_content(doc, &block)
result =
@items.keys.collect do |item|
val = if $GSL
val = if $SVD == :numo
content_node.search_vector.dot(@items[item].transposed_search_vector)
elsif $SVD == :gsl
content_node.search_vector * @items[item].transposed_search_vector
else
(Matrix[content_node.search_vector] * @items[item].search_vector)[0]
Expand All @@ -220,7 +242,8 @@ def proximity_norms_for_content(doc, &block)
return [] if needs_rebuild?

content_node = node_for_content(doc, &block)
if $GSL && content_node.raw_norm.isnan?.all?
if ($SVD == :gsl && content_node.raw_norm.isnan?.all?) ||
($SVD == :numo && content_node.raw_norm.isnan.all?)
puts "There are no documents that are similar to #{doc}"
else
content_node_norms(content_node)
Expand All @@ -230,7 +253,9 @@ def proximity_norms_for_content(doc, &block)
def content_node_norms(content_node)
result =
@items.keys.collect do |item|
val = if $GSL
val = if $SVD == :numo
content_node.search_norm.dot(@items[item].search_norm)
elsif $SVD == :gsl
content_node.search_norm * @items[item].search_norm.col
else
(Matrix[content_node.search_norm] * @items[item].search_norm)[0]
Expand Down Expand Up @@ -332,7 +357,20 @@ def build_reduced_matrix(matrix, cutoff = 0.75)
s[ord] = 0.0 if s[ord] < s_cutoff
end
# Reconstruct the term document matrix, only with reduced rank
u * ($GSL ? GSL::Matrix : ::Matrix).diag(s) * v.trans
u * ($SVD == :gsl ? GSL::Matrix : ::Matrix).diag(s) * v.trans
end

def numo_build_reduced_matrix(matrix, cutoff = 0.75)
s, u, vt = Numo::Linalg.svd(matrix, driver: 'svd', job: 'S')

# TODO: Better than 75% term (as above)
s_cutoff = s.sort.reverse[(s.size * cutoff).round - 1]
s.size.times do |ord|
s[ord] = 0.0 if s[ord] < s_cutoff
end

# Reconstruct the term document matrix, only with reduced rank
u.dot(::Numo::DFloat.eye(s.size) * s).dot(vt)
end

def node_for_content(item, &block)
Expand Down
23 changes: 17 additions & 6 deletions lib/classifier-reborn/lsi/content_node.rb
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,11 @@ def search_vector

# Method to access the transposed search vector
def transposed_search_vector
search_vector.col
if $SVD == :numo
search_vector
else
search_vector.col
end
end

# Use this to fetch the appropriate search vector in normalized form.
Expand All @@ -40,7 +44,9 @@ def search_norm
# Creates the raw vector out of word_hash using word_list as the
# key for mapping the vector space.
def raw_vector_with(word_list)
vec = if $GSL
vec = if $SVD == :numo
Numo::DFloat.zeros(word_list.size)
elsif $SVD == :gsl
GSL::Vector.alloc(word_list.size)
else
Array.new(word_list.size, 0)
Expand All @@ -51,7 +57,9 @@ def raw_vector_with(word_list)
end

# Perform the scaling transform and force floating point arithmetic
if $GSL
if $SVD == :numo
total_words = vec.sum.to_f
elsif $SVD == :gsl
sum = 0.0
vec.each { |v| sum += v }
total_words = sum
Expand All @@ -61,7 +69,7 @@ def raw_vector_with(word_list)

total_unique_words = 0

if $GSL
if [:numo, :gsl].include?($SVD)
vec.each { |word| total_unique_words += 1 if word != 0.0 }
else
total_unique_words = vec.count { |word| word != 0 }
Expand All @@ -85,12 +93,15 @@ def raw_vector_with(word_list)
hash[val] = Math.log(val + 1) / -weighted_total
end

vec.collect! do |val|
vec = vec.map do |val|
cached_calcs[val]
end
end

if $GSL
if $SVD == :numo
@raw_norm = vec / Numo::Linalg.norm(vec)
@raw_vector = vec
elsif $SVD == :gsl
@raw_norm = vec.normalize
@raw_vector = vec
else
Expand Down
2 changes: 1 addition & 1 deletion test/extensions/matrix_test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

class MatrixTest < Minitest::Test
def test_zero_division
skip "extensions/vector is only used by non-GSL implementation" if $GSL
skip "extensions/vector is only used by non-GSL implementation" if $SVD != :ruby

matrix = Matrix[[1, 0], [0, 1]]
matrix.SV_decomp
Expand Down
2 changes: 1 addition & 1 deletion test/extensions/zero_vector_test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

class ZeroVectorTest < Minitest::Test
def test_zero?
skip "extensions/zero_vector is only used by non-GSL implementation" if $GSL
skip "extensions/zero_vector is only used by non-GSL implementation" if $SVD != :ruby

vec0 = Vector[]
vec1 = Vector[0]
Expand Down
6 changes: 3 additions & 3 deletions test/lsi/lsi_test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,7 @@ def test_cached_content_node_option
end

def test_clears_cached_content_node_cache
skip "transposed_search_vector is only used by GSL implementation" unless $GSL
skip "transposed_search_vector is only used by GSL implementation" if $SVD == :ruby

lsi = ClassifierReborn::LSI.new(cache_node_vectors: true)
lsi.add_item @str1, 'Dog'
Expand Down Expand Up @@ -191,8 +191,8 @@ def test_keyword_search
assert_equal %i[dog text deal], lsi.highest_ranked_stems(@str1)
end

def test_invalid_searching_when_using_gsl
skip "Only GSL currently raises invalid search error" unless $GSL
def test_invalid_searching_with_linalg_lib
skip "Only GSL currently raises invalid search error" if $SVD == :ruby

lsi = ClassifierReborn::LSI.new
lsi.add_item @str1, 'Dog'
Expand Down

0 comments on commit 914f708

Please sign in to comment.