Skip to content
This repository has been archived by the owner on Mar 23, 2018. It is now read-only.

Chore/change sdr source -- feedback request #252

Open
wants to merge 6 commits into
base: development
Choose a base branch
from

Conversation

eshadatta
Copy link
Contributor

@eshadatta eshadatta commented Oct 5, 2016

Hello @NYULibraries/hydra , @sgbalogh ,

Here's another PR for this. I have re-written this to read through the layers.json file and parse out the urls and read those. This definitely cuts down the time. I'd still like to add some sort of imitation paging mechanism because this will keep growing. So, if people have sustainable design patterns in mind to implement this, I'm all ears. Thanks.

EDITED to add: if everyone is ok with the way it is, please let me know that too.

@coveralls
Copy link

coveralls commented Oct 5, 2016

Coverage Status

Coverage increased (+0.0005%) to 99.269% when pulling 1fb946f on chore/change_sdr_source into c3c0ecb on development.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage increased (+0.0005%) to 99.269% when pulling 1fb946f on chore/change_sdr_source into c3c0ecb on development.

@sgbalogh
Copy link
Contributor

sgbalogh commented Oct 5, 2016

@eshadatta: I just downloaded this to take a closer look at possibilities for pagination. One thing I noticed was this line: https://github.com/NYULibraries/ichabod/blob/chore/change_sdr_source/lib/ichabod/resource_set/source_readers/git_geo_blacklight_reader.rb#L50

Since this is now using layers.json to find the URL to all of the records, is there any need for a Github token?

@eshadatta
Copy link
Contributor Author

@sgbalogh ,

Yes, the access token allows for multiple requests of the api and if we do it without, there's a restriction(like a few number of requests/hr...I don't remember the exact number). I added it for now because I keep querying it while I'm actively developing the reader, but I might take it out. And if so, of course, the way to figure out whether it's for testing purposes or not will be different and I'll change that code. But for now, I'd like to keep it in there. The access token is in configula. Please let me know if you need it and I'll send it to you. Thanks!

@sgbalogh
Copy link
Contributor

sgbalogh commented Oct 5, 2016

@eshadatta oh, wasn't aware of the time restriction –– that makes sense then! Thanks.

@da70
Copy link
Contributor

da70 commented Oct 5, 2016

@eshadatta

Some feedback:

git_geo_blacklight_reader.rb

map

  • Consider changing name to something more specific. Hard to tell if "map" here refers to a programming term like a map data structure linking data pairs or a method for running a specific function over an array, or if it refers to a geographic map since this is a geospatial data set. I saw from running the rspec that this actually returns an array of hashes, with each hash containing the contents of a geoblacklight.json. So I'm guessing map refers to a geographic map?
  • Change json_file variable name to json_files, plural. Or you could remove this variable altogether since its never used, provided the method name indicates clearly what data is returned.

read_static_files

  • Change json_file to json_files, plural.
  • Consider doing the array push in this calling code rather than passing in the array as the second argument to read_json_file, and also perhaps rename read_json_file to add_to_json_files_array if you decide to keep the mutating of argument structure.

parse_download_files

  • Change data variable name to layer_urls.
  • Remove size variable as it is never used.
  • Change json_file to json_files, plural.
  • Consider doing the array push in this calling code rather than passing in the array as the second argument to read_json_file, and also perhaps rename read_json_file to add_to_json_files_array if you decide to keep the mutating of argument structure.

read_json_file - consider renaming add_to_json_files_array or else changing it so it doesn't take an array as second argument that it adds to but instead just returns MultiJson.load(contents).

read_layer_paths

  • Change name of method to something like get_download_urls_for_layers_data.
  • Change variable name url to layers_list_download_url, and get_layer_path to get_layers_list_download_url -- note that it returns a URL, not a path.

get_layer_path

  • Rename get_layers_list_download_url.
  • Rename get_layer to tree.
  • Rename layer_download_url to layers_list_url. The URLs that are generated from data in the file downloaded from this URL are in fact the layer download URLs. This variable is more like the download URL for the list of layers.

Hope this helps.

@eshadatta
Copy link
Contributor Author

Thanks @sgbalogh for looking into this. Thanks @da70 , I'll look into changing the var names.

@sgbalogh
Copy link
Contributor

sgbalogh commented Oct 5, 2016

@eshadatta: I was thinking that we could read records (based on the layers.json list) in small batches (maybe ~100) instead of all at once, create ResourceSets for those 100, and then continue iterating through layers.json until there are none left –– where, this way, there are never more than 100 records stored in memory at a time.

But on second thought, I'm suspicious that the slowness might be caused by something else... I'm currently running the load task for this collection on my local machine, and though it took about five minutes to retrieve all 1700 records and store them in memory, it's taking far longer to create ResourceSets. It's been over 20 minutes since the task got to the ResourceSet creation phase, and my local Ichabod is only showing about 200 records of the 1700 as ingested. I'm not sure that paging would avoid this slowness? What do you think?

@eshadatta
Copy link
Contributor Author

@sgbalogh I wonder if associating each record to a collection is causing the slowness? I can look into it.

@da70
Copy link
Contributor

da70 commented Oct 5, 2016

@eshadatta, @sgbalogh, I just now stepped through the rspec test to see where the slowness was. It took about 7 minutes for the JSON file content to fill up. I am presuming it was from cassette so in fact it probably would take longer if Ichabod was downloading from Github. Maybe instead of one geoblacklight.json file per layer (or whatever the resource is), we have JSON files for batches of 100? Or are these files not under our control? If they're not, maybe a another repo or area of the repo where they've been merged into batch files -- if it's deemed worth it.

I have not run the read method of the GitGeoBlackLightReader run all the way through but I did go through many iterations of the loop and for me ResourceSet::Resource.new(resource_attributes_from_geoblacklight(gb)) ran pretty fast.
When I let it run all the way through I got an error, actually, but it could be due to the fact that I'd paused at breakpoints a lot.

@sgbalogh
Copy link
Contributor

sgbalogh commented Oct 5, 2016

@da70: if the retrieval of all of the documents individually from GitHub is taking longer than we like, then probably the easiest / fastest thing to do would be to grab the zip archive of the entire edu.nyu repo when the task is run, have Ruby stash it in the /tmp dir, and then just traverse the files, destroying the directory at the end of the ingest. It would also probably remove the need for a GitHub API key –– though, admittedly, it's not the prettiest solution.

@eshadatta: you might be right that the slowness has something to do with collections... I ended up killing my task after over an hour, since I wasn't even half way through the collection, and Ruby was occupying 99% of my CPU cycles. Perhaps there is a way to break around the Hydra methods that send and receive data from Fedora which would let us figure out if that's what's taking so much time?

@da70
Copy link
Contributor

da70 commented Oct 5, 2016

@sgbalogh, I think using the zip file is a perfectly valid solution, and would save a lot of time and machine resources. Would it also eliminate the need for downloading and parsing layers.json?

@sgbalogh
Copy link
Contributor

sgbalogh commented Oct 6, 2016

@da70: yes, we could avoid using layers.json and just use Ruby's Find class to collect all geoblacklight.json files within the tmp/edu.nyu directory. That's actually exactly what I do to add records into the Solr core which powers GeoBlacklight.

@eshadatta
Copy link
Contributor Author

@sgbalogh , sure I can get the zip files. Where do I get those from? It's easy enough to traverse a directory and I don't have to worry about tokens.

@da70
Copy link
Contributor

da70 commented Oct 6, 2016

I did a test to see how long it would take to save 10 of the resources to Fedora. In read_layer_paths I added file_paths = file_paths[ 0, 10 ] and then timed how long it took for the loop in load method of Ichabod::ResourceSet::Base to complete. The method:

 def load

...

        resources.collect do |resource|
          unless resource.is_a?(Resource)
            raise RuntimeError.new("Expecting #{resource} to be a Resource")
          end
          nyucore = resource.to_nyucore
          before_load_methods.each do |before_load_method|
            before_load_method.call(resource, nyucore)
          end
          # if restrictions are specified, assign the value
          unless @set_restrictions.blank?
            nyucore.source_metadata.restrictions = restrictions[@set_restrictions] if restrictions.has_key?(@set_restrictions)
          end
          #assign collection
          nyucore.collection=@collection
          if nyucore.save
            Rails.logger.info("#{nyucore.pid} has been saved to Fedora")
            nyucore
          end
        end.compact

...

On my workstation (Mac Pro with 8 cores, 12G RAM), it took 1 minute 4 seconds. This is 6.4 seconds per record, most of the time this is spent at if nyucore.save. There are currently 1,788 records, which require an estimated total of 191 minutes to load into Fedora.

I stepped through nyucore.save a little bit, and if I had to point at one call that seemed to take the most time, it would be this one in activesupport-4.1.15/lib/active_support/callbacks.rb, arrived at after calling the persist method in the ActiveFedora Callbacks module:

      def call(*args)
        @before.each { |b| b.call(*args) }
        value = @call.call(*args)
        @after.each { |a| a.call(*args) }
        value
      end

Specifically, the value = @call.call(*args) line where args is an array with one element:

#<struct ActiveSupport::Callbacks::Filters::Environment target=#<Ichabod::NyucoreDatastream @pid="sdr:hdl-handle-net-2451-33914" @dsid="source_metadata" @controlGroup="M" changed="true" @mimeType="application/n-triples" >, halted=false, value=#<Ichabod::NyucoreDatastream @pid="sdr:hdl-handle-net-2451-33914" @dsid="source_metadata" @controlGroup="M" changed="true" @mimeType="application/n-triples" >, run_block=#<Proc:0x007f88fb854ec8@/Users/david/.rvm/gems/ruby-2.1.3@ichabod/gems/rubydora-1.8.1/lib/rubydora/datastream.rb:307>>

Looks like it's the persisting of the datastream that is taking the most time.

@eshadatta
Copy link
Contributor Author

@da70 , thanks for the investigation. If that's the case, then it doesn't really matter how we grab the data. If there's time at the tech meeting, I'll ask people what they think about which method zip file/layer is the best method. Thanks.

@sgbalogh
Copy link
Contributor

sgbalogh commented Oct 6, 2016

@eshadatta: You can always grab a zip of the master branch here: https://github.com/OpenGeoMetadata/edu.nyu/archive/master.zip

@da70: It was also taking me approximately 6 seconds per record, on an i7 Mac with 16gb ram, so that's useful to hear.

Looks like Fedora really is the culprit then?

@eshadatta
Copy link
Contributor Author

update: will check how long it takes to create resources from an ichabod commit pre collection vs current and see if there's a difference. Regardless, will go with the zip method to ingest data.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants