Chore/change sdr source -- feedback request #252

eshadatta · 2016-10-05T15:31:40Z

Hello @NYULibraries/hydra , @sgbalogh ,

Here's another PR for this. I have re-written this to read through the layers.json file and parse out the urls and read those. This definitely cuts down the time. I'd still like to add some sort of imitation paging mechanism because this will keep growing. So, if people have sustainable design patterns in mind to implement this, I'm all ears. Thanks.

EDITED to add: if everyone is ok with the way it is, please let me know that too.

…blacklight records

… files for cuke testing

coveralls · 2016-10-05T16:19:39Z

Coverage increased (+0.0005%) to 99.269% when pulling 1fb946f on chore/change_sdr_source into c3c0ecb on development.

coveralls · 2016-10-05T16:19:39Z

Coverage increased (+0.0005%) to 99.269% when pulling 1fb946f on chore/change_sdr_source into c3c0ecb on development.

sgbalogh · 2016-10-05T17:38:41Z

@eshadatta: I just downloaded this to take a closer look at possibilities for pagination. One thing I noticed was this line: https://github.com/NYULibraries/ichabod/blob/chore/change_sdr_source/lib/ichabod/resource_set/source_readers/git_geo_blacklight_reader.rb#L50

Since this is now using layers.json to find the URL to all of the records, is there any need for a Github token?

eshadatta · 2016-10-05T17:45:31Z

@sgbalogh ,

Yes, the access token allows for multiple requests of the api and if we do it without, there's a restriction(like a few number of requests/hr...I don't remember the exact number). I added it for now because I keep querying it while I'm actively developing the reader, but I might take it out. And if so, of course, the way to figure out whether it's for testing purposes or not will be different and I'll change that code. But for now, I'd like to keep it in there. The access token is in configula. Please let me know if you need it and I'll send it to you. Thanks!

sgbalogh · 2016-10-05T17:50:25Z

@eshadatta oh, wasn't aware of the time restriction –– that makes sense then! Thanks.

da70 · 2016-10-05T18:12:43Z

@eshadatta

Some feedback:

git_geo_blacklight_reader.rb

map

Consider changing name to something more specific. Hard to tell if "map" here refers to a programming term like a map data structure linking data pairs or a method for running a specific function over an array, or if it refers to a geographic map since this is a geospatial data set. I saw from running the rspec that this actually returns an array of hashes, with each hash containing the contents of a geoblacklight.json. So I'm guessing map refers to a geographic map?
Change json_file variable name to json_files, plural. Or you could remove this variable altogether since its never used, provided the method name indicates clearly what data is returned.

read_static_files

Change json_file to json_files, plural.
Consider doing the array push in this calling code rather than passing in the array as the second argument to read_json_file, and also perhaps rename read_json_file to add_to_json_files_array if you decide to keep the mutating of argument structure.

parse_download_files

Change data variable name to layer_urls.
Remove size variable as it is never used.
Change json_file to json_files, plural.
Consider doing the array push in this calling code rather than passing in the array as the second argument to read_json_file, and also perhaps rename read_json_file to add_to_json_files_array if you decide to keep the mutating of argument structure.

read_json_file - consider renaming add_to_json_files_array or else changing it so it doesn't take an array as second argument that it adds to but instead just returns MultiJson.load(contents).

read_layer_paths

Change name of method to something like get_download_urls_for_layers_data.
Change variable name url to layers_list_download_url, and get_layer_path to get_layers_list_download_url -- note that it returns a URL, not a path.

get_layer_path

Rename get_layers_list_download_url.
Rename get_layer to tree.
Rename layer_download_url to layers_list_url. The URLs that are generated from data in the file downloaded from this URL are in fact the layer download URLs. This variable is more like the download URL for the list of layers.

Hope this helps.

eshadatta · 2016-10-05T18:23:28Z

Thanks @sgbalogh for looking into this. Thanks @da70 , I'll look into changing the var names.

sgbalogh · 2016-10-05T18:32:11Z

@eshadatta: I was thinking that we could read records (based on the layers.json list) in small batches (maybe ~100) instead of all at once, create ResourceSets for those 100, and then continue iterating through layers.json until there are none left –– where, this way, there are never more than 100 records stored in memory at a time.

But on second thought, I'm suspicious that the slowness might be caused by something else... I'm currently running the load task for this collection on my local machine, and though it took about five minutes to retrieve all 1700 records and store them in memory, it's taking far longer to create ResourceSets. It's been over 20 minutes since the task got to the ResourceSet creation phase, and my local Ichabod is only showing about 200 records of the 1700 as ingested. I'm not sure that paging would avoid this slowness? What do you think?

eshadatta · 2016-10-05T18:35:59Z

@sgbalogh I wonder if associating each record to a collection is causing the slowness? I can look into it.

da70 · 2016-10-05T20:19:24Z

@eshadatta, @sgbalogh, I just now stepped through the rspec test to see where the slowness was. It took about 7 minutes for the JSON file content to fill up. I am presuming it was from cassette so in fact it probably would take longer if Ichabod was downloading from Github. Maybe instead of one geoblacklight.json file per layer (or whatever the resource is), we have JSON files for batches of 100? Or are these files not under our control? If they're not, maybe a another repo or area of the repo where they've been merged into batch files -- if it's deemed worth it.

I have not run the read method of the GitGeoBlackLightReader run all the way through but I did go through many iterations of the loop and for me ResourceSet::Resource.new(resource_attributes_from_geoblacklight(gb)) ran pretty fast.
When I let it run all the way through I got an error, actually, but it could be due to the fact that I'd paused at breakpoints a lot.

sgbalogh · 2016-10-05T20:25:33Z

@da70: if the retrieval of all of the documents individually from GitHub is taking longer than we like, then probably the easiest / fastest thing to do would be to grab the zip archive of the entire edu.nyu repo when the task is run, have Ruby stash it in the /tmp dir, and then just traverse the files, destroying the directory at the end of the ingest. It would also probably remove the need for a GitHub API key –– though, admittedly, it's not the prettiest solution.

@eshadatta: you might be right that the slowness has something to do with collections... I ended up killing my task after over an hour, since I wasn't even half way through the collection, and Ruby was occupying 99% of my CPU cycles. Perhaps there is a way to break around the Hydra methods that send and receive data from Fedora which would let us figure out if that's what's taking so much time?

da70 · 2016-10-05T20:35:22Z

@sgbalogh, I think using the zip file is a perfectly valid solution, and would save a lot of time and machine resources. Would it also eliminate the need for downloading and parsing layers.json?

sgbalogh · 2016-10-06T01:41:44Z

@da70: yes, we could avoid using layers.json and just use Ruby's Find class to collect all geoblacklight.json files within the tmp/edu.nyu directory. That's actually exactly what I do to add records into the Solr core which powers GeoBlacklight.

eshadatta · 2016-10-06T15:16:11Z

@sgbalogh , sure I can get the zip files. Where do I get those from? It's easy enough to traverse a directory and I don't have to worry about tokens.

da70 · 2016-10-06T16:47:08Z

I did a test to see how long it would take to save 10 of the resources to Fedora. In read_layer_paths I added file_paths = file_paths[ 0, 10 ] and then timed how long it took for the loop in load method of Ichabod::ResourceSet::Base to complete. The method:

 def load

...

        resources.collect do |resource|
          unless resource.is_a?(Resource)
            raise RuntimeError.new("Expecting #{resource} to be a Resource")
          end
          nyucore = resource.to_nyucore
          before_load_methods.each do |before_load_method|
            before_load_method.call(resource, nyucore)
          end
          # if restrictions are specified, assign the value
          unless @set_restrictions.blank?
            nyucore.source_metadata.restrictions = restrictions[@set_restrictions] if restrictions.has_key?(@set_restrictions)
          end
          #assign collection
          nyucore.collection=@collection
          if nyucore.save
            Rails.logger.info("#{nyucore.pid} has been saved to Fedora")
            nyucore
          end
        end.compact

...

On my workstation (Mac Pro with 8 cores, 12G RAM), it took 1 minute 4 seconds. This is 6.4 seconds per record, most of the time this is spent at if nyucore.save. There are currently 1,788 records, which require an estimated total of 191 minutes to load into Fedora.

I stepped through nyucore.save a little bit, and if I had to point at one call that seemed to take the most time, it would be this one in activesupport-4.1.15/lib/active_support/callbacks.rb, arrived at after calling the persist method in the ActiveFedora Callbacks module:

      def call(*args)
        @before.each { |b| b.call(*args) }
        value = @call.call(*args)
        @after.each { |a| a.call(*args) }
        value
      end

Specifically, the value = @call.call(*args) line where args is an array with one element:

#<struct ActiveSupport::Callbacks::Filters::Environment target=#<Ichabod::NyucoreDatastream @pid="sdr:hdl-handle-net-2451-33914" @dsid="source_metadata" @controlGroup="M" changed="true" @mimeType="application/n-triples" >, halted=false, value=#<Ichabod::NyucoreDatastream @pid="sdr:hdl-handle-net-2451-33914" @dsid="source_metadata" @controlGroup="M" changed="true" @mimeType="application/n-triples" >, run_block=#<Proc:0x007f88fb854ec8@/Users/david/.rvm/gems/ruby-2.1.3@ichabod/gems/rubydora-1.8.1/lib/rubydora/datastream.rb:307>>

Looks like it's the persisting of the datastream that is taking the most time.

eshadatta · 2016-10-06T16:50:38Z

@da70 , thanks for the investigation. If that's the case, then it doesn't really matter how we grab the data. If there's time at the tech meeting, I'll ask people what they think about which method zip file/layer is the best method. Thanks.

sgbalogh · 2016-10-06T16:51:06Z

@eshadatta: You can always grab a zip of the master branch here: https://github.com/OpenGeoMetadata/edu.nyu/archive/master.zip

@da70: It was also taking me approximately 6 seconds per record, on an i7 Mac with 16gb ram, so that's useful to hear.

Looks like Fedora really is the culprit then?

eshadatta · 2016-10-06T20:58:37Z

update: will check how long it takes to create resources from an ichabod commit pre collection vs current and see if there's a difference. Regardless, will go with the zip method to ingest data.

eshadatta added 6 commits September 27, 2016 15:17

changed a little to allow for new location of canonical data

343dbc8

made mapping change to map ichabod records to identifier field in geo…

20319bc

…blacklight records

checking in yml file, geo records have changed

f1a9a84

tweaked features to account for change of data and added 10 more json…

b8d9d5f

… files for cuke testing

DRYing up code

fac016e

created new yml file

1fb946f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chore/change sdr source -- feedback request #252

Chore/change sdr source -- feedback request #252

eshadatta commented Oct 5, 2016 •

edited

Loading

coveralls commented Oct 5, 2016 •

edited

Loading

coveralls commented Oct 5, 2016

sgbalogh commented Oct 5, 2016

eshadatta commented Oct 5, 2016

sgbalogh commented Oct 5, 2016

da70 commented Oct 5, 2016

eshadatta commented Oct 5, 2016

sgbalogh commented Oct 5, 2016

eshadatta commented Oct 5, 2016

da70 commented Oct 5, 2016

sgbalogh commented Oct 5, 2016

da70 commented Oct 5, 2016

sgbalogh commented Oct 6, 2016

eshadatta commented Oct 6, 2016

da70 commented Oct 6, 2016

eshadatta commented Oct 6, 2016

sgbalogh commented Oct 6, 2016

eshadatta commented Oct 6, 2016

Chore/change sdr source -- feedback request #252

Are you sure you want to change the base?

Chore/change sdr source -- feedback request #252

Conversation

eshadatta commented Oct 5, 2016 • edited Loading

coveralls commented Oct 5, 2016 • edited Loading

coveralls commented Oct 5, 2016

sgbalogh commented Oct 5, 2016

eshadatta commented Oct 5, 2016

sgbalogh commented Oct 5, 2016

da70 commented Oct 5, 2016

git_geo_blacklight_reader.rb

eshadatta commented Oct 5, 2016

sgbalogh commented Oct 5, 2016

eshadatta commented Oct 5, 2016

da70 commented Oct 5, 2016

sgbalogh commented Oct 5, 2016

da70 commented Oct 5, 2016

sgbalogh commented Oct 6, 2016

eshadatta commented Oct 6, 2016

da70 commented Oct 6, 2016

eshadatta commented Oct 6, 2016

sgbalogh commented Oct 6, 2016

eshadatta commented Oct 6, 2016

eshadatta commented Oct 5, 2016 •

edited

Loading

coveralls commented Oct 5, 2016 •

edited

Loading