-
Notifications
You must be signed in to change notification settings - Fork 2
Chore/change sdr source -- feedback request #252
base: development
Are you sure you want to change the base?
Conversation
…blacklight records
… files for cuke testing
1 similar comment
@eshadatta: I just downloaded this to take a closer look at possibilities for pagination. One thing I noticed was this line: https://github.com/NYULibraries/ichabod/blob/chore/change_sdr_source/lib/ichabod/resource_set/source_readers/git_geo_blacklight_reader.rb#L50 Since this is now using layers.json to find the URL to all of the records, is there any need for a Github token? |
Yes, the access token allows for multiple requests of the api and if we do it without, there's a restriction(like a few number of requests/hr...I don't remember the exact number). I added it for now because I keep querying it while I'm actively developing the reader, but I might take it out. And if so, of course, the way to figure out whether it's for testing purposes or not will be different and I'll change that code. But for now, I'd like to keep it in there. The access token is in configula. Please let me know if you need it and I'll send it to you. Thanks! |
@eshadatta oh, wasn't aware of the time restriction –– that makes sense then! Thanks. |
Some feedback: git_geo_blacklight_reader.rb
read_json_file - consider renaming
Hope this helps. |
@eshadatta: I was thinking that we could read records (based on the But on second thought, I'm suspicious that the slowness might be caused by something else... I'm currently running the load task for this collection on my local machine, and though it took about five minutes to retrieve all 1700 records and store them in memory, it's taking far longer to create ResourceSets. It's been over 20 minutes since the task got to the ResourceSet creation phase, and my local Ichabod is only showing about 200 records of the 1700 as ingested. I'm not sure that paging would avoid this slowness? What do you think? |
@sgbalogh I wonder if associating each record to a collection is causing the slowness? I can look into it. |
@eshadatta, @sgbalogh, I just now stepped through the rspec test to see where the slowness was. It took about 7 minutes for the JSON file content to fill up. I am presuming it was from cassette so in fact it probably would take longer if Ichabod was downloading from Github. Maybe instead of one I have not run the |
@da70: if the retrieval of all of the documents individually from GitHub is taking longer than we like, then probably the easiest / fastest thing to do would be to grab the zip archive of the entire @eshadatta: you might be right that the slowness has something to do with collections... I ended up killing my task after over an hour, since I wasn't even half way through the collection, and Ruby was occupying 99% of my CPU cycles. Perhaps there is a way to break around the Hydra methods that send and receive data from Fedora which would let us figure out if that's what's taking so much time? |
@sgbalogh, I think using the zip file is a perfectly valid solution, and would save a lot of time and machine resources. Would it also eliminate the need for downloading and parsing |
@da70: yes, we could avoid using |
@sgbalogh , sure I can get the zip files. Where do I get those from? It's easy enough to traverse a directory and I don't have to worry about tokens. |
I did a test to see how long it would take to save 10 of the resources to Fedora. In def load
...
resources.collect do |resource|
unless resource.is_a?(Resource)
raise RuntimeError.new("Expecting #{resource} to be a Resource")
end
nyucore = resource.to_nyucore
before_load_methods.each do |before_load_method|
before_load_method.call(resource, nyucore)
end
# if restrictions are specified, assign the value
unless @set_restrictions.blank?
nyucore.source_metadata.restrictions = restrictions[@set_restrictions] if restrictions.has_key?(@set_restrictions)
end
#assign collection
nyucore.collection=@collection
if nyucore.save
Rails.logger.info("#{nyucore.pid} has been saved to Fedora")
nyucore
end
end.compact
... On my workstation (Mac Pro with 8 cores, 12G RAM), it took 1 minute 4 seconds. This is 6.4 seconds per record, most of the time this is spent at I stepped through def call(*args)
@before.each { |b| b.call(*args) }
value = @call.call(*args)
@after.each { |a| a.call(*args) }
value
end Specifically, the
Looks like it's the persisting of the datastream that is taking the most time. |
@da70 , thanks for the investigation. If that's the case, then it doesn't really matter how we grab the data. If there's time at the tech meeting, I'll ask people what they think about which method zip file/layer is the best method. Thanks. |
@eshadatta: You can always grab a zip of the master branch here: https://github.com/OpenGeoMetadata/edu.nyu/archive/master.zip @da70: It was also taking me approximately 6 seconds per record, on an i7 Mac with 16gb ram, so that's useful to hear. Looks like Fedora really is the culprit then? |
update: will check how long it takes to create resources from an ichabod commit pre collection vs current and see if there's a difference. Regardless, will go with the zip method to ingest data. |
Hello @NYULibraries/hydra , @sgbalogh ,
Here's another PR for this. I have re-written this to read through the layers.json file and parse out the urls and read those. This definitely cuts down the time. I'd still like to add some sort of imitation paging mechanism because this will keep growing. So, if people have sustainable design patterns in mind to implement this, I'm all ears. Thanks.
EDITED to add: if everyone is ok with the way it is, please let me know that too.