You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think it would be really useful if Warcat would provide an interface for lazy iteration over whole WARC file. I would image it to look something like this:
importwarcatforrecordinwarcat.readrecords('pages.warc.gz'):
withrecord.content() asf:
# process f
Also, if I could get lxml, BeautifulSoap and json from records, something like this:
I was surprised that example provided in documentation:
Reads everything into memory. And there is no easy way to iterate over records without loading everything into memory.
In my case, WARC files takes gigabytes of space, so I want to process those files record by record without loading everything into memory.
After reading sources I came up with this helper function:
I think it would be really useful if Warcat would provide an interface for lazy iteration over whole WARC file. I would image it to look something like this:
Also, if I could get lxml, BeautifulSoap and json from records, something like this:
Then it would be really amazing.
If you agree with suggested API, I can create pull request with the implementation.
The text was updated successfully, but these errors were encountered: