-
Notifications
You must be signed in to change notification settings - Fork 5
Getting Started: Information Storage
**If you haven't grab a local copy of our Examples, click here to learn how.
The following will be based off
ai.preferred.crawler.stackoverflow.entity
ai.preferred.crawler.stackoverflow.master
in our Examples package.
To get started on using your own storage, we need to do a few basic steps:
- Create an object you would like to store (Listing.java)
- Create a Session with a Storage Engine (ListingCrawler.java)
- Pass the session into the Crawler (ListingCrawler.java)
- Utilise the Storage Engine in the Handler (ListingHandler.java)
Since we will be crawling a bunch of job listing, we create a Listing object with the properties: url, name and company.
public class Listing {
private final String url;
private final String name;
private final String company;
public Listing(String url, String name, String company) {
this.url = url;
this.name = name;
this.company = company;
}
public String getUrl() {
return url;
}
public String getName() {
return name;
}
public String getCompany() {
return company;
}
@Override
public String toString() {
return ToStringBuilder.reflectionToString(this);
}
}
We would create a Listing object for every job listing we find on the page later in the Handler.
In today's example, we will be using a CSV storage engine to store our data. Once again, I emphasize, you may use any storage of your choice.
First, we need to create a key before main() to store and retrieve the storage engine.
// Create session keys for CSV printer to print from handler
static final Session.Key<EntityCSVStorage<Listing>> CSV_STORAGE_KEY = new Session.Key<>();
We then initialise a EntityCSVStorage with the type Listing with the following line and put it into session storage using session.put().
try (final EntityCSVStorage<Listing> printer = new EntityCSVStorage<>(filename, Listing.class)) {
// Let's init the session, this allows us to retrieve the storage engine in the Handler
final Session session = Session.builder()
.put(CSV_STORAGE_KEY, printer)
.build();
...
} catch (IOException e) {
LOGGER.error("Unable to open file: {}, {}", filename, e);
}
}
Instead of an empty builder, we have to set the session when building the Crawler using setSession(). Let's change createCrawler() a little to take in an additional parameter called session, like this:
private static Crawler createCrawler(Fetcher fetcher, Session session) {
// You can look in builder the different things you can add
return Crawler.builder()
.setFetcher(fetcher)
.setSession(session)
.build();
}
We also need to change how we initialise our crawler by adding additional parameter session in the main method.
try (final Crawler crawler = createCrawler(createFetcher(fileManager), session).start()) {
...
}
Now, we have initialise the session.
Time to use it in the Handler.
We need to retrieve the storage engine from the session. To do this we use session.get().
// Get the CSV printer we created
final EntityCSVStorage<Listing> csvStorage = session.get(ListingCrawler.CSV_STORAGE_KEY);
We can now use it to write to the CSV file in the handler like so.
final Listing listing = new Listing("http://example.com", "Example Job", "Example Coy");
csvStorage.append(listing);
Now run the crawler and watch data\stackoverflow.csv get populated.
Venom (c) Your preferred open source focused crawler for the deep web
Blazing fast | Customizable | Robust | Simple and Handy