Getting Started: Information Storage

**If you haven't grab a local copy of our Examples, click here to learn how.

Code

The following will be based off

ai.preferred.crawler.stackoverflow.entity
ai.preferred.crawler.stackoverflow.master

in our Examples package.

Let's learn

To get started on using your own storage, we need to do a few basic steps:

Create an object you would like to store (Listing.java)
Create a Session with a Storage Engine (ListingCrawler.java)
Pass the session into the Crawler (ListingCrawler.java)
Utilise the Storage Engine in the Handler (ListingHandler.java)

1. Create an object you would like to store (Listing.java)

Since we will be crawling a bunch of job listing, we create a Listing object with the properties: url, name and company.

public class Listing {

  private final String url;

  private final String name;

  private final String company;

  public Listing(String url, String name, String company) {
    this.url = url;
    this.name = name;
    this.company = company;
  }

  public String getUrl() {
    return url;
  }

  public String getName() {
    return name;
  }

  public String getCompany() {
    return company;
  }

  @Override
  public String toString() {
    return ToStringBuilder.reflectionToString(this);
  }
}

We would create a Listing object for every job listing we find on the page later in the Handler.

2. Create a Session with a Storage Engine (ListingCrawler.java)

In today's example, we will be using a CSV storage engine to store our data. Once again, I emphasize, you may use any storage of your choice.

First, we need to create a key before main() to store and retrieve the storage engine.

  // Create session keys for CSV printer to print from handler
  static final Session.Key<EntityCSVStorage<Listing>> CSV_STORAGE_KEY = new Session.Key<>();

We then initialise a EntityCSVStorage with the type Listing with the following line and put it into session storage using session.put().

    try (final EntityCSVStorage<Listing> printer = new EntityCSVStorage<>(filename, Listing.class)) {

      // Let's init the session, this allows us to retrieve the storage engine in the Handler
      final Session session = Session.builder()
          .put(CSV_STORAGE_KEY, printer)
          .build();

      ...

    } catch (IOException e) {
      LOGGER.error("Unable to open file: {}, {}", filename, e);
    }
  }

3. Pass the session into the Crawler (ListingCrawler.java)

Instead of an empty builder, we have to set the session when building the Crawler using setSession(). Let's change createCrawler() a little to take in an additional parameter called session, like this:

  private static Crawler createCrawler(Fetcher fetcher, Session session) {
    // You can look in builder the different things you can add
    return Crawler.builder()
        .setFetcher(fetcher)
        .setSession(session)
        .build();
  }

We also need to change how we initialise our crawler by adding additional parameter session in the main method.

      try (final Crawler crawler = createCrawler(createFetcher(fileManager), session).start()) {
        ...
      }

Now, we have initialise the session.

4. Utilise the Storage Engine in the Handler (ListingHandler.java)

Time to use it in the Handler.

We need to retrieve the storage engine from the session. To do this we use session.get().

    // Get the CSV printer we created
    final EntityCSVStorage<Listing> csvStorage = session.get(ListingCrawler.CSV_STORAGE_KEY);

We can now use it to write to the CSV file in the handler like so.

    final Listing listing = new Listing("http://example.com", "Example Job", "Example Coy");
    csvStorage.append(listing);

Congratulations

Now run the crawler and watch data\stackoverflow.csv get populated.

Venom (c) Your preferred open source focused crawler for the deep web

Blazing fast | Customizable | Robust | Simple and Handy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly