layout | title | categories | parent | weight |
---|---|---|---|---|
post |
Modeling your data |
SBP |
data-access-patterns.html |
1550 |
{% tip %}
Summary: How to model application data for in-memory data grid.
Author: Shay Hassidim, Deputy CTO, GigaSpaces
Recently tested with GigaSpaces version: XAP 7.1
Last Update: October 2009
{% endtip %}
When moving from a centralized into a distributed data store, your data needs to be partitioned across multiple nodes (AKA partitions). Implementing the partitioning mechanism technically is not a hard task; however, planning the distribution of your data for scalability and performance, requires some thinking.
There are several questions which need to be answered when planning for data partitioning:
- What is the information I should store in memory? The answer to this question is not a technical one, and should not be mistakenly confused with the structure of the data. This is in essence a business question: How much the data will it grow over time? For how long should you keep it?
We recommend using the following table for this process:
{: .table .table-bordered}
Data item | Estimated Quantity | Expected Growth | Estimated Object Size |
---|---|---|---|
Data Type A | 100K | 10% | 2K |
Data Type B | 200K | 20% | 4K |
Once you have identified the size and expected growth of your data, you can start thinking about partitioning it; however, there's more to consider before doing that.
- What are my application's use cases? While you might be used to model your data by the logical relationship of your data items, in the case of distributed data, you need to think differently. The rule of thumb here is to avoid cross cluster relationships as much as possible, since they will lead to cross cluster queries and updates which are usually much less scalable and fast than their local counterparts.
Thinking in terms of traditional relationships ("one to one", "one to many" and "many to many"), is deceiving with distributed data. The first question to ask is: How many different associations does each entity have?
If an entity is associated with several containers (parent entities), it can't be embedded within the containing entity. It might be also impossible to store it with all of its containers on the same partition. We have mentioned the concept of embedded relationships above, let us now explain this concept's implications on your application.
A space may store many type of entities. A space can be compared to a database that may have many tables, in the same way a space may have many space classes. Practically there is no limit to the number of Entities (space classes or data types) you may store within a given space cluster. Each space class may have unlimited number of instances (space objects or entries).
Unlike legacy caching products that promote a Map per Entity approach storage model, with the space data modeling approach, you can treat your entire application objects naturally having one global In-Memory data source regardless their data type.
With Embedded Relationships a parent object physically contains the associated object(s) and there is a strong lifecycle dependency between them - once you delete the containing object, you also delete all of its contained objects. With this type of object association, you are always ensuring a local transaction since the entire object graph is stored in the same entry within the Space.
Fetching objects from the space when using the Embedded Relationships model done by using a SQLQuery with the readMultiple
call or the IteratorBuilder when having large set of objects where the SQLQuery predicate using root level or embedded objects properties. With a single SQLQuery
you may specify a query that span objects from different data types related to each other contained in each other. The embedded objects may be elements within an array or any type of collection (List , Map) or just a simple referenced object.
With earlier versions of XAP, when updating an object you had to read the entire object back to the client, get a property value, update it and later write back the object to the space. If you had to update a property within an embedded object, you had to navigate the object graph, access the property within the embedded object and update it before writing the entire space object. When having an object with many properties or many nested embedded objects, updating data may impose an overhead as it involves serialization of large amount of properties.
Starting with XAP 9.0 the Change API allows you to overcome this limitation and modify a specific property(s) within the root space object or any embedded object without reading the entire object graph in an atomic manner. This optimizes the amount of data transferred between the client and the primary space and also between the primary and backup when replicating updates.
With the embedded model, updating (as well adding or removing) a nested collection with large number of elements must use the change API since the default behavior would be to replicate the entire space object and its nested collection elements from the primary to the backup (or other replica primary copies when using the sync-replicate or the async-replicated cluster schema). The Change API reduces the CPU utilization at the primary side, reduce the serialization overhead and reduce the garbage collection activity both at the primary and backup. This improves the overall system stability significantly.
With Non-Embedded Relationships a parent object is associated with a number of other objects, so you can navigate from one object to others. However, there is no life cycle dependency between them, so if you delete the referencing object (parent), you don't automatically delete the referenced (child) object(s). The association is therefore manifested in storing the child IDs in the parent rather than storing the actual associated object itself. This type of relationship means that you might want to access the child object seperatly without accessing their parent objects. This approach avoid the need to duplicate child object in case these are references by more than a single parent object. This approach might enfore you to perform multiple space operations when accessing the entire parent-child graph across multiple space cluster partitions.
The following describes the different data modeling options available with Non-Embedded Relationships:
With this approach you first retrieve an initial set of "root space objects" usually using a SQLQuery or a template with the readMultiple
call or the IteratorBuilder when having large set of objects and later using some meta data stored within these root space objects such as the ID or IDs of related objects and their routing field value (when having these distributed across remote multiple partitions) to fetch the related (child) objects. Fetching these should use the readById
or readByIds
calls. Both the readById
or readByIds
allows you to provide the routing field value avoiding the need to search the entire cluster for matching objects. You may also use the Change API call to modify specific child objects without even reading these first.
With this approach you access the referenced (child) objects directly and from these access their parent object. With this flow the child object store the parent object ID (and routing field value). You query the space for child objects via some property(s) using a SQLQuery or a template with the readMultiple
call , iterate over the child objects result set collecting getting the parent IDs and via the readByIds
call read all relevant parent objects.
{% tip %} Since version 9.5, XAP supports projections where you can read specific properties (delta read) instead of reading the entire space object content. This may optimize the data retrieval flow. {% endtip %}
An hybrid approach of the Parent-First and Child-First involves having both the Parent storing the ID of the child objects and also having the child objects storing the ID of the parent object. With this approach you may choose the right data retrieval flow based on the business logic requirements which provide greater flexibility. Such model allows navigating from a child object to its sibling child via the common parent via 2 simple space calls. The downside of this approach is redundant meta-data maintained in memory and extra updates required when data is deleted and a transaction which space more objects. This impacts system concurrency level.
Many times you have an existing application that has been evolved around a database as its sole system of record. In such a case you might be using Hibernate (or some other mapping layer) to bridge between the Object model your application is using and the relational model the database is using. In some other cases you might be using JDBC API to access the database.
To leverage the space data modeling approach you will need to adapt your existing application entities to use the right data access routines. The Entity class should be modified to leverage the space data model and API. When the application using Hibernate for example, the changes can be done in a relatively transparent manner to the application itself. The Moving from Hibernate to Space provides simple guide how to perform these changes at the Data Access Objects (DAO). You might be able to automate this process via auto-code generation or byte code manipulation.
With the following example we have the Author and the Book entities. Here is how the original Author and the Book Entities looks like:
{: .table .table-bordered}
Author | Book |
---|---|
id:Integer | id:Integer |
lastName:String | title:String |
You can download the code used with the example below. See MainEmbeddedOne2Many
, MainEmbeddedOne2One
, MainNonEmbeddedOne2Many
, MainNonEmbeddedOne2One
and MainJDBC
demonstrating each scenario described below.
The examples below can be used with a client accessing a remote space or a "collocated client" running within the space - e.g. a DistributedTask implementation or a service method invoked in a broadcast mode. The collocated client will reduce the serialization and network overhead. When using the collocated client approach with the non-embedded model you should use the same routing field value for associated objects (parent-child).
With this example there is One-to-One relationship between the Author and the Book entity - An Author may have one Book.
Users may search for:
- All Book titles written by an Author with a specific last name (there may be multiple matching Authors).
- An Author with a specific Book.
{% comment %}
- All Authors that published Books after a certain date. This may result multiple result.
- All Books that start with the word "peace" regardless their Author last name. {% endcomment %}
When using JDBC to query for all the Books an Author with a specific last name your SQL query would look like this:
{% highlight java %} select Book.id , Author.id,Author.lastName from Book, Author WHERE Author.lastName='AuthorX' AND Book.authorId = Author.id {% endhighlight %}
The main problem with this approach is the execution time. The more Books or Authors you have the time to execute the query will grow. Using the Space API with the embedded and non-embedded model will provide much better performance that will not be affected when having large amount of Books or Authors .
Let's compare the JDBC approach to the embedded and non-embedded model:
With the embedded model the root Space object is the Author. It has a Book object embedded. The representation of these Entities looks like this:
{% inittab embedded|top %} {% tabcontent The Author Entity %}
{% highlight java %} public class Author { Integer id; String lastName; Book book;
@SpaceId
public Integer getId() {
return id;
}
public void setId(Integer id) {
this.id = id;
}
@SpaceIndex
public String getLastName() {
return lastName;
}
public void setLastName(String lastName) {
this.lastName = lastName;
}
@SpaceIndex(path = "title")
public Book getBook() {
return book;
}
public void setBook(Book book) {
this.book = book;
}
} {% endhighlight %}
{% endtabcontent %}
{% tabcontent The Embedded Book Entity %}
{% highlight java %} public class Book implements Serializable{ Integer id; String title;
public Integer getId() {
return id;
}
public void setId(Integer id) {
this.id = id;
}
public String getTitle() {
return title;
}
public void setTitle(String title) {
this.title = title;
}
} {% endhighlight %}
{% endtabcontent %} {% endinittab %}
{% tip %} See the how the book Title property is indexed within Author class. {% endtip %}
To query for all the Books written by an Author with a specific last name your query code would look like this:
{% highlight java %} SQLQuery query = new SQLQuery (Author.class , "lastName=?"); query.setParameter(1, "AuthorX"); Author authorFounds [] = space.readMultiple(query); Set booksFound = new HashSet (); for (int j = 0; j < authorFounds.length; j++) { booksFound.add(authorFounds[j].getBook()); } return booksFound; {% endhighlight %}
To query for an Author with a specific Book title the query would look like this:
{% highlight java %} SQLQuery query = new SQLQuery (Author.class , "lastName=? and book.title=?"); query.setParameter(1, "AuthorX"); query.setParameter(2, "BookX"); Author authorFounds [] = space.readMultiple(query); return authorFounds ; {% endhighlight %}
With the non-Embedded model the Author and the Book would look like this - See how the ID of the Book is stored within the Author rather the Book object itself. It is stored as a separate Space object:
{% inittab non-Embedded %} {% tabcontent The Author Entity %}
{% highlight java %} @SpaceClass public class Author { Integer id; String lastName; List bookIds;
@SpaceId(autoGenerate=false)
public Integer getId() {
return id;
}
public void setId(Integer id) {
this.id = id;
}
@SpaceIndex
public String getLastName() {
return lastName;
}
public void setLastName(String lastName) {
this.lastName = lastName;
}
public List<Integer> getBookIds() {
return bookIds;
}
public void setBookIds(List<Integer> bookIds) {
this.bookIds = bookIds;
}
} {% endhighlight %}
{% endtabcontent %} {% tabcontent The Book Entity %}
{% highlight java %} @SpaceClass public class Book { Integer id; Integer authorId; String title;
@SpaceId (autoGenerate=false)
public Integer getId() {
return id;
}
public void setId(Integer id) {
this.id = id;
}
@SpaceIndex
public Integer getAuthorId() {
return authorId;
}
public void setAuthorId(Integer authorId) {
this.authorId = authorId;
}
@SpaceIndex
public String getTitle() {
return title;
}
public void setTitle(String title) {
this.title = title;
}
} {% endhighlight %}
{% endtabcontent %} {% endinittab %}
To query for all the Books written by an Author with a specific last name your query code would look like this - See how the readById is used:
{% highlight java %} SQLQuery query = new SQLQuery (Author.class , "lastName=?"); query.setParameter(1, "AuthorX"); Author authors [] = space.readMultiple(query); ArrayList booksFound = new ArrayList() ;
// read the Author Book via its ID for (int j=0;j<authors.length;j++) { Book book = space.readById(Book.class , authors[j].getBookId()); booksFound.add(book); } return booksFound; {% endhighlight %}
{% tip %}
See the Id Queries page for more details how readById
can be used.
{% endtip %}
To query for a specific Author with a specific Book title the query code would look like this:
{% highlight java %} String authoridsForTitle = ""; SQLQuery bookQuery = new SQLQuery (Book.class , "title=?"); bookQuery.setParameter(1, "BookX"); Book booksFounds [] = space.readMultiple(bookQuery); for (int j = 0; j < booksFounds.length; j++) { Book book = booksFounds[j]; authoridsForTitle = authoridsForTitle + book.getAuthorId().toString() ; if ((j +1)!= booksFounds.length) authoridsForTitle = authoridsForTitle + ","; }
SQLQuery query = new SQLQuery (Author.class , "lastName=? AND id IN ("+ authoridsForTitle+")"); query.setParameter(1, "Author" + i); Author authorFounds [] = space.readMultiple(query); return authorFounds ; {% endhighlight %}
With this example there is One-to-Many relationship between the Author and the Book entity - An Author may write many Books.
Users may search for:
- All Book titles written by an Author with a specific last name (there may be multiple matching Authors).
- An Author with a specific Book.
{% comment %}
- All Authors that published Books after a certain date. This may result multiple result.
- All Books that start with the word "peace" regardless their Author last name. {% endcomment %}
When using JDBC to query for all the Books an Author with a specific last name your query code would look like this:
{% highlight java %} select Book.id , Author.id,Author.lastName from Book, Author WHERE Author.lastName='AuthorX' AND Book.authorId = Author.id {% endhighlight %}
The main problem with this approach is the execution time. The more Books or Authors you have the time to execute the query will grow. Using the Space API with the embedded and non-embedded model will provide much better performance that will not be affected when having large amount of Books or Authors .
Let's compare the JDBC approach to the embedded and non-embedded model:
With the embedded model the root Space object is the Author. It has a Book collection embedded. The representation of these Entities looks like this:
{% inittab embedded|top %} {% tabcontent The Author Entity %}
{% highlight java %} public class Author { Integer id; String lastName; List books;
@SpaceId
public Integer getId() {
return id;
}
public void setId(Integer id) {
this.id = id;
}
@SpaceIndex
public String getLastName() {
return lastName;
}
public void setLastName(String lastName) {
this.lastName = lastName;
}
@SpaceIndex(path = "[*].title")
public List<Book> getBooks() {
return books;
}
public void setBooks(List<Book> books) {
this.books = books;
}
} {% endhighlight %}
{% endtabcontent %}
{% tabcontent The Embedded Book Entity %}
{% highlight java %} public class Book implements Serializable{ Integer id; String title;
public Integer getId() {
return id;
}
public void setId(Integer id) {
this.id = id;
}
public String getTitle() {
return title;
}
public void setTitle(String title) {
this.title = title;
}
} {% endhighlight %}
{% endtabcontent %} {% endinittab %}
{% tip %} See the how the book Title property is indexed within Author class. {% endtip %}
To query for all the Books written by an Author with a specific last name your query code would look like this:
{% highlight java %} Set booksFound = new HashSet (); SQLQuery query = new SQLQuery (Author.class , "lastName=?"); query.setParameter(1, "AuthorX"); Author authorFounds [] = space.readMultiple(query); for (int j = 0; j < authorFounds.length; j++) { booksFound.addAll(authorFounds[j].getBooks()); } return booksFound; {% endhighlight %}
To query for an Author with a specific Book title the query would look like this:
{% highlight java %} SQLQuery query = new SQLQuery (Author.class , "lastName=? and books[*].title=?"); query.setParameter(1, "AuthorX"); query.setParameter(2, "BookY"); Author authorFounds [] = space.readMultiple(query); return authorFounds; {% endhighlight %}
With the non-Embedded model the Author and the Book would look like this - See how the IDs of the Books are stored within the Author object rather than the Books themselvs. These are stored as seperate Space objects:
{% inittab non-Embedded %} {% tabcontent The Author Entity %}
{% highlight java %} @SpaceClass public class Author { Integer id; String lastName; List bookIds;
@SpaceId(autoGenerate=false)
public Integer getId() {
return id;
}
public void setId(Integer id) {
this.id = id;
}
@SpaceIndex
public String getLastName() {
return lastName;
}
public void setLastName(String lastName) {
this.lastName = lastName;
}
public List<Integer> getBookIds() {
return bookIds;
}
public void setBookIds(List<Integer> bookIds) {
this.bookIds = bookIds;
}
} {% endhighlight %}
{% endtabcontent %} {% tabcontent The Book Entity %}
{% highlight java %} @SpaceClass public class Book { Integer id; Integer authorId; String title;
@SpaceId (autoGenerate=false)
public Integer getId() {
return id;
}
public void setId(Integer id) {
this.id = id;
}
@SpaceIndex
public Integer getAuthorId() {
return authorId;
}
public void setAuthorId(Integer authorId) {
this.authorId = authorId;
}
@SpaceIndex
public String getTitle() {
return title;
}
public void setTitle(String title) {
this.title = title;
}
} {% endhighlight %}
{% endtabcontent %} {% endinittab %}
To query for all the Books written by an Author with a specific last name your query code would look like this - See how the readByIds is used:
{% highlight java %} SQLQuery query = new SQLQuery (Author.class , "lastName=?"); query.setParameter(1, "AuthorX"); Author authors [] = space.readMultiple(query); ArrayList booksFound = new ArrayList() ;
// read all the Author Books via their IDs for (int j=0;j<authors.length;j++) { Integer ids [] = new Integer[authors[j].getBookIds().size()]; ids = authors[j].getBookIds().toArray(ids); Iterator bookIter = space.readByIds(Book.class ,ids).iterator(); while (bookIter.hasNext()) { booksFound.add((Book) bookIter.next()); } } return booksFound; {% endhighlight %}
{% tip %}
See the Id Queries page for more details how readByIds
can be used.
{% endtip %}
To query for a specific Author with a specific Book title the query would look like this:
{% highlight java %} SQLQuery bookQuery = new SQLQuery (Book.class , "title=?"); bookQuery.setParameter(1, "BookX"); Book booksFounds [] = space.readMultiple(bookQuery); String authoridsForTitle=""; for (int j = 0; j < booksFounds.length; j++) { Book book = booksFounds[j]; authoridsForTitle = authoridsForTitle + book.getAuthorId().toString() ; if ((j +1)!= booksFounds.length) authoridsForTitle = authoridsForTitle + ","; }
SQLQuery query = new SQLQuery (Author.class , "lastName=? AND id IN ("+ authoridsForTitle+")"); query.setParameter(1, "AuthorX"); Author authorFounds [] = space.readMultiple(query); return authorFounds ; {% endhighlight %}
{% tip %} More Examples See the SQLQuery section for details about embedded entities query and indexing. See the Parent Child Relationship for an example for non-embedded relationships. {% endtip %}
In the Pet Clinic application that is based on the Spring pet clinic sample, a Pet is only associated with an Owner. We can therefore store each Pet with its owner on the same partition. We can even embed the Pet object within the physical Owner entry. However, if a Pet would have been associated with a Vet as well, we could have certainly not embedded the Pet in the Vet physical entry (without duplicating each Pet entry) and could not even store the Pet and its Vet in the same partition.
- Embed when an entity is meaningful only with the context of its containing object. For example, in the petclinic application - a Pet has a meaning only when it has an Owner. A Pet in itself is meaningless without an Owner in this specific application. There is no business scenario for transferring a Pet from owner to owner or admitting a Pet to a Vet without the owner.
- Embedding may sometimes mean duplicating your data. For example, if you want to reference a certain Visit from both the Pet and Vet class, you'll need to have duplicate Visit entries. So let's look into duplication:
- Duplication means preferring scalability over footprint - the reason to duplicate is to avoid cluster wide transactions and in many cases it's the only way to partition your object in a scalable manner.
- Duplication means higher memory consumption: While memory is considered a commodity and low cost today, duplication has a bigger price to pay - you might have two space objects that contain the same data.
- Duplication means more lenient consistency. When you add a Visit to a Pet and Vet for example, you need to update them both. You can do it in one (potentially distributed) transaction, or in two separate transactions, which will scale better but be less consistent. This may be sufficient for many types of applications (e.g. social networks), where losing a post, although undesired, does not incur significant damage. In contrast, this is not feasible for financial applications where every operation should be accounted for.