Lecture material adapted from 7DB-HBase-day1
See hand written notes from class
To get started, make sure you are in env/hbase
. We will use the commands that
I prepared that are in env/hbase/bin
to work with a standalone version of
Hbase.
Start hbase by running
./bin/hbase-start
And we will start a hbase shell with
./bin/hbase-shell
The first thing that we are going to do is create a table, called wiki
, with
one column family called text
create 'wiki', 'text'
We saw earlier that each row is a map of maps of byte arrays, so, lets add some data and then query it. Adding data has the form
put <tableName>, <key>, <columnFamily>:<column>, <data>
For example,
put 'wiki', 'Home', 'text:', 'Welcome to the wiki!'
Question What is the column family name?
Next lets get the homepage
get 'wiki', 'Home', 'text:'
Let's add a few more rows using the schema that we decide
put 'wiki', 'Page1', 'text:', 'This is page 1!'
put 'wiki', 'Page2', 'text:', 'This is page 2!'
put 'wiki', 'Page3', 'text:', 'This is page 3!'
put 'wiki', 'Page4', 'text:', 'This is page 4!'
We can scan over the contents of the table as follows
scan 'wiki'
This is okay for playing around and while development, but be careful about using scans when developing. As you could expect, scans are linear time operations and linear time operations are costly over GBs of data.
Let's explore history a bit, lets make some changes to Page4
put 'wiki', 'Page4', 'text:', 'This is page 4 version a!'
put 'wiki', 'Page4', 'text:', 'This is page 4 version b!'
put 'wiki', 'Page4', 'text:', 'This is page 4 version c!'
We can inspect the current version.
get 'wiki', 'Page4', 'text'
And if all went well, we will see
COLUMN CELL
text: timestamp=1568167309595, value=This is page 4 version c!
1 row(s) in 0.0100 seconds
But what about the history? Let's enable it! To do so, we will use the alter
command, but
to alter the schema, we will need to take the table off line.
disable 'wiki'
alter 'wiki', { NAME => 'text', VERSIONS => org.apache.hadoop.hbase.HConstants::ALL_VERSIONS }
enable 'wiki'
Now, let's try making more changes
put 'wiki', 'Page4', 'text:', 'This is page 4 version d!'
put 'wiki', 'Page4', 'text:', 'This is page 4 version e!'
put 'wiki', 'Page4', 'text:', 'This is page 4 version f!'
And we can explore the history by telling how many versions we would like:
get 'wiki', 'Page4', {COLUMN => 'text:', VERSIONS => 3}
COLUMN CELL
text: timestamp=1568168115717, value=This is page 4 version f!
text: timestamp=1568168115688, value=This is page 4 version e!
text: timestamp=1568168115653, value=This is page 4 version d!
We can also use time ranges (specified in milliseconds since the epoch).
get 'wiki', 'Page4', {TIMERANGE => [1568100000000, 1568168115688], VERSIONS=>4 }
COLUMN CELL
text: timestamp=1568168115653, value=This is page 4 version d!
text: timestamp=1568167951506, value=This is page 4 version c!
2 row(s) in 0.0100 seconds
Let's use everything that we have learned to do something a little more interesting. Lets create a wiki with the following requirements:
- page title is unique
- page can have multiple revisions
- each revision is identified by a timestamp
- each revision contains text
- each revision has an author
- each revision contains optional commit comment
The requirements naturally break down as follows:
- title is key
- two column families:
- text with default column
""
- revision with columns
author
andcomment
- text with default column
Let's update the schema. As before, we will use the alter
command to add the
revision column family.
disable 'wiki'
alter 'wiki', { NAME => 'revision', VERSIONS => org.apache.hadoop.hbase.HConstants::ALL_VERSIONS }
enable 'wiki'
Note that we do not specify the columns, it is up the application layer to maintain the schema.
Great, now, let's make some edits:
put 'wiki', 'Page1', 'text:', 'This is page 1a!'
put 'wiki', 'Page1', 'revision:author', 'Dave M'
put 'wiki', 'Page1', 'revision:commit', 'First edit from Dave'
And, check out the result!
get 'wiki', 'Page1'
COLUMN CELL
revision:author timestamp=1568169525393, value=Dave M
revision:commit timestamp=1568169528613, value=First edit from Dave
text: timestamp=1568169522777, value=This is page 1a!
3 row(s) in 0.0160 seconds
Question What is the problem here?
Well, since we performed the operation in three separate commands, we have different time stamps, this is not ideal. Well, the goods news is that we have been programming in ruby the entire time (via jruby). Let's take advantage of that and build up a little program.
Use your text editor to open a file home/put_multiple_columns.rb
and add the
following code (I will talk though this in class).
include Java
import 'org.apache.hadoop.hbase.client.HTable'
import 'org.apache.hadoop.hbase.client.Put'
import 'org.apache.hadoop.hbase.HBaseConfiguration'
def jbytes(*args)
args.map {|arg| arg.to_s.to_java_bytes}
end
table = HTable.new(HBaseConfiguration.create, "wiki")
p = Put.new(*jbytes("Home"))
p.add(*jbytes("text", "", "This is page 1b!"))
p.add(*jbytes("revision", "author", "Dave M"))
p.add(*jbytes("revision", "commit", "Second edit from Dave"))
table.put(p)
Start bash in the hbase container
> ./bin/hbase-base
And from the bash shell, run the script with:
hbase shell /mnt/put_multiple_columns.rb
Hop back over to our hbase shell and check out the time stamps on our home page. The timestamps are all the same!
get 'wiki', 'Home'
COLUMN CELL
revision:author timestamp=1633356965268, value=Dave M
revision:commit timestamp=1633356965268, value=Second edit from Dave
text: timestamp=1633356965268, value=This is page 1b!
3 row(s) in 0.0140 seconds
To get rid of the table, just drop it
disable 'wiki'
drop 'wiki'
We will spend some time together working though 7DB-HBase Day 2 to finish get you ready for an HBase assignment! But feel free to start before Fri!