Skip to content

Latest commit

 

History

History
213 lines (137 loc) · 6.6 KB

2021-10-04.md

File metadata and controls

213 lines (137 loc) · 6.6 KB

Hbase

Lecture material adapted from 7DB-HBase-day1

Data model

See hand written notes from class

Some hands on

To get started, make sure you are in env/hbase. We will use the commands that I prepared that are in env/hbase/bin to work with a standalone version of Hbase.

Start hbase by running

./bin/hbase-start

And we will start a hbase shell with

./bin/hbase-shell

Getting going

The first thing that we are going to do is create a table, called wiki, with one column family called text

create 'wiki', 'text'

We saw earlier that each row is a map of maps of byte arrays, so, lets add some data and then query it. Adding data has the form

put <tableName>, <key>, <columnFamily>:<column>, <data>

For example,

put 'wiki', 'Home', 'text:', 'Welcome to the wiki!'

Question What is the column family name?

Next lets get the homepage

get 'wiki', 'Home', 'text:'

Let's add a few more rows using the schema that we decide

put 'wiki', 'Page1', 'text:', 'This is page 1!'
put 'wiki', 'Page2', 'text:', 'This is page 2!'
put 'wiki', 'Page3', 'text:', 'This is page 3!'
put 'wiki', 'Page4', 'text:', 'This is page 4!'

We can scan over the contents of the table as follows

scan 'wiki'

This is okay for playing around and while development, but be careful about using scans when developing. As you could expect, scans are linear time operations and linear time operations are costly over GBs of data.

History

Let's explore history a bit, lets make some changes to Page4

put 'wiki', 'Page4', 'text:', 'This is page 4 version a!'
put 'wiki', 'Page4', 'text:', 'This is page 4 version b!'
put 'wiki', 'Page4', 'text:', 'This is page 4 version c!'

We can inspect the current version.

get 'wiki', 'Page4', 'text'

And if all went well, we will see

COLUMN                             CELL
 text:                             timestamp=1568167309595, value=This is page 4 version c!
1 row(s) in 0.0100 seconds

But what about the history? Let's enable it! To do so, we will use the alter command, but to alter the schema, we will need to take the table off line.

disable 'wiki'
alter 'wiki', { NAME => 'text', VERSIONS => org.apache.hadoop.hbase.HConstants::ALL_VERSIONS }
enable 'wiki'

Now, let's try making more changes

put 'wiki', 'Page4', 'text:', 'This is page 4 version d!'
put 'wiki', 'Page4', 'text:', 'This is page 4 version e!'
put 'wiki', 'Page4', 'text:', 'This is page 4 version f!'

And we can explore the history by telling how many versions we would like:

get 'wiki', 'Page4', {COLUMN => 'text:', VERSIONS => 3}
COLUMN                             CELL
 text:                             timestamp=1568168115717, value=This is page 4 version f!
 text:                             timestamp=1568168115688, value=This is page 4 version e!
 text:                             timestamp=1568168115653, value=This is page 4 version d!

We can also use time ranges (specified in milliseconds since the epoch).

get 'wiki', 'Page4',  {TIMERANGE => [1568100000000, 1568168115688], VERSIONS=>4 }
COLUMN                             CELL
 text:                             timestamp=1568168115653, value=This is page 4 version d!
 text:                             timestamp=1568167951506, value=This is page 4 version c!
2 row(s) in 0.0100 seconds

Building a wiki

Let's use everything that we have learned to do something a little more interesting. Lets create a wiki with the following requirements:

  1. page title is unique
  2. page can have multiple revisions
  3. each revision is identified by a timestamp
  4. each revision contains text
  5. each revision has an author
  6. each revision contains optional commit comment

The requirements naturally break down as follows:

  • title is key
  • two column families:
    • text with default column ""
    • revision with columns author and comment

Let's update the schema. As before, we will use the alter command to add the revision column family.

disable 'wiki'
alter 'wiki', { NAME => 'revision', VERSIONS => org.apache.hadoop.hbase.HConstants::ALL_VERSIONS }
enable 'wiki'

Note that we do not specify the columns, it is up the application layer to maintain the schema.

Great, now, let's make some edits:

put 'wiki', 'Page1', 'text:', 'This is page 1a!'
put 'wiki', 'Page1', 'revision:author', 'Dave M'
put 'wiki', 'Page1', 'revision:commit', 'First edit from Dave'

And, check out the result!

get 'wiki', 'Page1'
COLUMN                             CELL
 revision:author                   timestamp=1568169525393, value=Dave M
 revision:commit                   timestamp=1568169528613, value=First edit from Dave
 text:                             timestamp=1568169522777, value=This is page 1a!
3 row(s) in 0.0160 seconds

Question What is the problem here?

Well, since we performed the operation in three separate commands, we have different time stamps, this is not ideal. Well, the goods news is that we have been programming in ruby the entire time (via jruby). Let's take advantage of that and build up a little program.

Use your text editor to open a file home/put_multiple_columns.rb and add the following code (I will talk though this in class).

include Java
import 'org.apache.hadoop.hbase.client.HTable'
import 'org.apache.hadoop.hbase.client.Put'
import 'org.apache.hadoop.hbase.HBaseConfiguration'


def jbytes(*args)
  args.map {|arg| arg.to_s.to_java_bytes}
end

table = HTable.new(HBaseConfiguration.create, "wiki")

p = Put.new(*jbytes("Home"))

p.add(*jbytes("text", "", "This is page 1b!"))
p.add(*jbytes("revision", "author", "Dave M"))
p.add(*jbytes("revision", "commit", "Second edit from Dave"))

table.put(p)

Start bash in the hbase container

> ./bin/hbase-base

And from the bash shell, run the script with:

hbase shell /mnt/put_multiple_columns.rb

Hop back over to our hbase shell and check out the time stamps on our home page. The timestamps are all the same!

get 'wiki', 'Home'
COLUMN                    CELL
 revision:author          timestamp=1633356965268, value=Dave M
 revision:commit          timestamp=1633356965268, value=Second edit from Dave
 text:                    timestamp=1633356965268, value=This is page 1b!
3 row(s) in 0.0140 seconds

Cleanup

To get rid of the table, just drop it

disable 'wiki'
drop 'wiki'

Next time

We will spend some time together working though 7DB-HBase Day 2 to finish get you ready for an HBase assignment! But feel free to start before Fri!