Lecture material adapted from 7DB-HBase-day2
Like last time, we will get our systems running. Make sure you are in env/hbase
. We will use the commands that I prepared that are in
env/hbase/bin
to work with a standalone version of Hbase.
Start hbase by running
./bin/hbase-start
And we will start a hbase shell with
./bin/hbase-shell
The first thing that we are going to do is create a table, called wiki
, with
one column family called text
create 'wiki', 'text'
Let's use everything that we have learned to do something a little more interesting. Lets create a wiki with the following requirements:
- page title is unique
- page can have multiple revisions
- each revision is identified by a timestamp
- each revision contains text
- each revision has an author
- each revision contains optional commit comment
The requirements naturally break down as follows:
- title is key
- two column families:
- text with default column
""
- revision with columns
author
andcomment
- text with default column
Let's update the schema. As before, we will use the alter
command to add the
revision column family.
disable 'wiki'
alter 'wiki', { NAME => 'revision', VERSIONS => org.apache.hadoop.hbase.HConstants::ALL_VERSIONS }
enable 'wiki'
Note that we do not specify the columns, it is up the application layer to maintain the schema.
Great, now, let's make some edits:
put 'wiki', 'Page1', 'text:', 'This is page 1a!'
put 'wiki', 'Page1', 'revision:author', 'Dave M'
put 'wiki', 'Page1', 'revision:commit', 'First edit from Dave'
And, check out the result!
get 'wiki', 'Page1'
COLUMN CELL
revision:author timestamp=1568169525393, value=Dave M
revision:commit timestamp=1568169528613, value=First edit from Dave
text: timestamp=1568169522777, value=This is page 1a!
3 row(s) in 0.0160 seconds
Question What is the problem here?
Well, since we performed the operation in three separate commands, we have different time stamps, this is not ideal. Well, the goods news is that we have been programming in ruby the entire time (via jruby). Let's take advantage of that and build up a little program.
Use your text editor to open a file home/put_multiple_columns.rb
and add the
following code (I will talk though this in class).
include Java
import 'org.apache.hadoop.hbase.client.HTable'
import 'org.apache.hadoop.hbase.client.Put'
import 'org.apache.hadoop.hbase.HBaseConfiguration'
def jbytes(*args)
args.map {|arg| arg.to_s.to_java_bytes}
end
table = HTable.new(HBaseConfiguration.create, "wiki")
p = Put.new(*jbytes("Home"))
p.add(*jbytes("text", "", "This is page 1b!"))
p.add(*jbytes("revision", "author", "Dave M"))
p.add(*jbytes("revision", "commit", "Second edit from Dave"))
table.put(p)
And run the command
./bin/hbase-shell /mnt/put_multiple_columns.rb
Hop back over to the shell
./bin/hbase-shell
To get rid of the table, just take it offline and drop it
disable 'wiki'
drop 'wiki'
In this section we will import and XML stream Hbase. An example of XML:
<page>
<title>Anarchism</title>
<id>12</id>
<revision>
<id>408067712</id>
<timestamp>2011-01-15T19:28:25Z</timestamp>
<contributor>
<username>RepublicanJacobite</username>
<id>5223685</id>
</contributor>
<comment>Undid revision 408057615 by [[Special:Contributions...</comment> <text xml:space="preserve">{{Redirect|Anarchist|the fictional character|
... [[bat-smg:Anarkėzmos]]
</text>
</revision>
</page>
In general, a basic xml parser has the following form:
import 'javax.xml.stream.XMLStreamConstants'
factory = javax.xml.stream.XMLInputFactory.newInstance
reader = factory.createXMLStreamReader(java.lang.System.in)
while reader.has_next
type = reader.next
if type == XMLStreamConstants::START_ELEMENT
tag = reader.local_name
# do something with tag
elsif type == XMLStreamConstants::CHARACTERS
text = reader.text
# do something with text
elsif type == XMLStreamConstants::END_ELEMENT
# same as START_ELEMENT
end
end
Using the above "bones" of a parser, we can build out a tool that imports data
from wikipedia. You can store the file in home/import_from_wikipedia.rb
include Java
require 'time'
import 'javax.xml.stream.XMLStreamConstants'
import 'org.apache.hadoop.hbase.client.HTable'
import 'org.apache.hadoop.hbase.client.Put'
def jbytes(*args)
args.map { |arg| arg.to_s.to_java_bytes }
end
factory = javax.xml.stream.XMLInputFactory.newInstance
reader = factory.createXMLStreamReader(java.lang.System.in)
document = nil
buffer = nil
count = 0
table = HTable.new(@hbase.configuration, 'wiki')
table.setAutoFlush(false)
while reader.has_next
case reader.next
when XMLStreamConstants::START_ELEMENT
case reader.local_name
when 'page' then document = {}
when /title|timestamp|username|comment|text/ then buffer = []
end
when XMLStreamConstants::CHARACTERS
buffer << reader.text unless buffer.nil?
when XMLStreamConstants::END_ELEMENT
case reader.local_name
when /title|timestamp|username|comment|text/
document[reader.local_name] = buffer.join
when 'revision'
key = document['title'].to_java_bytes
ts = (Time.parse document['timestamp']).to_i
p = Put.new(key, ts)
p.add(*jbytes("text", "", document['text']))
p.add(*jbytes("revision", "author", document['username']))
p.add(*jbytes("revision", "comment", document['comment']))
table.put(p)
count += 1
table.flushCommits() if count % 10 == 0
if count % 500 == 0
puts "#{count} records inserted (#{document['title']})"
end
end
end
end
table.flushCommits()
exit
Things to note:
- We are building up content in a hash called
document
and at some point we will write that document to the database. - Observe the usage of the
flush
. In general, HBash flushes data to disk periodically. In this script we disable auto-flushing and control when we flush to disk.
Before running, lets add some compression. Start the hbase shell and alter the table (as we did before) to enable compression. (We will use gzip, but in general, using LZO is probably a better bet, however, LZO is a bit more challenging to set up.) We will also enable bloom filters. We will study these in depth later in the semester, but for now, it is enough to think of it as a probabilistic data structure that helps solve element existence in a set. If the query element is in the set, it will always be correct, but it may sometimes tell you an element is in the set, even if it is not.
hbase> alter 'wiki', {NAME=>'text', COMPRESSION=>'GZ', BLOOMFILTER=>'ROW'}
Depending on how you got here, you may have removed the table. If you did, no worries, just create it again with the compression and a bloom filter:
hbase> create 'wiki', {NAME=>'text', COMPRESSION=>'GZ', BLOOMFILTER=>'ROW'}, { NAME => 'revision', VERSIONS => org.apache.hadoop.hbase.HConstants::ALL_VERSIONS }
Now, lets run the script. You will need to run hbase-base
$ curl https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 | \
bzcat | \
hbase shell /mnt/import_from_wikipedia.rb
And if all works out, data should be downloading. Now, lets poke around in HBase's file system! Let's start a bash shell in the container
$ ./bin/hbase-bash
Check out the contents of hbase-1.2.6/conf/hbase-site.xml
, we want to find out
where the hbase underlying hbase files live
<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:///home/testuser/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/testuser/zookeeper</value>
</property>
</configuration>
We find this info by looking up hbase.rootdir
and find it is
/home/testuser/hbase
. Let's go check it out
$ cd /home/testuser/hbase/
In class discussion and things to check out
du -h
on the content of/data/default
as we scrape data. Observe the file file structure and how it relates to our tables. Observe how the folder names change as we load data. Folder names relate to region server- check out the folders in the root with "WAL" in their names, there are the files for storing the Write Ahead Logs. Check if learners know about journaling, if not, future lecture topic.
- run
scan 'hbase:meta', { COLUMNS => ['info:server', 'info:regioninfo'] }
and observe how the output, key names, and ENCODE values and timestamp values relate to the data indata/hbase/namespace
Next, lets extract a graph from the data we have collected so far. Stop the script from running with Ctrl-C. Hope over to hbase-shell and create a links table
hbase> create 'links', {
NAME => 'to', VERSIONS => 1, BLOOMFILTER => 'ROWCOL'
},{
NAME => 'from', VERSIONS => 1, BLOOMFILTER => 'ROWCOL'
}
And create a scanner to extract links
include Java
import 'org.apache.hadoop.hbase.client.HTable'
import 'org.apache.hadoop.hbase.client.Put'
import 'org.apache.hadoop.hbase.client.Scan'
import 'org.apache.hadoop.hbase.util.Bytes'
def jbytes(*args)
return args.map { |arg| arg.to_s.to_java_bytes }
end
wiki_table = HTable.new(@hbase.configuration, 'wiki')
links_table = HTable.new(@hbase.configuration, 'links')
links_table.setAutoFlush(false)
scanner = wiki_table.getScanner(Scan.new)
linkpattern = /\[\[([^\[\]\|\:\#][^\[\]\|:]*)(?:\|([^\[\]\|]+))?\]\]/
count = 0
while (result = scanner.next())
title = Bytes.toString(result.getRow())
text = Bytes.toString(result.getValue(*jbytes('text', '')))
if text
put_to = nil
text.scan(linkpattern) do |target, label|
unless put_to
put_to = Put.new(*jbytes(title))
put_to.setWriteToWAL( false )
end
target.strip!
target.capitalize!
label = '' unless label
label.strip!
put_to.add(*jbytes("to", target, label))
put_from = Put.new(*jbytes(target))
put_from.add(*jbytes("from", title, label))
put_from.setWriteToWAL(false)
links_table.put(put_from)
end
links_table.put(put_to) if put_to
links_table.flushCommits()
end
count += 1
puts "#{count} pages processed (#{title})" if count % 500 == 0
end
links_table.flushCommits()
exit
Now we can run the script
$ ./bin/hbase-shell /mnt/generate_wiki_links.rb
And we can start looking around
hbase> scan 'links', STARTROW => "Admiral Ackbar", ENDROW => "Admiral of"
hbase> get 'links', 'Addition'
hbase> count 'wiki'
Play around, check out the hbase docs for more and enjoy! You are now ready to use hbase to do some data crunching on the assignment!