Skip to content

Latest commit

 

History

History
366 lines (264 loc) · 11.1 KB

2021-10-06.md

File metadata and controls

366 lines (264 loc) · 11.1 KB

Hbase

Lecture material adapted from 7DB-HBase-day2

Running Scripts

Like last time, we will get our systems running. Make sure you are in env/hbase. We will use the commands that I prepared that are in env/hbase/bin to work with a standalone version of Hbase.

Start hbase by running

./bin/hbase-start

And we will start a hbase shell with

./bin/hbase-shell

Getting going

The first thing that we are going to do is create a table, called wiki, with one column family called text

create 'wiki', 'text'

Building a wiki

Let's use everything that we have learned to do something a little more interesting. Lets create a wiki with the following requirements:

  1. page title is unique
  2. page can have multiple revisions
  3. each revision is identified by a timestamp
  4. each revision contains text
  5. each revision has an author
  6. each revision contains optional commit comment

The requirements naturally break down as follows:

  • title is key
  • two column families:
    • text with default column ""
    • revision with columns author and comment

Let's update the schema. As before, we will use the alter command to add the revision column family.

disable 'wiki'
alter 'wiki', { NAME => 'revision', VERSIONS => org.apache.hadoop.hbase.HConstants::ALL_VERSIONS }
enable 'wiki'

Note that we do not specify the columns, it is up the application layer to maintain the schema.

Great, now, let's make some edits:

put 'wiki', 'Page1', 'text:', 'This is page 1a!'
put 'wiki', 'Page1', 'revision:author', 'Dave M'
put 'wiki', 'Page1', 'revision:commit', 'First edit from Dave'

And, check out the result!

get 'wiki', 'Page1'
COLUMN                             CELL
 revision:author                   timestamp=1568169525393, value=Dave M
 revision:commit                   timestamp=1568169528613, value=First edit from Dave
 text:                             timestamp=1568169522777, value=This is page 1a!
3 row(s) in 0.0160 seconds

Question What is the problem here?

Well, since we performed the operation in three separate commands, we have different time stamps, this is not ideal. Well, the goods news is that we have been programming in ruby the entire time (via jruby). Let's take advantage of that and build up a little program.

Use your text editor to open a file home/put_multiple_columns.rb and add the following code (I will talk though this in class).

include Java
import 'org.apache.hadoop.hbase.client.HTable'
import 'org.apache.hadoop.hbase.client.Put'
import 'org.apache.hadoop.hbase.HBaseConfiguration'


def jbytes(*args)
  args.map {|arg| arg.to_s.to_java_bytes}
end

table = HTable.new(HBaseConfiguration.create, "wiki")

p = Put.new(*jbytes("Home"))

p.add(*jbytes("text", "", "This is page 1b!"))
p.add(*jbytes("revision", "author", "Dave M"))
p.add(*jbytes("revision", "commit", "Second edit from Dave"))

table.put(p)

And run the command

./bin/hbase-shell /mnt/put_multiple_columns.rb

Cleanup

Hop back over to the shell

./bin/hbase-shell

To get rid of the table, just take it offline and drop it

disable 'wiki'
drop 'wiki'

Importing data (7DB - Hbase - Day 2)

In this section we will import and XML stream Hbase. An example of XML:

<page>
  <title>Anarchism</title>
  <id>12</id>
  <revision>
    <id>408067712</id>
    <timestamp>2011-01-15T19:28:25Z</timestamp>
    <contributor>
      <username>RepublicanJacobite</username>
<id>5223685</id>
</contributor>
<comment>Undid revision 408057615 by [[Special:Contributions...</comment> <text xml:space="preserve">{{Redirect|Anarchist|the fictional character|
... [[bat-smg:Anarkėzmos]]
    </text>
  </revision>
</page>

In general, a basic xml parser has the following form:

import 'javax.xml.stream.XMLStreamConstants'

factory = javax.xml.stream.XMLInputFactory.newInstance
reader = factory.createXMLStreamReader(java.lang.System.in)

while reader.has_next
  type = reader.next

    if type == XMLStreamConstants::START_ELEMENT
        tag = reader.local_name
        # do something with tag
    elsif type == XMLStreamConstants::CHARACTERS
        text = reader.text
        # do something with text
    elsif type == XMLStreamConstants::END_ELEMENT
        # same as START_ELEMENT
    end
end

Using the above "bones" of a parser, we can build out a tool that imports data from wikipedia. You can store the file in home/import_from_wikipedia.rb

include Java

require 'time'

import 'javax.xml.stream.XMLStreamConstants'
import 'org.apache.hadoop.hbase.client.HTable'
import 'org.apache.hadoop.hbase.client.Put'


def jbytes(*args)
  args.map { |arg| arg.to_s.to_java_bytes }
end



factory = javax.xml.stream.XMLInputFactory.newInstance
reader = factory.createXMLStreamReader(java.lang.System.in)

document = nil
buffer = nil
count = 0

table = HTable.new(@hbase.configuration, 'wiki')
table.setAutoFlush(false)

while reader.has_next
  case reader.next
  when XMLStreamConstants::START_ELEMENT
	case reader.local_name
	when 'page' then document = {}
	when /title|timestamp|username|comment|text/ then buffer = []
	end
  when XMLStreamConstants::CHARACTERS
	buffer << reader.text unless buffer.nil?
  when XMLStreamConstants::END_ELEMENT
	case reader.local_name
	when /title|timestamp|username|comment|text/
	  document[reader.local_name] = buffer.join
	when 'revision'
	  key = document['title'].to_java_bytes
	  ts = (Time.parse document['timestamp']).to_i

	  p = Put.new(key, ts)
	  p.add(*jbytes("text", "", document['text']))
	  p.add(*jbytes("revision", "author", document['username']))
	  p.add(*jbytes("revision", "comment", document['comment']))
	  table.put(p)

	  count += 1
	  table.flushCommits() if count % 10 == 0
	  if count % 500 == 0
		puts "#{count} records inserted (#{document['title']})"
	  end
	end
  end
end

table.flushCommits()
exit

Things to note:

  • We are building up content in a hash called document and at some point we will write that document to the database.
  • Observe the usage of the flush. In general, HBash flushes data to disk periodically. In this script we disable auto-flushing and control when we flush to disk.

Before running, lets add some compression. Start the hbase shell and alter the table (as we did before) to enable compression. (We will use gzip, but in general, using LZO is probably a better bet, however, LZO is a bit more challenging to set up.) We will also enable bloom filters. We will study these in depth later in the semester, but for now, it is enough to think of it as a probabilistic data structure that helps solve element existence in a set. If the query element is in the set, it will always be correct, but it may sometimes tell you an element is in the set, even if it is not.

hbase> alter 'wiki', {NAME=>'text', COMPRESSION=>'GZ', BLOOMFILTER=>'ROW'}

Depending on how you got here, you may have removed the table. If you did, no worries, just create it again with the compression and a bloom filter:

hbase> create 'wiki', {NAME=>'text', COMPRESSION=>'GZ', BLOOMFILTER=>'ROW'}, { NAME => 'revision', VERSIONS => org.apache.hadoop.hbase.HConstants::ALL_VERSIONS }

Now, lets run the script. You will need to run hbase-base

$ curl https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 | \
    bzcat | \ 
    hbase shell /mnt/import_from_wikipedia.rb

And if all works out, data should be downloading. Now, lets poke around in HBase's file system! Let's start a bash shell in the container

$ ./bin/hbase-bash

Check out the contents of hbase-1.2.6/conf/hbase-site.xml, we want to find out where the hbase underlying hbase files live

<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>file:///home/testuser/hbase</value>
  </property>
  <property>
    <name>hbase.zookeeper.property.dataDir</name>
    <value>/home/testuser/zookeeper</value>
  </property>
</configuration>

We find this info by looking up hbase.rootdir and find it is /home/testuser/hbase. Let's go check it out

$ cd /home/testuser/hbase/

In class discussion and things to check out

  • du -h on the content of /data/default as we scrape data. Observe the file file structure and how it relates to our tables. Observe how the folder names change as we load data. Folder names relate to region server
  • check out the folders in the root with "WAL" in their names, there are the files for storing the Write Ahead Logs. Check if learners know about journaling, if not, future lecture topic.
  • run scan 'hbase:meta', { COLUMNS => ['info:server', 'info:regioninfo'] } and observe how the output, key names, and ENCODE values and timestamp values relate to the data in data/hbase/namespace

Extracting graph

Next, lets extract a graph from the data we have collected so far. Stop the script from running with Ctrl-C. Hope over to hbase-shell and create a links table

hbase> create 'links', {
NAME => 'to', VERSIONS => 1, BLOOMFILTER => 'ROWCOL'
},{
  NAME => 'from', VERSIONS => 1, BLOOMFILTER => 'ROWCOL'
}

And create a scanner to extract links

include Java

import 'org.apache.hadoop.hbase.client.HTable'
import 'org.apache.hadoop.hbase.client.Put'
import 'org.apache.hadoop.hbase.client.Scan'
import 'org.apache.hadoop.hbase.util.Bytes'

def jbytes(*args)
  return args.map { |arg| arg.to_s.to_java_bytes }
end

wiki_table = HTable.new(@hbase.configuration, 'wiki')
links_table = HTable.new(@hbase.configuration, 'links')
links_table.setAutoFlush(false)

scanner = wiki_table.getScanner(Scan.new)

linkpattern = /\[\[([^\[\]\|\:\#][^\[\]\|:]*)(?:\|([^\[\]\|]+))?\]\]/
count = 0

while (result = scanner.next())
  title = Bytes.toString(result.getRow())
  text = Bytes.toString(result.getValue(*jbytes('text', '')))
  if text
    put_to = nil
    text.scan(linkpattern) do |target, label|
      unless put_to
        put_to = Put.new(*jbytes(title))
        put_to.setWriteToWAL( false )
      end

      target.strip!
      target.capitalize!

      label = '' unless label
      label.strip!

      put_to.add(*jbytes("to", target, label))
      put_from = Put.new(*jbytes(target))
      put_from.add(*jbytes("from", title, label))
      put_from.setWriteToWAL(false)
      links_table.put(put_from)
    end
    links_table.put(put_to) if put_to
    links_table.flushCommits()
  end

  count += 1
  puts "#{count} pages processed (#{title})" if count % 500 == 0
end

links_table.flushCommits()
exit

Now we can run the script

$ ./bin/hbase-shell /mnt/generate_wiki_links.rb

And we can start looking around

hbase> scan 'links', STARTROW => "Admiral Ackbar", ENDROW => "Admiral of"
hbase> get 'links', 'Addition'
hbase> count 'wiki'

Play around, check out the hbase docs for more and enjoy! You are now ready to use hbase to do some data crunching on the assignment!