-
Notifications
You must be signed in to change notification settings - Fork 32
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-16557][SQL] Remove stale doc in sql/README.md
## What changes were proposed in this pull request? Most of the documentation in https://github.com/apache/spark/blob/master/sql/README.md is stale. It would be useful to keep the list of projects to explain what's going on, and everything else should be removed. ## How was this patch tested? N/A Author: Reynold Xin <[email protected]> Closes #14211 from rxin/SPARK-16557.
- Loading branch information
Showing
1 changed file
with
1 addition
and
74 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,83 +1,10 @@ | ||
Spark SQL | ||
========= | ||
|
||
This module provides support for executing relational queries expressed in either SQL or a LINQ-like Scala DSL. | ||
This module provides support for executing relational queries expressed in either SQL or the DataFrame/Dataset API. | ||
|
||
Spark SQL is broken up into four subprojects: | ||
- Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. | ||
- Execution (sql/core) - A query planner / execution engine for translating Catalyst's logical query plans into Spark RDDs. This component also includes a new public interface, SQLContext, that allows users to execute SQL or LINQ statements against existing RDDs and Parquet files. | ||
- Hive Support (sql/hive) - Includes an extension of SQLContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. | ||
- HiveServer and CLI support (sql/hive-thriftserver) - Includes support for the SQL CLI (bin/spark-sql) and a HiveServer2 (for JDBC/ODBC) compatible server. | ||
|
||
|
||
Other dependencies for developers | ||
--------------------------------- | ||
In order to create new hive test cases (i.e. a test suite based on `HiveComparisonTest`), | ||
you will need to setup your development environment based on the following instructions. | ||
|
||
If you are working with Hive 0.12.0, you will need to set several environmental variables as follows. | ||
|
||
``` | ||
export HIVE_HOME="<path to>/hive/build/dist" | ||
export HIVE_DEV_HOME="<path to>/hive/" | ||
export HADOOP_HOME="<path to>/hadoop" | ||
``` | ||
|
||
If you are working with Hive 0.13.1, the following steps are needed: | ||
|
||
1. Download Hive's [0.13.1](https://archive.apache.org/dist/hive/hive-0.13.1) and set `HIVE_HOME` with `export HIVE_HOME="<path to hive>"`. Please do not set `HIVE_DEV_HOME` (See [SPARK-4119](https://issues.apache.org/jira/browse/SPARK-4119)). | ||
2. Set `HADOOP_HOME` with `export HADOOP_HOME="<path to hadoop>"` | ||
3. Download all Hive 0.13.1a jars (Hive jars actually used by Spark) from [here](http://mvnrepository.com/artifact/org.spark-project.hive) and replace corresponding original 0.13.1 jars in `$HIVE_HOME/lib`. | ||
4. Download [Kryo 2.21 jar](http://mvnrepository.com/artifact/com.esotericsoftware.kryo/kryo/2.21) (Note: 2.22 jar does not work) and [Javolution 5.5.1 jar](http://mvnrepository.com/artifact/javolution/javolution/5.5.1) to `$HIVE_HOME/lib`. | ||
5. This step is optional. But, when generating golden answer files, if a Hive query fails and you find that Hive tries to talk to HDFS or you find weird runtime NPEs, set the following in your test suite... | ||
|
||
``` | ||
val testTempDir = Utils.createTempDir() | ||
// We have to use kryo to let Hive correctly serialize some plans. | ||
sql("set hive.plan.serialization.format=kryo") | ||
// Explicitly set fs to local fs. | ||
sql(s"set fs.default.name=file://$testTempDir/") | ||
// Ask Hive to run jobs in-process as a single map and reduce task. | ||
sql("set mapred.job.tracker=local") | ||
``` | ||
|
||
Using the console | ||
================= | ||
An interactive scala console can be invoked by running `build/sbt hive/console`. | ||
From here you can execute queries with HiveQl and manipulate DataFrame by using DSL. | ||
|
||
```scala | ||
$ build/sbt hive/console | ||
|
||
[info] Starting scala interpreter... | ||
import org.apache.spark.sql.catalyst.analysis._ | ||
import org.apache.spark.sql.catalyst.dsl._ | ||
import org.apache.spark.sql.catalyst.errors._ | ||
import org.apache.spark.sql.catalyst.expressions._ | ||
import org.apache.spark.sql.catalyst.plans.logical._ | ||
import org.apache.spark.sql.catalyst.rules._ | ||
import org.apache.spark.sql.catalyst.util._ | ||
import org.apache.spark.sql.execution | ||
import org.apache.spark.sql.functions._ | ||
import org.apache.spark.sql.hive._ | ||
import org.apache.spark.sql.hive.test.TestHive._ | ||
import org.apache.spark.sql.hive.test.TestHive.implicits._ | ||
import org.apache.spark.sql.types._ | ||
Type in expressions to have them evaluated. | ||
Type :help for more information. | ||
|
||
scala> val query = sql("SELECT * FROM (SELECT * FROM src) a") | ||
query: org.apache.spark.sql.DataFrame = [key: int, value: string] | ||
``` | ||
|
||
Query results are `DataFrames` and can be operated as such. | ||
``` | ||
scala> query.collect() | ||
res0: Array[org.apache.spark.sql.Row] = Array([238,val_238], [86,val_86], [311,val_311], [27,val_27]... | ||
``` | ||
|
||
You can also build further queries on top of these `DataFrames` using the query DSL. | ||
``` | ||
scala> query.where(query("key") > 30).select(avg(query("key"))).collect() | ||
res1: Array[org.apache.spark.sql.Row] = Array([274.79025423728814]) | ||
``` |