1. SELECT
SELECT [hints, ...] [* |DISTINCT] named_expression[, named_expression, ...]
FROM relation[, relation, ...]
[lateral_view[, lateral_view, ...]]
[WHERE boolean_expression]
[aggregation [HAVING boolean_expression]]
[WINDOW named_window[, WINDOW named_window, ...]]
[LIMIT num_rows]
named_expression:
: expression [AS alias]
relation:
| join_relation
| (table_name|query|relation) [sample] [AS alias]
: VALUES (expressions)[, (expressions), ...]
[AS (column_name[, column_name, ...])]
expressions:
: expression[, expression, ...]
Output data from one or more relations. A relation refers to any source of input data. It could be the contents of an existing table (or view), the joined result of two existing tables, or a subquery (the result of another SELECT statement).
1.3 WHERE
Filter rows by predicate.
Like Operator: used to match text values against a pattern using wildcards. If the search expression can be matched to the pattern expression, the LIKE operator will return true, which is 1.
There are two wildcards used in conjunction with the LIKE operator −
- The percent sign (%)
- The underscore (_)
SELECT FROM table_name
WHERE column LIKE 'XXXX%'
SELECT FROM table_name
WHERE column LIKE '%XXXX%'
SELECT FROM table_name
WHERE column LIKE 'XXXX_'
1.4 GROUP BY (Aggregate Functions)
It is used in collaboration with the SELECT statement to group together those rows in a table that have identical data. This is done to eliminate redundancy in the output and/or compute aggregates that apply to these groups. The GROUP BY clause follows the WHERE clause in a SELECT statement and precedes the ORDER BY clause.
SELECT column-list
FROM table_name
WHERE [ conditions ]
GROUP BY column1, column2....columnN
ORDER BY column1, column2....columnN
Common Aggregate Functions
COUNT() Returns the number of rows containing non-NULL values in the specified field. SUM() Returns the sum of the non-NULL values in the specified field. AVG() Returns the average of the non-NULL values in the specified field. MAX() Returns the maximum of the non-NULL values in the specified field. MIN() Returns the minimum of the non-NULL values in the specified field.
1.7 WINDOW
Assign an identifier to a window specification. Refer to section 6, for more information on Window Function.
1.10 COMMON TABLE EXPRESSIONS (CTE)
A common table expression is a temporary result set which you can reference within another SQL statement including SELECT, INSERT, UPDATE OR DELETE Common Table Expressions are temporary in the sense that they only exist during the execution of the query. The following shows the syntax of creating a CTE:
WITH cte_name (column_list) AS (
CTE_query_definition
)
[SELECT STATEMENT]:[INSERT SATEMENT]:[UPDATE SATEMENT]:[DELETE SATEMENT];
For Example:
with cte as
(SELECT
*
from
ipl.individual)
SELECT
count(*)
from
cte
In this syntax
- First, specify the name of the CTE following by an optional column list.
- Second, inside the body of WITH clause, specify a query that returns a result set. If you do not explicitly specify the column list after the CTE name, the select list of the cte query definition will become the column list of the CTE.
- Third, use the CTE like a table or view in the statement which can be a SELECT, INSERT, UPDATE OR DELETE
Common Table Expressions or CTEs are typically used to simplify complex joins and subqueries.
1.11 EXAMPLES:
SELECT * FROM ipl.iplmatch;
SELECT matchnumber,hometeam from ipl.iplmatch;
SELECT DISTINCT matchnumber,hometeam from ipl.iplmatch;
SELECT count(*) from ipl.iplmatch;
SELECT * FROM ipl.iplmatch where Location='Bangalore' and IndexNo=419112;
SELECT * FROM ipl.iplmatch LIMIT 10;
SELECT batsman_name as batsman ,max(batsman_runs) as max_runs FROM bbl.bbl01cs
group by 1
order by max_runs desc ;
SELECT batsman_name as batsman ,max(batsman_runs) as max_runs FROM bbl.bbl01cs
where batsman_name like '%Ma%'
group by 1
having max_runs >=4
order by max_runs desc;
2. SAMPLING
sample:
| TABLESAMPLE ((integer_expression | decimal_expression) PERCENT): TABLESAMPLE (integer_expression ROWS)
Sample the input data. This can be expressed in terms of either a percentage (must be between 0 and 1 00) or a fixed number of input rows
3. JOIN
join_relation:
| relation join_type JOIN relation (ON boolean_expression | USING (column_name[, column_name, ...]))
: relation NATURAL join_type JOIN relation
join_type:
| INNER
| (LEFT|RIGHT) SEMI
| (LEFT|RIGHT|FULL) [OUTER]
: [LEFT] ANTI
3.2 OUTER JOIN
Select all rows from both relations, filling with null values on the side that does not have a match.
4. LATERAL VIEW
lateral_view:
: LATERAL VIEW [OUTER] function_name (expressions)
table_name [AS (column_name[, column_name, ...])]
Generate zero or more output rows for each input row using a table-generating function. The most common built-in function used with LATERAL VIEW is explode.
5. AGGREGATION
aggregation:
: GROUP BY expressions [(WITH ROLLUP | WITH CUBE | GROUPING SETS (expressions))]
Group by a set of expressions using one or more aggregate functions. Common built-in aggregate functions include count, avg, min, max, and sum.
5.1 ROLLUP
Create a grouping set at each hierarchical level of the specified expressions.
For instance,
GROUP BY a, b, c WITH ROLLUP is equivalent to GROUP BY a, b, c GROUPING SETS ((a, b, c), (a, b), (a), ()).
The total number of grouping sets will be N + 1, where N is the number of group expressions.
5.2 CUBE
Create a grouping set for each possible combination of set of the specified expressions.
For instance,
GROUP BY a, b, c WITH CUBE is equivalent to GROUP BY a, b, c GROUPING SETS ((a, b, c), (a, b), (b, c), (a, c), (a), (b), (c), ()).
The total number of grouping sets will be 2^N, where N is the number of group expressions.
5.3 GROUPING SETS
Perform a group by for each subset of the group expressions specified in the grouping sets.
For instance,
GROUP BY x, y GROUPING SETS (x, y) is equivalent to the result of GROUP BY x unioned with that of GROUP BY y.
5.4 EXAMPLES:
SELECT
venue,
COUNT(*) AS num_matches
FROM
wbbl.wbbl01c
GROUP BY venue
SELECT
venue,
AVG(batsman_runs) AS avg_runs
FROM
wbbl.wbbl01c
GROUP BY venue
SELECT
venue,
fixture,
batsman_name
FROM
wbbl.wbbl01c
GROUP BY venue, fixture, batsman_name WITH ROLLUP
SELECT
venue,
fixture,
AVG(batsman_runs)
FROM
wbbl.wbbl01c
GROUP BY venue, fixture GROUPING SETS (venue, fixture)
6. WINDOW FUNCTION
window_expression:
: expression OVER window_spec
named_window:
: window_identifier AS window_spec
window_spec:
| window_identifier
: ((PARTITION|DISTRIBUTE) BY expressions
[(ORDER|SORT) BY sort_expressions] [window_frame])
window_frame:
| (RANGE|ROWS) frame_bound
: (RANGE|ROWS) BETWEEN frame_bound AND frame_bound
frame_bound:
| CURRENT ROW
| UNBOUNDED (PRECEDING|FOLLOWING)
: expression (PRECEDING|FOLLOWING)
Compute a result over a range of input rows. A windowed expression is specified using the OVER keyword, which is followed by either an identifier to the window (defined using the WINDOW keyword) or the specification of a window.
6.4 ROWS bound
Express the size of the window in terms of the number of rows before and/or after the current row.
7. HINTS
hints:
: /*+ hint[, hint, ...] */
hint:
: hintName [(expression[, expression, ...])]
Hints can be used to help execute a query better.
For example, you can hint that a table is small enough to be broadcast, which would speed up joins.
You add one or more hints to a SELECT statement inside /*+ … */ comment blocks.
Multiple hints can be specified inside the same comment block, in which case the hints are separated by commas, and there can be multiple such comment blocks. A hint has a name (for example, BROADCAST) and accepts 0 or more parameters.
8. DATA SOURCE VIEW
View is virtual table based on the result-set of an SQL statement.
A view contains rows and columns, just like a real table.
The fields in a view are fields from one or more real tables in the database.
Using the Zetaris Data Fabric, tables/views from multiple data sources can be used to create a new view.
You can add SQL functions, WHERE, and JOIN statements to a view and present the data as if the data were coming from one single table.
CREATE VIEW [OR ALTER] schema_name.view_name [(column_list)]
AS
select_statement;
In this syntax:
- First, specify the name of the view after the CREATE VIEW keywords. The schema_name is the name of the schema to which the view belongs.
- Second, specify a SELECT statement (select_statement) that defines the view after the AS keyword. The SELECT statement can refer to one or more tables.
If you don’t explicitly specify a list of columns for the view, SQL Server will use the column list derived from the SELECT statement.
In case you want to redefine the view e.g., adding more columns to it or removing some columns from it, you can use the OR ALTER keywords after the CREATE VIEW keywords.
CREATE DATASOURCE VIEW cust_ordr_count AS
select
b.c_name,
count(distinct(a.o_orderkey))
FROM
TPCH_AZBLB.orders a
INNER JOIN
TPCH_AZBLB.customer b
on
a.o_custkey=b.c_custkey
group by 1
9. CAST OPERATOR
There are many cases that you want to convert a value of one data type into another. PostgreSQL provides you with the CAST operator that allows you to do this.
The following illustrates the syntax of type CAST:
CAST ( expression AS target_type );
In this syntax:
- First, specify an expression that can be a constant, a table column, an expression that evaluates to a value.
- Then, specify the target data type to which you want to convert the result of the expression.
9.1 EXAMPLES
SELECT
CAST ('100' AS INTEGER);
If the expression cannot be converted to the target type, PostgreSQL will raise an error. See the following example:
SELECT
CAST ('10C' AS INTEGER);
[Err] ERROR: invalid input syntax for integer:
SELECT
CAST ('2015-01-01' AS DATE),
CAST ('01-OCT-2015' AS DATE);
10. Other String Functions
10.1 CONCAT( string str1, string str2… )
The CONCAT function concatenates all the stings.
Example: CONCAT('hadoop','-','hive') returns 'hadoop-hive'
CONCAT_WS( string delimiter, string str1, string str2... )
The CONCAT_WS function is similar to the CONCAT function. Here you can also provide the delimiter, which can be used in between the strings to concat.
Example: CONCAT_WS('-','hadoop','hive') returns 'hadoop-hive'
10.2 FIND_IN_SET( string search_string, string source_string_list )
The FIND_IN_SET function searches for the search string in the source_string_list and returns the position of the first occurrence in the source string list. Here the source string list should be comma delimited one. It returns 0 if the first argument contains comma.
Example: FIND_IN_SET('ha','hao,mn,hc,ha,hef') returns 4
10.3 LENGTH( string str )
The LENGTH function returns the number of characters in a string.
Example: LENGTH('hive') returns 4
10.4 LOWER( string str ), LCASE( string str )
The LOWER or LCASE function converts the string into lower case letters.
Example: LOWER('HiVe') returns 'hive'
10.5 LPAD( string str, int len, string pad )
The LPAD function returns the string with a length of len characters left-padded with pad.
Example: LPAD('hive',6,'v') returns 'vvhive'
10.6 LTRIM( string str )
The LTRIM function removes all the trailing spaces from the string.
Example: LTRIM(' hive') returns 'hive'
10.7 REPEAT( string str, int n )
The REPEAT function repeats the specified string n times.
Example: REPEAT('hive',2) returns 'hivehive'
10.8 RPAD( string str, int len, string pad )
The RPAD function returns the string with a length of len characters right-padded with pad.
Example: RPAD('hive',6,'v') returns 'hivevv'
10.9 REVERSE( string str )
The REVERSE function gives the reversed string
Example: REVERSE('hive') returns 'evih'
10.10 RTRIM( string str )
The RTRIM function removes all the leading spaces from the string.
Example: LTRIM('hive ') returns 'hive'
10.11 SPACE( int number_of_spaces )
The SPACE function returns the specified number of spaces.
Example: SPACE(4) returns ' '
10.12 SPLIT( string str, string pat )
The SPLIT function splits the string around the pattern pat and returns an array of strings. You can specify regular expressions as patterns.
Example: SPLIT('hive:hadoop',':') returns ["hive","hadoop"]
10.13 SUBSTR( string source_str, int start_position [,int length] )
The SUBSTR or SUBSTRING function returns a part of the source string from the start position with the specified length of characters. If the length is not given, then it returns from the start position to the end of the string.
Example1: SUBSTR('hadoop',4) returns 'oop'
Example2: SUBSTR('hadoop',4,2) returns 'oo'
10.14 TRIM( string str )
The TRIM function removes both the trailing and leading spaces from the string.
Example: TRIM(' hive ') returns 'hive'
11. Other Useful Functions - Quick Reference
11.1 Mathematical Functions
The following built-in mathematical functions are supported in Lightning; most return NULL when the argument(s) are NULL:
Function Name
acos(double a), acos(DECIMAL a)
Returns the arc cosine of x if -1<=a<=1 or null otherwise
Returns double
asin(double a), asin(DECIMAL a)
Returns the arc sin of x if -1<=a<=1 or null otherwise
Returns double
ceil(double a), ceiling(double a)
Returns the minimum BIGINT value that is equal or greater than the double
Returns bigint
conv(BIGINT num, int from_base, int to_base), conv(STRING num, int from_base, int to_base)
Converts a number from a given base to another
Returns string
exp(double a), exp(DECIMAL a)
Returns ea where e is the base of the natural logarithm
Returns double
floor(double a)
Returns the maximum BIGINT value that is equal or less than the double
Returns bigint
hex(BIGINT a) hex(string a) hex(BINARY a)
If the argument is an int, hex returns the number as a string in hex format.
Otherwise if the number is a string, it converts each character into its hex representation and returns the resulting string.
Returns string
log(double base, double a), log(DECIMAL base, DECIMAL a)
Return the base “base” logarithm of the argument
Returns double
11.2 The following are built-in String functions
Function Name
base64(binary bin)
Converts the argument from binary to a base- 64 string (as of 0.12.0)
Returns string
chr(bigint|double A)
Returns the ASCII character having the binary equivalent to A (as of Hive 1.3.0 and 2.1.0). If A is larger than 256 the result is equivalent to chr(A % 256). Example: select chr(88); returns “X”.string
Returns string
concat(string|binary A, string|binary B…)
Returns the string or bytes resulting from concatenating the strings or bytes passed in as parameters in order. e.g. concat(‘foo’, ‘bar’) results in ‘foobar’. Note that this function can take any number of input strings.
context_ngrams(array<array>, array, int K, int pf)
Returns the top-k contextual N-grams from a set of tokenized sentences, given a string of “context”. See StatisticsAndDataMining for more information.
Returns array<struct<string,double>>
concat_ws(string SEP, string A, string B…)
Like concat() above, but with custom separator SEP.
Returns string
concat_ws(string SEP, array<string>)
Like concat_ws() above, but taking an array of strings. (as of Hive 0.9.0)
Returns string
decode(binary bin, string charset)
Decodes the first argument into a String using the provided character set (one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’). If either argument is null, the result will also be null. (As of Hive 0.12.0.)
Returns string
elt(N int,str1 string,str2 string,str3 string,…)
Return string at index number. For example elt(2,’hello’,’world’) returns ‘world’. Returns NULL if N is less than 1 or greater than the number of arguments. (See https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_elt)
Returns string
encode(string src, string charset)
Encodes the first argument into a BINARY using the provided character set (one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’). If either argument is null, the result will also be null. (As of Hive 0.12.0.)
Returns binary
field(val T,val1 T,val2 T,val3 T,…)
Returns the index of val in the val1,val2,val3,… list or 0 if not found. For example field(‘world’,’say’,’hello’,’world’) returns 3. All primitive types are supported, arguments are compared using str.equals(x). If val is NULL, the return value is 0. (See https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_field)
Returns int
find_in_set(string str, string strList)
Returns the first occurance of str in strList where strList is a comma-delimited string. Returns null if either argument is null. Returns 0 if the first argument contains any commas. e.g. find_in_set(‘ab’, ‘abc,b,ab,c,def’) returns 3
Returns int
format_number(number x, int d)
Formats the number X to a format like ‘#,###,###.##’, rounded to D decimal places, and returns the result as a string. If D is 0, the result has no decimal point or fractional part. (as of Hive 0.10.0)
Returns string
get_json_object(string json_string, string path)
Extract json object from a json string based on json path specified, and return json string of the extracted json object. It will return null if the input json string is invalid. NOTE: The json path can only have the characters [0-9a-z], i.e., no upper-case or special characters. Also, the keys cannot start with numbers. This is due to restrictions on Hive column names.
Returns string
in_file(string str, string filename)
Returns true if the string str appears as an entire line in filename.
Returns boolean
initcap(string A)
Returns string, with the first letter of each word in uppercase, all other letters in lowercase. Words are delimited by whitespace. (As of Hive 1.1.0.)
Returns int
instr(string str, string substr)
Returns the position of the first occurence of substr in str
Returns int
levenshtein(string A, string B)
Returns the Levenshtein distance between two strings (as of Hive 1.2.0). For example, levenshtein(‘kitten’, ‘sitting’) results in 3.
Returns int
locate(string substr, string str[, int pos])
Returns the position of the first occurrence of substr in str after position pos
Returns int
lower(string A) lcase(string A)
Returns the string resulting from converting all characters of B to lower case. For example, lower(‘fOoBaR’) results in ‘foobar’.
Returns string
lpad(string str, int len, string pad)
Returns str, left-padded with pad to a length of len. If str is longer than len, the return value is shortened to len characters. In case of empty pad string, the return value is null.
Returns string
ltrim(string A)
Returns the string resulting from trimming spaces from the beginning(left hand side) of A e.g. ltrim(‘ foobar ‘) results in ‘foobar ‘
Returns string
ngrams(array<array >, int N, int K, int pf)
Returns the top-k N-grams from a set of tokenized sentences, such as those returned by the sentences() UDAF. See StatisticsAndDataMining for more information.
Returns array<struct<string,double>>
parse_url(string urlString, string partToExtract [, string keyToExtract])
Returns the specified part from the URL. Valid values for partToExtract include HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, and USERINFO. e.g. parse_url(‘https://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1’, ‘HOST’) returns ‘facebook.com’. Also a value of a particular key in QUERY can be extracted by providing the key as the third argument, e.g. parse_url(‘https://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1’, ‘QUERY’, ‘k1’) returns ‘v1’.
Returns string
printf(String format, Obj… args)
Returns the input formatted according do printf-style format strings (as of Hive 0.9.0)
Returns string
regexp_extract(string subject, string pattern, int index)
Returns the string extracted using the pattern. e.g. regexp_extract(‘foothebar’, ‘foo(.*?)(bar)’, 2) returns ‘bar.’’ Note that some care is necessary in using predefined character classes: using ‘s’ as the second argument will match the letter s; ‘s’ is necessary to match whitespace, etc. The ‘index’ parameter is the Java regex Matcher group() method index. See docs/api/java/util/regex/Matcher.html for more information on the ‘index’ or Java regex group() method.
Returns string
regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT)
Returns the string resulting from replacing all substrings in INITIAL_STRING that match the java regular expression syntax defined in PATTERN with instances of REPLACEMENT, e.g. regexp_replace(“foobar”, “oo|ar”, “”) returns ‘fb.’ Note that some care is necessary in using predefined character classes: using ‘s’ as the second argument will match the letter s; ‘s’ is necessary to match whitespace, etc.
Returns string
replace(string A, string OLD, string NEW)
Returns the string A with all non-overlapping occurrences of OLD replaced with NEW (as of Hive 1.3.0 and 2.1.0). Example: select replace(“ababab”, “abab”, “Z”); returns “Zab”.
Returns string
rpad(string str, int len, string pad)
Returns str, right-padded with pad to a length of len. If str is longer than len, the return value is shortened to len characters. In case of empty pad string, the return value is null.
Returns string
rtrim(string A)
Returns the string resulting from trimming spaces from the end(right hand side) of A e.g. rtrim(‘ foobar ‘) results in ‘ foobar’
sentences(string str, string lang, string locale)
Tokenizes a string of natural language text into words and sentences, where each sentence is broken at the appropriate sentence boundary and returned as an array of words. The ‘lang’ and ‘locale’ are optional arguments. e.g. sentences(‘Hello there! How are you?’) returns ( (“Hello”, “there”), (“How”, “are”, “you”) )
Returns array<array>
soundex(string A)
Returns soundex code of the string (as of Hive 1.2.0). For example, soundex(‘Miller’) results in M460.
Returns string
str_to_map(text[, delimiter1, delimiter2])
Splits text into key-value pairs using two delimiters. Delimiter1 separates text into K-V pairs, and Delimiter2 splits each K-V pair. Default delimiters are ‘,’ for delimiter1 and ‘=’ for delimiter2.
Returns map<string,string>
substr(string|binary A, int start) substring(string|binary A, int start)
Returns the substring or slice of the byte array of A starting from start position till the end of string A e.g. substr(‘foobar’, 4) results in ‘bar’
Returns string
substr(string|binary A, int start, int len) substring(string|binary A, int start, int len)
Returns the substring or slice of the byte array of A starting from start position with length len e.g. substr(‘foobar’, 4, 1) results in ‘b’
Returns string
translate(string|char|varchar input, string|char|varchar from, string|char|varchar to)
Translates the input string by replacing the characters present in the from string with the corresponding characters in the to string. This is similar to the translate function in PostgreSQL. If any of the parameters to this UDF are NULL, the result is NULL as well (available as of Hive 0.10.0; char/varchar support added as of Hive 0.14.0.)
Returns string
12 Register Logical Datasources
To build a virtual data lake, a user need to identify data base to be connected. Lightning supports all JDBC compliant DB, all know NoSQL, RestAPI, CSV, JSON file.
Register Clustered Databases
Register Master Datasources
A user need to provide JDBC driver class, url and connectivity credentials including extra parameters database need.
Syntax:
CREATE DATASOURCE ORCL [DESCRIBE BY ["Oracle for Product Master"] OPTIONS (
jdbcdriver "com.oracle.OracleDriver",
jdbcurl "jdbc:oracle:thin@oracle-master:1521:orcl",
username "scott",
password "tiger",
[key "value"]*)
Real-world example
showing a connection to Oracle on port 1521 of a localhost with a database name of orcl with user name scott and password tiger.
CREATE DATASOURCE ORACLE DESCRIBE BY "Oracle for Product Master" OPTIONS (
jdbcdriver "com.oracle.OracleDriver",
jdbcurl "jdbc:oracle:thin@localhost:1521:orcl",
username "scott",
password "tiger")
Add Slave Nodes
If the registered database supports cluster base computing such as MPP, then a user can register slave nodes so that Lightning can directly query to slave nodes rather than running it through the master node.
ADD SLAVE DATASOURCE TO ORCL OPTIONS (
jdbcdriver "com.oracle.OracleDriver",
jdbcurl "jdbc:oracle:thin@oracle-slave1;1521:orcl",
username "scott",
password "tiger",
[key "value"])*
ADD SLAVE DATASOURCE TO ORCL OPTIONS (
jdbcdriver "com.oracle.OracleDriver",
jdbcurl "jdbc:oracle:thin@oracle-slave2;1521:orcl",
username "scott",
password "tiger",
[key "value"])*
ADD SLAVE DATASOURCE TO ORCL OPTIONS (
jdbcdriver "com.oracle.OracleDriver",
jdbcurl "jdbc:oracle:thin@oracle-slave3;1521:orcl",
username "scott",
password "tiger",
[key "value"])*
The above example shows registering 1 master node(oracle-master) and 3 slave nodes(oracle-slave1, oracle-slave2, and oracle-slave3).
*ORCL is alias for the target data source which will be decided by a user when creating data source. *If schema is provided, Lightning will only ingest metadata from that schema *If schema_prepended_table is set to true, schema will be prepended to the table name as there may be same tables using the same name across different schemas. For example, role table in zetaris_bi schema will be named zetaris_bi__role
Register RDBMS Datasource
MS SQL Server
CREATE DATASOURCE MSSQL DESCRIBE BY "MSSQL-2017-linux " OPTIONS (
jdbcdriver "com.microsoft.sqlserver.jdbc.SQLServerDriver",
jdbcurl "jdbc:sqlserver://localhost:1433 ",
databaseName "DemoData",
username "scott" ,
password "tiger",
schema “dbo”
)
My SQL
CREATE DATASOURCE MY_SQL DESCRIBE BY "MySQL " OPTIONS (
jdbcdriver "com.mysql.jdbc.Driver",
jdbcurl "jdbc:mysql://127.0.0.1/test_db",
username "scott" ,
password "tiger
)
IBM DB2
CREATE DATASOURCE DB2_DB2INST1 DESCRIBE BY "DB2 Sample DB Schema " OPTIONS (
jdbcdriver "com.ibm.db2.jcc.DB2Driver",
jdbcurl "jdbc:db2://127.0.0.1:50000/db_name",
username "db2inst1" ,
password "db2inst1-pwd",
schema "DB2INST1",
schema_prepended_table "true"
)
Green Plum
CREATE DATASOURCE GREEN_PLUM DESCRIBE BY "GREEN_PLUM " OPTIONS (
jdbcdriver "org.postgresql.Driver",
jdbcurl "jdbc:postgresql://localhost:5432/postgres",
username "gpadmin" ,
password "pivotal",
schema "public"
)
Teradata
CREATE DATASOURCE TERA_DATA DESCRIBE BY "TERA_DATA " OPTIONS (
jdbcdriver "com.teradata.jdbc.TeraDriver",
jdbcurl "jdbc:teradata://10.128.87.16/DBS_PORT=1025",
username "dbc" ,
password "dbc",
schema "dbcmngr"
)
Amazon Aurora
CREATE DATASOURCE AWS_AURORA DESCRIBE BY "AWS_AURORA " OPTIONS (
jdbcdriver "com.mysql.jdbc.Driver",
jdbcurl "jdbc:mysql://zet-aurora-cluster.cluster-ckh4ncwbhsty.ap-southeast-2.rds.amazonaws.com/your_db?",
username "your_db_account_name" ,
password "your_db_account_password""
)
Amazon Redshift
CREATE DATASOURCE REDSHIFT DESCRIBE BY "AWS RedShift" OPTIONS (
jdbcdriver "com.amazon.redshift.jdbc.Driver",
jdbcurl "jdbc:redshift://zetaris.cyzoanxzdpje.ap-southeast-2.redshift.amazonaws.com:5439/your_db_name",
username "your_db_account_name",
password "your_db_account_password"
)
Register NOSQL Datasource
Register Mongo DB
For MongoDB, the below 5 parameters(host, port, db name, user name and password) must be provided.
CREATE DATASOURCE MONGO DESCRIBE BY "MongoDB" OPTIONS (
lightning.datasource.mongodb.host "localhost",
lightning.datasource.mongodb.port "27017",
lightning.datasource.mongodb.database "lightning-demo",
lightning.datasource.mongodb.username "",
lightning.datasource.mongodb.password ""
)
Register Cassandra
For Cassandra, there is only on parameter for Lightning, which is key space for this connection. the other parameters start with “spark.cassandra” prefix, which is actually provided for Spark Cassandra connector(https://github.com/datastax/spark-cassandra-connector).
CREATE DATASOURCE CSNDR DESCRIBE BY "Cassandra" OPTIONS (
spark.cassandra.connection.host "localhost",
spark.cassandra.connection. port "9042",
spark.cassandra.auth.username "cassandra",
spark.cassandra.auth. password "cassandra",
lightning. datasource .cassandra.keyspace "lightning_demo"
)
13 Create Physical Datasources
These are datasources based on physical files residing on on-prem filestore (Local filesystem,nfs filesystem), or cloud filestores like S3 and Azure Blob. One need to first create a lightning database and then register the respective files under this namespace . .. highlight:: sql
CREATE LIGHTNING DATABASE AWS_S3 DESCRIBE BY "AWS S3 bucket" OPTIONS (
[key "value"]
)
Ingest file from local filesystem
CREATE LIGHTNING FILESTORE TABLE pref FROM HR FORMAT (CSV | JSON)
OPTIONS (path "file path", header "true", inferSchema "true", [key value pair]);
Ingest RESTful Service
For the RESTful service when returns JSON format, a user need to provide end point, HTTP method, encoding type as well as schema.::
REGISTER REST DATASOURCE TABLE SAFC_USERS FROM SAFC SCHEMA (
uid Long,
gender String,
age Integer ,
job String,
ts String)
OPTIONS (
endpoint "/example/users",
method "GET",
requesttype "URLENCODED"
);
Other parameter for the API call, such as security key, can be provided in OPTIONS field.
Ingest file from S3
CREATE LIGHTNING FILESTORE TABLE customer FROM TPCH_S3 FORMAT CSV(JSON) OPTIONS (
PATH "s3n://zetaris-lightning-test/csv-data/tpc-h/customer.csv",
inferSchema "true",
AWSACCESSKEYID "AKIAITGIWHBIPE******A",
AWSSECRETACCESSKEY "EWfnuO/2E8UAA/5v89sxo6hTVefa5U*********"
)
Ingest file from azure Blob
CREATE LIGHTNING FILESTORE TABLE customer FROM TPCH_AZBLB FORMAT CSV(JSON) OPTIONS (
PATH "wasb://zettest-storage-container@zettesstorage.blob.core.windows.net/customer.csv",
inferSchema "true",
fs.azure.account.key.zettesstorage.blob.core.windows.net "bHLzau36KlZ6cYnSrvPzSJVniBDtu819nHTR/+hRyDZEVScQ3wuesst9P5/I7v******"
)
Key name for security key depends on Azure Blob container, refer to Azure Blob service.
14 Ingest Metadata
Once a data source is registered in Lightning it will ingest all table, column and constraints metadata.
Ingest all tables from the data source
REGISTER DATASOURCE TABLES FROM ORCL
This command will connect to ORCL database, and ingest all metadata(tables, columns, foreign key, index and all other constraints) into Schema Store
Ingest a table from the data source
REGISTER DATASOURCE TABLE "USER" [USER_ALIAS] FROM ORCL
This will register “USER” table as USER_ALIAS if alias is provided.
15 Manage Schema Store
Lightning provides various commands to manage meta data in the schema store. Also, these will be provided via RESTful service.
Datasource
Show Datasource
This command shows the data sources registered in the schema store:: .. highlight:: sql
SHOW DATASOURCES
View
Lightning supports the view capability with query definition on a single data source or across multiple data sources
Create Data Source View
CREATE DATASOURCE VIEW TEEN_AGER FROM ORCL AS
SELECT * FROM USERS WHERE AGE >= 13 AND AGE < 20
the TEEN_AGER view belongs to ORCL data source. With this capability a user can create a view with DBMS native query, which is really handy :
CREATE DATASOURCE VIEW SALARY_RANK FROM ORCL AS
SELECT department_id, last_name, salary, RANK() OVER (PARTITION BY department_id ORDER BY salary) RANK
FROM employees
WHERE department_id = 60
ORDER BY RANK, last_name
SELECT * FROM ORCL.SALARY_RANK will produce :
DEPARTMENT_ID LAST_NAME SALARY RANK
------------- ------------------------- ---------- ----------
60 Lorentz 4200 1
60 Austin 4800 2
60 Pataballa 4800 2
60 Ernst 6000 4
60 Hunold 9000 5
Also, those views can be join with other tables in other data sources.
Create Schema Store View
This view can across different data sources.
CREATE DATASOURCE VIEW TOP10_MOVIES_FOR_TEENS AS
SELECT movies_from_oracle.title, user_rating.count, user_rating.min, user_rating.max, user_rating.avg
FROM(
SELECT iid, count(*) count, min(pref) min, max(pref) max, avg(pref) avg
FROM TRDT.ratings ratings_from_teradata, PGRS.users users_from_postgres
WHERE users_from_postgres.age >=13AND users_from_postgres.age <20
AND ratings_from_teradata.uid = users_from_postgres.uid
GROUP BY ratings_from_teradata.iid
ORDER BY avg DESC
LIMIT20
) AS user_rating, ORCL.movies movies_from_oracle
WHERE movies_from_oracle.iid = user_rating.iid
This view can be queried like normal table :
SELECT*FROM TOP10_MOVIES_FOR_TEENS
16 Run Query
Lightning supports SQL2003. Also, it can run all 99 TPC-DS queries. As long as a data source registered into schema store, query can across all data sources.
For example the following query run join query across three different data sources(Teradata ↔ Oracle ↔ Cassandra),
SELECT users_from_cassandra.age, users_from_cassandra.gender, movies_from_oracle.title title, ratings_from_teradata.pref, ratings_from_teradata.ts
FROM TRDT.ratings ratings_from_teradata, ORCL.movies movies_from_oracle, CSNDR.users users_from_cassandra
WHERE users_from_postgres.gender = 'F'
AND ratings_from_teradata.uid = users_from_postgres.uid
AND movies_from_oracle.iid = ratings_from_teradata.iid
17 Materialization and Cache
For some reasons, for example query performance, all data source tables or views can be materialized by leveraging Zetaris Fusion DB. Also, Lightning support Cache capabilities where a user can load all data into main memory.
Materialization
For example the following query materialize all data from RESTful Service to USER_FOR_COPY table in fusion db.:
INSERT INTO FUSIONDB.USERS_FOR_COPY
SELECT uid, gender, age, job , ts FROM SAFC.SAFC_USERS
Cache/Uncache
A user can load all data into main memory by leverging cache capability and also, uncache it anytime.
CACHE TABLE pref;
CACHE TABLE ORCL.movies;
The pref, ORCL.moves table are chaned now, and the following query performs a lot better :
SELECT movies_from_oracle.title, hdfs_pref. count , hdfs_pref. min , hdfs_pref. max , hdfs_pref. avg
FROM (
SELECT iid, count ( * ) count , min (pref) min , max (pref) max , avg (pref) avg
FROM pref
GROUP BY iid
) AS hdfs_pref, ORCL.movies movies_from_oracle
WHERE movies_from_oracle.iid = hdfs_pref.iid
These tables uncached any time
UNCACHE TABLE pref
UNCACHE TABLE ORCL.movies
18 Statistics
Lighting come up with CBO(Cose Based Optimizer) to reduce data shuffling across cluster. To do this, Lighting keeps statistics for the data source. There are two types of statistics, the one is table level statistics and the other is column level statistics.
19 Partitioning
Query performance can be improved by partitioning table. What partitioning means here is that all records are splitted into multiple partitions and these are processed independently in each worker node.
CREATE DATASOURCE PARTITION ON ORCL.USERS OPTIONS (
COLUMN "UID",
COUNT "2",
LOWERBOUND "1",
UPPERBOUND "6040")
This command makes two partitions based on the "UID" column. lower/upper bound provides boundary value for the partition. This partition can be removed by :
DROP DATASOURCE PARTITION ON ORCL.USERS
20 Import CSV file
Lighting supports CSV file and running query on top of it. A CSV can be imported by either :
Hive syntax
CREATE EXTERNAL TABLE pref (uid INT, iid INT, pref FLOAT, ts STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 'csv directory'
With the syntax, a user can import other file format than csv(tsv for example). But this doesn’t support :
- Header CSV header must be got rid of.
- LOCATION must be directory. Create a directory, and place csv file there
- For directory in s3 bucket, csv directory looks like “s3n://mys3bucket/perf/”
OR
21 Access Control
User Management
Only admin user or users in “admin” role can add/drop user.
Role Based Access(RBA) control
The Lightning provides Role Based Access which limits a users to access a specific set of data. This is applied at data source level or table level in each data source. Only admin users or equivalents can run these commands
Privileges
- SELECT privilege - Give read access to the data source or relation
- INSERT privilege - Give insert access to the data source or relation
- CACHE privileges - Give cache access to a relation : (UN)CACHE DATASOURCE TABLE
Predefined Roles
Role is case insensitive - admin - none - all - default
Grant
Granted user with GRANT OPTIONS can grant same privilege on the table.
GRANT SELECT | INSERT | CACHEON table_or_view_nameTO principal_spec [,principal_spec] ...[WITH GRANT OPTION]
Revoke
REVOKE SELECT(INSERT | CACHE) ON table_or_view_name FROM principal_spec [,principal_spec] ...
Show Grant
- Wild card can be only used in table field. for example, ORCL.* is allowed but *.* or *.movies were not allowed.
SHOW GRANT[principal_specification] ON(ALL | [TABLE] table_or_view_name)
principal_specification: USER user | ROLE roleIt will display :
table_identifier | principal_name | principal_type | privilege | grant_option | grant_time | grantor
+------------------------+----------------------+--------------------+-------------+------------------+----------------------------------+----------+
ORCL.movies | ashutosh | USER | DELETE | false | 2018-05-07 11:44:12.301 | thejas
ORCL.movies | ashutosh | USER | INSERT | false | 2018-05-07 11:44:12.301 | thejas
ORCL.ratings | ashutosh | ROLE | SELECT | false | 2018-05-07 11:44:12.301 | thejas