1 應(yīng)用場(chǎng)合
SAP HANA作為一款內(nèi)存數(shù)據(jù)庫(kù)產(chǎn)品, 使得數(shù)據(jù)常駐內(nèi)存, 物理磁盤(pán)的存儲(chǔ)作為數(shù)據(jù)備份與日志記錄, 以防斷電內(nèi)存中數(shù)據(jù)丟失. 這種構(gòu)架大大的縮短了數(shù)據(jù)存取的時(shí)間, 使得SAP HANA很”高速”.
在傳統(tǒng)數(shù)據(jù)模型中,數(shù)據(jù)庫(kù)只是作為存取數(shù)據(jù)一個(gè)工具,對(duì)于類(lèi)似下圖所示的應(yīng)用, 客戶端從Database獲取數(shù)據(jù),然后計(jì)算,最后再把結(jié)果寫(xiě)回Database, 如果數(shù)據(jù)量過(guò)大, 數(shù)據(jù)傳輸?shù)拈_(kāi)銷(xiāo)過(guò)大,并且如果客戶端的內(nèi)存不夠, 計(jì)算分析的過(guò)程也將非常緩慢.
借助于大內(nèi)存的優(yōu)勢(shì), SAP HANA的解決方案是把數(shù)據(jù)敏感的相關(guān)計(jì)算邏輯都移動(dòng)到SAP HANA內(nèi), 從而省去了數(shù)據(jù)傳輸?shù)拈_(kāi)銷(xiāo). 典型的框架如下:

對(duì)于一些簡(jiǎn)單的計(jì)算分析, 可以利用SQLScript腳本完成, SQLScript提供了基本的變量定義語(yǔ)句,流程控制語(yǔ)句. 但是對(duì)于復(fù)雜的分析與計(jì)算, 單純使用SQLScript可能不是特別方便, 比如對(duì)1T的數(shù)據(jù)表作聚類(lèi)分析. 為此 ,SAP HANA提供 AFL (Application Function Library) , 把一些常見(jiàn)的分析任務(wù)用C++實(shí)現(xiàn),作為庫(kù)函數(shù)的形式, 提供給SQLScript調(diào)用,極大地豐富了SQLScript的功能.
2 PAL簡(jiǎn)介
PAL (Predictive Analysis Library)是SAP HANA中AFL (Application Function Library)框架下的一個(gè)函數(shù)庫(kù), 主要用于數(shù)據(jù)預(yù)測(cè)與分析, 提供了很多數(shù)據(jù)挖掘算法的實(shí)現(xiàn). 按應(yīng)用的場(chǎng)景進(jìn)行分類(lèi),PAL函數(shù)主要包括以下類(lèi)別:
? 聚類(lèi)
? 分類(lèi)
? 關(guān)聯(lián)分析
? 時(shí)間序列分析
? 數(shù)據(jù)預(yù)處理
? 統(tǒng)計(jì)分析
? 社會(huì)網(wǎng)絡(luò)分析
具體到每個(gè)類(lèi)別下面, 有常見(jiàn)算法的實(shí)現(xiàn), 比如聚類(lèi)下面的K-means算法.
值得一提的是, AFL是一個(gè)單獨(dú)的包, 需要另外進(jìn)行安裝. 另外AFL 的版本號(hào)需要與SAP HANA的版本號(hào)匹配.
3 基本使用步驟
PAL函數(shù)的使用包括三個(gè)步驟:
(1) 生成AFL_WRAPPER_GENERATOR 與 AFL_WRAPPER_ERASER存儲(chǔ)過(guò)程.
對(duì)于具體的某個(gè)算法, 在使用之前,利用AFL_WRAPPER_GENERATOR生成該算法的一個(gè)包裝器,然后才能進(jìn)行調(diào)用, 可以理解為生成該算法的一個(gè)實(shí)例, AFL_WRAPPER_ERASER作用是刪除這個(gè)算法的實(shí)例.
這兩個(gè)存儲(chǔ)過(guò)程的生成很簡(jiǎn)單. 在AFL插件的目錄下有afl_wrapper_generator.sql, afl_wrapper_eraser.sql兩個(gè)腳文文件, 把它們的內(nèi)容拷貝到SAP HANA Studio的SQL Console,然后執(zhí)行以及即可.然后為用戶分配執(zhí)行權(quán)限:
GRANT EXECUTE ON system.afl_wrapper_generator to USER1;
GRANT EXECUTE ON system.afl_wrapper_eraser to USER1;
這個(gè)步驟只需要首次利用使用AFL時(shí)執(zhí)行一次,對(duì)于后續(xù)其他的算法使用,就不需要執(zhí)行了.
(2) 生成算法的實(shí)例
CALL SYSTEM.AFL_WRAPPER_GENERATOR(
‘<procedure_name>’,
‘<area_name>’,
‘<function_name>’, <signature_table>);
Procedure_name:自定義的名稱(chēng);
Area_name:通常為AFLPAL;
Function_name:算法名稱(chēng);
Signature_table:指定一個(gè)用戶表,作為方法簽名的信息;
(3) 調(diào)用算法實(shí)例
CALL <procedure_name>(
<data_input_table> {,…},
<parameter_table>,
<output_table> {,…}) with overview;
Procedure_name:算法實(shí)例名;
Data_input_table:輸入數(shù)據(jù)表;
Parameter_table:參數(shù)表;
Output_table:輸出表;
4 示例Demo
下面以DBSCAN聚類(lèi)算法來(lái)說(shuō)明說(shuō)PAL算法的調(diào)用過(guò)程.(因測(cè)試機(jī)AFL_WRAPPER_GENERATOR存儲(chǔ)過(guò)程已經(jīng)存在,故第一步不再執(zhí)行,另假定Schema為TEST)
DBSCAN聚類(lèi)算法是一個(gè)基于密度的聚類(lèi)算法,該算法有很好的降噪能力,有關(guān)該算法的更多介紹,請(qǐng)參考http://en./wiki/DBSCAN
/*創(chuàng)建數(shù)據(jù)表類(lèi)型 ,id,屬性1,屬性2 */
CREATE TYPE PAL_DBSCAN_DATA_T AS TABLE ( ID integer, ATTRIB1 double, ATTRIB2
double);
/*創(chuàng)建算法控制參數(shù)類(lèi)型*/
CREATE TYPE PAL_CONTROL_T AS TABLE( NAME varchar(50), INTARGS integer, DOUBLEARGS
double, STRINGARGS varchar(100));
/*結(jié)構(gòu)表類(lèi)型,ID,類(lèi)簇編號(hào)*/
CREATE TYPE PAL_DBSCAN_RESULTS_T AS TABLE( ID integer, RESULT integer);
/*創(chuàng)建方法參數(shù)表*/
CREATE COLUMN TABLE PAL_DBSCAN_PDATA_TBL( "ID" INT, "TYPENAME" VARCHAR(100), "DIRECTION" VARCHAR(100) );
/*向參數(shù)表中插入相關(guān)參數(shù)數(shù)據(jù)*/
INSERT INTO PAL_DBSCAN_PDATA_TBL VALUES (1, 'TEST.PAL_DBSCAN_DATA_T', 'in');
INSERT INTO PAL_DBSCAN_PDATA_TBL VALUES (2, 'TEST.PAL_CONTROL_T', 'in');
INSERT INTO PAL_DBSCAN_PDATA_TBL VALUES (3, 'TEST.PAL_DBSCAN_RESULTS_T', 'out');
/*分配權(quán)限*/
GRANT SELECT ON DM_PAL.PAL_DBSCAN_PDATA_TBL to SYSTEM;
/*生成PAL_DBSCAN9算法實(shí)例*/
call SYSTEM.afl_wrapper_eraser('PAL_DBSCAN9');
call SYSTEM.afl_wrapper_generator('PAL_DBSCAN9', 'AFLPAL', 'DBSCAN', PAL_DBSCAN_PDATA_TBL);
/* 創(chuàng)建數(shù)據(jù)表*/
CREATE COLUMN TABLE PAL_DBSCAN_DATA_TBL ( ID integer, ATTRIB1 double, ATTRIB2 double);
/*插入測(cè)試數(shù)據(jù)*/
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(1,0.10,0.10);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(2,0.11,0.10);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(3,0.10,0.11);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(4,0.11,0.11);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(5,0.12,0.11);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(6,0.11,0.12);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(7,0.12,0.12);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(8,0.12,0.13);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(9,0.13,0.12);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(10,0.13,0.13);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(11,0.13,0.14);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(12,0.14,0.13);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(13,10.10,10.10);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(14,10.11,10.10);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(15,10.10,10.11);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(16,10.11,10.11);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(17,10.11,10.12);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(18,10.12,10.11);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(19,10.12,10.12);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(20,10.12,10.13);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(21,10.13,10.12);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(22,10.13,10.13);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(23,10.13,10.14);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(24,10.14,10.13);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(25,4.10,4.10);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(26,7.11,7.10);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(27,-3.10,-3.11);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(28,16.11,16.11);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(29,20.11,20.12);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(30,15.12,15.11);
/*用臨時(shí)表來(lái)存儲(chǔ)算法的輸入?yún)?shù)*/
CREATE LOCAL TEMPORARY COLUMN TABLE #PAL_CONTROL_TBL( NAME varchar(50), INTARGS
integer, DOUBLEARGS double, STRINGARGS varchar(100));
/*指定DBSCAN算法的輸入?yún)?shù)*/
/*線程數(shù)18*/
INSERT INTO #PAL_CONTROL_TBL VALUES('THREAD_NUMBER',18,null,null);
/*自動(dòng)確定MINPTS與RADIUS參數(shù)*/
INSERT INTO #PAL_CONTROL_TBL VALUES('AUTO_PARAM',null,null,'true');
/*點(diǎn)與點(diǎn)之間的距離采用Manhattan距離*/
INSERT INTO #PAL_CONTROL_TBL VALUES('DISTANCE_METHOD',1,null,null);
/*結(jié)果表*/
CREATE COLUMN TABLE PAL_DBSCAN_RESULTS_TBL( ID integer, RESULT integer);
/*調(diào)用DBSCAN算法*/
CALL _SYS_AFL.PAL_DBSCAN9(PAL_DBSCAN_DATA_TBL, "#PAL_CONTROL_TBL",
PAL_DBSCAN_RESULTS_TBL) with overview;
/*查看結(jié)果*/
SELECT * FROM PAL_DBSCAN_RESULTS_TBL;
/*創(chuàng)建數(shù)據(jù)表類(lèi)型 ,id,屬性1,屬性2 */
CREATE TYPE PAL_DBSCAN_DATA_T AS TABLE ( ID integer, ATTRIB1 double, ATTRIB2
double);
/*創(chuàng)建算法控制參數(shù)類(lèi)型*/
CREATE TYPE PAL_CONTROL_T AS TABLE( NAME varchar(50), INTARGS integer, DOUBLEARGS
double, STRINGARGS varchar(100));
/*結(jié)構(gòu)表類(lèi)型,ID,類(lèi)簇編號(hào)*/
CREATE TYPE PAL_DBSCAN_RESULTS_T AS TABLE( ID integer, RESULT integer);
/*創(chuàng)建方法參數(shù)表*/
CREATE COLUMN TABLE PAL_DBSCAN_PDATA_TBL( "ID" INT, "TYPENAME" VARCHAR(100), "DIRECTION" VARCHAR(100) );
/*向參數(shù)表中插入相關(guān)參數(shù)數(shù)據(jù)*/
INSERT INTO PAL_DBSCAN_PDATA_TBL VALUES (1, 'TEST.PAL_DBSCAN_DATA_T', 'in');
INSERT INTO PAL_DBSCAN_PDATA_TBL VALUES (2, 'TEST.PAL_CONTROL_T', 'in');
INSERT INTO PAL_DBSCAN_PDATA_TBL VALUES (3, 'TEST.PAL_DBSCAN_RESULTS_T', 'out');
/*分配權(quán)限*/
GRANT SELECT ON DM_PAL.PAL_DBSCAN_PDATA_TBL to SYSTEM;
/*生成PAL_DBSCAN9算法實(shí)例*/
call SYSTEM.afl_wrapper_eraser('PAL_DBSCAN9');
call SYSTEM.afl_wrapper_generator('PAL_DBSCAN9', 'AFLPAL', 'DBSCAN', PAL_DBSCAN_PDATA_TBL);
/* 創(chuàng)建數(shù)據(jù)表*/
CREATE COLUMN TABLE PAL_DBSCAN_DATA_TBL ( ID integer, ATTRIB1 double, ATTRIB2 double);
/*插入測(cè)試數(shù)據(jù)*/
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(1,0.10,0.10);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(2,0.11,0.10);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(3,0.10,0.11);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(4,0.11,0.11);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(5,0.12,0.11);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(6,0.11,0.12);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(7,0.12,0.12);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(8,0.12,0.13);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(9,0.13,0.12);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(10,0.13,0.13);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(11,0.13,0.14);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(12,0.14,0.13);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(13,10.10,10.10);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(14,10.11,10.10);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(15,10.10,10.11);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(16,10.11,10.11);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(17,10.11,10.12);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(18,10.12,10.11);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(19,10.12,10.12);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(20,10.12,10.13);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(21,10.13,10.12);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(22,10.13,10.13);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(23,10.13,10.14);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(24,10.14,10.13);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(25,4.10,4.10);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(26,7.11,7.10);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(27,-3.10,-3.11);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(28,16.11,16.11);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(29,20.11,20.12);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(30,15.12,15.11);
/*用臨時(shí)表來(lái)存儲(chǔ)算法的輸入?yún)?shù)*/
CREATE LOCAL TEMPORARY COLUMN TABLE #PAL_CONTROL_TBL( NAME varchar(50), INTARGS
integer, DOUBLEARGS double, STRINGARGS varchar(100));
/*指定DBSCAN算法的輸入?yún)?shù)*/
/*線程數(shù)18*/
INSERT INTO #PAL_CONTROL_TBL VALUES('THREAD_NUMBER',18,null,null);
/*自動(dòng)確定MINPTS與RADIUS參數(shù)*/
INSERT INTO #PAL_CONTROL_TBL VALUES('AUTO_PARAM',null,null,'true');
/*點(diǎn)與點(diǎn)之間的距離采用Manhattan距離*/
INSERT INTO #PAL_CONTROL_TBL VALUES('DISTANCE_METHOD',1,null,null);
/*結(jié)果表*/
CREATE COLUMN TABLE PAL_DBSCAN_RESULTS_TBL( ID integer, RESULT integer);
/*調(diào)用DBSCAN算法*/
CALL _SYS_AFL.PAL_DBSCAN9(PAL_DBSCAN_DATA_TBL, "#PAL_CONTROL_TBL",
PAL_DBSCAN_RESULTS_TBL) with overview;
/*查看結(jié)果*/
SELECT * FROM PAL_DBSCAN_RESULTS_TBL;

如果執(zhí)行無(wú)誤,將看到如上圖所示的結(jié)果,記錄被聚成三類(lèi),0,1,-1各代表一個(gè)類(lèi)簇.
5結(jié)束語(yǔ)
本文介紹了SAP HANA中PAL算法的使用,以DBSCAN聚類(lèi)算法作為具體的實(shí)現(xiàn)例子.其他的相關(guān)算法使用流程上與上述流程都相似,主要的工作在于準(zhǔn)備數(shù)據(jù)表,根據(jù)算法的接口文檔定義相關(guān)的參數(shù),并將參數(shù)存入?yún)?shù)表.最后調(diào)用算法實(shí)例即可.
從效率上講,在SAP HANA中使用PAL,一方面利用了大內(nèi)存的優(yōu)勢(shì),另一方面利用了C++作為編譯型語(yǔ)言本身的高效性,如果使用得當(dāng),對(duì)于大數(shù)據(jù)的相關(guān)分析任務(wù),在速度上將會(huì)有一個(gè)很大的飛躍!
[注: 本文的測(cè)試案例所使用的SAP HANA版本為SAP HANA SPS06]
想獲取更多SAP HANA學(xué)習(xí)資料或有任何疑問(wèn),請(qǐng)關(guān)注新浪微博@HANAGeek!我們歡迎你的加入!
|