Snakecy's NOTE


  • Home

  • Archives

  • About

  • Search

UCI Machine Learning Repository

Posted on 2016-01-15   |   In algorithm   |     |   Views

Datasets Examples for machine learning

Here are the data sets from UCI Machine Learning Repository once I sorted and practiced. Some of the abstracts I summarized updated to my github CUIMachineLearningRepository.

Example:

Image Segmentation

  1. Data set website
    http://archive.ics.uci.edu/ml/datasets/Image+Segmentation
  2. Datasets describe
    【1】The data used in our experiments were collected by Vision Group, University of Massachusetts.【2】We used to classify the image.【3】The instances were drawn randomly from a database of 7 outdoor images. The images were hand segmented to create a classification for every pixel. Each instance is a 3x3 region. Number of Attributes has 19 continuous attributes, each attribute can describes as region-centroid-col、region-centroid-row、region-pixel-count、short-line-density-5、short-line-density-2、 vedge-mean、vegde-sd、hedge-mean、hedge-sd、intensity-mean、rawred-mean、rawblue-mean、rawgreen-mean、exred-mean、exblue-mean、exgreen-mean、value-mean、saturatoin-mean、hue-mean.【4】The database has 2310 samples, respectively belong to training with 210 samples and testing with 2100 samples. The categories of network system include seven categories, as shown in Table 1.
Table 1 Category Distribution of Network System
Invasion Training Testing Total Number of Samples
Brickface 30 300 330
sky 30 300 330
Foliage 30 300 330
Cement 30 300 330
window 30 300 330
Path 30 300 330
grass 30 300 330
Total number of samples in total 210 2100 2310

Install NoSQL Database

Posted on 2016-01-13   |   In open-source   |     |   Views

NoSQL database for a key-value DB Redis

  • Quick Start Redis

  • use the key value DB (redis)

    1
    2
    3
    4
    sudo apt-get install redis-server
    to test redis using: >> redis-cli to enter
    then : 127.0.0.1:6379> set test 1
    > get test
  • Or download the redis from website

    1
    wget http://download.redis.io/releases/redis-3.0.2.tar.gz
  • Or download

    1
    2
    3
    4
    wget http://download.redis.io/redis-stable.tar.gz
    tar xvzf redis-stable.tar.gz
    cd redis-stable
    make
  • Update the redis-server

    • first way

      1
      2
      3
      sudo add-apt-repository ppa:chris-lea/redis-server
      sudo apt-get update
      sudo apt-get install redis-server
    • another way to do this update

      1
      2
      3
      4
      sudo apt-get install -y python-software-properties
      sudo add-apt-repository -y ppa:rwky/redis
      sudo apt-get update
      sudo apt-get install -y redis-server

Json parsered by Java

Posted on 2016-01-13   |   In open-source   |     |   Views

Describe how to parser Json flows in (eclipse) Java, also in Scala

Json in Java

Json-lib must contain the jars blowing ( versions not limited)

  • Java-org-json

  • Json-lib

    1
    2
    3
    4
    5
    6
    commons-beanutils-1.7.0.jar  or commons-beanutils-1.9.2.jar
    commons-collections-3.1.jar or commons-collections-3.2.1.jar
    commons-lang-2.5.jar or commons-lang-2.6.jar
    commons-logging-1.1.1.jar or commons-logging-1.2.jar
    ezmorph-1.0.3.jar or ezmorph-1.0.6.jar
    json-lib-2.2.2-jdk15.jar or json-lib-2.4-jdk15.jar
  • Json-smart

    • Compare with each json jar package compare
  • simple-json

  • org.json

  • example

    • http://crunchify.com/how-to-write-json-object-to-file-in-java/
    • http://crunchify.com/how-to-read-json-object-from-file-in-java/
1
2
3
4
5
6
7
8
9
10
11
12
13
# reference:
> json格式如下:{"response":{"data":[{"address":"南京市游乐园","province":"江苏","district":"玄武区","city":"南京"}]},"status":"ok"}
Result: 江苏 南京 玄武区 南京市游乐园

JSONObject dataJson=new JSONObject("Json data");
JSONObject response=dataJson.getJSONObject("response");
JSONArray data=response.getJSONArray("data");
JSONObject info=data.getJSONObject(0);
String province=info.getString("province");
String city=info.getString("city");
String district=info.getString("district");
String address=info.getString("address");
System.out.println(province+city+district+address);

API configure with Php-r-sparkr

Posted on 2016-01-12   |   In cloud-tech   |     |   Views

PHP configure


  • How to install the nignx on Ubuntu 10.04?
    Append the appropriate stanza to /etc/apt/sources.list. The Pgp page expalins the signing of th nignx.org released packaging.

    • shell script

      1
      2
      deb http://nginx.org/packages/ubuntu/ lucid nginx
      deb-src http://nginx.org/packages/ubuntu/ lucid nginx
    • then do

      1
      2
      3
      4
      5
      sudo -s
      nginx=stable # use nginx=development for latest development version
      add-apt-repository ppa:nginx/$nginx
      apt-get update
      apt-get install nginx
    • Then finished. Ref. at website
      (https://www.nginx.com/resources/wiki/start/topics/tutorials/install/)

  • PHP Server Nginx

    • $ cd nginx/html
    • port : 80 –> public port: 8181
    • example-data.cloudapp.net:8181/index.html
  • Docker

    • $ curl -sSL https//get.docker.com/ | sh
    • $ sudo docker run hello-world
    • Add a user with ‘sudo’
      • sudo user -aG docker admin
    • Restart the server
      • $ docker run hello-world

R installation


  • R Tutorial
  • Update R on Ubuntu

    • $ sudo gedit /etc/apt/sources.list –> open the .list file
      • sudo vi /etc/apt/sources.list
      • #sudo apt-get install gedit
      • http://cran.r-project.org/becin/linux/ubuntu/
    • add the following line into source.list
      • deb http://cran.cnr.berkeley.edu/bin/linux/ubuntu/ trusty/
      • trusty for Ubuntu 14.04

    • Secure apt, when u see the gpg secure problem, just do the following lines

      1
      2
      3
      $sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9
      $gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9
      $gpg -a --export E084DAB9 | sudo apt-key add -
    • Then, perform this command

      • sudo apt-get update
      • sudo apt-get install r-base r-base-dev
  • Import library dplyr & FeatureHashing

    • Request R version >3.1.2
      1
      2
      3
      4
      5
      $ R
      > install.packages("dplyr")
      > library(dplyr)
      > install.packages("FeatureHashing")
      > library(FeatureHashing)

R library


  • library

    • Matrix
    • glmnt
    • FeatureHashing
      • Important, the FeatureHashing’s latest version is 0.9, which default parameter transpose is TRUE (but in 0.8 version is FALSE)
    • glmnet
    • dplyr
    • ROC http://web.expasy.org/pROC/
      • install.packages(“pROC”)
      • library(pROC)
    • “josnlite” package, parser json file
      • A smart json encoder in R https://www.opencpu.org/posts/jsonlite-a-smarter-json-encoder/
      • install.packages(“jsonlite”, repos=”http://cran.r-project.org“)
      • install.packages(“curl”)
    • “lubridate”, parser date
    • “json_decode”, function to parser json into array
    • R compare script (* important)

      1
      2
      df = data.frame(A=c(5,6,7,8), B=c(1,7,5,9))
      with(df,df[A>B,])
    • Censored regression in r

      • Each steps http://stats.stackexchange.com/questions/149091/censored-regression-in-r
      • Package https://cran.r-project.org/web/packages/AER/
      • How to install packages
        1
        2
        3
        > install.packages("AER", lib = "/my/own/R-packages/")
        > library("AER", lib.loc="/my/own/R-packages/")
        Ref http://www.math.usask.ca/~longhai/software/installrpkg.html

PHP call R API


  • API part

    • PHP API

      • API url http://example-rtb.cloudapp.net:8181/example/api/predict.php
      • PHP calling function methods
      1
      2
      3
      4
      call_user_func()
      call_user_func('a', "111", "222")
      call_user_func_array()
      call_user_func_array('a', array("111", "222")
    • RESTful API based OpenCPU

      • example prediction API
        1
        $ time curl http://localhost:7509/ocpu/library/exampleApi/R/predict_api/json -H "Content-Type:application/json" -d '{"request":["http://*.cloudapp.net:8181/json/req.txt"]}'
    • Install opencpu on Ubuntu cloud server

      • Recommended on Ubuntu 14.04
      1
      2
      3
      4
      5
      6
      7
      8
      #requires ubuntu 14.04 (trusty)
      sudo add-apt-reporitory -y ppa:opencpu/opencpu-1.5
      sudo apt-get update
      sudo apt-get upgrade
      #install opencpu server
      sudo apt-get install -y opencpu
      # optional
      sudo apt-get isntall -y rstudio-server
  • R command

    • Local

      1
      ./bin/sparkR --packages com.databricks:spark-csv_2.11:1.2.0
    • Standalone

      1
      ./bin/sparkR --master spark://example-data01:7077 --packages com.databricks:spark-csv_2.11:1.2.0
  • SparkR MLlib

    • An exampel for sparkR MLlib
      • Ref https://github.com/AlbanPhelip/SparkR-example
    • Only the glm() can be used in sparkR
      1
      ./bin/sparkR --master spark://example-data01:7077 --packages com.databricks:spark-csv_2.11:1.2.0 /home/admin/Rscript_model_test/train_sparkr.R hdfs://masters/Rscript_model/data.txt hdfs://masters/Rscript_model/model.Rds

R on SparkR tutorials for Big Data analysis and Machine Learning as IPython/Jupyter notebooks

https://github.com/jadianes/spark-r-notebooks


A good example for R app


1
2
3
4
5
6
7
8
9
10
11
http://blog.fens.me/r-app-china-weather/
RandomForest in R with parallel trainning.
# the parallel computing by RandomForest
library(randomForest)
cl <- makeCluster(4)
registerDoParallel(cl)
rf <- foreach(ntree=rep(25000, 4),
.combine=combine,
.packages='randomForest') %dopar%
randomForest(Species~., data=iris, ntree=ntree)
stopCluster(cl)

SparkR configure


SparkR is an R package that provides a light-weight frontend to use Apache Spark from R

  • Ref https://github.com/amplab-extras/SparkR-pkg | http://amplab-extras.github.io/SparkR-pkg/
  • Configure the sparkR environment

    • Required
      • openjdk 7, R
    • Steps for set up
      1
      2
      3
      4
      5
      6
      7
      8
      sudo R
      install.packages("rJava")
      install.packages("devtools", dependencies = TRUE)
      # after install rJava & devtools
      library(devtools)
      install_github("amplab-extras/SparkR-pkg", subdir="pkg")
      # Resolve
      # sudo apt-get install r-cran-rjava
  • Required

    1
    2
    3
    4
    5
    6
    7
    sudo apt-get install libxml2-dev
    sudo apt-get install libcurl4-openssl-dev
    sudo apt-get install libcurl4-gnutls-dev
    sudo apt-get install curl
    /etc/apt/source.list
    #deb http://cran.rstudio.com/bin/linux/ubuntu trusty/
    sudo apt-get install libssl-dev
  • Sort the SparkR
1
2
3
4
5
6
7
8
9
10
11
12
13
./bin/sparkR --packages com.databricks:spark-csv_2.11:1.2.0
sc <- sparkR.init(sparkPackages="com.databricks:spark-csv_2.11:1.2.0")
sc <- sparkR.init(master="spark://example-data01:7077", sparkEnvir=list(spark.executor.memory="10g", spark.cores.max="4"), sparkPackages="com.databricks:spark-csv_2.11:1.2.0" )
sqlContext <- sparkRSQL.init(sc)
dataT <- read.df(sqlContext, "hdfs://masters/Rscript_model/data.txt","com.databricks.spark.csv",header="true",delimiter="\t")
on example-data01: dataT <- read.df(sqlContext, "/home/admin/data/data.txt","com.databricks.spark.csv",header="true",delimiter="\t", transpose = "true", is.dgCMatrix = "false")
Ref script
dataT <- read.df(sqlContext, "/home/admin/data/data.txt","com.databricks.spark.csv",header="true",delimiter="\t")
head(select(dataT,dataT$is_win))
test <- structure(dataT, package="SparkR")
dataT <- read.table(sqlContext, "hdfs://masters/Rscript_model/data.txt","com.databricks.spark.csv",header="true",delimiter="\t")
test <- cbind(select(dataT, "is_win"), select(dataT, "days"), select(dataT, "hours"), select(dataT, "exchange_id"), select(dataT, "app_id"), select(dataT, "publiser_id"), select(dataT, "bidfloor"), select(dataT, "w"), select(dataT, "h"), select(dataT, "os"), select(dataT, "Osv"), select(dataT, "model"), select(dataT, "connectiontype"), select(dataT,"country"), select(dataT, "ua"), select(dataT,"carrier"), select(dataT, "js"), select(dataT, "user"), select(dataT, "carriername"), select(dataT, "app_cat"), select(dataT,"btype"), select(dataT,"mimes"), select(dataT,"badv"), select(dataT,"bcat"))
as.data.frame(test)

Output Format using R


  • pdf format

    1
    2
    pdf(file="myplot.pdf")
    dev.off()
  • jpeg format

    1
    2
    3
    4
    setwd("path")
    jpeg(file="myplot.jpeg")
    plot(1:10)
    dev.off()
  • png format

    1
    2
    3
    4
    5
    png(file="myplot.png", bg="transparent")
    dev.off()
    #View the png on Ubuntu, by using "gthumb"
    # sudo apt-get install gthumb
    # $gthumb myplot.png
  • bmp format

    1
    bmp("myplot.bmp")
  • PostScript format

    1
    postscript("myplot.ps")
  • Windows image file format

    1
    win.metafile("myplot.ps")

HDFS configure with Zookeeper (02)

Posted on 2016-01-12   |   In cloud-tech   |     |   Views

Construct the platform for Big Data project

Configure Zookeeper

  • on each server, configure the java environment
    1
    2
    3
    4
    5
    $ tar -zxvf zookeeper-3.4.6.tar.gz
    $ mv zookeeper-3.4.6 zookeeper
    $ cd zookeeper/conf
    $ cp zoo_sample.cfg zoo.cfg
    $ vi zoo.cfg
Server role example-data01(namenode1) example-data02(namenode2) example-data03(datanode1) example-data04(datanode2)
NameNode YES YES NO NO
DataNode NO NO YES YES
JournalNode YES YES YES NO
ZooKeeper YES YES YES NO
ZKFC YES YES NO NO
  • 8080 => example-data01:50070
  • 8000 => example-data02:50070

Hadoop configure

  • hosts

    1
    2
    3
    4
    *.*.*.*  example-data01
    *.*.*.* example-data02
    *.*.*.* example-data03
    *.*.*.* example-data04
  • $ vi ~/.bashrc

    1
    2
    3
    4
    5
    # vim ~/.bashrc
    export ZOO_HOME=/home/admin/zookeeper
    export ZOO_LOG_DIR=/home/admin/zookeeper/logs
    export PATH=$PATH:$ZOO_HOME/bin
    # source ~/.bashrc
  • $ mkdir -p /data/hadoop/zookeeper/{data,logs}

  • $ cp ~/zookeeper/conf/zoo_sample.cfg ~/zookeeper/conf/zoo.cfg
  • $ vim zoo.cfg

    1
    2
    3
    4
    5
    6
    7
    8
    9
    # vim /home/admin/zookeeper/conf/zoo.cfg
    tickTime=2000
    initLimit=10
    syncLimit=5
    dataDir=/home/admin/zookeeper/data
    clientPort=2181
    server.1=example-data01:2888:3888
    server.2=example-data02:2888:3888
    server.3=example-data03:2888:3888
  • $ create myid in zkData file -> echo 1 > myid

  • $ scp -r zookeeper/* admin@example-data02:~ -> echo 2 > myid
  • $ scp -r zookeeper/* admin@example-data03:~ -> echo 3 > myid
  • $ cp ~/zookeeper to each server ( example-data02, example-data03)
  • on example-data01
    • check status by using
    • $ ./bin/zkCli.sh -server example-data01:2181
    • $ or ./bin/zkCli.sh -server example-data02:2181
    • exit by using “quit”
  • on each node to perform
    • ./bin/zkServer.sh start
    • ./bin/zkServer.sh status

Configure

- example-data01(Namenode) example-data02(Sec-namenode) example-data03(datanode)  example-data04 (datanode)

On example-data01

  • Step 1:

    • vim ~/.bashrc ( then copy the ~/.bashrc to example-data02( namenode) & example-data03 (datanode)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# java configure
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export PATH=$JAVA_HOME/bin:$PATH
export CLASSPATH=.:$CLASSPATH:$JAVA_HOME/lib:$JAVA_HOME/jre/lib

# hadoop configure
export HADOOP_HOME=/home/admin/hadoop
export PATH=$HADOOP_HOME/bin:$PATH

export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

#zookeeper configure

export ZOO_HOME=/home/admin/zookeeper
export PATH=$PATH:$ZOO_HOME/bin
export ZOO_LOG_DIR=/home/admin/zookeeper/logs

#
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_PID_DIR=/home/admin/hadoop/pids
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME

export PATH=$HADOOP_HOME/sbin:$PATH
  • Step 2:
1
2
3
4
5
6
$ mkdir -p /home/admin/hadoop/{pids,storage}
$ mkdir -p /home/admin/hadoop/storage/{hdfs,tmp,journal}
$ mkdir -p /home/admin/hadoop/storage/hdfs/{name,data}
in example-data03 & example-data04
$ mkdir -p /datadriver/hdfs/data
$ sudo chown admin /datadriver/hdfs/data/
  • Step 3:

configure the core-site.xml, hadoop-env.sh, hdfs-site.xml

  • configure core-site.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://masters</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/home/admin/hadoop/storage/tmp</value>
</property>
<property>
<name>ha.zookeeper.quorum</name>
<value>example-data01:2181,example-data02:2181,example-data03:2181</value>
</property>
<property>
<name>hadoop.native.lib</name>
<value>true</value>
</property>
</configuration>
configure hdfs-site.xml
<configuration>
<property>
<name>dfs.nameservices</name>
<value>masters</value>
</property>
<property>
<name>dfs.ha.namenodes.masters</name>
<value>nn1,nn2</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/admin/hadoop/storage/hdfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
// <value>file:/home/admin/hadoop/storage/hdfs/data</value>
<value>file:///datadriver/hdfs/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>

<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.namenode.rpc-address.masters.nn1</name>
<value>example-data01:9000</value>
</property>
<property>
<name>dfs.namenode.http-address.masters.nn1</name>
<value>example-data01:50070</value>
</property>
<property>
<name>dfs.namenode.rpc-address.masters.nn2</name>
<value>example-data02:9000</value>
</property>
<property>
<name>dfs.namenode.http-address.masters.nn2</name>
<value>example-data02:50070</value>
</property>
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://example-data01:8485;example-data02:8485;example-data03:8485/masters</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/home/admin/hadoop/storage/journal</value>
</property>
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence(hdfs)
shell(/bin/true)</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/admin/.ssh/id_rsa</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.connect-timeout</name>
<value>30000</value>
</property>
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.client.failover.proxy.provider.masters</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
</configuration>
  • configure hadoop-env.sh
1
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
  • configure slaves
1
2
example-data03
example-data04
  • Step 4:

scp hadoop to each node ( example-data02, example-data03, example-data04)

  • Step 5:

Setup the HDFS Cluster with ZooKeeper Using Demo (example)

  • on the namenode1 (example-data01)

    1
    $ hdfs zkfc -formatZK
  • on each zookeeper node

    1
    2
    $ hadoop-daemon.sh start journalnode
    // $ ./sbin/hadoop-daemons.sh --hostnames 'example-data01 example-data02 example-data03' start journalnode // this command is used to start the journalnode of zookeeper
  • on the namenode1 (example-data01)

    1
    $ hdfs namenode -format
  • on the namenode1 (example-data01)

    1
    $ ./sbin/hadoop-daemon.sh start namenode
  • on namenode2 (example-data02)

    1
    $ hdfs namenode -bootstrapStandby
  • on namenode2 (example-data02)

    1
    $ ./sbin/hadoop-daemon.sh start namenode
  • on the two namenode (namenode1, namenode2)

    1
    $ ./sbin/hadoop-daemon.sh start zkfc
  • on the all datanode (datanode1, datanode2)

    1
    $ ./sbin/hadoop-daemon.sh start datanode
  • still now, we can see the following name on namenode1 (example-data01)

    QuorumPeerMain, NameNode, DFSZKFailoverController(basically), JournalNode

  • Step 6: Testing the HDFS’s function
    -

    1
    2
    3
    4
    5
    6
    #  login on web
    example-data01: namenode (active) => example-data01:9000
    website: example-hadoop.cloudapp.net:8080
    example-data02: namenode (standby) => example-data02:9000
    website: example-hadoop.cloudapp.net:8000
    # $ hadoop-daemon.sh start namenode
  • Step 7: Import , excute on namenode1 (example-data01)

    • Before Starting HDFS , Formating the zookeeper

      1
      2
      3
      4
      $ hdfs zkfc -formatZK
      Start the HDFS --> $ cd /home/admin/hadoop && sbin/start-dfs.sh
      Stop the HDFS --> $ cd /home/admin/hadoop && sbin/stop-dfs.sh
      $ start-dfs.sh
    • check the status on two namenode

      1
      2
      $ hdfs haadmin -getServiceState nn1 --> active
      $ hdfs haadmin -getServiceState nn2 --> standby
    • when the stop-dfs.sh turn to error

      1
      2
      $ vi ~/.bashrc
      export HADOOP_PID_DIR=/home/hadoop/pids
  • hdfs dfsadmin -report

Spark on Yarn Reference

configure additional two file
yarn-site.xml & maperd-site.xml

1…345
SZhou

SZhou

The unexamined life is not worth living. --Socrates

24 posts
5 categories
22 tags
RSS
GitHub LinkedIn Weibo
Creative Commons

Links

DataTopics Chinabyte
© 2016 SZhou
Powered by Hexo
Theme - NexT.Mist
  |   hits from vistors