Resolve the word2vec installing error

1st time I try to install word2vec

root@dev:/# pip install word2vec
Looking in indexes: https://mirrors.aliyun.com/pypi/simple/
Collecting word2vec
 Downloading https://mirrors.aliyun.com/pypi/packages/ce/51/5e2782b204015c8aef0ac830297c2f2735143ec90f592b9b3b909bb89757/word2vec-0.10.2.tar.gz (60kB)
 100% |████████████████████████████████| 61kB 1.1MB/s
 Complete output from command python setup.py egg_info:
 Traceback (most recent call last):
 File "", line 1, in 
 File "/tmp/pip-install-rj7udqcw/word2vec/setup.py", line 4, in 
 from Cython.Build import cythonize
 ModuleNotFoundError: No module named 'Cython'
----------------------------------------

Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-rj7udqcw/word2vec/

resolve it

root@dev:/# pip install Cython
Looking in indexes: https://mirrors.aliyun.com/pypi/simple/
Collecting Cython
 Downloading https://mirrors.aliyun.com/pypi/packages/e1/fd/711507fa396064bf716493861d6955af45369d2c470548e34af20b79d4d4/Cython-0.29.6-cp36-cp36m-manylinux1_x86_64.whl (2.1MB)
 100% |████████████████████████████████| 2.1MB 58.3MB/s
Installing collected packages: Cython
Successfully installed Cython-0.29.6
root@dev:/# pip install word2vec
Looking in indexes: https://mirrors.aliyun.com/pypi/simple/
Collecting word2vec
 Downloading https://mirrors.aliyun.com/pypi/packages/ce/51/5e2782b204015c8aef0ac830297c2f2735143ec90f592b9b3b909bb89757/word2vec-0.10.2.tar.gz (60kB)
 100% |████████████████████████████████| 61kB 1.4MB/s
Requirement already satisfied: cython in /usr/local/lib/python3.6/dist-packages (from word2vec) (0.29.6)
Requirement already satisfied: numpy>=1.9.2 in /usr/local/lib/python3.6/dist-packages (from word2vec) (1.15.4)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.6/dist-packages (from word2vec) (0.20.1)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from word2vec) (1.1.0)
Requirement already satisfied: six in /usr/lib/python3/dist-packages (from word2vec) (1.11.0)
Building wheels for collected packages: word2vec
 Building wheel for word2vec (setup.py) ... done
 Stored in directory: /root/.cache/pip/wheels/6c/41/28/8a47f03d8b1387e2360e13f9719847eb545d0daa5f65d44ef3
Successfully built word2vec
Installing collected packages: word2vec
Successfully installed word2vec-0.10.2

Add Hosts for Ansible

System Env

[root@ansible ~]# cat /etc/redhat-release
CentOS Linux release 7.6.1810 (Core)
[root@ansible ~]# lsb_release -a
LSB Version: :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: CentOS
Description: CentOS Linux release 7.6.1810 (Core)
Release: 7.6.1810
Codename: Core
[root@ansible ~]# ansible --version
ansible 2.7.7
 config file = None
 configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
 ansible python module location = /usr/local/lib/python3.6/site-packages/ansible
 executable location = /usr/local/bin/ansible
 python version = 3.6.6 (default, Jan 26 2019, 16:53:05) [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)]

Generate Ansible Server's public key

[root@ansible ~]# ssh-keygen

Deploy Ansible Server's public key

  • Deploy public key to s1
    [root@ansible ~]# ssh-copy-id -i ~/.ssh/id_rsa.pub root@s1
    

Add Hosts for Ansible

[root@ansible ~]# mkdir /etc/ansible
[root@ansible ~]# vim /etc/ansible/hosts
[root@ansible ~]# cat /etc/ansible/hosts
[servers]
s1

Run some test command

[root@ansible ~]# ansible s1 -m ping
s1 | SUCCESS => {
 "changed": false,
 "ping": "pong"
}

How To Install Ansible

System Env

  • OS info
    [root@ansible /]# cat /etc/redhat-release
    CentOS Linux release 7.6.1810 (Core)
    [root@ansible /]# lsb_release -a
    LSB Version: :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
    Distributor ID: CentOS
    Description: CentOS Linux release 7.6.1810 (Core)
    Release: 7.6.1810
    Codename: Core
    

Install Python36

  • upgrade system pkgs
    [root@ansible /]# yum clean all && yum makecache
    [root@ansible /]# yum upgrade -y
    
  • install python36
    [root@ansible /]# yum install epel-release -y
    [root@ansible /]# yum clean all && yum makecache
    [root@ansible /]# yum install python36 python36-devel python36-pip -y
    
  • set pip3.6 as the default pip
    [root@ansible /]# pip3.6 install --upgrade pip
    

Install Ansible

[root@ansible /]# pip install ansible

Run it in Docker

  • write a Dockerfile
    [root@DockerServer /]# cat Dockerfile
    FROM centos:latest
    MAINTAINER Kyle Chen
    ENV LANG C.UTF-8
    ENV DEBIAN_FRONTEND=noninteractive
    RUN yum clean all && \
    yum makecache && \
    yum upgrade -y && \
    yum install vim openssh-server epel-release -y && \
    yum clean all && \
    yum makecache && \
    yum install python36 python36-devel python36-pip -y && \
    echo "set -o vi" >> /etc/bashrc && \
    pip3.6 install --upgrade pip && \
    pip install ansible && \
    echo "PASSWORD" | passwd --stdin USER && \
    systemctl enable sshd
    
  • build an image
    [root@DockerServer /]# docker build . -t ansible
    
  • run image
    [root@DockerServer /]# docker run --privileged -v /sys/fs/cgroup:/sys/fs/cgroup:ro --ip IP --dns DNS --name ansible --hostname ansible -tdi ansible:latest /usr/sbin/init
    
  • ATTENTION:
    You must set the PASSWORD, USER, IP, DNS which can be found before you run image. After run image, you can use 'ssh USER@IP' and input password to login on the server.

Tensorflow-lite Android Object Detect 初体验

写在开始

  • 在这篇文章中, 会手把手教你如何在你的电脑上将Tensorflow-lite Android Demo给跑起来, 希望会对那些需要在Android原生应用落地的童鞋有所帮助.

准备工作

  • 下载Android Studio(我直接下的最新的)

下载地址: https://developer.android.com/studio/

  • 从github上clone Tensorflow
➜ sample ✗ git clone https://github.com/tensorflow/tensorflow.git
Cloning into 'tensorflow'...
remote: Enumerating objects: 7, done.
remote: Counting objects: 100% (7/7), done.
remote: Compressing objects: 100% (7/7), done.
remote: Total 479815 (delta 0), reused 6 (delta 0), pack-reused 479808
Receiving objects: 100% (479815/479815), 282.03 MiB | 7.41 MiB/s, done.
Resolving deltas: 100% (385210/385210), done.
Checking out files: 100% (13949/13949), done.

导入项目, 开干.!!

  • 打开Android Studio:
  • 先别导入项目, 直接点击Configure->SDK Manager:
  • 这里需要安装SDK和NDK(其实选择Android 9.0和NDK就可以了, 我这里有些包之前装的, 直接打勾, 点ok就好):
  • 安装好SDK和NDK之后, 点击Open an existing Android Studio Project:
  • 选择我们刚刚clone下来的项目中的tensorflow/tensorflow/lite/java/demo目录:
  • 然后需要等待一会, 导入项目会自动去下载一些依赖包, 过一会右下角会有如下提示, 点击 Add root:
  • 如果一直卡在这个界面, 请点击右边那个小叉叉关闭当前的process, 并点击左上角Preferences->Appearance & Behavior->System Settings->HTTP Proxy检查你的proxy配置:
  • 如果出现找不到toolchains的错误, 请手动下载NDK覆盖原NDK目录:
    错误:
    Gradle sync failed: No toolchains found in the NDK toolchains folder for ABI with prefix: mips64el-linux-android
     Consult IDE log for more details (Help | Show Log) (1 m 7 s 18 ms)
    

解决方案:

  1. 到NDK官网下载对应平台的NDK (我直接下的最新的mac版本) https://developer.android.com/ndk/downloads/?hl=zh-cn
  2. 解压文件, 并拷贝至指定目录(这里以macos为例; Windows, Linux思路一致, 请自行google)
    ➜ ndk ✗ unzip android-ndk-r16b-darwin-x86_64.zip
    ➜ ndk ✗ rm -rvf ~/Library/Android/sdk/ndk-bundle
    ➜ ndk ✗ cp -rvf android-ndk-r16b ~/Library/Android/sdk/ndk-bundle
    

改为我们手动下载的NDK之后, 点击右上角的这个按钮, sync一下:

  • 然后等一会, 会跳出这个框框要更新Gradle Plugin, 点击Update:
  • 过一会, 发现右下角报了个这个错误:
Gradle sync failed: Could not find method jackOptions() for arguments [build_c83lec3ciu5xopt1opuwlj8q5$_run_closure1$_closure6$_closure11@1b729409] on DefaultConfig_Decorated{name=main, dimension=null, minSdkVersion=DefaultApiVersion{mApiLevel=21, mCodename='null'}, targetSdkVersion=DefaultApiVersion{mApiLevel=26, mCodename='null'}, renderscriptTargetApi=null, renderscriptSupportModeEnabled=null, renderscriptSupportModeBlasEnabled=null, renderscriptNdkModeEnabled=null, versionCode=1, versionName=1.0, applicationId=android.example.com.tflitecamerademo, testApplicationId=null, testInstrumentationRunner=null, testInstrumentationRunnerArguments={}, testHandleProfiling=null, testFunctionalTest=null, signingConfig=null, resConfig=null, mBuildConfigFields={}, mResValues={}, mProguardFiles=[], mConsumerProguardFiles=[], mManifestPlaceholders={}, mWearAppUnbundled=null} of type com.android.build.gradle.internal.dsl.DefaultConfig.
 Consult IDE log for more details (Help | Show Log) (40 s 747 ms)

别慌, 我们去build.gradle(Module:app)里面将这段内容注释掉就好了, 注意画圈圈的地方:

然后再点击这个小按钮:

  • 又是一小段等待时间, 然后会发现有两个提示, 不着急, 打勾的地方, 一个一个点:
  • 然后catch到另外一个错误:
Could not find com.android.tools.build:aapt2:3.2.1-4818971.
Searched in the following locations:
 file:/Users/Kyle/Library/Android/sdk/extras/m2repository/com/android/tools/build/aapt2/3.2.1-4818971/aapt2-3.2.1-4818971.pom
 file:/Users/Kyle/Library/Android/sdk/extras/m2repository/com/android/tools/build/aapt2/3.2.1-4818971/aapt2-3.2.1-4818971-osx.jar
 file:/Users/Kyle/Library/Android/sdk/extras/google/m2repository/com/android/tools/build/aapt2/3.2.1-4818971/aapt2-3.2.1-4818971.pom
 file:/Users/Kyle/Library/Android/sdk/extras/google/m2repository/com/android/tools/build/aapt2/3.2.1-4818971/aapt2-3.2.1-4818971-osx.jar
 file:/Users/Kyle/Library/Android/sdk/extras/android/m2repository/com/android/tools/build/aapt2/3.2.1-4818971/aapt2-3.2.1-4818971.pom
 file:/Users/Kyle/Library/Android/sdk/extras/android/m2repository/com/android/tools/build/aapt2/3.2.1-4818971/aapt2-3.2.1-4818971-osx.jar
 https://jcenter.bintray.com/com/android/tools/build/aapt2/3.2.1-4818971/aapt2-3.2.1-4818971.pom
 https://jcenter.bintray.com/com/android/tools/build/aapt2/3.2.1-4818971/aapt2-3.2.1-4818971-osx.jar
 https://google.bintray.com/tensorflow/com/android/tools/build/aapt2/3.2.1-4818971/aapt2-3.2.1-4818971.pom
 https://google.bintray.com/tensorflow/com/android/tools/build/aapt2/3.2.1-4818971/aapt2-3.2.1-4818971-osx.jar
Required by:
 project :app

别慌, 只是少了点东西, 淡定的打开build.gradle(Project:demo), 加上箭头指的这行. (注意标红的两个地方, 一个是文件名, 一个是增加的内容):

然后再点击这个小按钮:

  • 然后经过一小段时间的等待, 可以发现只剩下一个warning了:
  • 既然是警告, 说明代码还是可以正常运行的, 我们继续. 试试能不能编译成apk导出来在手机上安装测试:
  • 出现打包成功的提示, 点击标红的这里, 会自动打开apk所在文件夹:
  • 然后我们将其拷贝到手机上, 安装:
  • 欧了, 跑起来:

写在结束

  • 总体上来说, 坑还是有的, 但是如果用心爬的话, 还是可以爬出来, 这只是开始, 后续想要在Android上面直接跑模型, 还有很长的路要走, 大家共勉吧.

精读西瓜书(第九章-聚类)-原型聚类

写在前面

  • 昨天, 我们学习了聚类中的距离计算; 今天, 我们将继续学习聚类中的原型聚类.

原型聚类

  • 原型聚类亦称'基于原型的聚类'(prototype-based clustering), 此类算法假设聚类结构能通过一组原型刻画, 在显示聚类任务中极为常用. 通常情形下, 算法先对原型进行初始化, 然后对原型进行迭代更新求解. 采用不同的原型表示, 不同的求解方式, 将产生不同的算法.

k均值算法

  • 给定样本集 , 'k均值'(k-means)算法针对聚类所得簇划分 最小化平方误差
  • 其中 , 是簇 的均值向量. 直观来看, E值越小则簇内样本相似度越高.

学习向量量化

  • 与k均值算法相似, '学习向量量化'(Learning Vector Quantization, 简称LVQ)也是视图找到一组原型向量来刻画聚类结构, 但与一般聚类算法不同的是, LVQ假设数据样本带有类别标记, 学习过程利用样本的这些监督信息来辅助聚类.
  • 给定样本集 , 每个样本 是由n个属性描述的特征向量 , 是样本 的类别标记, LVQ的目标是学得一组n维原型向量 , 每个原型向量代表一个聚类簇, 簇标记 .

高斯混合聚类

  • 与k均值, LVQ用原型向量来刻画聚类结构不同, 高斯混合(Mixture-of-Gaussian)聚类采用概率模型来表达聚类原型.

写在后面

  • 今天, 我们学习了聚类中的原型聚类; 明天, 我们将继续学习聚类中的密度聚类.

精读西瓜书(第九章-聚类)-性能度量

写在前面

  • 昨天, 我们学习了聚类中的聚类任务; 今天, 我们将继续学习聚类中的性能度量.

性能度量

  • 聚类性能度量亦称聚类'有效性指标'(Validity Index). 与监督学习中的性能度量作用相似, 对类似结果, 我们需通过某种性能度量来评估其好坏; 另一方面, 若明确了最终将要使用的性能度量, 则可直接将其作为聚类过程的优化目标, 从而更好地得到符号要求的聚类结果.
  • 聚类是将样本集 划分为若干互不相交的子集, 即样本簇. 那么, 什么样的聚类结果比较好呢? 直观上看, 我们希望'物以类聚', 即同一簇的样本尽可能彼此相似, 不同簇的样本尽可能不同. 换言之, 聚类结果的'簇内相似度'(Intra-Cluster Similarity)高且'簇间相似度'(Inter-Cluster Similarity)低.
  • 聚类性能度量大致有两类. 一类是将聚类结果与某个'参考模型'(Reference Model)进行比较, 称为'外部指标'(External Index); 另一类是直接考察聚类结果而不利用任何参考模型, 称为'内部指标'(Internal Index).
  • 对数据集 , 假定通过聚类给出的簇划分为 . 相应地, 令 分别表示与 对应的簇标记向量. 我们将样本两两配对考虑, 定义:
  • 其中集合 包含了在 中隶属于相同簇且在 中也隶属于相同簇的样本对, 集合 包含了在 中隶属于相同簇但在 中隶属于不同簇的样本对, 由于每个样本对 仅能出现在一个集合中, 因此有 成立. 则有:
  • 显然, 上述性能度量的结果值均在 区间, 值越大越好. 考虑聚类结果的簇划分 , 定义:
  • 其中, 用于计算两个样本之间的距离; 代表簇 的中心点 . 显然, 对应于簇 内样本间的平均距离, 对应于簇 内样本间的最远距离, 对应于簇 与簇 最近样本间的距离, 对应于簇 与簇 中心点检的距离, 则有:

写在后面

  • 今天, 我们学习了聚类中的性能度量; 明天, 我们将继续学习聚类中的距离计算.

精读西瓜书(第九章-聚类)-聚类任务

写在前面

  • 昨天, 我们学习了集成学习中的多样性; 今天, 我们将继续学习聚类中的聚类任务.

聚类任务

  • 在'无监督学习'(unsupervised learning)中, 训练样本的标记信息是未知的, 目标是通过对无标记训练样本的学习来揭示数据的内在性质及规律, 为进一步的数据分析提供基础. 此类学习任务中研究最多, 应用最广的是'聚类'(Clustering).
  • 聚类试图将数据集中的样本划分为若干个通常是不相交的子集, 每个子集称为一个'簇'(Cluster). 通过这样的划分, 每个簇有可能对应于一些潜在的概念(类别). 需要说明的是, 这些概念对聚类算法而言事先是未知的, 聚类过程仅能自动形成簇结构, 簇所对应的概念语义需由使用者来把握和命名.
  • 形式化地说, 假定样本集 包含 个无标记样本, 每个样本 是一个 维特征向量, 则聚类算法将样本集 划分为 个不相交的簇 , 其中 . 相应地, 我们用 表示样本 的'簇标记'(Cluster Label), 即 . 于是, 聚类的结果可用包含 个元素的簇标记向量 表示.
  • 聚类既能作为一个单独过程, 用于找寻数据内在的分布结构, 也可作为分类等其他学习任务的前驱过程. 例如, 在一些商业应用中需对新用户的类型进行判别, 但定义'用户类型'对商家来说却可能不太容易, 此时往往可先对用户数据进行聚类, 根据聚类结果将每个簇定义为一个类, 然后再基于这些类训练分类模型, 用于判别新用户的类型.
  • 基于不同的学习策略, 人们设计出多种类型的聚类算法.

写在后面

  • 今天, 我们学习了聚类中的聚类任务; 明天, 我们将继续学习聚类中的性能度量.

Big Surprise ! Do you know where is the best location to rent a house on Airbnb ?

Above All

  • Before we start report, we must know something above all:

    1.This is the 1st project from Udacity Data Scientist Term2.

    2.This project have 2 dataset, including airbnb Seattle and Boston host and reviews data.

    3.At this report, will discuss three question about the dataset.

Github Repo

Explore and Preprocess Data

  • The dataset including tow main area: Seattle and Boston. They are very similar. But only difference between some detail. We won't discuss them here, but i write it as a comment with code. If needed, you can find it in my github before this section.
  • The preprocess of data, i check the features which including NaN values one by one. And find which one can drop, and which not. And do some onehot processing with some features. Such as the feature price and available in df_seattle_clendar_raw (you can find detail in my code), the NaN values of price could not drop, because they're not available at all, so is fine, just keep it in.

Get the main research

  • After preprocess data, I try to find the pattern of the data. So I cut the dataframe. I just remain the features id, city, price, security_deposit, review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value from df_seattle_listings and df_boston_listings. It looks like this:
  • The price and security_deposit need to transfer to int type. So, I write a function to delete the '$', '.', ',' from dataframe. And after all, the info looks like this:

Question 1: Is there any pattern between price and review_scores_rating ?

  • Ok, let's start find some pattern between price and review_scores_rating.
  • At first, we need plot scatter of the Price and Review Scores Rating. It look like this:

Answer 1:

  • After plot the data scatter between Price and Review Scores Rating, it shows us something. It just like the house which price is higher and it's review scores rating's distribute is almost get higher at the same time. And you will find the price lower than $400's house, the probable of the score between 90 and 70 is very high. So, don't try to save any money, more expensive is more worth it.

Question 2: Is there any pattern between any other scores' with price ?

  • In this question, I write a loop for each plot. Because of there're a lot of features need to plot one by one.
  • In this report, just show these tow plot (because of there's no relation with any other features):

Answer 2:

  • After plot all the features between the price and scores. It seemed like there's only price/security_deposit plot have any affect with others, except the price/review_scores_rating plot we analyze before. From the plot of price/security_deposit, it seemed like the security_deposit is not related with price. Maybe the expensive price house was not care about any the security_deposit, and they may faced to any high level consumers.

Question 3: Is there any pattern about the location and the price ?

  • At this section, I concat tow dataframe to one. And do onehot with city feature.
  • Before apply onehot, is needed to check the data of city, and transform the location into same format. And plot it:

Answer 3:

  • It seemed like there's many expensive house in class 3 and 20. They are Seattle, and Boston's city center. They're always convenice to shopping and traffic. So it worth this price.

After all:

  • You can see I didn't use any ML models for classify the pattern or anything else. Because of I find it was not necessary to use it. The data can talk many thing to use. I do some data analyze about airbnb's Seattle and Boston's dataset. And try to find some pattern with price and location. But it was not really easy to explore it. At last, I find the house in Seattle and Boston's city center is the best way to build a house and rent it on airbnb. It will be good for rent and can rent for a better price.
  • After all above, can find the price of city center is more expensive than others. And the higher price house was worthy to rent, and the higher price house was not really about the security_deposit, because they may faced to any high level consumers. So if you want to buy a house in Seattle or Boston and rent it on Airbnb, I will suggest you select which are in the center of the city.

精读西瓜书(第八章-集成学习)-多样性

写在前面

  • 昨天, 我们学习了集成学习中的结合策略; 今天, 我们将继续学习集成学习中的多样性.

误差-分歧分解

  • 欲构建泛化能力强的集成, 个体学习器应'好而不同'. 现在我们来做一个简单的理论分析. 假定我们用个体学习器 通过加权平均法结合产生的集成来完成回归学习任务 . 对示例 , 定义学习器 的'分歧'(Ambiguity)为:
  • 则集成的'分歧'是:
  • 显然, 这里的'分歧'项表征了个体学习器在样本 上的不一致性, 即在一定程度上反映了个体学习器的多样性. 个体学习器 和集成 的平方误差分别为:
  • 表示个体学习器误差的加权均值, 有:
  • 表示样本的概率密度, 则在全样本上有:
  • 类似的, 个体学习器 在全样本上的泛化误差和分歧项分别为:
  • 集成的泛化误差为:
  • 最后有:

写在后面

  • 今天, 我们将继续学习了集成学习中的多样性; 明天, 我们将学习下一个章节, 聚类中的聚类任务.

精读西瓜书(第八章-集成学习)-结合策略

写在前面

  • 昨天, 我们学习了集成学习中的Bagging与随机森林; 今天, 我们将继续学习集成学习中的结合策略.

结合策略

  • 学习器结合可能会从三个方面带来好处: 首先, 从统计的方面来看, 由于学习任务的假设空间往往很大, 可能有多个假设在训练集上达到同等性能, 此时若使用单学习器可能因误选而导致泛化性能不佳, 结合多个学习器则会减小这一风险; 第二, 从计算的方面来看, 学习算法往往会陷入局部极小点所对应的泛化性能可能很糟糕, 而通过多次运行之后进行结合, 可降低陷入糟糕局部绩效点的风险; 第三, 从表示的方面来看, 某些学习任务的真实假设可能不在当前学习算法所考虑的假设空间中, 此时若使用单学习期则肯定无效, 而通过结合多个学习器, 由于相应的假设空间有所扩大, 有可能学得更好的近似, 如下图:
  • 假定集成包含 个基学习器 , 其中 在示例 上的输出为 . 本节介绍几种对 进行结合的常见策略.

    平均法

  • 对数值型输出 , 最常见的结合策略是使用平均法(Averaging).
  • 简单平均法(Simple Averaging):
  • 加权平均法(Weighted Averaging):
  • 其中, 是个体学习器 的权重, 通常要求 , .

    投票法

  • 对分类任务来说, 学习器 将从类别标记集合 中预测出一个标记, 最常见的结合策略是使用投票法(Voting). 为便于讨论, 我们将 在样本 上的预测输出表示为一个 维向量 , 其中 在类别标记 上的输出.
  • 绝对多数投票法(Majority Voting):
  • 即若某标记得票过半数, 则预测为该标记; 否则拒绝预测.

  • 相对多数投票法(Plurality Voting):
  • 即预测为得票最多的标记, 若同时有多个标记获得最高票, 则从中随机选取一个.

  • 加权投票法(Weighted Voting):
  • 与甲醛平均法类似, 的权重, 通常 ,

    学习法

  • 当训练数据很多时, 一种更为强大的结合策略是使用'学习法', 即通过另一个学习器来进行结合. Stacking是学习法的典型代表. 这里我们把个体学习器称为初级学习器, 用于结合的学习器称为次级学习器或原学习器(Meta-Learner). Stacking先从初始数据训练集训练处初级学习器, 然后'生成'一个新数据集用于训练次级学习器. 在这个新数据集中, 初级学习器的输出被当做样例输入特征, 而初始样本的标记被当做样例标记. Stacking的算法描述如下图, 这里我们假定初级学习器使用不同学习算法产生, 即初级集成是异质的:
  • 写在后面

  • 今天, 我们学习集成学习中的结合策略; 明天, 我们将继续学习集成学习中的多样性.