MSBD5002:数据挖掘课程笔记

文章目录
  1. 1. 1 概述
  2. 2. 2 数据
    1. 2.1. Data Objects and Attribute Types
    2. 2.2. Basic Statistical Descriptions of Data
    3. 2.3. Measuring Data Similarity and Dissimilarity
    4. 2.4. Summary
  3. 3. 3 模式挖掘
  4. 4. 3 集成方法

1 概述

  • Why Data Mining?

  • What Is Data Mining?

Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data

Alternative names

Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

  • A Multi-Dimensional View of Data Mining
  1. Data to be mined: Knowledge discovery in database (KDD) …….
  2. Knowledge to be mined (or: Data mining functions): Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc.

  3. Techniques utilized: ML, stastics

  4. Applications adapted: in Business Intelligence
  • What Kinds of Data Can Be Mined?

Database-oriented data sets and applications

Advanced data sets and advanced applications

  • What Kinds of Patterns Can Be Mined?

Data Mining Function: (1) Generalization (2) Association and Correlation Analysis (3) Classification (4) Cluster Analysis (5) Outlier Analysis

Time and Ordering: Sequential Pattern, Trend and Evolution Analysis

Structure and Network Analysis

Evaluation of Knowledge

  • What Kinds of Technologies Are Used?

confluence-of-data-mining

  • What Kinds of Applications Are Targeted?

Web page analysis: from web page classification, clustering to PageRank & HITS algorithms
Collaborative analysis & recommender systems
Basket data analysis to targeted marketing
Biological and medical data analysis: classification, cluster analysis (microarray data analysis), biological sequence analysis, biological network analysis
Data mining and software engineering (e.g., IEEE Computer, Aug. 2009 issue)

  • Major Issues in Data Mining

Mining Methodology, User Interaction, Efficiency and Scalability, Diversity of data types, Data mining and society

  • A Brief History of Data Mining and Data Mining Society

  • Summary

2 数据

Data Objects and Attribute Types

Basic Statistical Descriptions of Data

Measuring Data Similarity and Dissimilarity

  • Heuristic
    • Minkowski-form
    • Weighted-Mean-Variance (WMV)
  • Nonparametric test statistics
    • $x^2$ (Chi Square)
    • Kolmogorov-Smirnov (KS)
    • Cramer/von Mises (CvM)
  • Information-theory divergences
    • Kullback-Liebler (KL)
    • Jeffrey-divergence (JD)
  • Ground distance measures
    • Histogram intersection
    • Quadratic form (QF)
    • Earth Movers Distance (EMD)

Example: cosine similarity

Example: Dissimilarity between Binary variables => asymmetric / symmetric binary

Example: Minkowski Distance => 跟我们学的Lp-norm是一样的

Summary

3 模式挖掘

《使用Apriori算法和FP-growth算法进行关联分析》,有代码示例:https://www.cnblogs.com/qwertWZ/p/4510857.html

3 集成方法

《Adaboost - 新的角度理解权值更新策略》推导Adaboost的权重更新公式:https://blog.csdn.net/dream_angel_z/article/details/52348135