DATA100-L25: Data Regulations

HHZZ published on 2024-07-19 included in DATA100

impetus for regulation why “you” should care Because you are gonna to be a data scientist and product owner! regulations: Privacy laws GDPR (General Data Protection Regulation) CCPA (California Consumer Privacy Act) Cyber Security Law in China deletion can be more difficult than you think 😏 传输也要监管 fully take advantage of the “regulations” take care of gray areas 🤔 work with dear NGO and GO other regulations/ regulatory bodies

DATA100-L23: Decision Trees

HHZZ published on 2024-07-19 included in DATA100

Multiclass Classification 多分类问题但是没有softmax 😢 Decision Trees (conceptually) Decision Tree Demo Creating Decision Trees in sklearn 可视化代码见lecture code Evaluating Tree Accuracy Overfit Decision Tree Example tree is too complex to generalize well to new data too tall and narrow 有用的特征越多，树的结构可能比较简单🤔 The Decision Tree Generation Algorithm Intuitively Evaluating Split Quality 分割怎么样“更明显”？ Entropy 沿着树向下，信息熵越小？可能变大！ Generating Trees Using Entropy Weighted entropy can decrease! Traditional decision tree generation algorithm: All of the data starts in the root node.

DATA100-L21: Classification and Logistic Regression I

HHZZ published on 2024-07-19 included in DATA100

Regression vs. Classification 全攻略😋 intuition: the coin flip 重新定义概率，只需要满足一些性质即可。参考概率论与数理统计 deriving the logistic regression model knn一瞥这说明可以从某些变化转换为线性性质考虑 probability $p$ 考虑 odds $\frac{p}{1-p}$ 考虑 log odds 广义线性由此可见 Graph of Averages the sigmoid function $$ \sigma(t)=\frac{1}{1+e^{-t}} $$ the logistic regression model comparison to linear regression parameter estimation pitfalls of squared loss non-convex bounded, MSE ∈[0，1] conceptually questionable, not matching the “Probability and 0/1 labels” cross-entropy loss $$ -\frac{1}{N}\sum_{i=1}^N[y_i\log(p_i)+(1-y_i)\log(1-p_i)] $$ Loss function should penalize well!