DATA100-L17: Estimators, Bias, and Variance

HHZZ published on 2024-07-19 included in DATA100

sample statistics (from last time) 参考概率论与数理统计 prediction vs. inference modeling: assumptions of randomness the bias-variance tradeoff $$ model\ risk = observation\ variance + (model\ bias)^2+model\ variance $$ $$ \mathbb{E}[(Y-\hat{Y}(x))^2] = \sigma^2+(\mathbb{E}[\hat{Y}(x)]-g(x))^2+Var(\hat{Y}(x)) $$ interpreting slopes slope == 0? 假设检验证明是否无关 [Extra]review of the Bootstrap [Extra]derivation of Bias-Variance decomposition https://docs.google.com/presentation/d/1gzgxGO_nbCDajYs7qIpjzjQfJqKadliBOat7Es10Ll8/edit#slide=id.g11df3da7bd7_0_467

DATA100-L18: SQL I

HHZZ published on 2024-07-19 included in DATA100

why databases structured query language (SQL) 😋 DBMS: database management system sql example type INT for integer REAL for decimal TEXT for string BLOB for ARBITRARY data DATETIME for date and time different implementations of sql support different types sql table use singular, CamelCase for SQL tables! basic sql queries 通配 1 SELECT * FROM table_name; 选定子集 1 SELECT column1, column2 FROM table_name; AS rename columns 1 2 3 SELECT cute AS cuteness, smart AS intelligence FROM table_name; WHERE filter rows

DATA100-L15: Cross Validation, Regularization

HHZZ published on 2024-07-19 included in DATA100

Cross Validation the holdout method 1 2 from sklearn.utils import shuffle training_set, dev_set = np.split(shuffle(data), [int(.8*len(data))]) 比较validation error和training error，选择最优的模型。 K-fold cross validation K=1 is equivalent to holdout method. Test sets provide an unbiased estimate of the model’s performance on new, unseen data. Regularization L2 regularization (Ridge) the small the ball, the simpler the model 拉格朗日思想，$\alpha$ 越大，约束越强，模型越简单。岭回归 scaling data for regularization 标准化数据，be on the same scale L1 regularization (Lasso) summary