DATA100-L9: Introduction to Modeling, Simple Linear Regression

HHZZ published on 2024-07-18 included in DATA100

regression line, correlation 高中最小二乘法(least squares regression)，线性回归 model $“all\ models\ are\ wrong,\ but\ some\ are\ useful”$ trade between interpretability and accuracy 物理or统计模型 the modeling process: definitions SLR: Simple Linear Regression 明确input和parameter的区别有些统计模型可以没有参数！ loss functions metric for good or bad minimizing average loss (Empirical Risk 期望风险？) 最优化！ interpreting SLR: slope, Anscombe’s quartet 解释参数意义预测未知数据 evaluating the model: RMSE, Residual Plot

DATA100-L8: Visualizations Ⅱ

HHZZ published on 2024-07-16 included in DATA100

Kernel Density Functions KDE Mechanics smoothing in 1D（histograms） rug —> histogram smoothing in 2D（heatmaps/Hex Plot） KDEs 代码实现： sns.distplot(data, kde=True) Kernel Functions and Bandwidth $\alpha$ 越大，曲线越平滑当然也有其他的kernel函数，比如： triangular kernel epanechnikov kernel boxcar kernel Visualization Theory 注意可视化的目的！仅仅靠统计方法不够直观并且不够准确！ Information Channels color, shape, size, position (coordinate), and orientation Harnessing X/Y do not use different scales for x and y in the same visualization! 比例适中 Harnessing Color 选颜色，jet, viridis主题等等最好选择perceptually uniform的颜色！而jet不是！Inferno， Turbo可以 Harnessing Markings 人更倾向于比较整齐的直方图（一维长度）

DATA100-lab2

HHZZ published on 2024-07-15 included in DATA100

1 2 3 # Initialize Otter import otter grader = otter.Notebook("lab02.ipynb") Pandas is one of the most widely used Python libraries in data science. In this lab, you will review commonly used data wrangling operations/tools in Pandas. We aim to give you familiarity with: Creating DataFrames Slicing DataFrames (i.e. selecting rows and columns) Filtering data (using boolean arrays and groupby.filter) Aggregating (using groupby.agg) In this lab you are going to use several pandas methods.