/avatar.jpg

DATA100-L9: Introduction to Modeling, Simple Linear Regression

regression line, correlation 高中最小二乘法(least squares regression),线性回归 model $“all\ models\ are\ wrong,\ but\ some\ are\ useful”$ trade between interpretability and accuracy 物理or统计模型 the modeling process: definitions SLR: Simple Linear Regression 明确input和parameter的区别 有些统计模型可以没有参数! loss functions metric for good or bad minimizing average loss (Empirical Risk 期望风险?) 最优化! interpreting SLR: slope, Anscombe’s quartet 解释参数意义 预测未知数据 evaluating the model: RMSE, Residual Plot

DATA100-L8: Visualizations Ⅱ

Kernel Density Functions KDE Mechanics smoothing in 1D(histograms) rug —> histogram smoothing in 2D(heatmaps/Hex Plot) KDEs 代码实现: sns.distplot(data, kde=True) Kernel Functions and Bandwidth $\alpha$ 越大,曲线越平滑 当然也有其他的kernel函数,比如: triangular kernel epanechnikov kernel boxcar kernel Visualization Theory 注意可视化的目的! 仅仅靠统计方法不够直观并且不够准确! Information Channels color, shape, size, position (coordinate), and orientation Harnessing X/Y do not use different scales for x and y in the same visualization! 比例适中 Harnessing Color 选颜色,jet, viridis主题等等 最好选择perceptually uniform的颜色!而jet不是!Inferno, Turbo可以 Harnessing Markings 人更倾向于比较整齐的直方图(一维长度)

DATA100-lab2

1 2 3 # Initialize Otter import otter grader = otter.Notebook("lab02.ipynb") Pandas is one of the most widely used Python libraries in data science. In this lab, you will review commonly used data wrangling operations/tools in Pandas. We aim to give you familiarity with: Creating DataFrames Slicing DataFrames (i.e. selecting rows and columns) Filtering data (using boolean arrays and groupby.filter) Aggregating (using groupby.agg) In this lab you are going to use several pandas methods.