regression line, correlation 高中最小二乘法(least squares regression),线性回归 model $“all\ models\ are\ wrong,\ but\ some\ are\ useful”$
trade between interpretability and accuracy
物理or统计模型
the modeling process: definitions SLR: Simple Linear Regression 明确input和parameter的区别
有些统计模型可以没有参数!
loss functions metric for good or bad
minimizing average loss (Empirical Risk 期望风险?) 最优化!
interpreting SLR: slope, Anscombe’s quartet 解释参数意义 预测未知数据 evaluating the model: RMSE, Residual Plot
Kernel Density Functions KDE Mechanics smoothing in 1D(histograms) rug —> histogram
smoothing in 2D(heatmaps/Hex Plot) KDEs 代码实现: sns.distplot(data, kde=True)
Kernel Functions and Bandwidth $\alpha$ 越大,曲线越平滑
当然也有其他的kernel函数,比如:
triangular kernel epanechnikov kernel boxcar kernel Visualization Theory 注意可视化的目的!
仅仅靠统计方法不够直观并且不够准确!
Information Channels color, shape, size, position (coordinate), and orientation
Harnessing X/Y do not use different scales for x and y in the same visualization!
比例适中
Harnessing Color 选颜色,jet, viridis主题等等
最好选择perceptually uniform的颜色!而jet不是!Inferno, Turbo可以 Harnessing Markings 人更倾向于比较整齐的直方图(一维长度)
1 2 3 # Initialize Otter import otter grader = otter.Notebook("lab02.ipynb") Pandas is one of the most widely used Python libraries in data science. In this lab, you will review commonly used data wrangling operations/tools in Pandas. We aim to give you familiarity with:
Creating DataFrames Slicing DataFrames (i.e. selecting rows and columns) Filtering data (using boolean arrays and groupby.filter) Aggregating (using groupby.agg) In this lab you are going to use several pandas methods.