CS186-L9: Sorting and Hashing

HHZZ published on 2024-08-14 included in CS186

Why Sort? Rendezvous 为了“集合” eliminating duplicates (DISTINCT) Grouping for summarization (GROUP BY) Upcoming sort-merge join algorithms Ordering sometimes output must be in a specific order First step in bulk loading Tree indexes Problem: sort 100GB of data with 1GB of RAM why not virtual memory? – random IO access, too slow 😢 Out-of-Core Algorithms core == RAM back in the day Single Streaming data passing through the memory MapReduce’s “Map” 😎

CS186-L6: Indices & B+ Tree Refinements

HHZZ published on 2024-08-14 included in CS186

General Notes issues to consider in any index structure (not just in B+ tree) query support: what class of queries can be supported? choice of search key affects how we write the query data entry storage affect performance of the index variable-length keys tricks affect performance of the index cost model for Index vs Heap vs Sorted File query support Indexes basic selection <key><op><constant> 诸如=，BETWEEN，>，<，>=，<= more selection 维度灾难😲 但是这节课我们只是关注1-d range search, equality， B+ tree

DATA100-lab13: Decision Trees and Random Forests

HHZZ published on 2024-08-13 included in DATA100

1 2 3 # Initialize Otter import otter grader = otter.Notebook("lab13.ipynb") Lab 13: Decision Trees and Random Forests 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 # Run this cell to set up your notebook import numpy as np import pandas as pd import matplotlib.pyplot as plt from matplotlib.colors import ListedColormap import seaborn as sns from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn import tree # you may get a warning from importing ensemble.