Chapter 2 of Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow walks through a complete machine learning project from start to finish using a real-world housing dataset. Aurélien Géron’s goal is to demonstrate the practical workflow that ML practitioners follow, emphasizing that success comes from disciplined process, not just model selection.
Big Picture First
The chapter begins by stressing the importance of understanding the business objective before touching the data. Géron frames the example problem as predicting housing prices in California to support a company’s decision-making.
Key ideas:
- Define the problem clearly (supervised regression in this case)
- Identify performance metrics (e.g., RMSE)
- Understand how the model will be used in production
- Establish a baseline expectation
Takeaway: Machine learning projects start with problem framing, not coding.
Get the Data
Next, the dataset is loaded and explored. Géron introduces best practices for data acquisition and versioning.
Important steps:
- Download and store the dataset
- Take an initial look at structure and features
- Identify target variable
- Note potential data issues
He emphasizes creating reproducible data pipelines early.
Explore the Data (EDA)
Exploratory Data Analysis (EDA) helps build intuition about the dataset.
Key activities:
- Visualize distributions with histograms
- Use scatter plots to find correlations
- Identify geographical patterns
- Detect anomalies and skewed features
A major insight is that visualization often reveals problems that statistics alone miss.
Create a Test Set
One of the most critical lessons in the chapter is to split the data early to avoid data leakage.
- Géron demonstrates:
- Random train/test splitting
- Stratified sampling based on income categories
- Why naive random splits can bias results
Takeaway: Protect the test set to ensure honest evaluation.
Data Preparation
This section shows how raw data becomes model-ready.
Major steps include:
Handling Missing Values
Options discussed:
- Remove rows
- Remove features
- Impute values (median is commonly used)
Feature Engineering
Géron creates new features that improve predictive power, such as:
- Rooms per household
- Bedrooms per room
- Population per household
Key insight: Good features often matter more than complex models.
Encoding Categorical Variables
Techniques covered:
- Ordinal encoding
- One-hot encoding (preferred in many cases)
Feature Scaling
Why scaling matters for many algorithms:
- Standardization
- Min-max scaling
Build a Training Pipeline
A major best practice introduced is using Scikit-Learn pipelines to automate preprocessing and ensure consistency.
Benefits:
- Reproducibility
- Cleaner code
- Reduced risk of data leakage
- Easier experimentation
This is one of the most practically important lessons in the chapter.
Select and Train Models
Géron trains several models to establish baselines:
- Linear Regression
- Decision Tree
- Random Forest
He demonstrates that:
- Simple models provide useful baselines
- Complex models can overfit
- Performance must be validated properly
Better Evaluation with Cross-Validation
Instead of relying on a single train/test split, the chapter introduces cross-validation.
Benefits:
- More reliable performance estimates
- Better model comparison
- Reduced variance in evaluation
This step reveals overfitting in the Decision Tree model.
Fine-Tune the Model
Hyperparameter tuning is performed using:
- Grid Search
- Randomized Search
Géron shows how systematic tuning improves performance and why blind guessing is inefficient.
Analyze the Best Model
After selecting the best model (Random Forest), the chapter demonstrates:
- Feature importance analysis
- Error analysis
- Model interpretation
This helps understand why the model works, not just how well.
Evaluate on the Test Set
Only after all tuning is complete is the model evaluated on the held-out test set.
This provides an unbiased estimate of real-world performance.
Critical principle: 👉 Never touch the test set until the very end.
Key Takeaways
- Machine learning is an end-to-end engineering process
- Problem framing comes before modeling
- Always split data early to avoid leakage
- Feature engineering is extremely powerful
- Pipelines improve reliability and reproducibility
- Cross-validation gives more trustworthy evaluation
- Hyperparameter tuning should be systematic
- Final evaluation must use untouched test data
Bottom line: Chapter 2 teaches that successful machine learning depends far more on disciplined workflow and data preparation than on choosing sophisticated algorithms.