Predicting Buyer Intent with XGBoost and Hive Data

Predicting Buyer Intent with XGBoost and Hive Data
Heather Leffew, PhD
This document presents a comprehensive approach to building a robust machine learning model for predicting B2B buyer intent. We'll walk through connecting to data sources, preprocessing techniques, model training with XGBoost, evaluation methods, and deployment considerations to help data scientists implement effective buyer intent prediction systems.
View Project Folder
Downloadable Python Notebook
Notebook in Google Colab
Internal Linked Table of contents
Understanding B2B Buyer Intent Prediction
Data Preparation and Connection
Missing Value Imputation
Feature Engineering
Handling Class Imbalance
Train-Test Split Strategy
XGBoost Model Training
Model Evaluation
Model Explainability with SHAP
Continuous Monitoring and Retraining
Segment-Based Evaluation
Addressing Performance Gaps
Implementation Roadmap
1
Understanding B2B Buyer Intent Prediction
In B2B sales, identifying accounts likely to convert is crucial for efficient resource allocation. Buyer intent prediction uses machine learning to analyze account behavior, demographics, and third-party signals to forecast conversion likelihood. This enables sales teams to prioritize high-intent accounts, personalize outreach, and increase conversion rates.
Unlike B2C models, B2B buyer intent faces unique challenges: longer sales cycles, multiple stakeholders in decision-making, smaller data volumes, and significant class imbalance. Our approach addresses these challenges through specialized preprocessing, feature engineering, and model tuning techniques tailored to B2B contexts.
Core Benefits
Prioritize high-value accounts
Optimize sales resource allocation
Increase conversion rates
Reduce sales cycle length
Technical Components
Hive data warehouse connection
XGBoost classification model
SHAP explainability tools
Monitoring & retraining pipelines
Return to Top
2
Data Preparation and Connection
The foundation of our buyer intent model is robust data retrieval and preparation. We connect to a Hive data warehouse to access account-level information including behavioral signals (website visits, content downloads), firmographics (industry, company size), and third-party intent scores.
Connect to Hive Warehouse
Establish secure connection to retrieve account data using PyHive
Define Data Specification
Create SQL query targeting relevant account features and conversion labels
Load and Verify Data
Retrieve dataset into pandas DataFrame and validate structure
Handle Missing Values
Apply median imputation for numerical features and mode imputation for categoricals
Our dataset includes behavioral metrics like website visits and content downloads, firmographic data like industry and company size, and third-party signals from sources like Bombora and G2. This comprehensive data foundation enables our model to identify complex patterns associated with buyer intent.
Return to Top
3
Missing Value Imputation
In B2B datasets, missing values are common due to incomplete account information, data collection gaps, or integration issues. Proper handling of these gaps is crucial for model reliability. Our approach uses a statistical imputation strategy tailored to the data type:
Numerical Features
We apply median imputation for numerical columns like website_visits, revenue, and intent scores. The median is preferred over mean imputation because it's robust to outliers, which are common in B2B behavioral data where a few accounts may have extremely high activity levels.
Example columns handled: website_visits, content_downloads, email_opens, company_size, revenue, bombora_intent_score, g2_intent_score
Categorical Features
For categorical columns like industry and tech_stack, we use most_frequent (mode) imputation. This replaces missing values with the most common category in the dataset, which is a simple yet effective approach for nominal categories.
Example columns handled: industry, tech_stack
While XGBoost has built-in capabilities for handling missing values, explicit imputation ensures compatibility with other potential modeling steps and creates more predictable behavior across the entire pipeline. It also allows us to reuse these transformations for preprocessing new data during inference.
Return to Top
4
Feature Engineering
Feature engineering transforms raw data into a format that maximizes model performance. For our buyer intent model, we focus on two critical transformations: categorical encoding and feature definitions.
Machine learning models require numerical inputs, so we convert categorical features like industry and tech_stack into numerical representations using one-hot encoding. This creates binary (0/1) columns for each unique category, allowing the model to learn specific effects of different industries or technologies.
One-Hot Encoding
Transforms categorical columns into multiple binary features (e.g., industry_Software, industry_Finance) that the model can process
Cardinality Considerations
For high-cardinality features (many unique values), consider alternatives like frequency encoding, target encoding, or embeddings
Feature Selection
Remove identifier columns like account_id from the feature set while preserving them for later reference
After preprocessing, our feature matrix (X) contains 20 columns, while our target vector (y) contains the binary conversion indicator. This transformation prepares our data for efficient model training while preserving the predictive signal in categorical variables.
Return to Top
5
Handling Class Imbalance
B2B buyer intent prediction typically faces significant class imbalance, with converting accounts (positive class) being much rarer than non-converting accounts (negative class). In our dataset, only 7.5% of accounts convert, creating a ratio of approximately 12.33:1 between negative and positive examples.
This imbalance can cause models to be biased toward predicting the majority class (non-converters), resulting in poor recall for the minority class. To address this challenge, we use XGBoost's scale_pos_weight parameter.
The Problem
In our dataset: 4,625 negative examples vs. only 375 positive examples (7.5% conversion rate)
The Solution
We calculate scale_pos_weight as the ratio of negative to positive examples:
scale_pos_weight = 4,625 ÷ 375 = 12.33
This parameter scales the gradient contribution of positive examples during training, effectively giving them more importance and counteracting the imbalance.
By applying this weighting strategy, we help the model better identify patterns associated with conversion, improving its ability to detect true positive cases while maintaining reasonable precision.
Return to Top
6
Train-Test Split Strategy
Proper dataset splitting is crucial for reliable model evaluation. We split our data into training (80%) and testing (20%) sets, preserving the class distribution across both sets through stratification.
Implementation
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,
    stratify=y,
    random_state=42
)
Key Parameters
test_size=0.2: Reserves 20% of data for evaluation
stratify=y: Maintains class proportion in both sets
random_state=42: Ensures reproducibility
Stratification is particularly important for imbalanced datasets like ours. By using stratify=y, we ensure that both training and test sets have the same proportion of converting accounts (7.5%), which gives us a more reliable performance estimate and prevents scenarios where the test set might end up with too few positive examples for meaningful evaluation.
The final split results in 4,000 accounts for training (300 positive, 3,700 negative) and 1,000 accounts for testing (75 positive, 925 negative), maintaining the 7.5% conversion rate in both sets.
Return to Top
7
XGBoost Model Training
XGBoost (Extreme Gradient Boosting) is our algorithm of choice for buyer intent prediction due to its exceptional performance on tabular data, handling of missing values, and built-in support for class imbalance.
We initialize the XGBoost classifier with carefully selected parameters to address the specific challenges of B2B buyer intent prediction:
objective='binary:logistic'
Specifies that we're solving a binary classification problem (convert/don't convert) with logistic loss function
use_label_encoder=False
Recommended setting for newer XGBoost versions to avoid deprecation warnings
eval_metric='logloss'
Defines logarithmic loss as our evaluation metric during training, suitable for binary classification
scale_pos_weight=12.33
Applies our calculated class weight to address the 12.33:1 imbalance between negative and positive examples
While we use default values for other parameters in this implementation, production models might benefit from hyperparameter tuning through grid search or Bayesian optimization to find optimal values for learning_rate, max_depth, min_child_weight, and other XGBoost parameters.
Return to Top
8
Model Evaluation
For imbalanced classification problems like buyer intent prediction, accuracy alone is misleading. A model that simply predicts "no conversion" for all accounts could achieve 92.5% accuracy in our dataset but would be useless for identifying high-intent accounts.
We use a comprehensive evaluation framework that provides deeper insight into model performance:
Classification Report
Shows precision (correctness of positive predictions), recall (coverage of true positives), and F1-score (harmonic mean of precision and recall) for each class
ROC-AUC & PR-AUC
ROC-AUC: 0.5173 - Measures ranking ability across all thresholds (0.5 is random)
PR-AUC: 0.0812 - More informative for imbalanced data, focuses on precision-recall tradeoff
Our current model shows slight improvement over random prediction (ROC-AUC > 0.5), but there's substantial room for improvement through feature engineering and parameter tuning.
The low precision and recall for the positive class (0.10 and 0.05) indicate that our model struggles to identify converting accounts. This is a common challenge in highly imbalanced datasets and suggests we should consider additional features or more sophisticated modeling approaches.
Return to Top
9
Model Explainability with SHAP
Understanding why a model makes certain predictions is crucial for gaining business trust and identifying potential issues. SHAP (SHapley Additive exPlanations) values provide a robust framework for interpreting our buyer intent model.
How SHAP Works
SHAP uses game theory to calculate the contribution of each feature to a prediction, showing both the magnitude and direction of each feature's impact. The TreeExplainer specifically optimized for tree-based models like XGBoost provides efficient and accurate explanations.
Key Insights
The SHAP summary plot shows features ranked by importance, with each point representing a sample. Color indicates feature value (red = high, blue = low) and position shows impact on prediction.
From the SHAP analysis, we can identify the most influential features for predicting buyer intent. This information helps validate the model's logic (do the important features align with business understanding?), detect potential confounds or leakage variables, and guide future feature engineering efforts by highlighting areas with strong predictive signal.
Return to Top
10
Continuous Monitoring and Retraining
Machine learning models deployed in production environments face the challenge of data drift - changes in the underlying data distributions that can degrade model performance over time. Implementing a robust monitoring and retraining framework is essential for maintaining model effectiveness.
Performance Monitoring
Track ROC-AUC, PR-AUC, and business KPIs on new data
Drift Detection
Monitor feature and prediction distributions using PSI (Population Stability Index)
Retraining Triggers
Define automatic retraining rules based on performance thresholds and drift metrics
Model Refresh
Retrain model with recent data and update production deployment
Implementing this monitoring cycle requires integration with MLOps tools and infrastructure. Key thresholds to define include minimum acceptable ROC-AUC (e.g., 0.70), maximum acceptable PSI for features and predictions (e.g., 0.20), and regular retraining schedules (e.g., monthly) regardless of performance to capture gradual shifts in buyer behavior patterns.
Return to Top
11
Segment-Based Evaluation
An overall performance metric can mask significant variations in model effectiveness across different business segments. Segment-based evaluation helps identify where the model performs well and where it struggles, enabling more targeted improvements.
Industry Segments
Evaluate model performance separately for each industry (Software, Manufacturing, Finance, etc.) to identify segments where the model may need specialized features or separate modeling approaches
Company Size Segments
Analyze performance across different company size bands to ensure the model works equally well for SMBs and enterprise accounts
Geographic Segments
Check for performance differences across regions to detect potential biases or market-specific patterns
For each segment, calculate the same evaluation metrics (ROC-AUC, precision, recall) as used for the overall model. Significant performance gaps may indicate the need for segment-specific features, separate models for certain segments, or data augmentation for underrepresented segments where performance is poor.
Return to Top
12
Addressing Performance Gaps
Our initial model evaluation revealed modest performance (ROC-AUC of 0.5173), indicating room for improvement. Here are strategies to enhance model effectiveness:
Feature Enhancement
Add new features like engagement recency, interaction velocity, and competitive research signals
Advanced Feature Engineering
Create interaction terms between features and apply non-linear transformations to capture complex relationships
Hyperparameter Optimization
Conduct systematic tuning of XGBoost parameters using grid search or Bayesian optimization
Ensemble Approaches
Combine multiple models (e.g., XGBoost, LightGBM, neural networks) to leverage diverse learning patterns
Additionally, consider specialized techniques for imbalanced classification such as SMOTE for synthetic minority oversampling, or implementing custom loss functions that penalize false negatives more heavily. For deployment, explore calibrating model outputs to produce more reliable probability estimates that sales teams can use for prioritization.
Return to Top
13
Implementation Roadmap
Implementing a production-ready buyer intent prediction system involves several key components beyond the core model development covered in this document. Here's a roadmap for moving from prototype to production:
Pipeline Orchestration
Integrate the entire workflow (data retrieval, preprocessing, training, evaluation) into an orchestration tool like Airflow, Kubeflow, or Databricks Workflows
Model Registry & Versioning
Implement MLflow to track experiments, register models, manage versions, and store preprocessing artifacts
Deployment Architecture
Design serving infrastructure for batch scoring (for CRM integration) and potentially real-time API endpoints (for website personalization)
Monitoring Dashboard
Create visualization tools for tracking model performance, data drift, and business impact metrics over time
Sales Enablement
Develop integration with sales tools and training materials to help teams effectively use model outputs
Throughout implementation, maintain close collaboration between data scientists, engineers, and business stakeholders to ensure the system addresses real business needs and delivers measurable ROI through improved conversion rates and sales efficiency.
Return to Top
14