Medial Code Documentation
Loading...
Searching...
No Matches
Data Structures | Functions | Variables
xgboost.spark.data Namespace Reference

Data Structures

class  PartIter
 

Functions

np.ndarray stack_series (pd.Series series)
 
Optional[np.ndarray] concat_or_none (Optional[Sequence[np.ndarray]] seq)
 
None cache_partitions (Iterator[pd.DataFrame] iterator, Callable[[pd.DataFrame, str, bool], None] append)
 
csr_matrix _read_csr_matrix_from_unwrapped_spark_vec (pd.DataFrame part)
 
DMatrix make_qdm (Dict[str, List[np.ndarray]] data, Optional[int] dev_ordinal, Dict[str, Any] meta, Optional[DMatrix] ref, Dict[str, Any] params)
 
Tuple[DMatrix, Optional[DMatrix]] create_dmatrix_from_partitions (Iterator[pd.DataFrame] iterator, Optional[Sequence[str]] feature_cols, Optional[int] dev_ordinal, bool use_qdm, Dict[str, Any] kwargs, bool enable_sparse_data_optim, bool has_validation_col)
 
np.ndarray pred_contribs (XGBModel model, ArrayLike data, Optional[ArrayLike] base_margin=None, bool strict_shape=False)
 

Variables

 Alias = namedtuple("Alias", ("data", "label", "weight", "margin", "valid", "qid"))
 
 alias = Alias("values", "label", "weight", "baseMargin", "validationIndicator", "qid")
 

Detailed Description

Utilities for processing spark partitions.

Function Documentation

◆ cache_partitions()

None xgboost.spark.data.cache_partitions ( Iterator[pd.DataFrame]  iterator,
Callable[[pd.DataFrame, str, bool], None]   append 
)
Extract partitions from pyspark iterator. `append` is a user defined function for
accepting new partition.

◆ concat_or_none()

Optional[np.ndarray] xgboost.spark.data.concat_or_none ( Optional[Sequence[np.ndarray]]  seq)
Concatenate the data if it's not None.

◆ create_dmatrix_from_partitions()

Tuple[DMatrix, Optional[DMatrix]] xgboost.spark.data.create_dmatrix_from_partitions ( Iterator[pd.DataFrame]  iterator,
Optional[Sequence[str]]  feature_cols,
Optional[int]  dev_ordinal,
bool  use_qdm,
Dict[str, Any]  kwargs,
bool  enable_sparse_data_optim,
bool  has_validation_col 
)
Create DMatrix from spark data partitions.

Parameters
----------
iterator :
    Pyspark partition iterator.
feature_cols:
    A sequence of feature names, used only when rapids plugin is enabled.
dev_ordinal:
    Device ordinal, used when GPU is enabled.
use_qdm :
    Whether QuantileDMatrix should be used instead of DMatrix.
kwargs :
    Metainfo for DMatrix.
enable_sparse_data_optim :
    Whether sparse data should be unwrapped
has_validation:
    Whether there's validation data.

Returns
-------
Training DMatrix and an optional validation DMatrix.

◆ make_qdm()

DMatrix xgboost.spark.data.make_qdm ( Dict[str, List[np.ndarray]]  data,
Optional[int]  dev_ordinal,
Dict[str, Any]  meta,
Optional[DMatrix ref,
Dict[str, Any]  params 
)
Handle empty partition for QuantileDMatrix.

◆ pred_contribs()

np.ndarray xgboost.spark.data.pred_contribs ( XGBModel  model,
ArrayLike  data,
Optional[ArrayLike]   base_margin = None,
bool   strict_shape = False 
)
Predict contributions with data with the full model.

◆ stack_series()

np.ndarray xgboost.spark.data.stack_series ( pd.Series  series)
Stack a series of arrays.