DMatrix used for external memory. More...

#include <sparse_page_dmatrix.h>

Inheritance diagram for xgboost.data::SparsePageDMatrix:

Public Member Functions
	SparsePageDMatrix (DataIterHandle iter, DMatrixHandle proxy, DataIterResetCallback reset, XGDMatrixCallbackNext next, float missing, int32_t nthreads, std::string cache_prefix)

MetaInfo &	Info () override

const MetaInfo &	Info () const override

Context const *	Ctx () const override

bool	SingleColBlock () const override

DMatrix *	Slice (common::Span< int32_t const >) override

DMatrix *	SliceCol (int, int) override

Public Member Functions inherited from xgboost.core.DMatrix
None	__init__ (self, DataType data, Optional[ArrayLike] label=None, *Optional[ArrayLike] weight=None, Optional[ArrayLike] base_margin=None, Optional[float] missing=None, bool silent=False, Optional[FeatureNames] feature_names=None, Optional[FeatureTypes] feature_types=None, Optional[int] nthread=None, Optional[ArrayLike] group=None, Optional[ArrayLike] qid=None, Optional[ArrayLike] label_lower_bound=None, Optional[ArrayLike] label_upper_bound=None, Optional[ArrayLike] feature_weights=None, bool enable_categorical=False, DataSplitMode data_split_mode=DataSplitMode.ROW)

None	__del__ (self)

None	set_info (self, *Optional[ArrayLike] label=None, Optional[ArrayLike] weight=None, Optional[ArrayLike] base_margin=None, Optional[ArrayLike] group=None, Optional[ArrayLike] qid=None, Optional[ArrayLike] label_lower_bound=None, Optional[ArrayLike] label_upper_bound=None, Optional[FeatureNames] feature_names=None, Optional[FeatureTypes] feature_types=None, Optional[ArrayLike] feature_weights=None)

np.ndarray	get_float_info (self, str field)

np.ndarray	get_uint_info (self, str field)

None	set_float_info (self, str field, ArrayLike data)

None	set_float_info_npy2d (self, str field, ArrayLike data)

None	set_uint_info (self, str field, ArrayLike data)

None	save_binary (self, Union[str, os.PathLike] fname, bool silent=True)

None	set_label (self, ArrayLike label)

None	set_weight (self, ArrayLike weight)

None	set_base_margin (self, ArrayLike margin)

None	set_group (self, ArrayLike group)

np.ndarray	get_label (self)

np.ndarray	get_weight (self)

np.ndarray	get_base_margin (self)

np.ndarray	get_group (self)

scipy.sparse.csr_matrix	get_data (self)

Tuple[np.ndarray, np.ndarray]	get_quantile_cut (self)

int	num_row (self)

int	num_col (self)

int	num_nonmissing (self)

"DMatrix"	slice (self, Union[List[int], np.ndarray] rindex, bool allow_groups=False)

Optional[FeatureNames]	feature_names (self)

None	feature_names (self, Optional[FeatureNames] feature_names)

Optional[FeatureTypes]	feature_types (self)

None	feature_types (self, Optional[FeatureTypes] feature_types)

Additional Inherited Members
Data Fields inherited from xgboost.core.DMatrix
	missing

	nthread

	silent

	handle

	feature_names

	feature_types

Protected Member Functions inherited from xgboost.core.DMatrix
None	_init_from_iter (self, DataIter iterator, bool enable_categorical)

Detailed Description

DMatrix used for external memory.

The external memory is created for controlling memory usage by splitting up data into multiple batches. However that doesn't mean we will actually process exactly 1 batch at a time, which would be terribly slow considering that we have to loop through the whole dataset for every tree split. So we use async to pre-fetch pages and let the caller to decide how many batches it wants to process by returning data as a shared pointer. The caller can use async function to process the data or just stage those batches based on its use cases. These two optimizations might defeat the purpose of splitting up dataset since if you stage all the batches then the memory usage might be even worse than using a single batch. As a result, we must control how many batches can be in memory at any given time.

Right now the write to the cache is a sequential operation and is blocking. Reading from cache on ther other hand, is async but with a hard coded limit of 3 pages as an heuristic. So by sparse dmatrix itself there can be only 7 pages in main memory (might be of different types) at the same time: 1 page pending for write, 3 pre-fetched sparse pages, 3 pre-fetched dependent pages.

Of course if the caller decides to retain some batches to perform parallel processing, then we might load all pages in memory, which is also considered as a bug in caller's code. So if the algo supports external memory, it must be careful that queue for async call must have an upper limit.

Another assumption we make is that the data must be immutable so caller should never change the data. Sparse page source returns const page to make sure of that. If you want to change the generated page like Ellpack, pass parameter into GetBatches to re-generate them instead of trying to modify the pages in-place.

A possible optimization is dropping the sparse page once dependent pages like ellpack are constructed and cached.

The documentation for this class was generated from the following files:

External/xgboost/src/data/sparse_page_dmatrix.h
External/xgboost/src/data/sparse_page_dmatrix.cc

Public Member Functions

Additional Inherited Members

Detailed Description