Medial Code Documentation
Loading...
Searching...
No Matches
Public Member Functions
xgboost.data::SparsePageDMatrix Class Reference

DMatrix used for external memory. More...

#include <sparse_page_dmatrix.h>

Inheritance diagram for xgboost.data::SparsePageDMatrix:
xgboost.core.DMatrix

Public Member Functions

 SparsePageDMatrix (DataIterHandle iter, DMatrixHandle proxy, DataIterResetCallback *reset, XGDMatrixCallbackNext *next, float missing, int32_t nthreads, std::string cache_prefix)
 
MetaInfoInfo () override
 
const MetaInfoInfo () const override
 
Context const * Ctx () const override
 
bool SingleColBlock () const override
 
DMatrixSlice (common::Span< int32_t const >) override
 
DMatrixSliceCol (int, int) override
 
- Public Member Functions inherited from xgboost.core.DMatrix
None __init__ (self, DataType data, Optional[ArrayLike] label=None, *Optional[ArrayLike] weight=None, Optional[ArrayLike] base_margin=None, Optional[float] missing=None, bool silent=False, Optional[FeatureNames] feature_names=None, Optional[FeatureTypes] feature_types=None, Optional[int] nthread=None, Optional[ArrayLike] group=None, Optional[ArrayLike] qid=None, Optional[ArrayLike] label_lower_bound=None, Optional[ArrayLike] label_upper_bound=None, Optional[ArrayLike] feature_weights=None, bool enable_categorical=False, DataSplitMode data_split_mode=DataSplitMode.ROW)
 
None __del__ (self)
 
None set_info (self, *Optional[ArrayLike] label=None, Optional[ArrayLike] weight=None, Optional[ArrayLike] base_margin=None, Optional[ArrayLike] group=None, Optional[ArrayLike] qid=None, Optional[ArrayLike] label_lower_bound=None, Optional[ArrayLike] label_upper_bound=None, Optional[FeatureNames] feature_names=None, Optional[FeatureTypes] feature_types=None, Optional[ArrayLike] feature_weights=None)
 
np.ndarray get_float_info (self, str field)
 
np.ndarray get_uint_info (self, str field)
 
None set_float_info (self, str field, ArrayLike data)
 
None set_float_info_npy2d (self, str field, ArrayLike data)
 
None set_uint_info (self, str field, ArrayLike data)
 
None save_binary (self, Union[str, os.PathLike] fname, bool silent=True)
 
None set_label (self, ArrayLike label)
 
None set_weight (self, ArrayLike weight)
 
None set_base_margin (self, ArrayLike margin)
 
None set_group (self, ArrayLike group)
 
np.ndarray get_label (self)
 
np.ndarray get_weight (self)
 
np.ndarray get_base_margin (self)
 
np.ndarray get_group (self)
 
scipy.sparse.csr_matrix get_data (self)
 
Tuple[np.ndarray, np.ndarray] get_quantile_cut (self)
 
int num_row (self)
 
int num_col (self)
 
int num_nonmissing (self)
 
"DMatrix" slice (self, Union[List[int], np.ndarray] rindex, bool allow_groups=False)
 
Optional[FeatureNames] feature_names (self)
 
None feature_names (self, Optional[FeatureNames] feature_names)
 
Optional[FeatureTypes] feature_types (self)
 
None feature_types (self, Optional[FeatureTypes] feature_types)
 

Additional Inherited Members

- Data Fields inherited from xgboost.core.DMatrix
 missing
 
 nthread
 
 silent
 
 handle
 
 feature_names
 
 feature_types
 
- Protected Member Functions inherited from xgboost.core.DMatrix
None _init_from_iter (self, DataIter iterator, bool enable_categorical)
 

Detailed Description

DMatrix used for external memory.

The external memory is created for controlling memory usage by splitting up data into multiple batches. However that doesn't mean we will actually process exactly 1 batch at a time, which would be terribly slow considering that we have to loop through the whole dataset for every tree split. So we use async to pre-fetch pages and let the caller to decide how many batches it wants to process by returning data as a shared pointer. The caller can use async function to process the data or just stage those batches based on its use cases. These two optimizations might defeat the purpose of splitting up dataset since if you stage all the batches then the memory usage might be even worse than using a single batch. As a result, we must control how many batches can be in memory at any given time.

Right now the write to the cache is a sequential operation and is blocking. Reading from cache on ther other hand, is async but with a hard coded limit of 3 pages as an heuristic. So by sparse dmatrix itself there can be only 7 pages in main memory (might be of different types) at the same time: 1 page pending for write, 3 pre-fetched sparse pages, 3 pre-fetched dependent pages.

Of course if the caller decides to retain some batches to perform parallel processing, then we might load all pages in memory, which is also considered as a bug in caller's code. So if the algo supports external memory, it must be careful that queue for async call must have an upper limit.

Another assumption we make is that the data must be immutable so caller should never change the data. Sparse page source returns const page to make sure of that. If you want to change the generated page like Ellpack, pass parameter into GetBatches to re-generate them instead of trying to modify the pages in-place.

A possible optimization is dropping the sparse page once dependent pages like ellpack are constructed and cached.


The documentation for this class was generated from the following files: