Medial Code Documentation
|
DMatrix used for external memory. More...
#include <sparse_page_dmatrix.h>
Public Member Functions | |
SparsePageDMatrix (DataIterHandle iter, DMatrixHandle proxy, DataIterResetCallback *reset, XGDMatrixCallbackNext *next, float missing, int32_t nthreads, std::string cache_prefix) | |
MetaInfo & | Info () override |
const MetaInfo & | Info () const override |
Context const * | Ctx () const override |
bool | SingleColBlock () const override |
DMatrix * | Slice (common::Span< int32_t const >) override |
DMatrix * | SliceCol (int, int) override |
![]() | |
None | __init__ (self, DataType data, Optional[ArrayLike] label=None, *Optional[ArrayLike] weight=None, Optional[ArrayLike] base_margin=None, Optional[float] missing=None, bool silent=False, Optional[FeatureNames] feature_names=None, Optional[FeatureTypes] feature_types=None, Optional[int] nthread=None, Optional[ArrayLike] group=None, Optional[ArrayLike] qid=None, Optional[ArrayLike] label_lower_bound=None, Optional[ArrayLike] label_upper_bound=None, Optional[ArrayLike] feature_weights=None, bool enable_categorical=False, DataSplitMode data_split_mode=DataSplitMode.ROW) |
None | __del__ (self) |
None | set_info (self, *Optional[ArrayLike] label=None, Optional[ArrayLike] weight=None, Optional[ArrayLike] base_margin=None, Optional[ArrayLike] group=None, Optional[ArrayLike] qid=None, Optional[ArrayLike] label_lower_bound=None, Optional[ArrayLike] label_upper_bound=None, Optional[FeatureNames] feature_names=None, Optional[FeatureTypes] feature_types=None, Optional[ArrayLike] feature_weights=None) |
np.ndarray | get_float_info (self, str field) |
np.ndarray | get_uint_info (self, str field) |
None | set_float_info (self, str field, ArrayLike data) |
None | set_float_info_npy2d (self, str field, ArrayLike data) |
None | set_uint_info (self, str field, ArrayLike data) |
None | save_binary (self, Union[str, os.PathLike] fname, bool silent=True) |
None | set_label (self, ArrayLike label) |
None | set_weight (self, ArrayLike weight) |
None | set_base_margin (self, ArrayLike margin) |
None | set_group (self, ArrayLike group) |
np.ndarray | get_label (self) |
np.ndarray | get_weight (self) |
np.ndarray | get_base_margin (self) |
np.ndarray | get_group (self) |
scipy.sparse.csr_matrix | get_data (self) |
Tuple[np.ndarray, np.ndarray] | get_quantile_cut (self) |
int | num_row (self) |
int | num_col (self) |
int | num_nonmissing (self) |
"DMatrix" | slice (self, Union[List[int], np.ndarray] rindex, bool allow_groups=False) |
Optional[FeatureNames] | feature_names (self) |
None | feature_names (self, Optional[FeatureNames] feature_names) |
Optional[FeatureTypes] | feature_types (self) |
None | feature_types (self, Optional[FeatureTypes] feature_types) |
Additional Inherited Members | |
![]() | |
missing | |
nthread | |
silent | |
handle | |
feature_names | |
feature_types | |
![]() | |
None | _init_from_iter (self, DataIter iterator, bool enable_categorical) |
DMatrix used for external memory.
The external memory is created for controlling memory usage by splitting up data into multiple batches. However that doesn't mean we will actually process exactly 1 batch at a time, which would be terribly slow considering that we have to loop through the whole dataset for every tree split. So we use async to pre-fetch pages and let the caller to decide how many batches it wants to process by returning data as a shared pointer. The caller can use async function to process the data or just stage those batches based on its use cases. These two optimizations might defeat the purpose of splitting up dataset since if you stage all the batches then the memory usage might be even worse than using a single batch. As a result, we must control how many batches can be in memory at any given time.
Right now the write to the cache is a sequential operation and is blocking. Reading from cache on ther other hand, is async but with a hard coded limit of 3 pages as an heuristic. So by sparse dmatrix itself there can be only 7 pages in main memory (might be of different types) at the same time: 1 page pending for write, 3 pre-fetched sparse pages, 3 pre-fetched dependent pages.
Of course if the caller decides to retain some batches to perform parallel processing, then we might load all pages in memory, which is also considered as a bug in caller's code. So if the algo supports external memory, it must be careful that queue for async call must have an upper limit.
Another assumption we make is that the data must be immutable so caller should never change the data. Sparse page source returns const page to make sure of that. If you want to change the generated page like Ellpack, pass parameter into GetBatches
to re-generate them instead of trying to modify the pages in-place.
A possible optimization is dropping the sparse page once dependent pages like ellpack are constructed and cached.