Medial Code Documentation
|
Copyright 2019-2023, XGBoost Contributors. More...
Namespaces | |
namespace | detail |
Copyright 2023, XGBoost Contributors. | |
Data Structures | |
class | ArrayAdapter |
Adapter for dense array on host, in Python that's numpy.ndarray . More... | |
class | ArrayAdapterBatch |
class | ArrowColumnarBatch |
struct | ArrowSchemaImporter |
struct | Cache |
Information about the cache including path and page offsets. More... | |
class | Column |
struct | ColumnarMetaInfo |
struct | COOTuple |
class | CSCAdapter |
class | CSCAdapterBatch |
class | CSCArrayAdapter |
CSC adapter with support for array interface. More... | |
class | CSCArrayAdapterBatch |
class | CSCPageSource |
class | CSRAdapter |
class | CSRAdapterBatch |
class | CSRArrayAdapter |
Adapter for CSR array on host, in Python that's scipy.sparse.csr_matrix . More... | |
class | CSRArrayAdapterBatch |
class | DataIterProxy |
class | DataTableAdapter |
class | DataTableAdapterBatch |
class | DenseAdapter |
class | DenseAdapterBatch |
class | DMatrixProxy |
class | EllpackPageSource |
class | ExceHandler |
class | FileAdapter |
FileAdapter wraps dmlc::parser to read files and provide access in a common interface. More... | |
class | FileAdapterBatch |
class | FileIterator |
An iterator for implementing external memory support with file inputs. More... | |
class | GHistIndexRawFormat |
class | GradientIndexPageSource |
struct | IsValidFunctor |
class | IterativeDMatrix |
DMatrix type for QuantileDMatrix , the naming IterativeDMatix is due to its construction process. More... | |
class | IteratorAdapter |
Data iterator that takes callback to return data, used in JVM package for accepting data iterator. More... | |
struct | LabelsCheck |
class | PageSourceIncMixIn |
class | PrimitiveColumn |
class | RecordBatchesIterAdapter |
class | SimpleBatchIteratorImpl |
class | SimpleDMatrix |
class | SingleBatchInternalIter |
class | SortedCSCPageSource |
class | SparsePageAdapterBatch |
class | SparsePageDMatrix |
DMatrix used for external memory. More... | |
class | SparsePageFormat |
Format specification of various data formats like SparsePage. More... | |
struct | SparsePageFormatReg |
Registry entry for sparse page format. More... | |
class | SparsePageRawFormat |
class | SparsePageSource |
class | SparsePageSourceImpl |
Base class for all page sources. More... | |
class | TryLockGuard |
struct | WeightsCheck |
Typedefs | |
using | ArrowColumnarBatchVec = std::vector< std::unique_ptr< ArrowColumnarBatch > > |
Enumerations | |
enum | ColumnDType : uint8_t { kUnknown , kInt8 , kUInt8 , kInt16 , kUInt16 , kInt32 , kUInt32 , kInt64 , kUInt64 , kFloat , kDouble } |
Functions | |
None | _warn_unused_missing (DataType data, Optional[FloatCompatible] missing) |
None | _check_data_shape (DataType data) |
bool | _is_scipy_csr (DataType data) |
bytes | _array_interface (np.ndarray data) |
DataType | transform_scipy_sparse (DataType data, bool is_csr) |
DispatchedDataBackendReturnType | _from_scipy_csr (DataType data, FloatCompatible missing, int nthread, Optional[FeatureNames] feature_names, Optional[FeatureTypes] feature_types) |
bool | _is_scipy_csc (DataType data) |
DispatchedDataBackendReturnType | _from_scipy_csc (DataType data, FloatCompatible missing, int nthread, Optional[FeatureNames] feature_names, Optional[FeatureTypes] feature_types) |
bool | _is_scipy_coo (DataType data) |
bool | _is_np_array_like (DataType data) |
Tuple[np.ndarray, Optional[NumpyDType]] | _ensure_np_dtype (DataType data, Optional[NumpyDType] dtype) |
np.ndarray | _maybe_np_slice (DataType data, Optional[NumpyDType] dtype) |
DispatchedDataBackendReturnType | _from_numpy_array (DataType data, FloatCompatible missing, int nthread, Optional[FeatureNames] feature_names, Optional[FeatureTypes] feature_types, DataSplitMode data_split_mode=DataSplitMode.ROW) |
bool | _is_pandas_df (DataType data) |
bool | _is_modin_df (DataType data) |
None | _invalid_dataframe_dtype (DataType data) |
Tuple[Optional[FeatureNames], Optional[FeatureTypes]] | pandas_feature_info (DataFrame data, Optional[str] meta, Optional[FeatureNames] feature_names, Optional[FeatureTypes] feature_types, bool enable_categorical) |
bool | is_nullable_dtype (PandasDType dtype) |
bool | is_pa_ext_dtype (Any dtype) |
bool | is_pa_ext_categorical_dtype (Any dtype) |
bool | is_pd_cat_dtype (PandasDType dtype) |
bool | is_pd_sparse_dtype (PandasDType dtype) |
DataFrame | pandas_cat_null (DataFrame data) |
DataFrame | pandas_ext_num_types (DataFrame data) |
Tuple[np.ndarray, Optional[FeatureNames], Optional[FeatureTypes]] | _transform_pandas_df (DataFrame data, bool enable_categorical, Optional[FeatureNames] feature_names=None, Optional[FeatureTypes] feature_types=None, Optional[str] meta=None, Optional[NumpyDType] meta_type=None) |
DispatchedDataBackendReturnType | _from_pandas_df (DataFrame data, bool enable_categorical, FloatCompatible missing, int nthread, Optional[FeatureNames] feature_names, Optional[FeatureTypes] feature_types) |
bool | _is_pandas_series (DataType data) |
None | _meta_from_pandas_series (DataType data, str name, Optional[NumpyDType] dtype, ctypes.c_void_p handle) |
bool | _is_modin_series (DataType data) |
DispatchedDataBackendReturnType | _from_pandas_series (DataType data, FloatCompatible missing, int nthread, bool enable_categorical, Optional[FeatureNames] feature_names, Optional[FeatureTypes] feature_types) |
bool | _is_dt_df (DataType data) |
Tuple[np.ndarray, Optional[FeatureNames], Optional[FeatureTypes]] | _transform_dt_df (DataType data, Optional[FeatureNames] feature_names, Optional[FeatureTypes] feature_types, Optional[str] meta=None, Optional[NumpyDType] meta_type=None) |
DispatchedDataBackendReturnType | _from_dt_df (DataType data, Optional[FloatCompatible] missing, int nthread, Optional[FeatureNames] feature_names, Optional[FeatureTypes] feature_types, bool enable_categorical) |
bool | _is_arrow (DataType data) |
Callable | record_batch_data_iter (Iterator data_iter) |
DispatchedDataBackendReturnType | _from_arrow (DataType data, FloatCompatible missing, int nthread, Optional[FeatureNames] feature_names, Optional[FeatureTypes] feature_types, bool enable_categorical) |
bool | _is_cudf_df (DataType data) |
bytes | _cudf_array_interfaces (DataType data, list cat_codes) |
Tuple[ctypes.c_void_p, list, Optional[FeatureNames], Optional[FeatureTypes]] | _transform_cudf_df (DataType data, Optional[FeatureNames] feature_names, Optional[FeatureTypes] feature_types, bool enable_categorical) |
DispatchedDataBackendReturnType | _from_cudf_df (DataType data, FloatCompatible missing, int nthread, Optional[FeatureNames] feature_names, Optional[FeatureTypes] feature_types, bool enable_categorical) |
bool | _is_cudf_ser (DataType data) |
bool | _is_cupy_array (DataType data) |
CupyT | _transform_cupy_array (DataType data) |
DispatchedDataBackendReturnType | _from_cupy_array (DataType data, FloatCompatible missing, int nthread, Optional[FeatureNames] feature_names, Optional[FeatureTypes] feature_types) |
bool | _is_cupy_csr (DataType data) |
bool | _is_cupy_csc (DataType data) |
bool | _is_dlpack (DataType data) |
bool | _transform_dlpack (DataType data) |
DispatchedDataBackendReturnType | _from_dlpack (DataType data, FloatCompatible missing, int nthread, Optional[FeatureNames] feature_names, Optional[FeatureTypes] feature_types) |
bool | _is_uri (DataType data) |
DispatchedDataBackendReturnType | _from_uri (DataType data, Optional[FloatCompatible] missing, Optional[FeatureNames] feature_names, Optional[FeatureTypes] feature_types, DataSplitMode data_split_mode=DataSplitMode.ROW) |
bool | _is_list (DataType data) |
DispatchedDataBackendReturnType | _from_list (Sequence data, FloatCompatible missing, int n_threads, Optional[FeatureNames] feature_names, Optional[FeatureTypes] feature_types) |
bool | _is_tuple (DataType data) |
DispatchedDataBackendReturnType | _from_tuple (Sequence data, FloatCompatible missing, int n_threads, Optional[FeatureNames] feature_names, Optional[FeatureTypes] feature_types) |
bool | _is_iter (DataType data) |
bool | _has_array_protocol (DataType data) |
DataType | _convert_unknown_data (DataType data) |
DispatchedDataBackendReturnType | dispatch_data_backend (DataType data, FloatCompatible missing, int threads, Optional[FeatureNames] feature_names, Optional[FeatureTypes] feature_types, bool enable_categorical=False, DataSplitMode data_split_mode=DataSplitMode.ROW) |
None | _validate_meta_shape (DataType data, str name) |
None | _meta_from_numpy (np.ndarray data, str field, Optional[NumpyDType] dtype, ctypes.c_void_p handle) |
None | _meta_from_list (Sequence data, str field, Optional[NumpyDType] dtype, ctypes.c_void_p handle) |
None | _meta_from_tuple (Sequence data, str field, Optional[NumpyDType] dtype, ctypes.c_void_p handle) |
None | _meta_from_cudf_df (DataType data, str field, ctypes.c_void_p handle) |
None | _meta_from_cudf_series (DataType data, str field, ctypes.c_void_p handle) |
None | _meta_from_cupy_array (DataType data, str field, ctypes.c_void_p handle) |
None | _meta_from_dt (DataType data, str field, Optional[NumpyDType] dtype, ctypes.c_void_p handle) |
None | dispatch_meta_backend (DMatrix matrix, DataType data, str name, Optional[NumpyDType] dtype=None) |
TransformedData | _proxy_transform (DataType data, Optional[FeatureNames] feature_names, Optional[FeatureTypes] feature_types, bool enable_categorical) |
None | dispatch_proxy_set_data (_ProxyDMatrix proxy, DataType data, Optional[list] cat_codes, bool allow_host) |
DMLC_REGISTRY_LINK_TAG (sparse_page_raw_format) | |
DMLC_REGISTRY_LINK_TAG (gradient_index_format) | |
std::string | ValidateFileFormat (std::string const &uri) |
DMLC_REGISTRY_FILE_TAG (gradient_index_format) | |
describe ("Raw GHistIndex binary data format.") .set_body([]() | |
bool | ReadHistogramCuts (common::HistogramCuts *cuts, common::AlignedResourceReadStream *fi) |
std::size_t | WriteHistogramCuts (common::HistogramCuts const &cuts, common::AlignedFileWriteStream *fo) |
void | GetCutsFromRef (Context const *ctx, std::shared_ptr< DMatrix > ref, bst_feature_t n_features, BatchParam p, common::HistogramCuts *p_cuts) |
Get quantile cuts from reference (Quantile)DMatrix. | |
void | GetCutsFromEllpack (EllpackPage const &page, common::HistogramCuts *cuts) |
Get quantile cuts from ellpack page. | |
std::shared_ptr< DMatrix > | CreateDMatrixFromProxy (Context const *ctx, std::shared_ptr< DMatrixProxy > proxy, float missing) |
Create a SimpleDMatrix instance from a DMatrixProxy . | |
DMatrixProxy * | MakeProxy (DMatrixHandle proxy) |
template<bool get_value = true, typename Fn > | |
decltype(auto) | HostAdapterDispatch (DMatrixProxy const *proxy, Fn fn, bool *type_error=nullptr) |
Dispatch function call based on input type. | |
std::string | MakeId (std::string prefix, SparsePageDMatrix *ptr) |
std::string | MakeCache (SparsePageDMatrix *ptr, std::string format, std::string prefix, std::map< std::string, std::shared_ptr< Cache > > *out) |
DMLC_REGISTRY_FILE_TAG (sparse_page_raw_format) | |
describe ("Raw binary data format.") .set_body([]() | |
void | TryDeleteCacheFile (const std::string &file) |
void | DevicePush (DMatrixProxy *, float, SparsePage *) |
template<typename T > | |
SparsePageFormat< T > * | CreatePageFormat (const std::string &name) |
Create sparse page of format. | |
void | ValidateQueryGroup (std::vector< bst_group_t > const &group_ptr_) |
TEST (FileIterator, Basic) | |
TEST (GradientIndex, ExternalMemoryBaseRowID) | |
TEST (GradientIndex, FromCategoricalBasic) | |
TEST (GradientIndex, FromCategoricalLarge) | |
TEST (GradientIndex, PushBatch) | |
TEST (GHistIndexPageRawFormat, IO) | |
TEST (IterativeDMatrix, Ref) | |
TEST (IterativeDMatrix, IsDense) | |
template<typename Page , typename Iter , typename Cuts > | |
void | TestRefDMatrix (Context const *ctx, Cuts &&get_cuts) |
TEST (ProxyDMatrix, HostData) | |
template<typename S > | |
void | TestSparsePageRawFormat () |
TEST (SparsePageRawFormat, SparsePage) | |
TEST (SparsePageRawFormat, CSCPage) | |
TEST (SparsePageRawFormat, SortedCSCPage) | |
Variables | |
DispatchedDataBackendReturnType | |
str | CAT_T = "c" |
dict | _matrix_meta = {"base_margin", "label"} |
dict | _pandas_dtype_mapper |
dict | pandas_nullable_mapper |
dict | pandas_pyarrow_mapper |
tuple | _ENABLE_CAT_ERR |
constexpr size_t | kAdapterUnknownSize = std::numeric_limits<size_t >::max() |
External data formats should implement an adapter as below. | |
Copyright 2019-2023, XGBoost Contributors.
Copyright 2022-2023 by XGBoost contributors.
Copyright 2021-2023 by XGBoost contributors.
Copyright 2020-2023, XGBoost contributors.
Copyright 2022-2023, XGBoost contributors.
Copyright 2021-2023, XGBoost Contributors.
Copyright 2021-2023 XGBoost contributors.
Copyright 2021-2023, XGBoost contributors.
Data dispatching for DMatrix.
|
protected |
Extract CuDF __cuda_array_interface__. This is special as it returns a new list of data and a list of array interfaces. The data is list of categorical codes that caller can safely ignore, but have to keep their reference alive until usage of array interface is finished.
|
protected |
Initialize DMatrix from cupy ndarray.
|
protected |
Initialize data from a 2-D numpy matrix.
|
protected |
Initialize data from a CSC matrix.
|
protected |
Initialize data from a CSR matrix.
|
protected |
Handle numpy slice. This can be removed if we use __array_interface__.
|
protected |
Help transform pandas series for meta data like labels
|
protected |
Validate feature names and types if data table
|
inline |
Create sparse page of format.
DispatchedDataBackendReturnType xgboost.data.dispatch_data_backend | ( | DataType | data, |
FloatCompatible | missing, | ||
int | threads, | ||
Optional[FeatureNames] | feature_names, | ||
Optional[FeatureTypes] | feature_types, | ||
bool | enable_categorical = False , |
||
DataSplitMode | data_split_mode = DataSplitMode.ROW |
||
) |
Dispatch data for DMatrix.
None xgboost.data.dispatch_meta_backend | ( | DMatrix | matrix, |
DataType | data, | ||
str | name, | ||
Optional[NumpyDType] | dtype = None |
||
) |
Dispatch for meta info.
None xgboost.data.dispatch_proxy_set_data | ( | _ProxyDMatrix | proxy, |
DataType | data, | ||
Optional[list] | cat_codes, | ||
bool | allow_host | ||
) |
Dispatch for QuantileDMatrix.
void xgboost::data::GetCutsFromRef | ( | Context const * | ctx, |
std::shared_ptr< DMatrix > | ref, | ||
bst_feature_t | n_features, | ||
BatchParam | p, | ||
common::HistogramCuts * | p_cuts | ||
) |
Get quantile cuts from reference (Quantile)DMatrix.
ctx | The context of the new DMatrix. |
ref | The reference DMatrix. |
n_features | Number of features, used for validation only. |
p | Batch parameter for the new DMatrix. |
p_cuts | Output quantile cuts. |
decltype(auto) xgboost::data::HostAdapterDispatch | ( | DMatrixProxy const * | proxy, |
Fn | fn, | ||
bool * | type_error = nullptr |
||
) |
Dispatch function call based on input type.
get_value | Whether the funciton Fn accept an adapter batch or the adapter itself. |
Fn | The type of the function to be dispatched. |
proxy | The proxy object holding the reference to the input. |
fn | The function to be dispatched. |
type_error[out] | Set to ture if it's not null and the input data is not recognized by the host. |
bool xgboost.data.is_nullable_dtype | ( | PandasDType | dtype | ) |
Whether dtype is a pandas nullable type.
bool xgboost.data.is_pa_ext_categorical_dtype | ( | Any | dtype | ) |
Check whether dtype is a dictionary type.
bool xgboost.data.is_pa_ext_dtype | ( | Any | dtype | ) |
Return whether dtype is a pyarrow extension type for pandas
bool xgboost.data.is_pd_cat_dtype | ( | PandasDType | dtype | ) |
Wrapper for testing pandas category type.
bool xgboost.data.is_pd_sparse_dtype | ( | PandasDType | dtype | ) |
Wrapper for testing pandas sparse type.
DataFrame xgboost.data.pandas_cat_null | ( | DataFrame | data | ) |
Handle categorical dtype and nullable extension types from pandas.
DataFrame xgboost.data.pandas_ext_num_types | ( | DataFrame | data | ) |
Experimental suppport for handling pyarrow extension numeric types.
Tuple[Optional[FeatureNames], Optional[FeatureTypes]] xgboost.data.pandas_feature_info | ( | DataFrame | data, |
Optional[str] | meta, | ||
Optional[FeatureNames] | feature_names, | ||
Optional[FeatureTypes] | feature_types, | ||
bool | enable_categorical | ||
) |
Handle feature info for pandas dataframe.
Callable xgboost.data.record_batch_data_iter | ( | Iterator | data_iter | ) |
Data iterator used to ingest Arrow columnar record batches. We are not using class DataIter because it is only intended for building Device DMatrix and external memory DMatrix.
DataType xgboost.data.transform_scipy_sparse | ( | DataType | data, |
bool | is_csr | ||
) |
Ensure correct data alignment and data type for scipy sparse inputs. Input should be either csr or csc matrix.
|
protected |
|
protected |
xgboost.data.DispatchedDataBackendReturnType |
|
constexpr |
External data formats should implement an adapter as below.
The adapter provides a uniform access to data outside xgboost, allowing construction of DMatrix objects from a range of sources without duplicating code.
The adapter object is an iterator that returns batches of data. Each batch contains a number of "lines". A line represents a set of elements from a sparse input matrix, normally a row in the case of a CSR matrix or a column for a CSC matrix. Typically in sparse matrix formats we can efficiently access subsets of elements at a time, but cannot efficiently lookups elements by random access, hence the "line" abstraction, allowing the sparse matrix to return subsets of elements efficiently. Individual elements are described by a COO tuple (row index, column index, value).
This abstraction allows us to read through different sparse matrix formats using the same interface. In particular we can write a DMatrix constructor that uses the same code to construct itself from a CSR matrix, CSC matrix, dense matrix, CSV, LIBSVM file, or potentially other formats. To see why this is necessary, imagine we have 5 external matrix formats and 5 internal DMatrix types where each DMatrix needs a custom constructor for each possible input. The number of constructors is 5*5=25. Using an abstraction over the input data types the number of constructors is reduced to 5, as each DMatrix is oblivious to the external data format. Adding a new input source is simply a case of implementing an adapter.
Most of the below adapters do not need more than one batch as the data originates from an in memory source. The file adapter does require batches to avoid loading the entire file in memory.
An important detail is empty row/column handling. Files loaded from disk do not provide meta information about the number of rows/columns to expect, this needs to be inferred during construction. Other sparse formats may specify a number of rows/columns, but we can encounter entirely sparse rows or columns, leading to disagreement between the inferred number and the meta-info provided. To resolve this, adapters have methods specifying the number of rows/columns expected, these methods may return zero where these values must be inferred from data. A constructed DMatrix should agree with the input source on numbers of rows/columns, appending empty rows if necessary.
An adapter can return this value for number of rows or columns indicating that this value is currently unknown and should be inferred while passing over the data.
dict xgboost.data.pandas_nullable_mapper |
dict xgboost.data.pandas_pyarrow_mapper |