input split creates that allows reading of records from split of data, independent part that covers all the dataset More...

#include <io.h>

Inheritance diagram for dmlc::InputSplit:

Data Structures
struct	Blob
	a blob of memory region More...

Public Member Functions
virtual void	HintChunkSize (size_t)
	hint the inputsplit how large the chunk size it should return when implementing NextChunk this is a hint so may not be enforced, but InputSplit will try adjust its internal buffer size to the hinted value

virtual size_t	GetTotalSize (void)=0
	get the total size of the InputSplit

virtual void	BeforeFirst (void)=0
	reset the position of InputSplit to beginning

virtual bool	NextRecord (Blob *out_rec)=0
	get the next record, the returning value is valid until next call to NextRecord, NextChunk or NextBatch caller can modify the memory content of out_rec

virtual bool	NextChunk (Blob *out_chunk)=0
	get a chunk of memory that can contain multiple records, the caller needs to parse the content of the resulting chunk, for text file, out_chunk can contain data of multiple lines for recordio, out_chunk can contain multiple records(including headers)

virtual bool	NextBatch (Blob *out_chunk, size_t)
	get a chunk of memory that can contain multiple records, with hint for how many records is needed, the caller needs to parse the content of the resulting chunk, for text file, out_chunk can contain data of multiple lines for recordio, out_chunk can contain multiple records(including headers)

virtual	~InputSplit (void) DMLC_THROW_EXCEPTION
	destructor

virtual void	ResetPartition (unsigned part_index, unsigned num_parts)=0
	reset the Input split to a certain part id, The InputSplit will be pointed to the head of the new specified segment. This feature may not be supported by every implementation of InputSplit.

Static Public Member Functions
static InputSplit *	Create (const char uri, unsigned part_index, unsigned num_parts, const char type)
	factory function: create input split given a uri

static InputSplit *	Create (const char uri, const char index_uri, unsigned part_index, unsigned num_parts, const char *type, const bool shuffle=false, const int seed=0, const size_t batch_size=256, const bool recurse_directories=false)
	factory function: create input split given a uri for input and index

Detailed Description

input split creates that allows reading of records from split of data, independent part that covers all the dataset

see InputSplit::Create for definition of record

Member Function Documentation

◆ BeforeFirst()

virtual void dmlc::InputSplit::BeforeFirst ( void )

pure virtual

reset the position of InputSplit to beginning

Implemented in dmlc::io::SingleThreadedInputSplit, dmlc::InputSplitShuffle, dmlc::io::InputSplitBase, dmlc::io::SingleFileSplit, and dmlc::io::IndexedRecordIOSplitter.

◆ Create() [1/2]

InputSplit * dmlc::InputSplit::Create	(	const char *	uri,
		const char *	index_uri,
		unsigned	part_index,
		unsigned	num_parts,
		const char *	type,
		const bool	shuffle = `false`,
		const int	seed = `0`,
		const size_t	batch_size = `256`,
		const bool	recurse_directories = `false`
	)

static

factory function: create input split given a uri for input and index

Parameters

uri	the uri of the input, can contain hdfs prefix
index_uri	the uri of the index, can contain hdfs prefix
part_index	the part id of current input
num_parts	total number of splits
type	type of record List of possible types: "text", "recordio", "indexed_recordio" "text": text file, each line is treated as a record input split will split on '\n' or '\r' "recordio": binary recordio file, see recordio.h "indexed_recordio": binary recordio file with index, see recordio.h
shuffle	whether to shuffle the output from the InputSplit, supported only by "indexed_recordio" type. Defaults to "false"
seed	random seed to use in conjunction with the "shuffle" option. Defaults to 0
batch_size	a hint to InputSplit what is the intended number of examples return per batch. Used only by "indexed_recordio" type
recurse_directories	whether to recursively traverse directories

Returns: a new input split

See also: InputSplit::Type

◆ Create() [2/2]

InputSplit * dmlc::InputSplit::Create	(	const char *	uri,
		unsigned	part_index,
		unsigned	num_parts,
		const char *	type
	)

static

factory function: create input split given a uri

Parameters

uri	the uri of the input, can contain hdfs prefix
part_index	the part id of current input
num_parts	total number of splits
type	type of record List of possible types: "text", "recordio", "indexed_recordio" "text": text file, each line is treated as a record input split will split on '\n' or '\r' "recordio": binary recordio file, see recordio.h "indexed_recordio": binary recordio file with index, see recordio.h

Returns: a new input split

See also: InputSplit::Type

◆ GetTotalSize()

virtual size_t dmlc::InputSplit::GetTotalSize ( void )

pure virtual

get the total size of the InputSplit

Implemented in dmlc::InputSplitShuffle, dmlc::io::InputSplitBase, dmlc::io::SingleFileSplit, and dmlc::io::SingleThreadedInputSplit.

◆ HintChunkSize()

virtual void dmlc::InputSplit::HintChunkSize ( size_t )

inlinevirtual

hint the inputsplit how large the chunk size it should return when implementing NextChunk this is a hint so may not be enforced, but InputSplit will try adjust its internal buffer size to the hinted value

Reimplemented in dmlc::InputSplitShuffle, dmlc::io::InputSplitBase, dmlc::io::SingleFileSplit, and dmlc::io::SingleThreadedInputSplit.

◆ NextBatch()

virtual bool dmlc::InputSplit::NextBatch	(	Blob *	out_chunk,
		size_t
	)

inlinevirtual

get a chunk of memory that can contain multiple records, with hint for how many records is needed, the caller needs to parse the content of the resulting chunk, for text file, out_chunk can contain data of multiple lines for recordio, out_chunk can contain multiple records(including headers)

This function ensures there won't be partial record in the chunk caller can modify the memory content of out_chunk, the memory is valid until next call to NextRecord, NextChunk or NextBatch

Parameters

out_chunk used to store the result

Returns: true if we can successfully get next record false if we reached end of split

See also: InputSplit::Create for definition of record; RecordIOChunkReader to parse recordio content from out_chunk

Reimplemented in dmlc::io::IndexedRecordIOSplitter.

◆ NextChunk()

virtual bool dmlc::InputSplit::NextChunk ( Blob * out_chunk )

pure virtual

get a chunk of memory that can contain multiple records, the caller needs to parse the content of the resulting chunk, for text file, out_chunk can contain data of multiple lines for recordio, out_chunk can contain multiple records(including headers)

This function ensures there won't be partial record in the chunk caller can modify the memory content of out_chunk, the memory is valid until next call to NextRecord, NextChunk or NextBatch

Usually NextRecord is sufficient, NextChunk can be used by some multi-threaded parsers to parse the input content

Parameters

out_chunk used to store the result

Returns: true if we can successfully get next record false if we reached end of split

See also: InputSplit::Create for definition of record; RecordIOChunkReader to parse recordio content from out_chunk

Implemented in dmlc::InputSplitShuffle, dmlc::io::InputSplitBase, dmlc::io::SingleFileSplit, dmlc::io::SingleThreadedInputSplit, and dmlc::io::IndexedRecordIOSplitter.

◆ NextRecord()

virtual bool dmlc::InputSplit::NextRecord ( Blob * out_rec )

pure virtual

get the next record, the returning value is valid until next call to NextRecord, NextChunk or NextBatch caller can modify the memory content of out_rec

For text, out_rec contains a single line For recordio, out_rec contains one record content(with header striped)

Parameters

out_rec used to store the result

Returns: true if we can successfully get next record false if we reached end of split

See also: InputSplit::Create for definition of record

Implemented in dmlc::InputSplitShuffle, dmlc::io::InputSplitBase, dmlc::io::SingleFileSplit, dmlc::io::SingleThreadedInputSplit, and dmlc::io::IndexedRecordIOSplitter.

◆ ResetPartition()

virtual void dmlc::InputSplit::ResetPartition	(	unsigned	part_index,
		unsigned	num_parts
	)

pure virtual

reset the Input split to a certain part id, The InputSplit will be pointed to the head of the new specified segment. This feature may not be supported by every implementation of InputSplit.

Parameters

part_index	The part id of the new input.
num_parts	The total number of parts.

Implemented in dmlc::io::SingleFileSplit, dmlc::io::SingleThreadedInputSplit, dmlc::InputSplitShuffle, dmlc::io::InputSplitBase, and dmlc::io::IndexedRecordIOSplitter.

The documentation for this class was generated from the following files:

External/xgboost/dmlc-core/include/dmlc/io.h
External/xgboost/dmlc-core/src/io.cc

Data Structures

Public Member Functions

Static Public Member Functions

Detailed Description

Member Function Documentation

◆ BeforeFirst()

◆ Create() [1/2]

◆ Create() [2/2]

◆ GetTotalSize()

◆ HintChunkSize()

◆ NextBatch()

◆ NextChunk()

◆ NextRecord()

◆ ResetPartition()