Medial Code Documentation
Loading...
Searching...
No Matches
Data Structures | Public Member Functions | Static Public Attributes | Protected Member Functions | Protected Attributes
dmlc::io::InputSplitBase Class Referenceabstract

class to construct input split from multiple files More...

#include <input_split_base.h>

Inheritance diagram for dmlc::io::InputSplitBase:
dmlc::InputSplit dmlc::io::IndexedRecordIOSplitter dmlc::io::LineSplitter dmlc::io::RecordIOSplitter

Data Structures

struct  Chunk
 helper struct to hold chunk data with internal pointer to move along the record More...
 

Public Member Functions

virtual void BeforeFirst (void)
 reset the position of InputSplit to beginning
 
virtual void HintChunkSize (size_t chunk_size)
 hint the inputsplit how large the chunk size it should return when implementing NextChunk this is a hint so may not be enforced, but InputSplit will try adjust its internal buffer size to the hinted value
 
virtual size_t GetTotalSize (void)
 get the total size of the InputSplit
 
virtual bool NextRecord (Blob *out_rec)
 get the next record, the returning value is valid until next call to NextRecord, NextChunk or NextBatch caller can modify the memory content of out_rec
 
virtual bool NextChunk (Blob *out_chunk)
 get a chunk of memory that can contain multiple records, the caller needs to parse the content of the resulting chunk, for text file, out_chunk can contain data of multiple lines for recordio, out_chunk can contain multiple records(including headers)
 
virtual void ResetPartition (unsigned rank, unsigned nsplit)
 reset the Input split to a certain part id, The InputSplit will be pointed to the head of the new specified segment. This feature may not be supported by every implementation of InputSplit.
 
virtual bool ReadChunk (void *buf, size_t *size)
 read a chunk of data into buf the data can span multiple records, but cannot contain partial records
 
bool ExtractNextChunk (Blob *out_rchunk, Chunk *chunk)
 extract next chunk from the chunk
 
virtual bool ExtractNextRecord (Blob *out_rec, Chunk *chunk)=0
 extract next record from the chunk
 
virtual bool IsTextParser (void)=0
 query whether this object is a text parser
 
virtual bool NextChunkEx (Chunk *chunk)
 fill the given chunk with new data without using internal temporary chunk
 
virtual bool NextBatchEx (Chunk *chunk, size_t)
 fill the given chunk with new batch of data without using internal temporary chunk
 
- Public Member Functions inherited from dmlc::InputSplit
virtual bool NextBatch (Blob *out_chunk, size_t)
 get a chunk of memory that can contain multiple records, with hint for how many records is needed, the caller needs to parse the content of the resulting chunk, for text file, out_chunk can contain data of multiple lines for recordio, out_chunk can contain multiple records(including headers)
 
virtual ~InputSplit (void) DMLC_THROW_EXCEPTION
 destructor
 

Static Public Attributes

static const size_t kBufferSize = 2UL << 20UL
 

Protected Member Functions

void Init (FileSystem *fs, const char *uri, size_t align_bytes, const bool recurse_directories=false)
 intialize the base before doing anything
 
virtual size_t SeekRecordBegin (Stream *fi)=0
 seek to the beginning of the first record in current file pointer
 
virtual const char * FindLastRecordBegin (const char *begin, const char *end)=0
 find the last occurance of record header
 
std::vector< URI > ConvertToURIs (const std::string &uri)
 split string list of files into vector of URIs
 
size_t Read (void *ptr, size_t size)
 same as stream.Read
 

Protected Attributes

FileSystem * filesys_
 FileSystem.
 
std::vector< size_t > file_offset_
 byte-offset of each file
 
size_t offset_curr_
 get the current offset
 
size_t offset_begin_
 beginning of offset
 
size_t offset_end_
 end of the offset
 
std::vector< FileInfo > files_
 information about files
 
SeekStreamfs_
 current input stream
 
size_t file_ptr_
 file pointer of which file to read on
 
size_t file_ptr_end_
 file pointer where the end of file lies
 
Chunk tmp_chunk_
 temporal chunk
 
size_t buffer_size_
 buffer size
 

Additional Inherited Members

- Static Public Member Functions inherited from dmlc::InputSplit
static InputSplitCreate (const char *uri, unsigned part_index, unsigned num_parts, const char *type)
 factory function: create input split given a uri
 
static InputSplitCreate (const char *uri, const char *index_uri, unsigned part_index, unsigned num_parts, const char *type, const bool shuffle=false, const int seed=0, const size_t batch_size=256, const bool recurse_directories=false)
 factory function: create input split given a uri for input and index
 

Detailed Description

class to construct input split from multiple files

Member Function Documentation

◆ BeforeFirst()

void dmlc::io::InputSplitBase::BeforeFirst ( void  )
virtual

reset the position of InputSplit to beginning

Implements dmlc::InputSplit.

Reimplemented in dmlc::io::IndexedRecordIOSplitter.

◆ ExtractNextChunk()

bool dmlc::io::InputSplitBase::ExtractNextChunk ( Blob out_rchunk,
Chunk chunk 
)

extract next chunk from the chunk

Parameters
out_chunkthe output record
chunkthe chunk information
Returns
true if non-empty record is extracted false if the chunk is already finishes its life

◆ ExtractNextRecord()

virtual bool dmlc::io::InputSplitBase::ExtractNextRecord ( Blob out_rec,
Chunk chunk 
)
pure virtual

extract next record from the chunk

Parameters
out_recthe output record
chunkthe chunk information
Returns
true if non-empty record is extracted false if the chunk is already finishes its life

Implemented in dmlc::io::LineSplitter, dmlc::io::RecordIOSplitter, and dmlc::io::IndexedRecordIOSplitter.

◆ FindLastRecordBegin()

virtual const char * dmlc::io::InputSplitBase::FindLastRecordBegin ( const char *  begin,
const char *  end 
)
protectedpure virtual

find the last occurance of record header

Parameters
beginbeginning of the buffer
endend of the buffer
Returns
the pointer between [begin, end] indicating the last record head

Implemented in dmlc::io::LineSplitter, dmlc::io::RecordIOSplitter, and dmlc::io::IndexedRecordIOSplitter.

◆ GetTotalSize()

virtual size_t dmlc::io::InputSplitBase::GetTotalSize ( void  )
inlinevirtual

get the total size of the InputSplit

Implements dmlc::InputSplit.

◆ HintChunkSize()

virtual void dmlc::io::InputSplitBase::HintChunkSize ( size_t  )
inlinevirtual

hint the inputsplit how large the chunk size it should return when implementing NextChunk this is a hint so may not be enforced, but InputSplit will try adjust its internal buffer size to the hinted value

Reimplemented from dmlc::InputSplit.

◆ Init()

void dmlc::io::InputSplitBase::Init ( FileSystem *  fs,
const char *  uri,
size_t  align_bytes,
const bool  recurse_directories = false 
)
protected

intialize the base before doing anything

Parameters
fsthe filesystem ptr
urithe uri of the files
rankthe rank of the split
nsplitnumber of splits
align_bytesthe head split must be multiple of align_bytes this also checks if file size are multiple of align_bytes
recurse_directoriesrecursively travese directories

◆ IsTextParser()

virtual bool dmlc::io::InputSplitBase::IsTextParser ( void  )
pure virtual

query whether this object is a text parser

Returns
true if this object represents a text parser; false if it represents a binary parser

Implemented in dmlc::io::LineSplitter, dmlc::io::RecordIOSplitter, and dmlc::io::IndexedRecordIOSplitter.

◆ NextBatchEx()

virtual bool dmlc::io::InputSplitBase::NextBatchEx ( Chunk chunk,
size_t   
)
inlinevirtual

fill the given chunk with new batch of data without using internal temporary chunk

Reimplemented in dmlc::io::IndexedRecordIOSplitter.

◆ NextChunk()

virtual bool dmlc::io::InputSplitBase::NextChunk ( Blob out_chunk)
inlinevirtual

get a chunk of memory that can contain multiple records, the caller needs to parse the content of the resulting chunk, for text file, out_chunk can contain data of multiple lines for recordio, out_chunk can contain multiple records(including headers)

This function ensures there won't be partial record in the chunk caller can modify the memory content of out_chunk, the memory is valid until next call to NextRecord, NextChunk or NextBatch

Usually NextRecord is sufficient, NextChunk can be used by some multi-threaded parsers to parse the input content

Parameters
out_chunkused to store the result
Returns
true if we can successfully get next record false if we reached end of split
See also
InputSplit::Create for definition of record
RecordIOChunkReader to parse recordio content from out_chunk

Implements dmlc::InputSplit.

Reimplemented in dmlc::io::IndexedRecordIOSplitter.

◆ NextChunkEx()

virtual bool dmlc::io::InputSplitBase::NextChunkEx ( Chunk chunk)
inlinevirtual

fill the given chunk with new data without using internal temporary chunk

Reimplemented in dmlc::io::IndexedRecordIOSplitter.

◆ NextRecord()

virtual bool dmlc::io::InputSplitBase::NextRecord ( Blob out_rec)
inlinevirtual

get the next record, the returning value is valid until next call to NextRecord, NextChunk or NextBatch caller can modify the memory content of out_rec

For text, out_rec contains a single line For recordio, out_rec contains one record content(with header striped)

Parameters
out_recused to store the result
Returns
true if we can successfully get next record false if we reached end of split
See also
InputSplit::Create for definition of record

Implements dmlc::InputSplit.

Reimplemented in dmlc::io::IndexedRecordIOSplitter.

◆ ReadChunk()

bool dmlc::io::InputSplitBase::ReadChunk ( void *  buf,
size_t *  size 
)
virtual

read a chunk of data into buf the data can span multiple records, but cannot contain partial records

Parameters
bufthe memory region of the buffer, should be properly aligned to 64 bits
sizethe maximum size of memory, after the function returns, it stores the size of the chunk
Returns
whether end of file was reached

Reimplemented in dmlc::io::IndexedRecordIOSplitter.

◆ ResetPartition()

void dmlc::io::InputSplitBase::ResetPartition ( unsigned  part_index,
unsigned  num_parts 
)
virtual

reset the Input split to a certain part id, The InputSplit will be pointed to the head of the new specified segment. This feature may not be supported by every implementation of InputSplit.

Parameters
part_indexThe part id of the new input.
num_partsThe total number of parts.

Implements dmlc::InputSplit.

Reimplemented in dmlc::io::IndexedRecordIOSplitter.

◆ SeekRecordBegin()

virtual size_t dmlc::io::InputSplitBase::SeekRecordBegin ( Stream fi)
protectedpure virtual

seek to the beginning of the first record in current file pointer

Returns
how many bytes we read past

Implemented in dmlc::io::LineSplitter, dmlc::io::RecordIOSplitter, and dmlc::io::IndexedRecordIOSplitter.


The documentation for this class was generated from the following files: