Medial Code Documentation
Loading...
Searching...
No Matches
Public Member Functions | Protected Member Functions | Protected Attributes
dmlc::io::IndexedRecordIOSplitter Class Reference

class that splits the recordIO file by record More...

#include <indexed_recordio_split.h>

Inheritance diagram for dmlc::io::IndexedRecordIOSplitter:
dmlc::io::InputSplitBase dmlc::InputSplit

Public Member Functions

 IndexedRecordIOSplitter (FileSystem *fs, const char *uri, const char *index_uri, unsigned rank, unsigned nsplit, const size_t batch_size, const bool shuffle, const int seed=0)
 
bool IsTextParser (void) override
 query whether this object is a text parser
 
bool ExtractNextRecord (Blob *out_rec, Chunk *chunk) override
 extract next record from the chunk
 
bool ReadChunk (void *buf, size_t *size) override
 read a chunk of data into buf the data can span multiple records, but cannot contain partial records
 
bool NextChunk (Blob *out_chunk) override
 get a chunk of memory that can contain multiple records, the caller needs to parse the content of the resulting chunk, for text file, out_chunk can contain data of multiple lines for recordio, out_chunk can contain multiple records(including headers)
 
void BeforeFirst (void) override
 reset the position of InputSplit to beginning
 
bool NextBatch (Blob *out_chunk, size_t n_records) override
 get a chunk of memory that can contain multiple records, with hint for how many records is needed, the caller needs to parse the content of the resulting chunk, for text file, out_chunk can contain data of multiple lines for recordio, out_chunk can contain multiple records(including headers)
 
bool NextRecord (Blob *out_rec) override
 get the next record, the returning value is valid until next call to NextRecord, NextChunk or NextBatch caller can modify the memory content of out_rec
 
void SetRandomSeed (size_t seed)
 
void SetBatchSize (int batch_size)
 
bool NextChunkEx (Chunk *out_chunk) override
 fill the given chunk with new data without using internal temporary chunk
 
bool NextBatchEx (Chunk *out_chunk, size_t n_records) override
 fill the given chunk with new batch of data without using internal temporary chunk
 
- Public Member Functions inherited from dmlc::io::InputSplitBase
virtual void HintChunkSize (size_t chunk_size)
 hint the inputsplit how large the chunk size it should return when implementing NextChunk this is a hint so may not be enforced, but InputSplit will try adjust its internal buffer size to the hinted value
 
virtual size_t GetTotalSize (void)
 get the total size of the InputSplit
 
bool ExtractNextChunk (Blob *out_rchunk, Chunk *chunk)
 extract next chunk from the chunk
 
- Public Member Functions inherited from dmlc::InputSplit
virtual ~InputSplit (void) DMLC_THROW_EXCEPTION
 destructor
 

Protected Member Functions

size_t SeekRecordBegin (Stream *fi) override
 seek to the beginning of the first record in current file pointer
 
const char * FindLastRecordBegin (const char *begin, const char *end) override
 find the last occurance of record header
 
virtual void ReadIndexFile (FileSystem *fs, const std::string &index_uri)
 
void ResetPartition (unsigned rank, unsigned nsplit) override
 reset the Input split to a certain part id, The InputSplit will be pointed to the head of the new specified segment. This feature may not be supported by every implementation of InputSplit.
 
- Protected Member Functions inherited from dmlc::io::InputSplitBase
void Init (FileSystem *fs, const char *uri, size_t align_bytes, const bool recurse_directories=false)
 intialize the base before doing anything
 
std::vector< URI > ConvertToURIs (const std::string &uri)
 split string list of files into vector of URIs
 
size_t Read (void *ptr, size_t size)
 same as stream.Read
 

Protected Attributes

std::vector< std::pair< size_t, size_t > > index_
 
std::vector< size_t > permutation_
 
bool shuffle_
 
size_t current_index_
 
size_t index_begin_
 
size_t index_end_
 
size_t batch_size_
 
size_t n_overflow_
 
const int kRandMagic = 111
 
std::mt19937 rnd_
 
- Protected Attributes inherited from dmlc::io::InputSplitBase
FileSystem * filesys_
 FileSystem.
 
std::vector< size_t > file_offset_
 byte-offset of each file
 
size_t offset_curr_
 get the current offset
 
size_t offset_begin_
 beginning of offset
 
size_t offset_end_
 end of the offset
 
std::vector< FileInfo > files_
 information about files
 
SeekStreamfs_
 current input stream
 
size_t file_ptr_
 file pointer of which file to read on
 
size_t file_ptr_end_
 file pointer where the end of file lies
 
Chunk tmp_chunk_
 temporal chunk
 
size_t buffer_size_
 buffer size
 

Additional Inherited Members

- Static Public Member Functions inherited from dmlc::InputSplit
static InputSplitCreate (const char *uri, unsigned part_index, unsigned num_parts, const char *type)
 factory function: create input split given a uri
 
static InputSplitCreate (const char *uri, const char *index_uri, unsigned part_index, unsigned num_parts, const char *type, const bool shuffle=false, const int seed=0, const size_t batch_size=256, const bool recurse_directories=false)
 factory function: create input split given a uri for input and index
 
- Static Public Attributes inherited from dmlc::io::InputSplitBase
static const size_t kBufferSize = 2UL << 20UL
 

Detailed Description

class that splits the recordIO file by record

Member Function Documentation

◆ BeforeFirst()

void dmlc::io::IndexedRecordIOSplitter::BeforeFirst ( void  )
overridevirtual

reset the position of InputSplit to beginning

Reimplemented from dmlc::io::InputSplitBase.

◆ ExtractNextRecord()

bool dmlc::io::IndexedRecordIOSplitter::ExtractNextRecord ( Blob out_rec,
Chunk chunk 
)
overridevirtual

extract next record from the chunk

Parameters
out_recthe output record
chunkthe chunk information
Returns
true if non-empty record is extracted false if the chunk is already finishes its life

Implements dmlc::io::InputSplitBase.

◆ FindLastRecordBegin()

const char * dmlc::io::IndexedRecordIOSplitter::FindLastRecordBegin ( const char *  begin,
const char *  end 
)
overrideprotectedvirtual

find the last occurance of record header

Parameters
beginbeginning of the buffer
endend of the buffer
Returns
the pointer between [begin, end] indicating the last record head

Implements dmlc::io::InputSplitBase.

◆ IsTextParser()

bool dmlc::io::IndexedRecordIOSplitter::IsTextParser ( void  )
inlineoverridevirtual

query whether this object is a text parser

Returns
true if this object represents a text parser; false if it represents a binary parser

Implements dmlc::io::InputSplitBase.

◆ NextBatch()

bool dmlc::io::IndexedRecordIOSplitter::NextBatch ( Blob out_chunk,
size_t   
)
overridevirtual

get a chunk of memory that can contain multiple records, with hint for how many records is needed, the caller needs to parse the content of the resulting chunk, for text file, out_chunk can contain data of multiple lines for recordio, out_chunk can contain multiple records(including headers)

This function ensures there won't be partial record in the chunk caller can modify the memory content of out_chunk, the memory is valid until next call to NextRecord, NextChunk or NextBatch

Parameters
out_chunkused to store the result
Returns
true if we can successfully get next record false if we reached end of split
See also
InputSplit::Create for definition of record
RecordIOChunkReader to parse recordio content from out_chunk

Reimplemented from dmlc::InputSplit.

◆ NextBatchEx()

bool dmlc::io::IndexedRecordIOSplitter::NextBatchEx ( Chunk chunk,
size_t   
)
overridevirtual

fill the given chunk with new batch of data without using internal temporary chunk

Reimplemented from dmlc::io::InputSplitBase.

◆ NextChunk()

bool dmlc::io::IndexedRecordIOSplitter::NextChunk ( Blob out_chunk)
overridevirtual

get a chunk of memory that can contain multiple records, the caller needs to parse the content of the resulting chunk, for text file, out_chunk can contain data of multiple lines for recordio, out_chunk can contain multiple records(including headers)

This function ensures there won't be partial record in the chunk caller can modify the memory content of out_chunk, the memory is valid until next call to NextRecord, NextChunk or NextBatch

Usually NextRecord is sufficient, NextChunk can be used by some multi-threaded parsers to parse the input content

Parameters
out_chunkused to store the result
Returns
true if we can successfully get next record false if we reached end of split
See also
InputSplit::Create for definition of record
RecordIOChunkReader to parse recordio content from out_chunk

Reimplemented from dmlc::io::InputSplitBase.

◆ NextChunkEx()

bool dmlc::io::IndexedRecordIOSplitter::NextChunkEx ( Chunk chunk)
inlineoverridevirtual

fill the given chunk with new data without using internal temporary chunk

Reimplemented from dmlc::io::InputSplitBase.

◆ NextRecord()

bool dmlc::io::IndexedRecordIOSplitter::NextRecord ( Blob out_rec)
inlineoverridevirtual

get the next record, the returning value is valid until next call to NextRecord, NextChunk or NextBatch caller can modify the memory content of out_rec

For text, out_rec contains a single line For recordio, out_rec contains one record content(with header striped)

Parameters
out_recused to store the result
Returns
true if we can successfully get next record false if we reached end of split
See also
InputSplit::Create for definition of record

Reimplemented from dmlc::io::InputSplitBase.

◆ ReadChunk()

bool dmlc::io::IndexedRecordIOSplitter::ReadChunk ( void *  buf,
size_t *  size 
)
overridevirtual

read a chunk of data into buf the data can span multiple records, but cannot contain partial records

Parameters
bufthe memory region of the buffer, should be properly aligned to 64 bits
sizethe maximum size of memory, after the function returns, it stores the size of the chunk
Returns
whether end of file was reached

Reimplemented from dmlc::io::InputSplitBase.

◆ ResetPartition()

void dmlc::io::IndexedRecordIOSplitter::ResetPartition ( unsigned  part_index,
unsigned  num_parts 
)
overrideprotectedvirtual

reset the Input split to a certain part id, The InputSplit will be pointed to the head of the new specified segment. This feature may not be supported by every implementation of InputSplit.

Parameters
part_indexThe part id of the new input.
num_partsThe total number of parts.

Reimplemented from dmlc::io::InputSplitBase.

◆ SeekRecordBegin()

size_t dmlc::io::IndexedRecordIOSplitter::SeekRecordBegin ( Stream fi)
overrideprotectedvirtual

seek to the beginning of the first record in current file pointer

Returns
how many bytes we read past

Implements dmlc::io::InputSplitBase.


The documentation for this class was generated from the following files: