stream.scanner

local stream = require "stream"
local scanner = stream.scanner.new{ stream = system.in_ }
scanner:get_line()

This class abstracts formatted buffered textual input as an AWK-inspired scanner. The input stream is broken into records, and each record may be further broken down into fields.

get_line() is used to get the next record. Surplus data read from the stream is kept in the buffer to be used in the next call to get_line().

When EOF is found on the stream, the buffered data is returned as the last record. To differentiate records finished on EOF from records finished on record_separator, check self.record_terminator.

You may change the parsing rules (e.g. record and field separators) once get_line() returns.

Line-based protocols

Many commonly-used internet protocols are line-based, which means that they have protocol elements that are delimited by the character sequence "\r\n". Examples include HTTP, SMTP and FTP.

To easily parse these protocols, you may set a scanner object with record_separator="\r\n". Then, get_line() will return a new line each time it is called. If the field separator/pattern is also specified, the line will be broken into a table made of the fields.

New buffers will be allocated as more space is needed until a specified maximum (or an unspecified maximum default).

Combining strategies

You may also use different parsers & algorithms to consume some parts of the stream. For instance, HTTP starts as a line-delimited textual protocol. Once the header section is consumed, the body payload is determined by rules extracted out of the headers. For "content-length" defined message bodies, you read a fixed amount of bytes to consume it.

In such scenario, you may use scanner to parse the header section, and, once it’s time to read the body, use the method buffer() to retrieve already buffered data. Just be sure to call remove_line() before calling buffer() so the last line of the header section doesn’t get mixed up with the body. Then it’ll be a matter of calling stream.read_all(3em) (or several calls to read_some()) to consume the body.

Once it’s time to parse the header section for the next message in the stream, you can call set_buffer() to pass the buffered data back to the scanner.

Functions

new(opts: table|nil) → scanner

Set attributes required by scanner.mt, set opts's metatable to scanner.mt and returns opts. If opts is nil, then a new table is returned.

You MUST set the stream attribute (before or after the call to new()) before using scanner's methods.

Optional attributes to opts:

record_separator: string|regex = "\n"

The pattern used to split records.

Regexes must be used with care on streaming content. For instance, if you set record_separator to the regex /abc(XYZ)?/, it is possible that "XYZ" will not match just because it wasn’t buffered yet even if it’ll appear in the next calls to read() on the stream.

Other tools such as GAWK suffer from the same constraint. Some regexes engines offer special support when working on streaming content, but they don’t solve the whole problem as it’d be impossible to differentiate “max record size reached” from “record_separator not found” if an attempt were made to use this support.

field_separator: string|regex|function|nil

If non-nil, defines how to split fields. Otherwise, the whole line/record is returned as is.

Check regex.split() to understand how fields are separated. In short, field_separator defines what fields are not.

On functions, the function is used to split the fields out of the line/record and its return is passed through.

field_pattern: regex|nil

Defines what fields are (as opposed to field_separator that defines what fields are not). It must be a regex. Check regex.patsplit() for details.

trim_record: boolean|string = false

Whether to strip linear whitespace (if string is given, then it’ll define the list of whitespace characters) from the beginning and end of each record.

buffer_size_hint: integer|nil

The initial size for the buffer. As is the case for every hint, it might be ignored.

max_record_size: integer = unspecified

The maximum size for each record/buffer.

with_awk_defaults(read_stream) → scanner

Returns a scanner acting on stream that has the semantics from AWK defaults:

record_separator

"\n"

trim_record

true

field_separator

A regex that describes a sequence of linear whitespace.

get_line(self) → byte_span|byte_span[]

Reads next record buffering any bytes as required and returns it. If field_separator, or field_pattern were set, the record’s extracted fields are returned.

It also sets self.record_terminator to the record separator just read. On end of streams that don’t include the record separator, self.record_terminator will be set to an empty byte_span (or an empty string if record separator was specified as a string).

It also increments self.record_number by one on success (it is initially zero).

buffered_line(self) → byte_span

Returns current buffered record without extracting its fields. It works like AWK’s $0 variable.

Precondition

A record must be present in the buffer from a previous call to get_line().

remove_line(self)

Removes current record from the buffer and sets self.record_terminator to nil.

Precondition

A record must be present in the buffer from a previous call to get_line().

buffer(self) → byte_span, integer

Returns the buffer + the offset where the read data begins.

The returned buffer’s capacity may be greater than its length.

set_buffer(self, buf: byte_span[, offset: integer = 1])

Set buf as the new internal buffer.

buf's capacity will indicate the usable part of the buffer for IO ops and buf's length (after slicing from offset) will indicate the buffered data.

Previously buffered record and self.record_terminator are discarded.
Example
local buffered_data = buf:slice(offset)
scanner:set_buffer(buf, offset)