For the latest stable version, please use Emilua API 0.10! |
stream.scanner
local stream = require "stream"
local scanner = stream.scanner.new{ stream = system.in_ }
scanner:get_line()
This class abstracts formatted buffered textual input as an AWK-inspired scanner. The input stream is broken into records, and each record may be further broken down into fields.
get_line()
is used to get the next record. Surplus data read from the stream
is kept in the buffer to be used in the next call to get_line()
.
When EOF is found on the stream, the buffered data is returned as the last
record. To differentiate records finished on EOF from records finished on
record_separator
, check self.record_terminator
.
You may change the parsing rules (e.g. record and field separators) once
get_line() returns.
|
Line-based protocols
Many commonly-used internet protocols are line-based, which means that they have protocol elements that are delimited by the character sequence
"\r\n"
. Examples include HTTP, SMTP and FTP.
To easily parse these protocols, you may set a scanner
object with
record_separator="\r\n"
(the default). Then, get_line()
will return a new
line each time it is called. If the field separator/pattern is also specified,
the line will be broken into a table made of the fields.
New buffers will be allocated as more space is needed until a specified maximum (or an unspecified maximum default).
Combining strategies
You may also use different parsers & algorithms to consume some parts of the
stream. For instance, HTTP starts as a line-delimited textual protocol. Once the
header section is consumed, the body payload is determined by rules extracted
out of the headers. For "content-length"
defined message bodies, you read a
fixed amount of bytes to consume it.
In such scenario, you may use scanner
to parse the header section, and, once
it’s time to read the body, use the method buffer()
to retrieve already
buffered data. Just be sure to call remove_line()
before calling buffer()
so
the last line of the header section doesn’t get mixed up with the body. Then
it’ll be a matter of calling stream.read_all(3em) (or
several calls to read_some()
) to consume the body.
Once it’s time to parse the header section for the next message in the stream,
you can call set_buffer()
to pass the buffered data back to the scanner
.
Functions
new(opts: table|nil) → scanner
Set attributes required by scanner.mt
, set opts
's metatable to
scanner.mt
and returns opts
. If opts
is nil
, then a new table is
returned.
You MUST set the stream
attribute (before or after the call to new()
)
before using scanner
's methods.
Optional attributes to opts
:
record_separator: string|regex = "\r\n"
-
The pattern used to split records.
Regexes must be used with care on streaming content. For instance, if you set
record_separator
to the regex/abc(XYZ)?/
, it is possible that "XYZ" will not match just because it wasn’t buffered yet even if it’ll appear in the next calls to read() on the stream.Other tools such as GAWK suffer from the same constraint. Some regexes engines offer special support when working on streaming content, but they don’t solve the whole problem as it’d be impossible to differentiate “max record size reached” from “
record_separator
not found” if an attempt were made to use this support. field_separator: string|regex|function|nil
-
If non-
nil
, defines how to split fields. Otherwise, the whole line/record is returned as is.Check
regex.split()
to understand how fields are separated. In short,field_separator
defines what fields are not.On functions, the function is used to split the fields out of the line/record and its return is passed through.
field_pattern: regex|nil
-
Defines what fields are (as opposed to
field_separator
that defines what fields are not). It must be a regex. Checkregex.patsplit()
for details. trim_record: boolean|string = false
-
Whether to strip linear whitespace (if string is given, then it’ll define the list of whitespace characters) from the beginning and end of each record.
buffer_size_hint: integer|nil
-
The initial size for the buffer. As is the case for every hint, it might be ignored.
max_record_size: integer = unspecified
-
The maximum size for each record/buffer.
with_awk_defaults(read_stream) → scanner
Returns a scanner acting on stream
that has the semantics from AWK defaults:
record_separator
-
"\n"
trim_record
-
true
field_separator
-
A regex that describes a sequence of linear whitespace.
get_line(self) → byte_span|byte_span[]
Reads next record buffering any bytes as required and returns it. If
field_separator
, or field_pattern
were set, the record’s extracted fields
are returned.
It also sets self.record_terminator
to the record separator just read. On end
of streams that don’t include the record separator, self.record_terminator
will be set to an empty byte_span
(or an empty string if record separator was
specified as a string).
It also increments self.record_number
by one on success (it is initially
zero).
buffered_line(self) → byte_span
Returns current buffered record without extracting its fields. It works like
AWK’s $0
variable.
Precondition
A record must be present in the buffer from a previous call to |
remove_line(self)
Removes current record from the buffer and sets self.record_terminator
to
nil
.
Precondition
A record must be present in the buffer from a previous call to |
buffer(self) → byte_span, integer
Returns the buffer + the offset where the read data begins.
The returned buffer’s capacity may be greater than its length. |
set_buffer(self, buf: byte_span[, offset: integer = 1])
Set buf
as the new internal buffer.
buf
's capacity will indicate the usable part of the buffer for IO ops and
buf
's length (after slicing from offset
) will indicate the buffered data.
Previously buffered record and self.record_terminator are discarded.
|
local buffered_data = buf:slice(offset)
scanner:set_buffer(buf, offset)