Prerequisites
Introduction
The Get Next Line project is a project in C programming that requires reading a file line by line, while handling multiple file descriptors and maintaining the state of each file descriptor.
In this study, we will discuss how to write an optimized version of the GNL project.
Implementation
Let’s split the problem into smaller parts.
We have to:
- Read a file line by line
- Save the lines in a buffer
- Check for New Line
- Return the line up to the New Line character if it exists
- Handle last line and EOF (end of file) cases
I will not guide you through the full implementation of the GNL project, but I will provide some tips on how to write efficient code. Writing it by yourself will help you understand the concepts better.
Optimization
Since we can’t know the BUFFER_SIZE
in advance, it can be even 1
.
If the BUFFER_SIZE
is 1
, we will have to call the read
function for every single byte, which is extremely inefficient because the read
function is a system call - it’s expensive in terms of time.
Therefire we cannot optimize the read
function, but surely we can optimize how often we reallocate memory for the buffer.
When I was first writing the GNL project, I was reallocating memory for the buffer in each iteration of the loop - it was an easy solution, since we could assume that we always have enough memory for the next read. But let’s imagine the following scenarios:
1. Expanding the buffer ⚠️
- We have a file with one big line of text (let’s say 1GB)
- We have a
BUFFER_SIZE
of1
- We are half way through the line, and for each byte we have to reallocate memory for the buffer - so we will have to reallocate 500MB of memory - this is extremely inefficient, and this is done for every single byte.
2. Returning the line and moving memory ⚠️
- Let’s say we have found a new line, and we have to return the line up to the new line character.
- If we reallocate or move the memory for the buffer when we find a new line, in case of big lines, we again will have to move a lot of memory.
Solution
Both issues can be solved by using a Struct to store the buffer data, buffer capacity, buffer end index, and buffer’s start index.
struct s_buffer
{
char *data;
size_t capacity;
size_t start;
size_t end;
};
data
- the memorycapacity
- the maximum size of the bufferstart
- the start index of the bufferend
- the end index of the buffer
By implementing such struct, we can avoid reallocating memory for the buffer in each iteration of the loop, and we can avoid moving memory when we find a new line. But how?
1.Expanding the buffer ✅
- Now using the struct, we can allocate memory for the buffer only when we need it. My approach was like this:
- If the buffer is empty (start == end), we allocate initial size of the buffer, which is
BUFFER_SIZE
* 4. Let’s sayBUFFER_SIZE
is16
, so we allocate64
bytes. - If the buffer has not enough space for the next read (
BUFFER_SIZE
+ 1), we reallocate memory for the buffer, doubling the capacity of the buffer.
- If the buffer is empty (start == end), we allocate initial size of the buffer, which is
2.Returning the line and moving memory ✅
- When we find a new line, we don’t have to move memory. We can just return the line up to the new line character, and update the
start
index of the buffer to the index of the new line character + 1. This way we can continue to work with existing buffer memory until we need to expand it.
Testing
When you finish the implementation, you can test your GNL project with this gnlTester by Tripouille.
If everything is working correctly, you should see the following output:
[Mandatory]
[BUFFER_SIZE = 1]:
Invalid fd: 1.OK 2.OK 3.OK
files/empty: 1.OK 2.OK
files/nl: 1.OK 2.OK
files/41_no_nl: 1.OK 2.OK
files/41_with_nl: 1.OK 2.OK 3.OK
files/42_no_nl: 1.OK 2.OK
files/42_with_nl: 1.OK 2.OK 3.OK
files/43_no_nl: 1.OK 2.OK
files/43_with_nl: 1.OK 2.OK 3.OK
files/multiple_nlx5: 1.OK 2.OK 3.OK 4.OK 5.OK 6.OK
files/multiple_line_no_nl: 1.OK 2.OK 3.OK 4.OK 5.OK 6.OK
files/multiple_line_with_nl: 1.OK 2.OK 3.OK 4.OK 5.OK 6.OK
files/alternate_line_nl_no_nl: 1.OK 2.OK 3.OK 4.OK 5.OK 6.OK 7.OK 8.OK 9.OK 10.OK
files/alternate_line_nl_with_nl: 1.OK 2.OK 3.OK 4.OK 5.OK 6.OK 7.OK 8.OK 9.OK 10.OK
files/big_line_no_nl: 1.OK 2.OK
files/big_line_with_nl: 1.OK 2.OK
stdin: 1.OK 2.OK 3.OK 4.OK 5.OK 6.OK 7.OK 8.OK 9.OK 10.OK
The most important part is on line 17 and 18. If you see OK
for these tests, your GNL project is efficient and working correctly.