Code optimization for XBM images on Nokia LCD

2017-05-12 23:45:29

My previous entry about the Nokia 5110/3310 LCD was more a storytelling session than going into the technical parts or the source code. Well, speaking of source code, the Github repository does cover the main usage concepts, but I still wanted to add a few paragraphs here now.

After finishing the initial proof of concept and dumping the example code on Github, I had another look and tried to see what parts could be optimized. After all, the very core idea was to reduce the program memory size to display XBM image based animations, so why stop here yet.

I was curious to see if I could fit the crappy-face-with-wobbling-eyes example animation into an AVR chip with only 2kB of flash memory, like an 8pin ATtiny25. That, however, will require one crucial adjustment regardless of program memory.

Cutting down on RAM usage

The original example was targeted for an ATmega328 with 32kB program memory and 2kB RAM. So if this goal succeeds, the whole example would fit into the RAM alone - which makes no sense in practice, and it doesn't work that way, but just to put in perspective. Especially putting in perspective what a spoiled fat cat these amounts of resources can make you.

Sticking with the ATtiny25 as example target, the available RAM amount drops to a sad 128 bytes. That amount couldn't even store a full-length tweet, let alone the full display memory content. In fact, this is how far 128 byte display memory would get us:

Nokia LCD showing 128 byte content

Not to mention that filling up all the RAM like that would mean that there's no space left to have a stack. Kiss your local variables and functions altogether goodbye right here and now.

Long story short, the LCD memory content can't be stored inside the code with that little RAM (and shouldn't be stored there either with just slightly more RAM than data, i.e. 512 bytes RAM, due to stack growth). Luckily that step isn't really necessary, and the LCD memory can to be updated by reading the image data directly from the program memory and sending it straight to the LCD.

So let's take the example's initial implementation that keeps the LCD memory in RAM, copies the given data into that and then sends the whole thing to the LCD via SPI:

/* internal LCD memory buffer */
unsigned char nokia_lcd_memory[LCD_MEMORY_SIZE];

void nokia_lcd_fullscreen(const uint8_t data[])
{   
    memcpy_P(nokia_lcd_memory, (PGM_P) data, 504);
    nokia_lcd_update();
}

static void nokia_lcd_update(void)
{   
    uint8_t x;
    uint8_t y;

    for (y = 0; y < LCD_Y_RES / 8; y++) {
        spi_send_command(0x80);     // set X addr to 0x00
        spi_send_command(0x40 | y); // set Y addr to y
        for (x = 0; x < LCD_X_RES; x++) {
            spi_send_data(nokia_lcd_memory[y * LCD_X_RES + x]); // send data
        }
    }   
}

Now let's get rid of all the nokia_lcd_memory related parts, which in the end will be simply one function directly sending the given image data to the LCD:

void nokia_lcd_fullscreen(const uint8_t data[])
{   
    uint8_t x;
    uint8_t y;

    for (y = 0; y < LCD_Y_RES / 8; y++) {
        spi_send_command(0x80);     // set X addr to 0x00
        spi_send_command(0x40 | y); // set Y addr to y
        for (x = 0; x < LCD_X_RES; x++) {
            /* read data straight from PROGMEM variable and send it */
            spi_send_data(pgm_read_byte(&(data[y * LCD_X_RES + x])));
        }
    }   
}

The nokia_lcd_memory char array is however also used within the frame diff update functions for animations. I'll come back to that later. For now, we'll be using full screen animations, i.e. call ./xbm2nokia.sh -g. Yeah, this is completely counterproductive as it will more than double the program memory size, but bear with me.

ELF binary sections and sizes

Assuming full screen animations were created from all 9 frames x1.xbm-x9.xbm, the avr-size command will reveal the differences:

$ avr-size v?/example.elf
   text    data     bss     dec     hex filename
   5054       0     523    5577    15c9 v1/example.elf
   5038       0      19    5057    13c1 v2/example.elf

A few words on what each section means:

text stores the actual compiled program code
data stores all initial values for global variables that aren't zero
bss defines the size of all global variables that are uninitialized or initialized with zero
dec is the total size of the executable including text, data and bss
hex is the same as dec, just as hexadecimal value

As we can see, the bss size decreased drastically, by 504 bytes, the size of the internal LCD memory buffer. If you're wondering why data and bss are separated, it's because the bss section won't actually make it into the final Hex file that is flashed to the microcontroller. The C runtime library will - among other things - take care that the RAM is properly initialized before calling main(), by copying data content to it, and zeroing everything in the bss section. So only the location and size of that zero-initialized memory section needs to be known.

To verify that the bss won't be flashed, have a look at at the final Hex file created from the ELF file, which will be flashed in the microntrollers program memory. If you know about the Intel Hex file format itself, you might be able to just look at the file, alternatively, avr-size works here as well:

$ avr-size v?/example.hex
   text    data     bss     dec     hex filename
      0    5054       0    5054    13be v1/example.hex
      0    5038       0    5038    13ae v2/example.hex

Well, it's not an ELF file, so everything is shown as data section content, but it's still good enough to see, that these numbers are both times exactly the same as the ELF file's text section was. Since there is nothing stored in the data section, this adds up. If you have doubt, just add some int abcd=123; as global variable and see how to resulting 2 bytes in the data section (since int is 16 bit in AVR) add up accordingly.

But regardless of being part of the to-be-flashed Hex file or not, it will be part of the size that is required in them RAM.

Coming back to animations

Well, as we can see from the avr-size output, eliminating the code that copies the image data into the internal buffer saved us a whooping 16 bytes in program memory. Too bad that's going to change.

As I mentioned before, the functions that handle the frame diffs for animations still use the nokia_lcd_memory buffer. There are actually two different implementations of this functionality: one that sends the full buffer content via SPI after updating all its content based on the frame diff, and one that sends the new display data to the display while it is processing the diff data and updating the internal buffer.

Getting rid of the internal buffer will leave us with option two as only choice. This however will require additional address calculations that will increase the program size but around 160 bytes.

This way, we also won't need separation between those two choices any longer, and everything related to the NOKIA_GFX_ANIMATION_FULL_UPDATE preprocessor defines can be removed from both nokia_lcd.h and nokia_lcd.c and all that will be left is:

void
nokia_lcd_update_diff(const struct nokia_gfx_frame *frame)
{
    uint16_t i;
...
        diff.data = pgm_read_byte(&(frame->diffs[i].data));
        /* nokia_lcd_memory[diff.addr] = diff.data; */
...
}

Okay, time to recreate the image data for animations and recompile

/path/to/example$ ../xbm2nokia -a x?.xbm
/path/to/example$ cd v1
/path/to/example/v1$ make distclean
/path/to/example/v1$ make
/path/to/example$ cd ../v2
/path/to/example/v2$ make distclean
/path/to/example/v2$ make
/path/to/example/v2$ cd ..
/path/to/example$ avr-size v?/example.elf
   text    data     bss     dec     hex filename
   2210       0     523    2733     aad v1/example.elf
   2352       0      19    2371     943 v2/example.elf
/path/to/example$

So, we got rid of 504 bytes RAM and gained 142 bytes program memory. This sucks, but if RAM is scarce, we haven no choice - and in this case, anything under 1kB RAM would be not enough (as there's usually nothing between 512 and 1024 bytes with AVRs)

Adjusting for specific hardware

Before digging into assembly code and analyzing the compiled binaries, some other optimization places can be considered, especially when having a closer look at the hardware. In this case, it's just the LCD connected to the AVR. And that's exactly the point, it's the only thing connected. There is no other SPI slave to communicate with, so whenever we do communicate via SPI, it's always the LCD.

So, there is no reason to control the chip select / chip enable signal, we can simply connect the LCD's CE pin to ground (since it's active low) and remove all code toggling its state.

Getting rid of chip select handling

Handling the signal is done via preprocessor macro that just toggles the pin. The easiest way is to just delete what the macro is doing, i.e. change

#define spi_cs_high()   do { PORTB |=  (1 << PB2); } while (0)
#define spi_cs_low()    do { PORTB &= ~(1 << PB2); } while (0)

#define spi_cs_high()
#define spi_cs_low()

Or alternatively fully delete every occurrence of either macro - the result will be the same: 20 bytes less content in the text section. (I did seem to run into some signal stability here as the animation was some pixels off in some places. I might just blame it on the shitty breadboard.)

An alternative in-between the previous handling and grounding the pin is to simply set the pin output to zero before sending the first command in the display init function. This way there's no need to toggle the pin before every single command or data sending.

A second candidate here is the LCD reset signal. Instead of using the microcontroller pin to drive the signal on startup, the LCD reset pin can be connected to an RC element that will keep the signal long enough in low state during power-up.

Getting rid of LCD reset handling

Connecting an RC element (as in resistor and capacitor) between the supply voltage and LCD reset pin will delay the signal rise long enough to count as actual reset signal. According to the LCD driver's datasheet, a low pulse of at least 3μs is required for this. Googling for "RC delay calculator" or "RC time calculator" should point you to formulas and online calculators to figure out good R and C values - ladyada's old website has one for example.

I had a 22kΩ resistor and 4.7μF capacitor lying next to me - that combination works most of the time. However, sometimes, the display remains blank and I have to try again by with a power cycle. Keep in mind here, while basically anything over the required 3μs is okay, the LCD driver needs to be out of the reset state by the time the first command is sent via SPI.

Removing the pin handling: 4 bytes off
Removing the delay between the signal change: 12 more bytes.
Removing the nokia_lcd_reset() function altogether: 6 bytes!

Yeah, we're obviously not getting anywhere with that.

But then, if really thinking about a device like the ATtinyX5 with just 8 pins, reducing the number of required pins can be yet another crucial part.

You will not believe what happens next!

The only pin left to look at as the D/#C signal that toggles between command and data transfer with SPI. The actual SPI data and clock signals are fully in control of the AVR's SPI hardware module, so not much to do with these pins.

Since D/#C signal needs to be high for data and low for commands, they need to be controlled inside the microcontroller firmware. But looking at my initial implementation of spi_send_data() and spi_send_command(), some parts are identical in both functions. In fact, everything but setting the D/#C signal level itself is identical. This calls for the next logical step: combine the common parts.

This could end up as something like this:

static void
spi_send(uint8_t data)
{
    SPDR = data;
    while (!(SPSR & (1 << SPIF))) {
        /* wait */
    }
}

static void
spi_send_command(uint8_t command)
{
    spi_dc_low();
    spi_send(command);
}

static void
spi_send_data(uint8_t data)
{
    spi_dc_high();
    spi_send(data);
}

So at this point we had reached 2310 bytes text section size. Now, with this change in place, we get this:

   text    data     bss     dec     hex filename
   2310       0      19    2329     919 example.elf

Yes. As it turns out, the compiler optimization was doing this re-arrangement all along.

To see the real impact of this change, we could disable the compiler optimization by removing the "-Os" flag from from the CFLAGS variable. This will actually result in a compiler warning as it's generally a bad idea to disable compiler optimization.

   text    data     bss     dec     hex filename
   3536       0      19    3555     de3 example--duplicate-parts.elf
   3528       0      19    3547     ddb example--shared-common-part.elf

Well, no significant difference, but still, separating the shared common parts into its own function saves a few bytes. And so does very clearly the compiler optimization, so let's put that back into the Makefile.

There probably is a lesson in here. Not every seemingly obvious optimization will actually have any impact? Something like that. And, well, I guess when it comes to assembly code in general, the compiler will probably do a better job than I ever could. At least from code size point of view, but that's the point here - speed optimization would be a different story.

Digging into the binary and assembly files

So those were the obvious hardware related adjustments. As I mentioned in the initial article on this, changing the animation part into a split frame and therefore being able to store the 16bit address data in 8bit would be another improvement on the code size.

But what comes next? What would be our options after all the (more or less) obvious parts have been optimized?

To answer that, you'll have to know where the program-space-hungry parts actually are. Two places to look at that are each source code file's assembly listing, i.e. what the compiler will turn the C code into, and the linker map file, showing more details on the resulting binary's memory spaces.

The example code's Makefile is set up to generate both of those, i.e. .lst files for each source file's assembly code, and a .map file from the linker. There are other tools as well, for example avr-objdump and avr-readelf will give similar information on the binary file.

Well, any one of those mentioned tools and files are worth their own article, so I won't be going into details on them at this point. Just have a look at the .lst and .map files to get some idea. It's all standard gcc related stuff, so documentation will be plentiful available on the internet.

Final words

Code optimization can be a tedious task and at some point it might be wise to just settle on compromises - provided size constraints allow that (or speed constraints, but again, speed optimization is a different story). If you want to use a non-SMD ATmega with small enough footprint (like a DIP-28 package), 32kB program memory will be as good as it gets. If your current code needs more than that, you have to get creative.

And if your code does need less than it, well, you might be able to use a smaller member of the family and save a few cents on the component. But if you're a hobbyist, you probably won't really benefit from those savings and are better off with the bigger devices in the long run.

But regardless of actual size constraints, code optimization is worth to have a look at, because in order to optimize your code, you need to know what you're doing, and you need to know your system. And this works the other way around as well, so trying to optimize your code will make you force to get to know your system, and it's definitely a place to learn new tricks.

All this is especially true in embedded systems and microcontrollers, but can also help you write more efficient code on less constraint system. Careful though, balance is key here; sacrificing all forms of readability and understandable logic to gain 1% is probably a bad idea unless you're at 101% capacity.