Last modified 2017-09-25 00:25:11 CDT

Getting Started with Cortex M0/0+/3, CMSIS, and GNU Tools

Update 01/26/2015: added lpc11uxx template, courtesy of usedbytes

Update 10/10/2013: upgraded to CMSIS v3.20 core; added lpc8xx,lpc11xx,lpc17xx templates

Git: git clone git://github.com/vsergeev/cmsis-templates.git

Github: http://github.com/vsergeev/cmsis-templates

This is a page where I’ve tried to consolidate everything I’ve learned about Cortex M0/0+/3, and present a tutorial for getting up and running with Cortex M0/0+/3, CMSIS, and GNU Tools. I have no affiliation with ARM or any Cortex M0/0+/3 vendor. Please email me with any errors on this page and I will be happy to correct them!

I’ll be using the LPC1313 as an example here, but most of this is general to any Cortex M0/0+/3 microcontroller.

Step 1: Acquire an appropriate ARM-GCC Toolchain

The first step is acquiring an ARM GCC toolchain that can target the Cortex M3 with -mcpu=cortex-m3 or the Cortex M0/0+ with -mcpu=cortex-m0. You can either go about building one yourself, and in that case there are many guides and nearly automatic scripts that can be found with your favorite search engine, or you can download a pre-built one. I’m going to take the quick approach and just recommend the latest pre-built CodeSourcery Lite arm-none-eabi-g** toolchain.

1. Download the CodeSourcery Lite arm-none-eabi toolchain, “IA32 GNU/Linux TAR” package here: http://www.mentor.com/embedded-software/sourcery-tools/sourcery-codebench/editions/lite-edition/arm-eabi

2. Extract the toolchain to your favorite place and add its bin directory to your path:

$ tar xvfj arm-2013.05-23-arm-none-eabi-i686-pc-linux-gnu.tar.bz2
$ export PATH=$PATH:/path/to/arm-2013.05/bin

The toolchain binaries will now be available with the arm-none-eabi- prefix in this session of your shell.

Step 2: Download and Configure CMSIS

1. Download and Extract the vendor-flavored CMSIS package for your chip

Or just clone my CMSIS template git repository for lpc8xx, lpc11xx, lpc13xx, and lpc17xx devices:

$ git clone git://github.com/vsergeev/lpc13xx-cmsis-template.git

CMSIS is ARM’s solution to abstracting the time-consuming and reference-manual-digging initialization of and low-level interface to the Cortex M0/0+/3 core. Specifically, if you remember the PLL configuration for an ARM7TDMI core microcontroller, this is precisely one kind of initialization sequence CMSIS takes care of for you. Instead, you just set a few appropriate define’s specifying your clock / PLL configuration, and call SystemInit(). Another example is enabling and disabling NVIC interrupts.

CMSIS comes in two parts. The first part, called “Core Peripheral Access Layer”, is the generic Cortex M0/0+/3 core abstraction, which is essentially the register definitions for the Cortex M0/0+/3 core and a whole host of inline functions to easily manage the NVIC (to enable/disable interrupts, change interrupt priorities, etc.), configure the SysTick timer, and invoke useful Thumb-2 intrinsics (like WFI). This is implemented by the header files in core/.

The second part of CMSIS, called “Device Peripheral Access Layer”, is device-specific and supplied by the vendor. This part is at minimum one source and header file pair for implementing clock initialization and any other routines the vendor feels would be useful to do for you, and one header file that characterizes all of the peripherals the vendor has attached to the Cortex M0/0+/3 core in their chip. This is implemented by the files in <device>/.

The latest CMSIS versions also provide a feature-full DSP library, which implements many math, transform, matrix, filtering, and control signal processing routines.

Some vendors have taken the liberty to provide a whole library of peripheral drivers (e.g. LPCOpen), but I think that things quickly become opaque and frustrating when you are relying on tens of files implementing all the drivers you don’t need, allocating interrupt handlers behind your back, and RPCing to the kitchen sink next door.

At this point, you’ll end up with 7 important files (in the CM3 example):

2. Configure your desired clock settings

Looking into system_LPC13xx.c, the definitions on lines 137-156, and 234 may require adjustment to your clock setup. Specifically, the *_SETUP defines are all 1/0 booleans specifying whether that clock or PLL should be enabled and configured. The corresponding *_VAL defines will specify multiplier and divider values, which are documented in the chip’s user manual.

AHBCLKCTRL_Val defines which peripherals outside the core will be clocked. It is important to set the corresponding clock bit if you want to use SPI, ADC, etc. peripherals. __XTAL defines the external crystal frequency in hertz. In general, follow the comments and datasheet, enable/disable the clocks you need, and set the multipliers/dividers for your target frequency.

The setting configured in the template git repository is to use the internal 12MHz RC oscillator multiplied up to a 72MHz system clock with the PLL for the LPC13xx. This is a safe starting default.

After you set the appropriate constants, initializing all of the clocks (and possibly some other peripherals if they are specified among those constants) on your chip requires only one call to SystemInit(). In the template, main.c does this for you before entering a loop to blink an LED.

Sidenote

CMSIS defines a neat precedent for accessing registers that vendors mimic in their device header file. Instead of enumerating every single individual peripheral register on its own in the #define U0RBR (*((volatile unsigned int *)0x40008000)) to yield a list of 300 registers like in the old days (see lpc21xx.h), CMSIS groups peripherals into a single C structure with correctly offset members that represent all of the individual registers of a peripheral.

Here is a code segment from core_cm3.h that describes the SysTick family of registers:

//  * Copyright (C) 2009 ARM Limited. All rights reserved.

/** @addtogroup CMSIS_CM3_SysTick CMSIS CM3 SysTick
  memory mapped structure for SysTick
  @{
 */
typedef struct
{
  __IO uint32_t CTRL;                         /*!< Offset: 0x00  SysTick Control and Status Register */
  __IO uint32_t LOAD;                         /*!< Offset: 0x04  SysTick Reload Value Register       */
  __IO uint32_t VAL;                          /*!< Offset: 0x08  SysTick Current Value Register      */
  __I  uint32_t CALIB;                        /*!< Offset: 0x0C  SysTick Calibration Register        */
} SysTick_Type;

...
/* Memory mapping of Cortex-M3 Hardware */
#define SCS_BASE            (0xE000E000UL)                            /*!< System Control Space Base Address */
#define SysTick_BASE        (SCS_BASE +  0x0010UL)                    /*!< SysTick Base Address              */
#define SysTick             ((SysTick_Type *)       SysTick_BASE)     /*!< SysTick configuration struct      */

In your code, instead of using SYSTICKCTRL, SYSTICKLOAD, SYSTICKVAL, SYSTICKCALIB independently as you would in the mass-#define style, you can refer to the individual registers in C with SysTick->CTRL, SysTick->LOAD, SysTick->VAL, SysTick->CALIB. This results in much cleaner code.

Step 3: Create a Linker Script

Create a linker script

The next step is to create a linker script, lpc1313.dld, defining where the interrupt vectors, code, data, and bss sections will go, as well as the entry point and the top of the stack.

ENTRY(Reset_Handler)

/* Memory configuration for LPC1313FBD48 */

MEMORY
{
    flash   :   ORIGIN = 0x00000000, LENGTH = 32k
    sram    :   ORIGIN = 0x10000000, LENGTH = 8k
}

_end_stack = 0x10001FFC;

SECTIONS {
    . = ORIGIN(flash);

    vectors :
    {
        *(.vectors)
    } >flash

    .text :
    {
        *(.text)
        *(.rodata)
        *(.rodata*)
        _end_text = .;
    } >flash

    .data :
    {
        _start_data = .;
        *(.data)
        _end_data = .;
    } >sram AT >flash

    . = ALIGN(4);

    .bss :
    {
        _start_bss = .;
        *(.bss)
        _end_bss = .;
    } >sram

    . = ALIGN(4);

    _start_stack = .;
}

_end = .;
PROVIDE(end = .);

This linker script is pretty general and ordinary. The SRAM origin, SRAM size, Flash origin, Flash size, and top of stack probably will need to be configured for your particular device.

The line >sram AT >flash is essential in declaring the data section. This tells the linker that the initialized values of the data section will be stored in Flash, but addressed as if they were in SRAM. It also means that you are responsible for copying these initialized values from Flash to SRAM at some point before your program uses them.

Step 4: Write the Startup Code

The startup code of a Cortex M0/0+/3 microcontroller requires specifying interrupt vectors, the stack pointer, copying the data section from flash to RAM, and clearing the bss section in RAM. This can all be done in C (no assembly) for the Cortex M0/0+/3.

The list of Cortex M0/0+/3 core interrupt vectors is available in ARM’s technical reference manual for the core (e.g. Cortex-M3 Technical Reference Manual, pg. 98), but will also be listed in your chip’s user manual under the Cortex core section. The vendor/device specific interrupt vectors are described in your chip’s user manual under the NVIC interrupt sources section.

The first thing we’ll do in C is declare the interrupt handler prototypes with attributes weak and an alias to a single dummy handler. The weak attribute lets these functions get overridden if declared anywhere else, such as your main program.

Next we’ll define the vector table, and give it the compiler attribute to assign it to the .vectors section of our linker script. The first value of the vector table is actually the top of the stack pointer, so we’ll pull this variable in from our linker script (notice the extern definitions at the top for linker script variables).

Next we have a Reset_Handler() function. This is the entry point for the whole system, and the second entry in the interrupt vector table. The reset handler will have a short loop to copy the initialized values of the data section stored in Flash to their corresponding region in SRAM (these are placed immediately after the .text section, which is why we start reading from memory location _end_text), and a short loop to clear the BSS section in memory. Finally, we call main() to start our actual program.

The dummy handler is just an infinite while loop.

This C code is essentially the equivalent of the assembly bootstrap required for ARM7TDMI core microcontrollers.

/* startup.c */

#include <stdint.h>

/* Addresses pulled in from the linker script */
extern uint32_t _end_stack;
extern uint32_t _end_text;
extern uint32_t _start_data;
extern uint32_t _end_data;
extern uint32_t _start_bss;
extern uint32_t _end_bss;

/* Application main() called in reset handler */
extern int main(void);

#define WEAK_ALIAS(x) __attribute__ ((weak, alias(#x)))

/* Cortex M3 core interrupt handlers */
void Reset_Handler(void);
void NMI_Handler(void) WEAK_ALIAS(Dummy_Handler);
void HardFault_Handler(void) WEAK_ALIAS(Dummy_Handler);
void MemManage_Handler(void) WEAK_ALIAS(Dummy_Handler);
void BusFault_Handler(void) WEAK_ALIAS(Dummy_Handler);
void UsageFault_Handler(void) WEAK_ALIAS(Dummy_Handler);
void SVC_Handler(void) WEAK_ALIAS(Dummy_Handler);
void DebugMon_Handler(void) WEAK_ALIAS(Dummy_Handler);
void PendSV_Handler(void) WEAK_ALIAS(Dummy_Handler);
void SysTick_Handler(void) WEAK_ALIAS(Dummy_Handler);

/* LPC13xx specific interrupt handlers */
void WAKEUP_Handler(void) WEAK_ALIAS(Dummy_Handler);
void I2C_Handler(void) WEAK_ALIAS(Dummy_Handler);
void TIMER_16_0_Handler(void) WEAK_ALIAS(Dummy_Handler);
void TIMER_16_1_Handler(void) WEAK_ALIAS(Dummy_Handler);
void TIMER_32_0_Handler(void) WEAK_ALIAS(Dummy_Handler);
void TIMER_32_1_Handler(void) WEAK_ALIAS(Dummy_Handler);
void SSP_Handler(void) WEAK_ALIAS(Dummy_Handler);
void UART_Handler(void) WEAK_ALIAS(Dummy_Handler);
void USB_IRQ_Handler(void) WEAK_ALIAS(Dummy_Handler);
void USB_FIQ_Handler(void) WEAK_ALIAS(Dummy_Handler);
void ADC_Handler(void) WEAK_ALIAS(Dummy_Handler);
void WDT_Handler(void) WEAK_ALIAS(Dummy_Handler);
void BOD_Handler(void) WEAK_ALIAS(Dummy_Handler);
void EINT3_Handler(void) WEAK_ALIAS(Dummy_Handler);
void EINT2_Handler(void) WEAK_ALIAS(Dummy_Handler);
void EINT1_Handler(void) WEAK_ALIAS(Dummy_Handler);
void EINT0_Handler(void) WEAK_ALIAS(Dummy_Handler);

void Dummy_Handler(void);

/* Stack top and vector handler table */
void *vector_table[] __attribute__ ((section(".vectors"))) = {
    &_end_stack,
    Reset_Handler,
    NMI_Handler,
    HardFault_Handler,
    MemManage_Handler,
    BusFault_Handler,
    UsageFault_Handler,
    0,
    0,
    0,
    0,
    SVC_Handler,
    DebugMon_Handler,
    0,
    PendSV_Handler,
    SysTick_Handler,

    /* LPC13xx specific interrupt vectors */
    WAKEUP_Handler, WAKEUP_Handler, WAKEUP_Handler, WAKEUP_Handler,
    WAKEUP_Handler, WAKEUP_Handler, WAKEUP_Handler, WAKEUP_Handler,
    WAKEUP_Handler, WAKEUP_Handler, WAKEUP_Handler, WAKEUP_Handler,
    WAKEUP_Handler, WAKEUP_Handler, WAKEUP_Handler, WAKEUP_Handler,
    WAKEUP_Handler, WAKEUP_Handler, WAKEUP_Handler, WAKEUP_Handler,
    WAKEUP_Handler, WAKEUP_Handler, WAKEUP_Handler, WAKEUP_Handler,
    WAKEUP_Handler, WAKEUP_Handler, WAKEUP_Handler, WAKEUP_Handler,
    WAKEUP_Handler, WAKEUP_Handler, WAKEUP_Handler, WAKEUP_Handler,
    WAKEUP_Handler, WAKEUP_Handler, WAKEUP_Handler, WAKEUP_Handler,
    WAKEUP_Handler, WAKEUP_Handler, WAKEUP_Handler, WAKEUP_Handler,
    I2C_Handler,
    TIMER_16_0_Handler,
    TIMER_16_1_Handler,
    TIMER_32_0_Handler,
    TIMER_32_1_Handler,
    SSP_Handler,
    UART_Handler,
    USB_IRQ_Handler,
    USB_FIQ_Handler,
    ADC_Handler,
    WDT_Handler,
    BOD_Handler,
    EINT3_Handler,
    EINT2_Handler,
    EINT1_Handler,
    EINT0_Handler,
};

void Reset_Handler(void) {
    uint8_t *src, *dst;

    /* Copy with byte pointers to obviate unaligned access problems */

    /* Copy data section from Flash to RAM */
    src = (uint8_t *)&_end_text;
    dst = (uint8_t *)&_start_data;
    while (dst < (uint8_t *)&_end_data)
        *dst++ = *src++;

    /* Clear the bss section */
    dst = (uint8_t *)&_start_bss;
    while (dst < (uint8_t *)&_end_bss)
        *dst++ = 0;

    main();
}

void Dummy_Handler(void) {
    while (1)
        ;
}

In our sample program, we initialize the system clock with a call to the CMSIS SystemInit(), and then we use the CMSIS SysTick_Config() function to setup and enable the SysTick interrupt with the number of ticks between each SysTick interrupt as the argument – in our case, 1 millisecond. We override the default dummy SysTick_Handler() in startup.c by declaring one of our own, and its sole purpose is to increment a global variable msTicks on every tick. The delay_ms() function uses msTicks to determine when “ms” number of milliseconds have passed. The main routine also configures the data direction of 4 GPIOs as outputs, and enters an infinite loop of turning on the LEDs, delaying, turning off the LEDs, delaying – in other words, blinking the LEDs.

/* main.c */
#include <stdint.h>

#include "LPC13xx.h"

volatile uint32_t msTicks = 0;

void SysTick_Handler(void) {
    msTicks++;
}

void delay_ms(uint32_t ms) {
    uint32_t now = msTicks;
    while ((msTicks-now) < ms)
        ;
}

int main(void) {
    SystemInit();
    SysTick_Config(SystemCoreClock/1000);

    LPC_GPIO3->DIR = (1<<3)|(1<<2)|(1<<1)|(1<<0);

    while (1) {
        LPC_GPIO3->DATA = (1<<3)|(1<<2)|(1<<1)|(1<<0);
        delay_ms(500);
        LPC_GPIO3->DATA = 0;
        delay_ms(500);
    }
}

We can compile, link, and produce an Intel HEX formatted binary of the entire program with:

$ make
arm-none-eabi-gcc -fno-common -mcpu=cortex-m3 -mthumb -Os  -Icore/ -Ilpc13xx/ -Wall -Wextra -Wcast-align -Wcast-qual -Wimplicit -Wpointer-arith -Wswitch -Wredundant-decls -Wreturn-type -Wshadow -Wunused -c lpc13xx/system_LPC13xx.c -o lpc13xx/system_LPC13xx.o
arm-none-eabi-gcc -fno-common -mcpu=cortex-m3 -mthumb -Os  -Icore/ -Ilpc13xx/ -Wall -Wextra -Wcast-align -Wcast-qual -Wimplicit -Wpointer-arith -Wswitch -Wredundant-decls -Wreturn-type -Wshadow -Wunused -c startup.c -o startup.o
arm-none-eabi-gcc -fno-common -mcpu=cortex-m3 -mthumb -Os  -Icore/ -Ilpc13xx/ -Wall -Wextra -Wcast-align -Wcast-qual -Wimplicit -Wpointer-arith -Wswitch -Wredundant-decls -Wreturn-type -Wshadow -Wunused -c main.c -o main.o
arm-none-eabi-gcc -mcpu=cortex-m3 -mthumb -Os -nostartfiles -Wl,-Map=blink.map -Tlpc1313.dld -nostdlib lpc13xx/system_LPC13xx.o startup.o main.o -o blink.elf
arm-none-eabi-objcopy -R .stack -R .bss -O ihex blink.elf blink.hex
arm-none-eabi-objcopy -R .stack -R .bss -O binary -S blink.elf blink.bin
$

Additional Resources

Introduction/Transitioning to Cortex M3:

Technical Resources:

Why I like Cortex M3

So what’s all the hype over the Cortex M3? In the Additional Resources section above, you can find a number of introductory Cortex M3 articles, but I’ve summarized here the key points that make the Cortex M3 core worthwhile. Most of these features are from the ARM7TDMI -> Cortex M3 transition perspective, but many are also applicable for comparison with other microcontroller cores.

In a nutshell:

Deterministic Interrupt Response Latency

The Cortex M3 predecessor core, ARM7TDMI, and many other microcontroller cores take a variable number of clock cycles for the processor to finish up what it’s doing and enter the interrupt handler for the corresponding interrupt. The variable number may depend on the clock cycles required to finish the current instruction (a hardware divide typically being the worst-case scenario), to save elements of the processor state, and some other things. In the case of ARM7TDMI, this number was 24-42 cycles.

The consequences of variable interrupt latency is timing jitter in any actions by the interrupt handler. If you are using the interrupt handler, for example, to generate a clocked external interface (SPI, PCM, etc., but possibly not a good idea in the first place) that you don’t have a hardware peripheral for, a variable interrupt latency may cause unacceptable jitter on the outputs of this interface.

The Cortex M3 has a flat 12 cycle interrupt latency.

Stack-based Interrupt Handling

The Cortex M3 seems to achieve this lower interrupt latency with several layers of magic, one of which is it being a Harvard architecture processor, meaning that it has separate memory interfaces to the instruction and data memory and can read/write from/to these simultaneously. But a key difference, departing from the ARM7TDMI core’s approach on this, is using the stack to save processor state (register file, flags, etc.) instead switching register banks by changing processor modes for the particular interrupt. If you recall, the ARM7TDMI had six! processor modes: System/User, FIQ, Supervisor, Abort, IRQ, Undefined. The core would swap into one of the last five modes when handling an interrupt.

The Cortex M3 also naturally supports nested interrupts, and higher priority interrupts interrupting lower priority ones, as they are just a stack push away. Also, processing another pending interrupt after finishing a preceding one is just 6 cycles (and called “tail-chaining”).

Just two processor modes: Thread and Handler

The Cortex M3 core has cut away all of the extraneous processor modes used in the ARM7TDMI for interrupt handling and is left with just two processor modes: Thread (privileged/unprivileged) and Handler (privileged). Interrupts are handled in handler mode, normal code in thread mode. Couldn’t be simpler.

No Assembly Bootstrap Required

One of the more tedious things about using an ARM7TDMI microcontroller is the required assembly bootstrap, to get the processor ready for your actual code. I’m sure this has been a development burden to get code up and running, as well as intimidating for those not comfortable with plowing into an instruction set if it’s their first time using that core. The good news is that, by design, the Cortex M3 requires zero assembly bootstrap. Everything can be written in pure C, and it gets even better…

Conventionally, the ARM7TDMI assembly bootstrap would entail defining branch statements at the interrupt vectors for the various interrupt handlers, setting up the stack and enable/disabling interrupts in each of the six processor modes, copying the data section from flash to RAM, and clearing the bss section in RAM. This had to be done in assembly (except for the last two steps), because the interrupt vectors required explicit single-instruction branch statements at their locations, and because the stack has to be set up before entering C code, in case the C code decides to use it.

Cortex M3 has done away with this requirement entirely. The interrupt vectors have changed from branch instructions to a simple table of interrupt handler addresses. As I mentioned before, the Cortex M3 has only two execution modes, Thread and Handler. On reset, the Cortex M3 starts in privileged Thread mode, and at the very least you need to set up the stack in this mode to be ready to execute C code. Cortex M3 goes about automatically doing this in a very clever way: the very first entry of the vector table is not an interrupt handler address at all, but actually the initial top value of the stack pointer. In other words, defining this vector table and putting it at the right place (0x00000000 in flash) is all you need to fully define the initial state of the processor to execute your actual program code. This translates to a void * table in C and a compiler attribute to slip it in the right section defined in your linker script – see the Step 4: Write the Startup Code section above. The data copying from flash to ROM and clearing bss can also be implemented easily in C.

This is the cool part: when an interrupt occurs, the Cortex M3 automatically saves the entire processor state to the stack upon entry, and restores it upon exit, and does not require any software-based clean up from you. ARM7TDMI required a fancy software epilogue to every interrupt handler, for restoring the CPSR from SPSR, and branching to some offset to the link register depending on a table (pg. 2-16 of ARM7TDMI Reference Manual) to return to the program code. However, with the Cortex M3, interrupt handlers are just normal C functions! You don’t need to figure out your compiler’s semantics on declaring an interrupt handler (is it __irq or __attribute__ ((interrupt)) or __attribute__ ((interrupt("IRQ"))) ? I never remember), because interrupt handlers for the Cortex M3 are regular functions.

SysTick Peripheral

A master tick in an embedded system is a very common application of a timer, whether it is as simple as a rudimentary timestamp, building a polling delay loop, or implementing preemptive context switching for an RTOS. Instead of wasting one of the two or three highly-configurable (PWM, event capture, etc.) timer peripherals implemented by chip vendors on such a menial task, the Cortex M3 gives you a timer integrated directly in the processor core called SysTick. This is a 24-bit timer, requires setting up only a control register and reload value register, and has a dedicated interrupt (a part of the standard Cortex M3 interrupts). The CMSIS core provides all the functions needed to set it up.

Memory Protection Unit (MPU)

The Cortex M3 core has an optional memory protection component called the Memory Protection Unit (MPU). Although it is technically optional, it seems like most Cortex M3 microcontrollers have it implemented. Basically, this peripheral lets you define 8 protected memory regions. A region can be subdivided into 8 subregions, each of which can span a power of 2 bytes (starting at 32 bytes). One obvious (and probably the most intended) application of this is separating the address space of processes in an RTOS, so each can have its own instruction, data, and stack regions, and restricted access to privileged hardware peripherals. During a context switch, the supervisor can switch to a different memory region corresponding to the swapped-in process. Any violation of the memory protection results in a MemManage fault, which can be dealt with in its corresponding interrupt handler. This seems completely sufficient for a small RTOS (in fact FreeRTOS has a Cortex M3 port that supports it), but is altogether pretty unique for the 8/16/32-bit embedded microcontroller market.

Bit-banding

Unlike its predecessor, the Cortex M3 has bit-addressable memory. The Cortex M3 lets you map a 1MBit SRAM or peripheral register “bit-band” region into a 32Mbit “alias” memory region. Each consecutive 32-bit word in the “alias” memory region refers to each consecutive bit in the “bit-band” region (which explains that size relationship: 1Mbit <-> 32Mbit). This is kind of a hack, as the upper 31 bits of each word in the “alias” region are completely ignored while the low 0th bit is the mapped bit. But, these bit accesses are atomic, which can be better than the read-modify-write approach in some applications (semaphores).

Unaligned Data Access

The Cortex M3 supports unaligned data memory access, at the expense of some clock cycles (unaligned accesses have to be internally converted into aligned accesses). This means that you can squeeze the most out of your data memory, if this memory is a top concern.

Single Instruction Set: Thumb-2

ARM7TDMI featured two instruction sets: 32-bit ARM and 16-bit Thumb. I think the general idea was that the 32-bit ARM instruction set would give you full leverage of 32-bit processing power when you needed it, and the 16-bit Thumb instruction set would allow for the high speed and high code density. The complexity of ARM/Thumb interworking and deciding what code should be ARM or Thumb means making the most out of both of these instruction sets can be non-trivial optimization problem.

The Cortex M3 simplifies things in this department by using only one instruction set, Thumb-2, which is a superset of Thumb (meaning just about all existing Thumb code will run without problems). Some of the instructions that extend Thumb to make up the Thumb-2 instruction set can have 32-bit opcodes, so this instruction set actually has a mix of 16-bit/32-bit opcodes. According to the shiny bar charts ARM puts out, the Thumb-2 instruction set apparently does much better than ARM and Thumb in both performance and code size, so it’s a win-win.

Miscellaneous features: Single-cycle multiply, hardware divide, easy sleeping, Serial Wire Debug

Cortex M3 / Thumb-2 also adds a few hardware goodies like single-cycle 32-bit by 32-bit multiply and two cycle multiply accumulate instructions (useful for signal processing), and hardware divide instructions (2-12 cycles). In addition, Thumb-2 makes sleeping so easy you don’t have an excuse not to use it: the single instruction WFI - wait for interrupt, puts the core into sleep mode by disabling the system clock to the core until an interrupt occurs. If you have an interrupt-driven system, you can throw this in your idle while loop, slap on a Green Energy sticker and call it a day. (Not really).

Additional options related to this functionality is a sleep-on-exit mode to automatically enter sleep mode after finishing executing an interrupt handler, and a “deep sleep” mode that turns off the system clock, PLL, and flash upon entering sleep. These are configurable in the System Control Register (SCR). The Cortex M3 also clock-gates just about every peripheral it can, so you can disable the clocks to peripherals connected to the AHB bus completely at runtime by configuring the SYSAHBCLKCTRL register.

The Cortex M3 also introduces a new debug interface alongside JTAG intended for low-pin count devices called Serial Wire Debug (SWD). SWD is a 2-wire interface, which could make throwing a debug port on small systems very practical. It is essentially a lower pin-count serial frontend to JTAG.

Cortex M3 Microcontrollers

The current microcontroller implementations of Cortex M3 look very nifty. The main vendors of Cortex M3 microcontrollers seem to be: Atmel, [Luminary Micro (owned by TI)](http://www.luminarymicro.com/), NXP, STMicroelectronics.

It’s pretty much standard for all of these ARM microcontrollers (including ARM7TDMI) to include a small bootloader in ROM that runs when a certain pin is low during reset, to allow flash programming through the UART (and there is free software on the host side to facilitate this, my favorite is FlashMagic, which works great with wine in Linux). So, flashing an ARM microcontroller is actually much easier than flashing an AVR/PIC, because you don’t need a special dongle—you can just use a serial port (or more likely a USB <-> UART adapter). This is convenient, but in terms of speed, JTAG will be faster with the right dongle.

Another hardware convenience that seemed to start with the later ARM7TDMI microcontrollers is only needing one, typically 3.3V, supply voltage for the entire chip. Some ARM7TDMI microcontrollers required a separate 1.8V supply for the core, and a 3.3V supply for the I/O, but these days it appears that all of the vendors are integrating an on-board voltage regulator for the core. This means that powering a Cortex M0/0+/3 microcontroller is as easy as powering any other 3.3V 8-bit/16-bit microcontroller.

One truly interesting hardware feature available from some of the vendors above is an internal high frequency (12MHz) RC oscillator. Atmel, NXP, and STMicro are all offering chips with these. This means you don’t even need a crystal, and you can connect the output of these internal oscillators to the internal PLL for 48/72/etc MHz. You can get away with running a single LQFP48 Cortex M3 microcontroller with only a 3.3V supply, a couple decoupling caps, and a jumper to pull down the bootloader pin for flashing.

Finally, the price. Surprisingly, some of these Cortex M3 microcontroller cost as little as $2.50 qty 1. The high speed, many peripheral, high pin count ones can run up to $12.

Here is a laundry list of specifications for the LPC1313 I’m using in one of my projects. This is pretty much the most low-end LQFP-packaged Cortex M3 microcontroller by NXP:

The LPC1343 is essentially the same thing (and same package), but with an integrated USB controller. It runs $4.51 qty. 1. A neat thing NXP has been doing on some of their more recent microcontrollers with USB controllers is including an entire USB software stack in memory-mapped ROM on the device. So instead of needing to implement this stack yourself (and use up flash memory in the process), you can literally just call functions that already reside in the on-chip ROM.

I like LQFP48 because it seems like the last big pin-count package that is still comfortable to solder. LQFP64 is doable, but it makes me nervous. But, I know there are also people out there that have no issues using stencils, solder paste, and a toaster oven, or a more legitimate setup, and can solder LQFP64 and the QFN packages without issues.

AVR/PIC/MSP430?

So I’ve only been talking about the Cortex M3 for this entire page, but I should qualify its role with saying that the popular 8-bit/16-bit microcontrollers like Atmel AVR/Microchip PIC/TI MSP430 still have their place. For small footprint, extremely low power, and very low cost, they can’t be beat yet. But, if you could use the clock speed and the 32-bit performance, I think a Cortex M3 microcontroller is the way to go.

Comments

Creative Commons License