View previous topic :: View next topic |
Author |
Message |
asmboy
Joined: 20 Nov 2007 Posts: 2128 Location: albany ny
|
times ten in fewest CPU cycles? |
Posted: Tue Nov 12, 2013 3:58 pm |
|
|
I need to do the fastest possible multiplication by ten for unsigned 16 bit INTS with a value range of 0-4095 ( output of 12bit ADC) and make process decisions as rapidly as i can after getting a new reading.
over the range of input values ( 0-4095) , code for *10 can profile out to
a few hundred cpu cycles. so i created a dedicated function for this that does what i need deterministically, irregardless of input ,
in a constant 25 instructions.
every cycle counts in this application and
what i'm not sure of is if it can be done more efficiently yet?
Anybody?
Code: |
unsigned int16 tenx(unsigned int16 in){
unsigned int16 a;
a=in<<3; // *8
a +=in; // *9
return ((a +in)); //*10
}
|
|
|
|
temtronic
Joined: 01 Jul 2010 Posts: 9282 Location: Greensville,Ontario
|
|
Posted: Tue Nov 12, 2013 4:18 pm |
|
|
While slowly reading your request, I was thinking shift and add twice..hmmm..scolled down to your function and it's the same !!
I don't think there is a faster way to do it
So great minds do think alike !!
hth
jay |
|
|
dmendesf
Joined: 31 Dec 2005 Posts: 32
|
|
Posted: Tue Nov 12, 2013 7:21 pm |
|
|
The fastest way I can think is to trade space for speed:
Code: |
const int16 tablemult[4096] = {0,10,20,30,40, ...... 40950};
a = tablemult[b]; // a = b*10;
|
|
|
|
dmendesf
Joined: 31 Dec 2005 Posts: 32
|
|
Posted: Tue Nov 12, 2013 7:48 pm |
|
|
You can also trade less space for less speed using this:
k = a*256+b // b from 0 to 255, a from 0 to 15
10 * k = 10*a*256 + 10*b
Code: |
union data_holder {
int8 data_byte[2];
int16 data_long;
} data_in, data_out;
const int16 tablemult[256] = {0, 10, 20, 30 ... 2550};
data_out.long = tablemult[data_in.data_byte[0]]; // = 10*b
data_out.data_byte[2] += tablemult[data_in.data_byte[1]]; // = 10*b+10*a*256... data_out.data_byte[2] started with at most (2550 / 256) = 9.
//since data_in.data_byte[1] is at most 15, the result is at most 159, and fits a byte
|
result is at data_out.data_long |
|
|
asmboy
Joined: 20 Nov 2007 Posts: 2128 Location: albany ny
|
|
Posted: Tue Nov 12, 2013 8:30 pm |
|
|
nice ideas above.
1-having less than fastest speed is a nogo,
2- the pic16 chip in question (selected for its unique hardware functions)
only has 8k worth of 'word space'
and a table of 4096 'words' of int16 data
leaves appx zero code space for the rest of the program.
the table lookup would surely be the bomb if i was using an 18f46k80,
which , i am not. |
|
|
dmendesf
Joined: 31 Dec 2005 Posts: 32
|
|
Posted: Tue Nov 12, 2013 8:43 pm |
|
|
I think that the second option I posted can be faster than your approach and it needs "only" 512 words of memory. You should benchmark it and also a small variation using make8 to acess 16 bit variables... I donĀ“t know if CCS compiler is smart enough to do all pointer arithmetic in compile time for the union... maybe the make8 is the key for it. |
|
|
alan
Joined: 12 Nov 2012 Posts: 357 Location: South Africa
|
|
Posted: Tue Nov 12, 2013 11:34 pm |
|
|
Hi asmboy
if you put all in the return with another shift, it shaves some instructions off on my 4.141.
Code: | return (((in<<3) +(in<<1))); |
Regards
Alan |
|
|
Ttelmah
Joined: 11 Mar 2010 Posts: 19605
|
|
Posted: Wed Nov 13, 2013 12:04 am |
|
|
This is one of those cases where using a function is less efficient...
The compiler knows that it can perform *8, and *2, using rotations. The optimiser handles this, so:
Code: |
#define tentimes(x) ((x*8)+(x*2))
void main()
{
int16 val=something;
int16 result;
result=tentimes(val);
}
|
Avoids the need to transfer values to temporary variables, and is quicker than the function solutions. In fact I think it may even beat the table lookup. It takes a sometimes surprising number of cycles to access a table (either ROM or RAM) on the PIC.
As a comment, I remember one of the CCS examples does this.
Just took the time to compare the _total_ instruction count for 'tenx' (original version), 'tenx' (as coded by alan), 'tentimes', and a table lookup. This is the time including the calls, setup etc., on a PIC16.
tenx, uses 41 instructions including the setup.
tenx (alan), uses 38 including setup.
tablemult, uses 30.
tentimes as a #define, uses 24.
The count given by asmboy in his original remark, is ignoring the time to call the function, and the time to load the temporary variables.
I 'rest my case'.....
Best Wishes |
|
|
ckielstra
Joined: 18 Mar 2004 Posts: 3680 Location: The Netherlands
|
|
Posted: Wed Nov 13, 2013 5:07 am |
|
|
Instead of using a macro I tried to #inline the function because I like the parameter type checking from functions. Too bad it seems like #inline isn't working in my v4.147.
We now only had a look at just one line of your algorithm. Multiplying a sampled value by 10 'smells' like there are more optimizations possible given a look at the total algorithm. |
|
|
Ttelmah
Joined: 11 Mar 2010 Posts: 19605
|
|
Posted: Wed Nov 13, 2013 5:51 am |
|
|
4.147?....
I thought 4.141 was the last.
Anyway on a version where it does work, it still adds the extra instructions to do an extra move to an extra variable.
If one is prepared to use two lines, then the most efficient I can think of with the compiler, is:
Code: |
result=val*2;
result+=result*4;
|
Gets it down to 19 instructions.
Best Wishes |
|
|
jeremiah
Joined: 20 Jul 2010 Posts: 1362
|
|
Posted: Wed Nov 13, 2013 8:08 am |
|
|
I checked on PCD 5.013 and inlining appears to work. It's still not as efficient as the macro. I found that making the input a reference parameter shaves off the unnecessary moves, but there is still a single "extra" instruction that I assume comes from their code generation algorithm (or maybe it does have a purpose...I didn't go too deeply into it)
Code: |
.................... #define tentimes(x) ((x*8)+(x*2))
....................
.................... #inline unsigned int16 anotherTenTimes(unsigned int16 x){
.................... return ((x*8)+(x*2));
*
39EA: MOV 8AC,W5
39EC: SL W5,#3,W5
39EE: MOV 8AC,W0
39F0: SL W0,#1,W0
39F2: ADD W0,W5,W0
39F4: MOV W0,0
.................... }
....................
.................... #inline unsigned int16 yetAnotherTenTimes(unsigned int16 &x){
.................... return ((x*8)+(x*2));
*
39F8: MOV 8AA,W5
39FA: SL W5,#3,W5
39FC: MOV 8AA,W0
39FE: SL W0,#1,W0
3A00: ADD W0,W5,W0
3A02: MOV W0,0
.................... }
....................
|
Code: |
.................... test = tentimes(test);
39DA: MOV 8AA,W5
39DC: SL W5,#3,W5
39DE: MOV 8AA,W0
39E0: SL W0,#1,W0
39E2: ADD W0,W5,W0
39E4: MOV W0,8AA
.................... test = anotherTenTimes(test);
39E6: PUSH 8AA
39E8: POP 8AC
*
39F6: MOV W0,8AA
.................... test = yetAnotherTenTimes(test);
*
3A04: MOV W0,8AA
|
Though, in PCD a mult instruction only takes one cycle, so this is just for academic purposes |
|
|
bkamen
Joined: 07 Jan 2004 Posts: 1615 Location: Central Illinois, USA
|
|
Posted: Wed Nov 13, 2013 9:07 am |
|
|
jeremiah wrote: |
Though, in PCD a mult instruction only takes one cycle, so this is just for academic purposes |
PIC18's are surprisingly fast too having their 8x8 dedicated hardware multiplier.
(although it's not a single instruction) _________________ Dazed and confused? I don't think so. Just "plain lost" will do. :D |
|
|
asmboy
Joined: 20 Nov 2007 Posts: 2128 Location: albany ny
|
|
Posted: Wed Nov 13, 2013 9:42 am |
|
|
many thanks to all who contributed.
i have to use the 16f1509 for its NCO feature , found in none of the 18f family
and also have to live with a reduced 10 bit ADC and needing to use external EEPROM as well.
how microchip can release a chip with so many unique and desirable functions , yet omit EEPROM is a mystery and a frustration to this fellow.
the macro define solution is the icing on the cake!
special thanks to Mr. T for your singularly focused analysis. |
|
|
temtronic
Joined: 01 Jul 2010 Posts: 9282 Location: Greensville,Ontario
|
|
Posted: Wed Nov 13, 2013 12:13 pm |
|
|
It could be that a customer wanted a special PIC with those specific features and low,low cost ,thus the '16F1509' was born.
It wouldn't be the first time a 'new product' was added to a manufacturer's line. Heck I did that decades ago with energy control products.
These days it's super easy to 'make-my-PIC', all you need is tell uChip what you need and some cash.
hth
jay |
|
|
|