amibroker

HomeKnowledge Base

Limits of multithreading

From time to time users approach us asking various questions related to multithreading such as:

  1. Why his/her formula does not run 32 times faster on 16 core / 32 thread computer
  2. Will 16 core processor be twice as fast as 8 core
  3. Why their CPU does not show 100% usage

The reason of all those questions is lack of understanding of multithreading and laws governing computing in general.

In this article we will try to address some of those misunderstandings and misconceptions. We assume that the reader already read Efficient use of multitreading from AmiBroker’s Users’ guide and is fully aware how work is distributed in many threads in the Analysis window. We also assume that the reader already read Peformance Tunning chapter of AmiBroker’s guide. These two parts of the manual explain fundamental concepts and are essential to understanding of what is written below.

Another fundamental reading is Amdahl’s Law article in the Wikipedia that explains theoretical speedup limit of any multi-threaded program. In short Amdahl’s law says that if 95% of your program runs in multiple threads and only 5% of it is serial (single-threaded), the maximum achievable speedup regardless of how many CPUs and how many cores you have is 20x (20 times).

Let us focus on Analysis window performance (Exploration/Scan/Backtesting/Optimization). Any operation in the Analysis window involves:

  1. preparing data (this involves reading data from the database, data compression to selected interval, filtering, padding, etc)
  2. setting up AFL engine for execution (setting up built-in arrays, stops, parsing of your formula)
  3. execution of your formula (in backtest for example it means first phase of backtest run, done on every symbol)
  4. per-symbol processing the output of your formula (in backtest it is sorting signals by position score)
  5. post-processing (in case of portfolio backtest it is for example portfolio backtest phase that is done once per backtest, NOT for every symbol)

AmiBroker is highly parallel multithreading application, so most of steps are done in multiple threads. Specifically only first and last (1. and 5.) step is serial, the rest (2, 3, 4) is parallel. It is worth noting that steps 1-4 are done on every symbol, while step 5 is only done once for all symbols. In addition to that program spends some time handling the UI (things like updating UI controls like progress bar and reacting to your mouse / keyboard input) which is of course done in single (main) UI thread.

Multitreading 1

There is one exception, a special case: Individual optimization. In individual optimization step 1 is done only once (for one symbol), and all other steps 2-5 (so including last one) are done in multiple threads.

Multitreading 2

Now is where Amdahl’s law kicks in. By adding threads/cores/processors you can only decrease parallel parts (2..4 or 2..5) and ultimately you are limited by the speed of data access. You can’t backtest faster than you can read/prepare the data.

As for data access: the database is shared resource, no matter where it resides. If it resides on hard disk, it is single physical device that does not speed up with increasing number of CPUs. If it resides in RAM, it is still single physical RAM, that has limit on bandwidth and fixed latency regardless how many processors you throw to the mix. Even if it is in L3 (Level 3) cache on the processor, it is still single L3 cache shared by multiple cores. And it is worth nothing, that L3 cache even on most modern processors operate on half the speed of the core, so single core can actually saturate bandwidth of L3 cache if doing nothing but reading or writing large chunks of data from/to it. In many cases this means that processor must wait for memory, unless it is doing complex computations involving only minimum amount of data. These are for example real-world measurement results for triple channel RAM controller on Intel i7 920 CPU (measured using memtest86 program)

Data location   Bandwidth [MB/sec]
L1 52408
L2 30722
L3 24521
RAM 11879

Only L1 cache runs at full core speed. As you can see L3 cache has half the bandwidth and RAM has 1/4 of bandwidth of L1 cache. Of course disk speeds (even SSD) are far cry behind 11GB/sec offered by RAM.

In case of portfolio backtest: a final backtest phase (portfolio backtesting) is one per backtest, done once for all symbols, so naturally it is done in single thread (as opposed to first phase that is done on every symbol in parallel).

Now, knowing this all you may wonder how to use all that knowledge in practice.

For example it allows to understand the limits of achievable speed gains for given formula and plan your hardware purchases or find ways to improve run times.

As we learned from the above the only parts that can be speed-ed up by adding more cores are those that are run in parallel (multiple threads). In practice it means – your AFL formula code. What is more the more time is spent in parallel part the better it scales on multiple cores. This means that simple formulas DO NOT scale too well because they are too simple to put enough strain on CPU and are mainly memory (data access) bound. All your simple moving average cross overs are just too simple to keep CPU busy for longer time, especially when there is not too much data to process.

Let us take this trivial formula for example:

period Optimize"period"102102);
Buy CrossCMACperiod ) );
Sell CrossMACperiod ), );

and run Optimize->Individual Optimize on symbol that has 2000 quotes.

Now switch to “Info” tab in the Analysis window and you will see this output (this example comes from 4 core / 8 thread Intel i7), all times are in seconds:

Individual optimize started.
Completed in 0.49 seconds. Number of rows: 500
( Timings: data: 0.11, setup: 0.00, afl: 0.28, job: 2.97, lock: 0.00,
pbt: 0.00, UI thread: 0.11, worker threads: 3.26/3.26 )

So our 500 step optimization on 2000 quotes took less than half of the second. What you see there are some cryptic numbers that you might wonder what they mean. Here is the explanation (for the backtest/optimization)

a) data – time spent accessing/preparing the data
b) setup – time spent preparing AFL engine
c) afl – time spent executing your formula (first phase of backtest)
d) job – post processing (here signals are collected and trading simulation is performed in case of individual optimize)
e) lock – time spent waiting in critical section / lock accessing shared signal table
f) pbt – portfolio backtesting code (not used in individual optimization)
g) UI thread – time spent in UI thread in total (data + pbt + UI handling) – single threaded time
h) worker threads – time spent in worker (parallel) threads (setup+afl+job+lock) – multi-threaded time

Firstly it may look surprising that “worker threads” time is 3.26 which is way longer than entire optimization took (0.49 seconds). But this time is a SUM of times spent in all 8 threads. They ran in parallel. Each was running for (3.26/8 seconds = 0.4075 seconds), and with only one thread running it would take 3.26s. Now you suddenly realize the power of multi-threading!

So now it would seem that our formula run (0.11 + 3.26)/0.49 = 6.8 times faster than on single core.

You may ask why not 8x? We had 8 threads, didn’t we?

First reason is the Amdahl’s law – serial time (0.11sec) is constant and limits our speed up, no matter how many threads you would put on it, but there is something more.

Let us check how much time would it really take if we limited to one thread only. Try running with #pragma statement limiting number of threads:

#pragma maxthreads 1
period Optimize"period"102102);
Buy CrossCMACperiod ) );
Sell CrossMACperiod ), );

Suddenly the result is:

Individual optimize started.
Completed in 1.62 seconds. Number of rows: 500
( Timings: data: 0.07, setup: 0.00, afl: 0.10, job: 1.37, lock: 0.00,
pbt: 0.00, UI thread: 0.07, worker threads: 1.47/1.47 )

What? Entire optimization took just 1.62 seconds when run in single-thread which is just 3.3 times slower than multi-threaded, not 6.8x as we calculated earlier. Why worker thread is 1.47 ? It was 3.26? What happened?

There are couple of reasons for that:
a) Hyper-threading – as soon as you exceed CPU core count and start to rely on hyperthreading (running 2 threads on single core) you find out that hyperthreading does not deliver 2x performance. If your code is NOT doing complicated things like lots of trigonometric functions that put FPU busy or other number crunching, the hyperthreading will not give you 2x performance. On simple tasks it struggles to deliver +30%.
b) Turbo boost – modern CPUs have different settings for single-core turbo boost and multi-core turbo boost. The effect is that CPU can reach raise clock to 4GHz when running single-core only but limit to 3.5GHz when running multi-threaded code. This limits multi-threaded performance and speeds up single-thread apps
c) Concurrent L3 cache / RAM access – when multiple cores run the code accessing L3 cache / RAM they will fight for access slowing them down

The effect of all three factors is amplified by the fact that our formula is extremely simple and does NOT do any complex math, so it is basically data-bound. This is why single-core execution was not as bad as we expected.

But what would happen if we increase the number of bars (keeping formula the same)? Let us try with 12000 bars of data (6 times more data than previously):

8-threads:

Individual optimize started.
Completed in 1.61 seconds. Number of rows: 500
( Timings: data: 0.18, setup: 0.00, afl: 0.81, job: 11.57, lock: 0.00,
pbt: 0.00, UI thread: 0.19, worker threads: 12.38/12.38 )

1-thread:

Individual optimize started.
Completed in 6.90 seconds. Number of rows: 500
( Timings: data: 0.10, setup: 0.00, afl: 0.28, job: 6.48, lock: 0.00,
pbt: 0.00, UI thread: 0.10, worker threads: 6.76/6.76 )

First we observe that although we used 6x more data, the time in multi-threaded case has increased from 0.49 to 1.61 which is only 3.28x. Secondly we see that 8-threaded execution is now 6.90/1.61 = 4.29 times faster than single-threaded.

What happened that multi-threaded performance is now better and it scales better?

Simply – we loaded CPU with more work. That is general rule, the more work you place on the CPU, the more time is spent in parallel section and more gain you get from multi-threading.

So, what would happen if you put CPU to some really heavy-work. It is surprisingly difficult to put i7 CPU into such a hard work that it sits busy doing calculations and not doing too much memory access. You would really need to use functions that do heaps on calculations on very small chunks of data sitting in L1 cache all the time or use some transcendental math functions that require FPU to spend way more than single cycle to derive result. Let us try with combination of raising to power, decimal logarithm and arcus sine.

period Optimize"period"102501);
Buy CrossCMACperiod ) );
Sell CrossMACperiod ), );
// add some math to force i7 CPU to sweat a little bit
for( 0100i++ ) acoslogperiod ) );

Once you to run this you will see AmiBroker saturating your CPU (on my end it uses 99% of CPU) for the first time. The results are:

8 threads:

Individual optimize started.
Completed in 39.39 seconds. Number of rows: 500
( Timings: data: 0.14, setup: 0.00, afl: 302.73, job: 9.14, lock: 0.00,
pbt: 0.00, UI thread: 0.14, worker threads: 311.87/311.87 )

1 thread:

Individual optimize started.
Completed in 251.27 seconds. Number of rows: 500
( Timings: data: 0.12, setup: 0.00, afl: 243.92, job: 6.59, lock: 0.00,
pbt: 0.00, UI thread: 0.12, worker threads: 250.51/250.51 )

Now you can see that 8 threaded execution was (251.27/39.39) 6.38 times faster than single-threaded.

This is almost perfect scaling with hyperthreading – remember hyper-threaded thread is NOT fast as separate-core thread. To prove that we can run same code on 4 threads:

#pragma maxthreads 4
period Optimize"period"102501);
Buy CrossCMACperiod ) );
Sell CrossMACperiod ), );
// add some math to force i7 CPU to sweat a little bit
for( 0100i++ ) acoslogperiod ) );

With four threads we get:

Individual optimize started.
Completed in 64.63 seconds. Number of rows: 500
( Timings: data: 0.13, setup: 0.00, afl: 250.22, job: 6.91, lock: 0.00,
pbt: 0.00, UI thread: 0.13, worker threads: 257.12/257.12 )

So 4-thread performance was 251.27/64.63 = 3.89 faster than single-thread. And look at the “worker threads” time it is very close to single-thread time (250s vs 257s). This proves our point that except the effect of RAM and L3 congestion and slightly slower turbo boost speed, full-core threads scale perfectly as long as your formula puts them into some real work.

Note: in all those tests we did NOT include the impact of disk speed because we run single-symbol individual optimization which runs out of RAM.

Bottom line is: despite marketing hype buying 32 thread CPU does not buy you 32x performance. Real-world performance depends on many factors including formula complexity, whenever it is heavy on math or not, amount of data, RAM speed, on-chip cache sizes, turbo boost clocks differences between single-thread and multi-thread configurations and so on. The devil is in the details and there are no simple answers. I always say: do not assume. Assumptions are not facts. Unless you measure something you don’t know.

Wrong close price in Yahoo data (no more?)

July 22, 2017 UPDATE :
Yahoo today changed the order of columns (swapped “Close” with “Adjusted Close”).
To fix, you need to edit aqh.format file (in Formats subdirectory).

Old format line


$FORMAT Date_DMY,Open,High,Low,Skip,Close,Volume

should be changed to:


$FORMAT Date_DMY,Open,High,Low,Close,Skip,Volume

A ready to use aqh.format file can be downloaded from here (right click and choose “Save target as…”). Place it into “Formats” subdirectory.

You may have noticed over last few days that on some symbols (stocks and ETF that pay dividends) close price is below open, high, low prices.

This is new Yahoo error due to the fact that they added “dividend adjustments” incorrectly
http://forum.amibroker.com/t/amiquote-3-15-is-coming/171/18

In the period May 17 – June 10, Yahoo was NOT adjusting for dividends so OHLC prices were adjusted for splits only.

Charts looked correct (albeit without dividend adjustments).

Then people sent complaints about missing dividend adjustments and Yahoo started adjusting for Dividends, but… only Close price, leaving High, Low, Open adjusted for Splits only.

Here is example of EWJ data downloaded from Yahoo:

02-08-2010, 39.000000, 39.360001, 38.880001, 9.840000, 35.239758, 4488800

The columns are as follows:

Date, AdjOpen, AdjHigh, AdjLow, RawClose, AdjCloseForSplitsAndDividends, Volume

As you can see adjusted for splits&dividends close price (35.239758) is lower than adjusted low (38.880001)

So here is where the mess is coming from.

Yahoo is adjusting High, Low, Open fields for splits but at the same time they are adjusting Close field for splits and dividends.

That is why you are getting Close field below OHL fields and strange looking charts.

Another example – this time straight from Yahoo web site.

Link:
https://finance.yahoo.com/quote/IP/history?period1=804549600&period2=807055200&interval=1d&filter=history&frequency=1d

What you get:

Wrong data on Yahoo

Clearly visible Adjusted Close is lower than Adjusted Low, because they subtract dividends from Close price, but don’t do that for High, Low, Open prices.

This needs to be fixed by Yahoo on their site. You need to complain to Yahoo to fix their mess.

There is already a thread that describes this data error on Yahoo web site https://forums.yahoo.net/t5/Yahoo-Finance-help/Historical-Data-Split-and-Dividend-Adjustment-Errors-in-EEM/td-p/279800 but apparently Yahoo does not see / understand that yet.

The are two possible ways to fix that on Yahoo side:
1. Yahoo needs to adjust Open, High, Low fields the same way as they are adjusting Close.
..or..
2. Yahoo needs to send RawOpen, RawHigh, RawLow instead of adjusted ones. Then adjusted OHL can be calculated on the fly by deriving split ratio from AdjClose/Close

Method 2 was used in the past (prior to May 17, 2017 changes). It is best solution as it gives access to both adjusted and unadjusted OHLC easily. But any method would do as long as OHLC fields are adjusted the same way (i.e. for both splits AND dividends)

How to install AmiQuote 3.14 correctly

UPDATE: AmiQuote 3.21 public version is available now.

AmiQuote 3.21 has been announced here: http://www.amibroker.com/devlog/2017/07/11/amiquote-3-21-released/

AmiQuote 3.21 is not a standalone installer, therefore – it requires AmiBroker installed on your computer and the installation path should point to the folder where AmiBroker is installed.

To run the installer – use the link at:
http://www.amibroker.com/bin/aq3210.exe (119KB) – 32 bit version
http://www.amibroker.com/bin/aq3210x64.exe (143KB) – 64 bit version

Make sure to download correct version. If you have 32-bit AmiBroker, use 32-bit AmiQuote. If you have 64-bit AmiBroker, use 64-bit AmiQuote. If you have both, install both.

32-bit version of AmiBroker

The correct folder is automatically detected if AmiBroker is installed so you don’t need to change it. The default installation path is the following:

Image1

64-bit version of AmiBroker

The correct folder is automatically detected if AmiBroker is installed so you don’t need to change it. The default installation path is the following:

Image1

TROUBLESHOOTING:

  1. If you are getting “Failed to update registry, use REGEDIT” message, then you need to run AmiQuote just once with administrator rights – to do so click on AmiQuote icon with right mouse button and select “Run As Administrator”. Do this just once. Don’t run as admin all the time because automatic import won’t work.
  2. If you are not able to download more than 2 years worth of data it means you entered too early “From” date. For example if you enter 1900 as “from” date you won’t get more than recent 2 years. But if you enter more resonable starting date such as year 2000, then you will be able to download 17 years worth of data
  3. AmiQuote 3.20 uses new method that should be independent from IE version installed, however if the download hangs on Windows XP – new Yahoo Finance pages are incompatible with Windows XP, and specifically with Internet Explorer 9 or lower that Windows XP shipped with. Windows XP is obsolete and not supported anymore by Yahoo. You need at least Internet Explorer 10 to be installed (Windows 7)
  4. If it still does not work – you did something wrong. The program works fine, as long above steps are PRECISELY done. Rinse and repeat until it clicks.

AmiQuote Yahoo Historical stopped working

In May 2017 Yahoo Finance started making changes to their web services. During this time certain services may be interrupted or broken.

Three AmiQuote functionalities are affected by recent Yahoo changes

  1. Yahoo Historical download – the old CSV download API is broken at Yahoo Finance. Fix is available in v3.14 and higher
  2. Yahoo Fundamental Extra – the API has been changed. Fix is available in v3.14 and higher
  3. Yahoo Intraday download – API does not respond. No fix available at the moment

You need to update to the most recent AmiQuote version announced here http://www.amibroker.com/devlog/ and available in the download section http://www.amibroker.com/download.html to be able to use Yahoo data again.

Purchases from Hong Kong

Customers from Hong Kong sometimes have problems ordering because there are no ZIP codes in Hong Kong and SWREG system that we are using requires ZIP code. If you are buying using credit card, entering 0000 or NA in place of ZIP code field is accepted. But if you are buying using PayPal, it may reject it if your PayPal account does not have matching ZIP code. Therefore, if you are using PayPal you need to follow these instructions:

  1. Update your PayPal account billing address with ‘0000’ in the Postal Code/ZIP field
  2. Now you can use 0000 as ZIP code in SWREG ordering page

Support response times

AmiBroker’s technical support staff everyday faces with very wide scope of subjects ranging from simple installation, lost registration details, password reminders questions to complex things like C++ programming or esoteric issues that occur say once a month or only when program is loaded with dozens of gigabytes of data.

Support response times to those different inquiries obviously vary a lot.

Technically we answer basic questions in 24 hours on week days (Monday-Friday).

Very simple questions get answered even in minutes if you happen to ask them when our support staff is in the office (we are in GMT+1 timezone). If you are in different time zone, we may be currently sleeping so you may need to wait for next day.

This response time applies to questions that are covered already in our Official Knowledge Base, Users’ Knowledge Base, Users’ Manual or internal documentation/resources. It is quite good idea to check those resources yourself as you are very likely to find the answer much quicker.

For more complex questions that need some formulas to be written or checked/verified the response time may be higher (48 hours), as long as this check can be done by our regular support staff.

Some complex issues/questions can not be solved/answered by support staff alone and then they are escalated to development. You need to keep in mind that development is 100% busy all the time, we are not sitting here doing nothing. It is all-day development job that is on-going and those complex support issues must wait in the queue. Also since some of issues/questions require lots of work (setting up environment to mimic customer’s setup, testing, single-step debugging sessions, going through millions of lines of code), it may easily take days or even weeks to complete. If development finds out that the issue is due to software problem, then the problem is either fixed at once or scheduled for fixing. This is a process. So please do not expect “next day response” for those kind of issues. You are also not going to get constant e-mails/updates like “we are working on it”, because we are always working on something in the queue. Please be patient, things are being worked on constantly.

How to register AmiQuote and AFL Code Wizard

AmiQuote and AFL Wizard are separate applications, therefore the registration process is also separate from registering AmiBroker and requires to enter the unlock code into Help->Register menu in AmiQuote or AFL Wizard respectively. The unlock codes are delivered in the transaction receipt generated after the purchase (sent from SWREG, ShareIt or other payment processor)

In order to register these programs, it is necessary to launch AmiQuote or AFL Wizard first.

AmiQuote can be launched e.g. from the Windows Start menu or by double-clicking on Quote (Quote.exe) program in AmiBroker/AmiQuote folder.

register

Once the program is running, we need to enter the unlock codes into Help->Register AmiQuote menu:

register

Then we can enter our name and the unlock code, then press Update button.

register

AFL Code Wizard can be launched from Analysis menu inside AmiBroker:

register

After the program is launched it’s necessary to select Help->Registration details item from the menu.

register

Then we can enter our name and the unlock code, then press Update button.

register

Calling custom user functions in our code

AFL language allows us to define reusable functions that can be used in our formulas. The following chapter of the manual explains the procedure in details: http://amibroker.com/guide/a_userfunctions.html

When we want to call such function in our formula, we should add function definition into our code, so AmiBroker could identify and interpret custom keyword properly. Consequently, if we use the function in multiple chart panes, each of the formulas should contain the function definition first.

// custom function definition
function myMACD( array, fastslow )
{
   return 
EMA( array, fast ) - EMA( array, slow );
}

// use of function
PlotmyMACDHigh1226 ), "Open MACD"colorRed );

Since we may potentially define a large group of our own functions, pasting the definitions manually may not be very convenient. To avoid that, we can use #include statement and group our definitions in a separate AFL file which will be called with a single statement from our main code.

To create such file we should do the following:

  1. Create a new formula. The preferred location is in Include folder in chart windows, we can in fact choose any custom location of the file.
    include

  2. We can also rename the file to a descriptive name, for example myfunctions.afl:
    include

  3. Now we can edit the file and paste our function definitions, then save the file:
    function myMACD( array, fastslow )
    {
       return 
    EMA( array, fast ) - EMA( array, slow );
    }
  4. Now in our main file we can use only a reference to myfunctions.afl file:
    // include our definitions
    #include <myfunctions.afl>

    // use of function
    PlotmyMACDHigh1226 ), "Open MACD"colorRed )

We don’t have to specify the path, because we saved our formula in the folder, which is specified as a ‘default include path’ in Tools–>Preferences–>AFL:

include

In other cases we should provide full path to the file – #include is a pre-processor command, therefore this time we use single backslashes in the path:

#include “C:\Program Files\AmiBroker\AFL\common.afl”

More information about include command can be found at:

http://www.amibroker.com/guide/afl/_include.html

Where does AmiQuote save downloaded data?

AmiQuote is a companion program shipped with AmiBroker, which allows data from free resources, such as Yahoo Finance, Google Finance and others. Since it is a separate application, then it can work independently from AmiBroker and it saves data in text files stored in Destination Folder defined in Tools->Settings window:

aq download folder

AmiQuote can also communicate with AmiBroker using OLE automation and automatically import downloaded data into AmiBroker if Automatic Import option is selected:

aq download folder

AmiQuote will import data to the database, which is opened in AmiBroker at the time of import.

Additionally, if more than one instance of AmiBroker is opened at the same time with different databases loaded, then AQ will communicate with the instance that was launched first and will import data into the database opened in this instance of AmiBroker.

Do not exceed real-time symbol limit

When we subscribe to a real-time datasource, such as eSignal or IQFeed – our subscription package determines how many symbols we can access in realtime at the same time. The plugin configuration in File->Database Settings->Configure should match the subscription limit.

IQFeed:

rt config

eSignal:

rt config

As it is explained in the users guide here: http://www.amibroker.com/guide/h_rtsource.html – although AmiBroker is able to handle more symbols in the database than the streaming limit, we should not really exceed the RT subscription limits in continuous screening during session hours.

This is because if we do otherwise and try to access more symbols than our subscription covers, then it would requires lengthy process that includes:

  1. removing the oldest symbol from the streaming list
  2. adding the new one
  3. triggering backfill for the newly added stock to fill the historical data from last valid update that we already have
  4. streaming and displaying RT data.

Then such process will be repeated for each new symbol that is included in screening. As a result, that might cause various problems with the data source not able to handle that many backfill requests in a short time, additionally data-vendors may be pro-actively protecting their servers from abusing the streaming limits this way.

Therefore – it is highly recommended to stay within the subscription limits for real-time operation and scanning to avoid problems.

Next Page »