TIOBE Programming Community Index Definition

Since there are many questions about the way the TIOBE index is assembled, a special page is devoted to its definition. Basically the calculation comes down to counting hits for the search query

+"<language> programming"

In the next few sections it is explained what search engines qualify, what programming languages qualify and how the ratings are exactly calculated.

Search Engines

There are 25 search engines that are used to calculate the TIOBE index. The selected search engines are the 25 highest ranked websites of Alexa that meet the following conditions:

Based on these criteria currently the most important Alexa search engines have the following qualification:

Programming Language

In this section it is clarified what counts as a programming language for the TIOBE index. There are 3 requirements that should all hold:

Programming languages that are very similar are grouped together. Currently the maximum of the hits of the individual languages is taken into account when calculating the ratings of groupings. In the future we will do a better job and take the union (from mathematical set theory) of all the hits.

The definition of what languages are grouped has been formalized according to the following rules:

In order to filter out false positives, two mechanisms are used. First of all a confidence is defined for a language. By default the confidence is 100%, but for some difficult search queries such as "Basic Programming", the confidence will be lower. Apart from the confidence, sometimes also exceptions or mandatory additions are used to weed out false positives.

The following table contains all programming languages tracked including its groupings, confidences and exceptions.

Ratings

The ratings are calculated by counting hits of the most popular search engines. The search query that is used is

+"<language> programming"

The number of hits determines the ratings of a language. The counted hits are normalized for each search engine for all languages in the list. In other words, all languages together have a score of 100%. Let's define "hits(SE)" as the sum of the number of hits for all languages for search engine SE and "hits(PL,SE)" as the number of hits for programming language PL for search engine SE. Possible false positives for a query are already filtered out in the definition of "hits(PL,SE)". This is done by using a manually determined confidence factor per query. A query such as "Basic programming" also returns pages that contain "Improve your basic programming skills in Java". The first 100 pages per search engine are checked for possible false positives and this is used to define the confidence factor. If this factor is 90%, then only 90% of the hits are used for "hits(PL,SE)". An overview of the confidence factor can be found in the groupings table below.

The ratings are calculated with the following formula:

((hits(PL,SE1)/hits(SE1) + ... + hits(PL,SEn)/hits(SEn))/n

where n is the number of search engines used.

Artifacts or ideas on improving the calculation of the TIOBE index will be received with gratitude (tpci@tiobe.com).