Tuesday, November 9, 2010

Character Frequency Analysis for Turkish (Türkçe)

1. Introduction to character frequency analysis:

Frequency analysis, which is used to break classical ciphers, is the study and analysis of characters of a language considering the number of times they appear in any written form of that language. This is based on the fact that for any given written language there happens to be a pattern in frequencies of the letters of the language (Some letters appear most and some appear less etc). Not only characters but also combinations of characters (digrams and trigrams) which are common for a given language can also be used in breaking a classical cipher.

2. Method of collecting samples:

This report describes a character frequency analysis which was carried out on Turkish language. For this analysis I have used 35 different samples from Turkish language and they are from following sources

i. Wikipedia

Samples were taken from different kinds of articles like “Earth”, “Sports”, “Turkish people" etc.

ii. Web sites written using Turkish

a. http://tdkterim.gov.tr/bts/ [On 11/03/2010]

b. http://www.turkcebilgi.com/sozluk/ [On 11/03/2010]

iii. Google Translate http://translate.google.lk/# [On 10/03/2010]

Some of the samples were created using Google translate option

Each of the samples used contains more than 300 words and in all 35 samples there are about 16505 words and 108851 Turkish characters.
















Figure 1 [ Result of the frequency analysis program as percentages]

3. Method used to count characters:

A java console application was developed for this purpose and first a path of a folder which contains text files which stores Turkish words samples is given as the input to the system. Then the program reads all the text files and counts frequencies for each of the Turkish characters. The output of the program is a list of frequencies for each of the characters [Fig 1]. Furthermore it gives the output as a bar graph using stars just to compare the frequencies of the characters [Fig 2]

4. Results:

The results of the program [Table 1] were taken in to MS Office Excel 2007 to draw the graph shown in [Fig 3].

Figure 2 [Result of the frequency analysis program as a star bar graph]

Figure 3 [Percentage frequencies as a bar graph]

According to the results, characters can be divided in to three different ranges.

i. High frequency range (6% - 11% ) [a e i n r l ]

Sum othe percentages is 50.73725%

ii. Medium frequency range (2% - 6% ) [k ı d t m u y s ü o b ]

Sum of the percentages is 38.1788%

iii. Low frequency range (0% - 2% ) [ş z g v ç ğ h ö c p f ]

Sum of the percentages is 11.08396

Table 1 [Percentages of frequencies in descending order]

5. Observations and analysis:

Following table (Table 2) gives the normal Turkish letter frequencies according to the book “Advances in Information Systems” [ref 4]

Table 2 [ Percentage character frequencies order according to Advances in Information Systems, Tatyana Yakhno ]

The results and the data taken from the book “Advances in Information Systems” [ref 4], are declared to be tally as the first six characters with the maximum frequencies (a,e,i,n,r,l) and the last three characters with the least frequencies(p,f,j). The letter d is in the ninth places in both the results and the book.

Other characters also have been placed in the order with a little deviation from the order of characters specified in the referenced book.

For an example in the descending order of frequencies of the observed result, the seventh letter is ‘k’ and the eighth letter is ‘I’ while these two letters have been interchanged in the results given in the referenced book. The same has been occurred for letters ‘c’ and ‘o’. In the descending order of frequencies of the observed result, the twenty-fifth letter is ‘Ö’ and the twenty-sixth letter is ‘c’, where as they are interchanged in the book.

According to the Table 1, the character having the highest frequency is character ‘a’ and lowest is ‘j’. In the high frequency range all first three characters are vowels. The sum of the percentages of the vowels in that range is 28.5776%.

Following categorization can also be done on the results

Ø Vowels- [a e i ı o ö u ü]

Generally vowels are the mostly used characters in most of the languages. According to the results percentage frequencies of the vowel of Turkish add up tos 42.45896%

Ø High frequency consonants- [n r l k d t ]

Sum of the frequencies is 30.48295%. This complies with the general characteristic of a language having a set of frequently used consonants.

Ø Low frequency consonants- [ m y s b ş z g v ç ğ h c p f j ]

Sum of the frequencies is only 27.05809%.

Bibliography

1. Character frequency analysis

http://en.wikipedia.org/wiki/Frequency_analysis [on 02/11/2010]

2. Letter frequency analysis http://library.thinkquest.org/28005/flashed/thelab/cryptograms/frequency.shtml [on 02/11/2010]

3. Turkish alphabet

http://www.onlineturkish.com/alphabet.asp [on 02/11/2010]

4. Advances in Information Systems, Tatyana Yakhno

http://goo.gl/T5IQA [on 02/11/2010]

Monday, July 5, 2010

An easy way to get rid of java.lang.OutOfMemoryError: PermGen space, in JDeveloper 11g


When applications are deployed again and again in JDeveloper after some times java.lang.OutOfMemoryError: PermGen space will occur, stopping your application being deployed.


After this error occur the process java.exe should be killed before the next deployment. This can be easily done using a .bat file which kills the process. (the .bat attached with this)




But with the option “External Tools” in JDeveloper killing this process takes only a single click!

JDeveloper>> Tools>> External>> Tools>> New>> Next

Browse the .bat file and Next>> Next

From the Integration window select where you want the icon or menu item to be appeared.


Then click Finish.

Now you can see the icon appearing in the main window.



Whenever you get the java.lang.OutOfMemoryError: PermGen space, what you have to do is to click on the newly created icon, the process java.exe will be terminated.