Spin-offs and derivates of the Media Arabic Text Collection


Do you want to benefit more from the Media Arabic Text Collection? If the answer is yes, please continue to read about the additional and supporting materials and spin-offs.

The following sets of data are available for you:
  1. Overview of the content of the MATC
  2. Links to supporting materials
  3. Spin-offs containing useful files and lists

1 Overview of the content of the MATC
The current content of the MATC is (Nov 2020):
Number of texts in MATC 45  
total number of words in MATC   7000
number of texts and words in subset    
Tunisia 17 2024
Palestine 19 2575
Short texts 9 3000
[word counts by approximation]

These numbers are based on the information that is available in the Catalog file. You can always check this page for new content in the MATC. In a downloadable pdf you can find two overviews of the texts in the MATC. In these tables the texts are sorted by ascending lexical difficulty or syntactic difficulty. These rankings are based on the number of glosses or syntactic codes in a text related to the number of words. In this way you can identify the easiest or most difficult texts if you are up to a challenge.



2 Links to supporting materials
If there is any news for the users of the MATC this can be found on the News page. You can check this page whenever you want to know if any new content is available. As already mentioned you can also check the Catalog page
For all information concerning the Top Media Vocabulary you can check the TMV page. From that page you can also download various pdf's containing the vocabulary.

3 Spin-offs containing useful files and lists
Two important derivatives are available for users of the MATC. The first is a list showing in which texts you can find some specific les frequent syntactic codes. The second spin-off is a table containing all phraselogy items from all texts in the MATC.

Occurrences of les frequent syntactic codes
The first derivative is a table providing information about the occurrences of the less frequent more complex syntactic codes. Imagine you want to see some examples of a less frequent syntactic phenomenon, for example a relative clause with a participle as verb (type R4). It may be interesting for you to know text TUN027 contains 4 relative clauses of type R4. I have prepared an overview of a limited set of syntactic codes and their occurrences in the texts of the MATC. The overview is available as a pdf. The overview presents occurrences of the following codes:
C3, EL, L4, L8, L9, M3, M4, MA1, MA3, MA4, MA5, R4, S4, V2.
If a code is not mentioned in this overview, this implies the code does not occur in the present collection of the MATC.



In the overview you will find:
Identical syntactic codes grouped together, in column A
The text in which you can find that syntactic code, in column B
The number of the syntactic code in that text in column C. (This number SYNxxx in relation to the total number of syntactic codes in a text, which you can see in the about tab, gives you an indication in which part of the text you can find the desired syntactic code, i.e., in the beginning, the middle or the end of a text.

All Phraseology in one table
The second derivate is even more important and useful. I have made available an Excel table containing all phraseology (collocations and multi word expressions – mwe's).
Since the phraseology topic has developed in time I decided to move that topic to a separate page so please continue reading that page.

Arabic Media Text Collection by Jan Hoogland is licensed under  CC BY-NC-ND 4.0