User:Pfctdayelise/Using Wikipedia as a resource for computational linguistics

The aim of this book is to outline some areas of research in computational linguistics or natural language processing where Wikipedia, and by extension the other Wikimedia projects, have the potential to be valuable resources. It is not intended to serve as an introduction to either of these fields and does not assume any knowledge of Wikipedia.

Description of Wikimedia projects

edit
 

Witkionary, Wikinews, Commons, Wikibooks, Wikisource, Wikiquote. Languages. Meta & Commons direct translations of help (etc) pages. Who contributes? growth.

 
Possibly dv.wp is the smallest project which has a translated logo. Less than 200 articles, probably less than 20 legitimate registered users, Dhivehi language, <300000 speakers, official lang of the Maldives.

Database dumps

edit

Post(pre?)-processing tools

edit

Description of the English Wikipedia

edit
  • License
  • Accessible - dumps
  • Coverage - biased to pop-culture and geek topics (best coverage), wikiprojects
  • Format - MOS - but not reliable
  • FAs, cleanup tags
  • RDRs
edit

Categories

edit

Disambiguation pages

edit

Possible tasks

edit

Word sense disambiguation

edit

Word and phrase translation

edit

Web mining, data mining

edit
 

Machine translation

edit

Geospatial term disambiguation and named entity recognition

edit

Image analysis (?)

edit

Synonymy, abbreviations (RDRs)

edit