(Workshop Series #4) Introduction to natural language processing and topic modeling with Python

[Workshop Series#4 of Introduction to Computational Social Science methods with Python]

Workshop Details

Date: 12 April 2023, Wednesday,

Time: 14:00 – 17:00

Venue: Koç University, Rumelifeneri Campus; LECTURE ROOM: SOS 238
(NOTE: Please beware that the event’s format may be changed to ONLINE depending on the regulation changes announced by the Higher Education Council (YÖK)).

Workshop Language: English

Instructors: Dr. Nicolò Gozzi & Dr. N. Gizem Bacaksizlar Turbic

Course description

Documents and full texts as data have a long history in the social sciences. Besides these, Computational Social Science is also concerned with new forms of text data that can be collected from digital platforms and the web. All such datasets resemble expressions of natural language and bring methods from computational linguistics and machine learning like Natural Language Processing (NLP) and automated content analysis to center stage. In the workshop, we will give an introduction to how text data can be preprocessed and analyzed in Python. In particular, we will discuss how information can be extracted from raw texts using regular expressions, how words can be reduced to their basic forms, what language models are, how they allow us to extract meaningful pieces of symbolic communication like n-grams, how grammatical parts of speech (e.g., nouns, verbs) can be identified, and how all those steps combine into a text preprocessing pipeline. At the end of such a pipeline stands a document-word matrix that is ready for analysis. For analysis, we will introduce Latent Dirichlet Allocation (also called topic modeling), a fully automated content analysis method that reduces the dimensionality of the document-term matrix. It assumes that documents are generated from topics and infers topics as groups of words. As data, we will use a popular text corpus still to be determined. The workshop will alternate between live-coding demonstrations and periods in which participants apply that knowledge in context, both using Jupyter Notebooks. The software we will be using are SpaCy and Gensim, two standard Python libraries for NLP and topic modeling.

Target group

Undergraduate, master students, doctoral candidates, postdoctoral researchers, and experienced researchers who want to get introduced to the practice of Computational Social Science.

Requirements

Participants are expected to know the basics of Python and have at least some experience using it.

For the workshops, participants should bring a running system on which they can execute Jupyter Notebooks. We will be using Python 3.9 and several standard libraries that are part of the Anaconda 2022.10 distribution or can be installed on top of that. A list of libraries and versions of these libraries that participants should import will be circulated before the workshops.

We recommend that participants install Anaconda 2022.10. Feel free to also work in a cloud-like Google Colab. Consult this link for more detailed instructions on how to set up your computing environment.