User warning: The following module is missing from the file system: readonlymode. For information about how to fix this, see the documentation page. in _drupal_trigger_error_with_delayed_logging() (line 1156 of /var/www/html/starling.sbs.arizona.edu/mig/ischool/includes/bootstrap.inc).

Natural Language Processing Certificate

About the Program

The 12-credit-hour Natural Language Processing (NLP) Certificate will provide undergraduate students the confidence and training they need in natural language processing: teaching computers to use language by extracting knowledge from text, and then using that knowledge in meaningful ways. The certificate will signal to employers that students have dedicated the time and energy necessary to develop the skills and confidence for working from these types of data.

The Certificate will service a diverse student population, training both technically-minded students as well as less technically-minded students in the basic skills necessary for gathering insights from NLP data.

Students in this certificate program will complete a set of choices among the core courses while also choosing at least one elective course. Up to 6 units may be shared with a degree requirement (major, minor, General Education) or second certificate.

Declare my Certificate 

Learning Outcomes

  • Students will able to critically analyze a data science problem to determine how natural language processing techniques might be applied
  • Students will be able to code a variety of natural language processing algorithms and techniques and apply them to specific data science problems

Required Courses

  • 12 units are required for the certificate 
  • Up to 6 units may be shared with a degree requirement (major, minor, General Education) or second certificate.

  • All students, including Information Science, Computer Science, and Linguistics major students, may only 'double use' 6 units towards another program of study (major, minor, General Education, or another certificate)

Student will choose either ISTA 130 (4 units, description below), CSC 110 (4 units), LING 201 (3 units), or LING 408 (3 units)

An introduction to computational techniques and using a modern programming language to solve current problems drawn from science, technology, and the arts. Topics include control structures, elementary data structures, and effective program design and implementation techniques. Weekly laboratory.

**Programming-intensive Course, College Algebra recommended

AND student will choose either ISTA 355 (3 units) or LING 388 (3 units)

Natural language processing (NLP) is the study of how we can teach computers to use language by extracting knowledge from text, and then use that knowledge in some meaningful way.  In this introductory course, we will examine the fundamental components on which natural language processing systems are built, including frequency distributions, part of speech tagging, syntactic parsing, semantics and analyzing meaning, search, introductory information and relation extraction, and structured knowledge resources.  We will also examine pragmatic concerns in processing raw text from real-world sources.

Fundamentals of processing of natural language and computational linguistics.

AND student will take LING/ISTA/CSC 439 (3 units)

This course introduces the key concepts underlying statistical natural language processing. Students will learn a variety of techniques for the computational modeling of natural language, including: n-gram models, smoothing, Hidden Markov models, Bayesian Inference, Expectation Maximization, Viterbi, Inside-Outside Algorithm for Probabilistic Context-Free Grammars, and higher-order language models.  Graduate-level requirements include assignments of greater scope than undergraduate assignments. In addition to being more in-depth, graduate assignments are typically longer and additional readings are required.This course introduces the key concepts underlying statistical natural language processing. Students will learn a variety of techniques for the computational modeling of natural language, including: n-gram models, smoothing, Hidden Markov models, Bayesian Inference, Expectation Maximization, Viterbi, Inside-Outside Algorithm for Probabilistic Context-Free Grammars, and higher-order language models.  

Elective Courses

Complete at least 3 units from the following courses (ISTA course descriptions below):

  • LING 408 (3 units)  
  • LING 438 (3 units) 
  • LING 478 (3 units) 
  • ISTA 131 (3 units) 
  • ISTA 455 (4 units)  
  • ISTA 456 (3 units) 
  • CSC 483 (3 units) 

At the core of Information Science lies the digital data that is the object of study. This course aims to introduce the tools, techniques, and issues involved with the handling of this data: where it comes from, how to store and retrieve it, how to extract knowledge from the data via analysis, and the social, ethical, and legal issues involved in its use. Throughout the course, students will be given hands-on experience with actual datasets from a variety of sources including social media and citizen science projects, as well as experience with common tools for analysis and visualization. Students will also examine topical case studies involving legal and ethical issues surrounding data.

Most of web data today consists of unstructured text. This course will cover the fundamental knowledge necessary to organize such texts, search them a meaningful way, and extract relevant information from them. This course will teach natural language processing through the design and development of end-to-end natural language understanding applications, including sentiment analysis (e.g., is this review positive or negative?), information extraction (e.g., extracting named entities and their relations from text), and question answering (retrieving exact answers to natural language questions such as “What is the capital of France” from large document collections). We will use several natural language processing toolkits, such as NLTK and Stanford’s CoreNLP. The main programming language used in the course will be Python, but code written in Java or Scala will be accepted as well.

Most of the web data today consists of unstructured text. Of course, the fact that this data exists is irrelevant, unless it is made available such that users can quickly find information that is relevant for their needs. This course will cover the fundamental knowledge necessary to build such systems, such as web crawling, index construction and compression, boolean, vector-based, and probabilistic retrieval models, text classification and clustering, link analysis algorithms such as PageRank, and computational advertising. The students will also complete one programming project, in which they will construct one complex application that combines multiple algorithms into a system that solves real-world problems.