NLP 6840: Natural Language Processing
Project Suggestions

1. Mapping comments to code

Background: Many programming tasks are routine and repetitive. It would be useful to have a "program synthesis" model that takes as input a problem description in natural language (English) and outputs code that solves that problem. One approach for building such a module would be to train it on a dataset of (English, code) pairs, but then the question is where to get a large dataset of such training examples. One possible answer is in the comments. The manual labeling of code with English descriptions is routinely done by programmers, so one could create a dataset for training an automatic program synthesis system by extracting (comment, code segment) pairs from large software projects. Given a comment, the start of the corresponding segment is right after the comment. The end of the code segment however can be less clear, especially for fine-grained comments inside functions.

Objective: Train a model that takes as input a comment in a source code file and determines the corresponding code segment. A possible approach is to use a RNN to train a classifier that first optionally goes over the comment, than goes over the lines of code following the comment and classifies them as either being associated with the comment (positive) or not (negative). At test time, the RNN would stop at the first line of code that it classifies as negative. The system could be trained on raw sequences of code tokens, or on higher level features associated with a line of code that exploit the syntactic structure of the code e.g. is this a new line, is this line at the same level as the first lien after the comments, is there another comment immediately following this line.

2. Coreference in mathematical statements

Objective: Adapt an existing ML-based coreference resolution system to solve coreference in mathematical statements, e.g. proofs, as shown on the slides 23 to 32 in the Introduction lecture . This would entail some manual annotation of mathematical proofs (some already done), designing and implementation of coreference features that are specific to math statements, adding these features in the existing ML-based system and training and evaluating the resulting system. A substantial amount of work has already been done. For more details, contact me at <>.

3. Semantic parsing of mathematical statements

Objective: Take a mathematical proof as input and translate it into an equivalent Coq proof. Dr. Juedes has created a small dataset of pairs (mathematical statement, Coq program), using a coded template.

4. Extraction of text relevant to a citation

Objective: Given a citation of paper C in another paper P, extract all the sentences in P that are directly relevant to the paper C, i.e. they mention C or they discuss content (methods, results) from paper C. For this, you would have to create a dataset and design and train an NLP model.

5. Platonic verses

Objective: Based on an idea from an undergraduate Theater student.

6. Explainable NLP

Objective: Use Layerwise Label Propagation or other similar methods to determine with parts of the input are most relevant for the classification decisions of an NLP system.