Processing the text data

To extract relevant features that can be used for modeling. by applying techniques Syntactic Dependency Parsing and Bag-of-words …. Then apply Text Vectorization to convert it to numeric representation; for fitting the model .

Syntactic Dependency Parsing:

This approach analyzes the grammatical structure of a sentence by identifying the relationships between words. It represents sentences as directed graphs, with each word as a node and each relationship as an edge. Syntactic dependency parsing can capture the relationships between words and can be useful for tasks such as semantic role labeling.

e.g. ”Mohammed is telling the truth, because he said I am telling the truth”

telling(ROOT)
  ________|________
 |               is
 |          _____|_____
 |         |         truthe
 |         |       ___|____
 |         |      |       ,
 |         |      |       |
 |         |      |    because
 |         |      |       |
 |         |      |       |
 |         |      |       said
 |         |      |    ___|___
 |         |      |   |      am
 |         |      |   |      |
 |         |      |   |  telling
 |         |      |   |      |
Mohammed   .      the  I    truth
-----------------------------------
telling(VP) 
   └── Mohammed(NP) 
   └── is(VP) 
       └── truth(NP) 
   └── because(S) 
       └── said(VP) 
           └── he(NP) 
           └── am(VP) 
               └── telling(VP) 
                   └── truth(NP)
We can use this output to generate a set of features that capture the 
syntactic relationships between words in the sentence.
 For example, one possible set of features could include:

    The number of noun phrases (NP) in the sentence
    The number of verb phrases (VP) in the sentence
    The distance between the subject of the main verb ("Mohammed") and the predicate ("telling")
    The distance between the subject of the subordinate verb ("he") and the predicate ("said")
    The presence or absence of a circular reference between the two predicates ("telling" and "said")

Json format:

We can use this output to generate a set of features that capture the syntactic relationships between words in the sentence. For example, one possible set of features could include:

The number of noun phrases (NP) in the sentence
The number of verb phrases (VP) in the sentence
The distance between the subject of the main verb ("Mohammed") and the predicate ("telling")
The distance between the subject of the subordinate verb ("he") and the predicate ("said")
The presence or absence of a circular reference between the two predicates ("telling" and "said")

{'tokens': ['Mohammed', 'is', 'telling', 'the', 'truth', ',', 'because', 'he', 'said', 'I', 'am', 'telling', 'the', 'truth'],
'dependencies': [('ROOT', 0, 3),
('nsubj', 3, 1),
('aux', 3, 2),
('ccomp', 3, 11),
('det', 5, 4),
('dobj', 3, 5),
('punct', 3, 6),
('mark', 11, 7),
('nsubj', 11, 8),
('ccomp', 3, 11),
('nsubj', 13, 12),
('cop', 13, 10),
('ccomp', 11, 13),
('det', 15, 14),
('dobj', 13, 15)]
}