Skip to content
This repository has been archived by the owner on Oct 4, 2022. It is now read-only.

Port all the text researches to tree researches #2078

Closed
atimmer opened this issue Dec 20, 2018 · 4 comments
Closed

Port all the text researches to tree researches #2078

atimmer opened this issue Dec 20, 2018 · 4 comments
Labels
backlog component: parse tree enhancement innovation Innovative issue. Relating to performance, memory or data-flow. owner: business

Comments

@atimmer
Copy link
Contributor

atimmer commented Dec 20, 2018

The following list of researches exist now, I will put after the research what needs to be done for them.

In general I want to:

  • Drop all the verbs from the research. A research is not something you do, it is an objective fact about the text.
  • Change count research to be plain research. The count can easily be retrieved by doing .length whenever the count is necessary. And the actual results are far more useful than the count.

List of current research

  • urlLength, unchanged
  • wordCountInText, change to words research. Count can easily be retrieved by doing .length in the assessment.
  • findKeywordInPageTitle, rename to keywordInPageTitle.
  • calculateFleschReading, rename to fleschReadingEase, change it so it uses the new syllableCount research.
  • getLinkStatistics, drop in favor of links research
  • getLinks, rename to links, change so it returns a completely object instead of only an URL.
  • linkCount, drop in favor of links research
  • imageCount, change to images research. Count can easily be retrieved by doing .length in the assessment. The research should return image objects with all the data for an image so it is a lot more useful
  • altTagCount, drop in favor of images research. Logic for counting how many images with/without tags should be in the assessment.
  • matchKeywordInSubheadings, drop in favor of a combination of the headings and keywords research. The assessment can first get references to all the headings in the text and then ask if the keyword has been matched in these headings by calling getResearchForNode( 'keywords', heading ).
  • keywordCount, change to keywords research. Count can easily be retrieved by doing .length in the assessment. The research should return keyword objects with keyword, startOffset, endOffset, isExactMatch properties.
  • getKeywordDensity, rename to keywordDensity, change to use the keywords and words research.
  • stopWordsInKeyword, unchanged
  • stopWordsInUrl, unchanged
  • metaDescriptionLength, unchanged
  • keyphraseLength, unchanged
  • keywordCountInUrl, unchanged
  • findKeywordInFirstParagraph, drop in favor of a combination of the paragraphs and keywords research. The assessment can first get a reference to the first paragraph and the ask if the keyword has been matched in this paragraph by calling getResearchForNode( 'keywords', firstParagraph ).
  • metaDescriptionKeyword, unchanged
  • pageTitleWidth, unchanged
  • getWordComplexity, drop in favor of the words research, a word object should include the amount of syllables in the word
  • getParagraphLength, **drop in favor of a combination of the paragraphs and words research. An assessment can get all the paragraphs with the paragraph research and then retrieve the words for this specific paragraph by calling getResearchForNode( 'words', paragraph )
  • countSentencesFromText, change to sentences research.
  • countSentencesFromDescription, rename to descriptionSentences
  • getSubheadingTextLengths, drop in favor of a combination of the headings and words research
  • findTransitionWords, rename to `transitionWords
  • passiveVoice, rename research file to passiveVoice
  • getSentenceBeginnings, we should probably not change this until we implement the linguistic tree
  • relevantWords, adapt for the tree, but this is going to require a very specifically tuned recursion strategy. We can also choose to refactor this later and only make this work on the root node for now.
  • readingTime, unchanged, but adapt for the tree
  • getTopicDensity, drop in favor of a combination of matchedTopics and words
  • topicCount, rename to matchedTopics, should return an object with properties matchedTopic, startOffset, endOffset.
  • sentences, unchanged, but works for the tree.
  • keyphraseDistribution, refactor to use the new words/keywords/matchedTopics research
  • morphology/buildKeywordsForms, rename to keywordForms
  • functionWordsInKeyphrase, unchanged
  • h1s, drop in favor of the headings research. That one can then easily be filtered based on the level (in this case 1)

As a result of the above changes we also need to introduce some new research:

  • words, returns all the words for a specific node. Every word has a
  • keywords, returns all the keywords for a specific node.
  • subheadings, returns a reference to all the Heading nodes in the tree.
  • paragraph, returns a reference to all the Paragraph nodes in the tree.
@atimmer atimmer added this to the StructuredTree milestone Dec 20, 2018
@atimmer atimmer added the innovation Innovative issue. Relating to performance, memory or data-flow. label Dec 21, 2018
@atimmer atimmer added the backlog label Feb 1, 2019
@igorschoester
Copy link
Member

I checked the current issues with the label requires structured data / html parser to match them to these researches. The goal is to make it easier to double check if the issues are indeed fixed after/while creating the tree researches.

This is a work in progress, at this time there are 6 issues left to match.

To match still

getSentenceBeginnings

headings/subheadings

getParagraphLength

getSubheadingTextLengths

getKeywordDensity

keyphraseDistribution

links

images

@atimmer
Copy link
Contributor Author

atimmer commented Mar 6, 2019

List with priority, based on a deliberation with @moorscode.

Sub Headings Keyword

Internal Links

Text Competing Links ( currently disabled)

Keyphrase Distribution

Keyword Density

sentence Beginnings

subheading Distribution Too Long (currently disabled)

Text Images

Outbound Links
passive Voice
sentence Length In Text
transition Words
Keyphrase Length
Single H1

Geen marking

Introduction Keyword
keyword Stop Words
Function Words In Keyphrase
text Presence
sentence Length In Description
flesch Reading Ease
Meta Description Keyword
Meta Description Length
Page Title Width
taxonomy Text Length
Url Keyword
Url Length
url Stop Words
Text Length
Title Keyword

Assessment disabled

word Complexity (assessment disabled)

@manuelaugustin
Copy link
Contributor

manuelaugustin commented Mar 19, 2019

Other issues that should be solved after the implementation of the tree (not specific to a given assessment, but possibly researches):

@omarreiss
Copy link
Contributor

Closing all parse tree issues.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backlog component: parse tree enhancement innovation Innovative issue. Relating to performance, memory or data-flow. owner: business
Projects
None yet
Development

No branches or pull requests

4 participants