Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss format or data when li contains p #769

Open
ezscode opened this issue Dec 16, 2024 · 1 comment · May be fixed by #772
Open

Loss format or data when li contains p #769

ezscode opened this issue Dec 16, 2024 · 1 comment · May be fixed by #772
Labels
bug Something isn't working

Comments

@ezscode
Copy link

ezscode commented Dec 16, 2024

Take this url as an example

https://learn.microsoft.com/en-us/azure/search/retrieval-augmented-generation-overview


1、this will ignore li format, turns just like p node

<li>
  <p>Indexing strategies that load and refresh at scale, for all of your content, at the frequency you require.</p>
</li>

2、can't get all li content

<ul>
  <li><a href="https://aka.ms/azai/py" data-linktype="external">Python</a></li>
  <li><a href="https://aka.ms/azai/net" data-linktype="external">.NET</a></li>
  <li><a href="https://aka.ms/azai/js" data-linktype="external">JavaScript</a></li>
  <li><a href="https://aka.ms/azai/java" data-linktype="external">Java</a></li>
</ul>

3、this will igonre format after the first li element

<ul>
  <li>Azure AI Foundry, <a href="/en-us/azure/ai-studio/concepts/retrieval-augmented-generation" data-linktype="absolute-path">use a vector index and retrieval augmentation</a>.</li>
  <li>Azure OpenAI, <a href="/en-us/azure/ai-services/openai/concepts/use-your-data" data-linktype="absolute-path">use a search index with or without vectors</a>.</li>
  <li>Azure Machine Learning, <a href="/en-us/azure/machine-learning/how-to-create-vector-index" data-linktype="absolute-path">use a search index as a vector store in a prompt flow</a>.</li>
</ul>

the result is :

- Azure AI Foundry,
[use a vector index and retrieval augmentation](https://learn.microsoft.com/en-us/azure/ai-studio/concepts/retrieval-augmented-generation). - Azure OpenAI,
[use a search index with or without vectors](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/use-your-data). - Azure Machine Learning,
[use a search index as a vector store in a prompt flow](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-vector-index).

I expect below :

- Azure AI Foundry,
[use a vector index and retrieval augmentation](https://learn.microsoft.com/en-us/azure/ai-studio/concepts/retrieval-augmented-generation). 
- Azure OpenAI,
[use a search index with or without vectors](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/use-your-data). 
- Azure Machine Learning,
[use a search index as a vector store in a prompt flow](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-vector-index).

My code is below:

import trafilatura 

def url2md_tfl(url):

    print('-- url2md : ', url) 
    if len(url) == 0:return '' 

    downloaded = trafilatura.fetch_url(url)
    if downloaded == None:
        print('xx trafilatura 获取网页信息失败:', url)
        return ''

    output_format = "txt"
    # 'txt', 'csv', 'json', 'xml', or 'xmltei'.

    meta = trafilatura.extract_metadata(downloaded)
    title = meta.title.strip()  

    result = trafilatura.extract(downloaded, url, 
            record_id=None, no_fallback=False,
            favor_precision=True, favor_recall=False,
            include_comments=False, output_format=output_format,
            tei_validation=False, target_language=None,
            include_tables=True, include_images=True, include_formatting=True,
            include_links=True, deduplicate=False,
            date_extraction_params=None,
            only_with_metadata=False, with_metadata=False,
            max_tree_size=None, url_blacklist=None, 
            author_blacklist=None, settingsfile=None )

    if result == None or len(result.strip()) == 0: 
        print('xx trafilatura get url err :', url) 
        return '' 
    
    md_str = '# ' + title + '\n\n' + result  
    return md_str 
 
url = 'https://learn.microsoft.com/en-us/azure/search/retrieval-augmented-generation-overview'
md_str = url2md_tfl(url)
save_path = '/Users/xx/Downloads/t41-1.md'
with open(save_path,'w') as f:f.write(md_str)  
@adbar adbar added the bug Something isn't working label Dec 18, 2024
@adbar
Copy link
Owner

adbar commented Dec 18, 2024

There are several things happening here:

  1. HTML formatting is odd/unexpected (not critical)
  2. The link filter removes too much text
  3. This is just when links are included, without links the list looks fine

unsleepy22 pushed a commit to unsleepy22/trafilatura that referenced this issue Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants