Skip to content

This repository is under active development. It's part of active research going on at Illinois Institute of Technology Chicago for using LLMs in chatbot/agent development.

Notifications You must be signed in to change notification settings

mali404/chatbot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 

Repository files navigation

chatbot

This repository is under active development. It's part of research going on at Illinois Institute of Technology Chicago for using LLMs in chatbot/agent development.

The webscrapping has been refined to great extent during the past couple of weeks (as of Jan 15, 2024). Currently, there are two main scrapping files deploying the scrapping logic in two parts:

  1. url_crawling.py: MultiCrawler class in this module accepts multiple base_urls. For each base_url, a list of urls is formed by crawling that url and its childurls recursively depending upon max_depth param. This functionality is achieved via Url_Crawl class which is actually inherited from Langchain's RecursiveURLLoader. It must be noted here that Langchain's source code has been tweaked to get URLs pertaining to similar top level domain (root domain). Originally, Langchain only sticks to base_url very strictly. Because of this, we had to determine the subdomains ourselves and feed them separately for getting the respective child_urls. However, our algorithm improves this by extending the original functionality to account for other subdomains lying within a given base_url.

    • input format: List of Tuples [(base_url, max_depth), ...] each tuple contains base_url along with maximum depth of recursion for getting child_urls
    • output format: a csv file containing one column of resulting consolidated URLs only
  2. webscrapping.py: This module scrapes webpages iteratively by reading URLs from the csv file resulted from url_crawling. There are two aspects to the scraping method:

    i. webscrapping on the basis of webpages: For IIT webpages, we have designed a customized way of scrapping using BeautifulSoup library; whereas for other webpages we're using Unstructured.io library.

    ii. splitting: This is our research area. There are three classes defined in this module. The primary orchestrator class is WebScraper which contains use_split param to scrape the webpages and store them into respective split_id directories. Before that, we're splitting the urls into categories depending upon the kind of information they contain. For IIT, there are some URLs that pertain to admissions others pertain to academics. We have curated five categories by using our collaborative judgement. An additional column called split_id has been manually added to csv file generated by url_crawling.py. When use_split = True in WebScraper, each url pertaining to specific category is scrapped and resulting text file is saved in respective directory based on split_id.

    • input format: csv file with or without split_id column
    • output format: text files

About

This repository is under active development. It's part of active research going on at Illinois Institute of Technology Chicago for using LLMs in chatbot/agent development.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages