- Sort Score
- Result 10 results
- Languages All
Results 1 - 10 of 26 for Extraction (0.06 sec)
-
fess-crawler/src/main/java/org/codelibs/fess/crawler/extractor/impl/PdfExtractor.java
* * <p>The extractor runs text extraction in a separate thread with a configurable timeout * to prevent hanging on problematic PDF files. It also extracts metadata from the PDF * document and includes it in the extraction result. * * <p>Features: * <ul> * <li>Text extraction from PDF pages</li> * <li>Embedded document extraction</li> * <li>Annotation extraction (file attachments)</li>
Registered: Sat Dec 20 11:21:39 UTC 2025 - Last Modified: Sun Nov 23 12:19:14 UTC 2025 - 12.8K bytes - Viewed (0) -
fess-crawler/src/main/java/org/codelibs/fess/crawler/extractor/ExtractorBuilder.java
* The builder allows setting parameters such as MIME type, filename, extractor name, maximum content length, * and cache file size to optimize the extraction process. * * <p> * The main purpose of this class is to simplify the extraction process by providing a fluent interface * for configuring the extraction parameters and handling the underlying complexities of content processing,
Registered: Sat Dec 20 11:21:39 UTC 2025 - Last Modified: Sun Jul 06 02:13:03 UTC 2025 - 10.1K bytes - Viewed (0) -
fess-crawler/src/test/java/org/codelibs/fess/crawler/extractor/impl/ExtractorResourceManagementTest.java
.singleton("textExtractor", TextExtractor.class); } /** * Test that MsWordExtractor properly closes resources on successful extraction. */ public void test_MsWordExtractor_closesResourcesOnSuccess() throws IOException { final MsWordExtractor extractor = container.getComponent("msWordExtractor");
Registered: Sat Dec 20 11:21:39 UTC 2025 - Last Modified: Mon Nov 24 03:59:47 UTC 2025 - 10.4K bytes - Viewed (0) -
README.md
## Overview **Fess Crawler** is a powerful, flexible Java-based web crawling framework designed for enterprise-scale content extraction and processing. Built with a modular architecture, it supports multiple protocols (HTTP/HTTPS, File System, FTP, SMB, Cloud Storage) and provides extensive content extraction capabilities from various document formats. ### Key Features
Registered: Sat Dec 20 11:21:39 UTC 2025 - Last Modified: Sun Aug 31 05:32:52 UTC 2025 - 15.3K bytes - Viewed (0) -
fess-crawler/src/main/java/org/codelibs/fess/crawler/extractor/impl/CsvExtractor.java
/** * Extracts text content and metadata from CSV files. * This extractor provides better structured data extraction compared to Tika's generic text extraction. * * <p>Features: * <ul> * <li>Automatic delimiter detection (comma, tab, semicolon, pipe)</li> * <li>Header row detection and extraction</li> * <li>Column name to data value association</li> * <li>Quoted field handling</li>
Registered: Sat Dec 20 11:21:39 UTC 2025 - Last Modified: Thu Dec 11 08:38:29 UTC 2025 - 12.8K bytes - Viewed (0) -
fess-crawler/src/main/java/org/codelibs/fess/crawler/extractor/impl/TikaExtractor.java
* <li>Handling resource names and content types</li> * <li>Retrying extraction without resource name or content type if the initial attempt fails</li> * <li>Extracting text from metadata if the main content extraction fails</li> * <li>Reading content as plain text if all other methods fail</li> * <li>Applying post-extraction filters</li> * <li>Handling Tika exceptions, including zip bomb exceptions</li> * </ul> *
Registered: Sat Dec 20 11:21:39 UTC 2025 - Last Modified: Sun Nov 23 12:19:14 UTC 2025 - 30.8K bytes - Viewed (0) -
CLAUDE.md
**Fess Crawler** is a Java-based web crawling framework for enterprise content extraction. ### Essential Info - **Language**: Java 21+ - **Build**: Maven 3.x - **License**: Apache 2.0 - **DI**: LastaFlute DI - **Repo**: https://github.com/codelibs/fess-crawler ### Tech Stack - **HTTP**: Apache HttpComponents 4.5+ - **Extraction**: Apache Tika 3.0+, POI 5.3+, PDFBox 3.0+ - **Testing**: JUnit 4, UTFlute, Mockito 5.7.0
Registered: Sat Dec 20 11:21:39 UTC 2025 - Last Modified: Fri Nov 28 17:31:34 UTC 2025 - 10.7K bytes - Viewed (0) -
fess-crawler/src/main/java/org/codelibs/fess/crawler/transformer/impl/HtmlTransformer.java
this.propertyMap = propertyMap; } /** * Gets the map of child URL extraction rules. * * @return the child URL rule map */ public Map<String, String> getChildUrlRuleMap() { return childUrlRuleMap; } /** * Sets the map of child URL extraction rules. * * @param childUrlRuleMap the child URL rule map to set */Registered: Sat Dec 20 11:21:39 UTC 2025 - Last Modified: Sat Nov 29 07:42:33 UTC 2025 - 30.5K bytes - Viewed (0) -
src/main/java/org/codelibs/fess/helper/DocumentHelper.java
/** * Helper class for document processing and manipulation in the Fess search system. * This class provides utilities for processing document content, titles, and digests, * handling text normalization, content extraction, and similar document hash encoding/decoding. * It also manages document processing requests and integrates with the crawler system. * */ public class DocumentHelper {Registered: Sat Dec 20 09:19:18 UTC 2025 - Last Modified: Fri Nov 28 16:29:12 UTC 2025 - 17.4K bytes - Viewed (0) -
fess-crawler/src/main/java/org/codelibs/fess/crawler/extractor/impl/ApiExtractor.java
* * @param in the input stream to extract text from * @param params additional parameters * @return the extracted data * @throws ExtractException if extraction fails */ @Override public ExtractData getText(final InputStream in, final Map<String, String> params) { if (logger.isDebugEnabled()) { logger.debug("Accessing {}", url); }Registered: Sat Dec 20 11:21:39 UTC 2025 - Last Modified: Mon Nov 24 03:59:47 UTC 2025 - 12.2K bytes - Viewed (0)