- Sort Score
- Result 10 results
- Languages All
Results 1 - 10 of 58 for Extraction (0.62 sec)
-
fess-crawler/src/main/java/org/codelibs/fess/crawler/extractor/impl/PdfExtractor.java
* * <p>The extractor runs text extraction in a separate thread with a configurable timeout * to prevent hanging on problematic PDF files. It also extracts metadata from the PDF * document and includes it in the extraction result. * * <p>Features: * <ul> * <li>Text extraction from PDF pages</li> * <li>Embedded document extraction</li> * <li>Annotation extraction (file attachments)</li>
Registered: Sat Dec 20 11:21:39 UTC 2025 - Last Modified: Sun Nov 23 12:19:14 UTC 2025 - 12.8K bytes - Viewed (0) -
fess-crawler/src/test/java/org/codelibs/fess/crawler/extractor/impl/FilenameExtractorEnhancedTest.java
} /** * Test extraction with null parameters map. */ public void test_getText_withNullParams() { final InputStream in = new ByteArrayInputStream(new byte[0]); final ExtractData result = filenameExtractor.getText(in, null); assertNotNull(result); assertEquals("", result.getContent()); } /** * Test extraction with empty parameters map. */Registered: Sat Dec 20 11:21:39 UTC 2025 - Last Modified: Mon Nov 24 03:59:47 UTC 2025 - 7K bytes - Viewed (0) -
fess-crawler/src/test/java/org/codelibs/fess/crawler/extractor/impl/TextExtractorEnhancedTest.java
assertTrue("Error message should indicate extraction failure", e.getMessage().contains("Failed to extract")); } finally { // Reset to default encoding textExtractor.setEncoding("UTF-8"); } } /** * Test extraction with empty input stream. */ public void test_getText_emptyInputStream_returnsEmptyContent() {
Registered: Sat Dec 20 11:21:39 UTC 2025 - Last Modified: Mon Nov 24 03:59:47 UTC 2025 - 8.9K bytes - Viewed (0) -
fess-crawler/src/main/java/org/codelibs/fess/crawler/extractor/impl/JsonExtractor.java
* This extractor provides better structured data extraction compared to Tika's generic text extraction. * * <p>Features: * <ul> * <li>Structured text extraction with key-value pairs</li> * <li>Top-level field extraction as metadata</li> * <li>Nested structure flattening with configurable depth</li> * <li>Array element extraction</li> * <li>Configurable field separator and array formatting</li> * </ul>
Registered: Sat Dec 20 11:21:39 UTC 2025 - Last Modified: Sun Nov 23 03:46:53 UTC 2025 - 9.7K bytes - Viewed (0) -
fess-crawler/src/main/java/org/codelibs/fess/crawler/extractor/ExtractorBuilder.java
* The builder allows setting parameters such as MIME type, filename, extractor name, maximum content length, * and cache file size to optimize the extraction process. * * <p> * The main purpose of this class is to simplify the extraction process by providing a fluent interface * for configuring the extraction parameters and handling the underlying complexities of content processing,
Registered: Sat Dec 20 11:21:39 UTC 2025 - Last Modified: Sun Jul 06 02:13:03 UTC 2025 - 10.1K bytes - Viewed (0) -
fess-crawler/src/main/java/org/codelibs/fess/crawler/extractor/impl/MarkdownExtractor.java
* This extractor provides better structured data extraction compared to Tika's generic text extraction. * * <p>Features: * <ul> * <li>YAML front matter metadata extraction</li> * <li>Heading structure extraction</li> * <li>Link URL extraction</li> * <li>Code block content extraction</li> * <li>Clean text conversion from Markdown</li> * <li>Configurable encoding</li>
Registered: Sat Dec 20 11:21:39 UTC 2025 - Last Modified: Sun Nov 23 03:46:53 UTC 2025 - 8.2K bytes - Viewed (0) -
fess-crawler/src/test/java/org/codelibs/fess/crawler/extractor/impl/ExtractorResourceManagementTest.java
.singleton("textExtractor", TextExtractor.class); } /** * Test that MsWordExtractor properly closes resources on successful extraction. */ public void test_MsWordExtractor_closesResourcesOnSuccess() throws IOException { final MsWordExtractor extractor = container.getComponent("msWordExtractor");
Registered: Sat Dec 20 11:21:39 UTC 2025 - Last Modified: Mon Nov 24 03:59:47 UTC 2025 - 10.4K bytes - Viewed (0) -
fess-crawler/src/main/java/org/codelibs/fess/crawler/extractor/impl/LhaExtractor.java
* * @param in the input stream containing the LHA archive * @param params extraction parameters * @return the extracted text data * @throws CrawlerSystemException if the input stream is null * @throws ExtractException if an error occurs during extraction * @throws MaxLengthExceededException if the extracted content size exceeds the maximum limit */ @OverrideRegistered: Sat Dec 20 11:21:39 UTC 2025 - Last Modified: Sun Nov 23 12:19:14 UTC 2025 - 5.9K bytes - Viewed (0) -
fess-crawler/src/main/java/org/codelibs/fess/crawler/extractor/impl/AbstractXmlExtractor.java
/** * Default character encoding for content extraction. */ protected String encoding = Constants.UTF_8; /** * The preload size for charset detection. */ protected int preloadSizeForCharset = 2048; /** * Indicates whether comment tags should be ignored during extraction. */ protected boolean ignoreCommentTag = false; /**Registered: Sat Dec 20 11:21:39 UTC 2025 - Last Modified: Sun Nov 23 12:19:14 UTC 2025 - 8.6K bytes - Viewed (0) -
fess-crawler/src/test/java/org/codelibs/fess/crawler/extractor/impl/EXTRACTOR_TESTS_README.md
**Key Test Areas**: - Resource closure on successful extraction (MS Office extractors) - Resource closure on failed extraction - Improved error messages with context - Input validation using `validateInputStream()` **Covered Extractors**: - MsWordExtractor - MsExcelExtractor - MsPowerPointExtractor - TextExtractor **Test Count**: 8 tests **Key Scenarios**: - ✅ Successful extraction closes resources properly
Registered: Sat Dec 20 11:21:39 UTC 2025 - Last Modified: Wed Nov 19 08:55:01 UTC 2025 - 5.7K bytes - Viewed (0)