extraction - Code Search

README.md

## Overview

**Fess Crawler** is a powerful, flexible Java-based web crawling framework designed for enterprise-scale content extraction and processing. Built with a modular architecture, it supports multiple protocols (HTTP/HTTPS, File System, FTP, SMB, Cloud Storage) and provides extensive content extraction capabilities from various document formats.

### Key Features

Created: Sun Apr 12 03:50:13 GMT 2026

- Last Modified: Sun Aug 31 05:32:52 GMT 2025

- 15.3K bytes

- Click Count (0)

github.com/minio/minio

internal/s3select/jstream/README.md

#

[![GoDoc](https://godoc.org/github.com/bcicen/jstream?status.svg)](https://godoc.org/github.com/bcicen/jstream)


`jstream` is a streaming JSON parser and value extraction library for Go.

Unlike most JSON parsers, `jstream` is document position- and depth-aware -- this enables the extraction of values at a specified depth, eliminating the overhead of allocating encompassing arrays or objects; e.g:

Using the below example document:

Created: Sun Apr 05 19:28:12 GMT 2026

- Last Modified: Mon Sep 23 19:35:41 GMT 2024

- 3.2K bytes

- Click Count (0)

github.com/codelibs/fess-crawler

CLAUDE.md

**Fess Crawler** is a Java-based web crawling framework for enterprise content extraction.

### Essential Info

- **Language**: Java 21+
- **Build**: Maven 3.x
- **License**: Apache 2.0
- **DI**: LastaFlute DI
- **Repo**: https://github.com/codelibs/fess-crawler

### Tech Stack

- **HTTP**: Apache HttpComponents 4.5+ and 5.x (switchable)
- **Extraction**: Apache Tika, POI, PDFBox

Created: Sun Apr 12 03:50:13 GMT 2026

- Last Modified: Thu Mar 12 03:39:20 GMT 2026

- 8.1K bytes

- Click Count (0)

github.com/codelibs/fess

src/main/java/org/codelibs/fess/crawler/transformer/FessXpathTransformer.java

        }
        return new URL(currentUrl);
    }

    /**
     * Gets child URL extraction rules from configuration.
     *
     * @param responseData the response data from crawling
     * @param resultData the result data
     * @return stream of tag-attribute pairs for URL extraction
     */
    @Override

Created: Tue Mar 31 13:07:34 GMT 2026

- Last Modified: Thu Mar 12 01:46:45 GMT 2026

- 55.3K bytes

- Click Count (0)

github.com/codelibs/fess

src/main/java/org/codelibs/fess/crawler/transformer/FessFileTransformer.java

            throw new FessSystemException("Could not find extractorFactory.");
        }
        final Extractor extractor = extractorFactory.getExtractor(responseData.getMimeType());
        if (logger.isDebugEnabled()) {
            logger.debug("url={}, extractor={}", responseData.getUrl(), extractor);
        }
        return extractor;
    }

Created: Tue Mar 31 13:07:34 GMT 2026

- Last Modified: Fri Nov 28 16:29:12 GMT 2025

- 3.5K bytes

- Click Count (0)

github.com/codelibs/fess

src/main/java/org/codelibs/fess/crawler/transformer/FessStandardTransformer.java

    }

    /**
     * Gets the appropriate extractor for the given response data.
     * Selects an extractor based on the MIME type or falls back to the Tika extractor.
     *
     * @param responseData the response data containing the document to extract
     * @return the extractor instance for processing the document
     * @throws FessSystemException if no suitable extractor can be found
     */
    @Override

Created: Tue Mar 31 13:07:34 GMT 2026

- Last Modified: Fri Nov 28 16:29:12 GMT 2025

- 3.8K bytes

- Click Count (0)

github.com/gradle/gradle

.teamcity/scripts/CheckWrapper.java

    private static final Pattern ALLOWED_WRAPPER_VERSION =
        Pattern.compile("^[0-9.]+(-(rc|milestone|m)-[0-9]+)?$");

    // Keep the same extraction semantics as the old sed:
    //   sed 's/.*gradle-\(.*\)-[a-z]*\.[a-z]*/\1/'
    private static final Pattern WRAPPER_VERSION_EXTRACT =
        Pattern.compile(".*gradle-(.*)-[a-z]*\\.[a-z]*");

Created: Wed Apr 01 11:36:16 GMT 2026

- Last Modified: Tue Jan 20 03:53:25 GMT 2026

- 6.4K bytes

- Click Count (0)

github.com/apache/maven

impl/maven-cli/src/test/java/org/apache/maven/cling/invoker/mvnup/goals/GAVUtilsTest.java

/**
 * Tests Artifact extraction, computation, and parent resolution functionality.
 */
@DisplayName("GAVUtils")
class GAVUtilsTest {

    @BeforeEach
    void setUp() {}

    private UpgradeContext createMockContext() {
        return TestUtils.createMockContext();
    }

    @Nested
    @DisplayName("Artifact Extraction")
    class GAVExtractionTests {

        @Test

Created: Sun Apr 05 03:35:12 GMT 2026

- Last Modified: Tue Nov 18 18:03:26 GMT 2025

- 17.3K bytes

- Click Count (0)

github.com/codelibs/fess

src/test/java/org/codelibs/fess/crawler/transformer/AbstractFessFileTransformerTest.java

import org.codelibs.fess.Constants;
import org.codelibs.fess.crawler.entity.ResponseData;
import org.codelibs.fess.crawler.exception.CrawlingAccessException;
import org.codelibs.fess.crawler.extractor.Extractor;
import org.codelibs.fess.mylasta.direction.FessConfig;
import org.codelibs.fess.unit.UnitFessTestCase;
import org.codelibs.fess.util.ComponentUtil;
import org.junit.jupiter.api.Test;

Created: Tue Mar 31 13:07:34 GMT 2026

- Last Modified: Thu Jan 15 12:54:47 GMT 2026

- 8.1K bytes

- Click Count (0)

github.com/codelibs/fess

src/main/java/org/codelibs/fess/crawler/transformer/AbstractFessFileTransformer.java

    /**
     * Get the extracted data.
     * @param extractor The extractor.
     * @param in The input stream.
     * @param params The parameters.
     * @return The extracted data.
     */
    protected ExtractData getExtractData(final Extractor extractor, final InputStream in, final Map<String, String> params) {
        try {
            return extractor.getText(in, params);
        } catch (final RuntimeException e) {

Created: Tue Mar 31 13:07:34 GMT 2026

- Last Modified: Fri Nov 28 16:29:12 GMT 2025

- 25.7K bytes

- Click Count (0)

Search Options